Re: BTree vacuum before page splitting
От | Junji TERAMOTO |
---|---|
Тема | Re: BTree vacuum before page splitting |
Дата | |
Msg-id | 43E090F5.3060603@lab.ntt.co.jp обсуждение исходный текст |
Ответ на | Re: BTree vacuum before page splitting (Simon Riggs <simon@2ndquadrant.com>) |
Список | pgsql-patches |
Thanks! I had misunderstood the content of the comment. I think that I can understand the problem now. I might be able to present the solution of the restarting an index scan. I try to correct the patch. Simon Riggs wrote: > On Tue, 2006-01-31 at 15:18 +0900, Junji TERAMOTO wrote: >> Tom Lane wrote: >>> I think this is quite likely to break things :-(. What sort of >>> conditions have you tested it under? (If this were safe, we'd >>> not have invented the LP_DELETE flag to begin with, but just have >>> deleted known-dead items immediately.) >> I apologize for my insufficient description(and bad english). >> I explain the operation of this patch in detail again. >> >> In _bt_insertonpg(), we insert the tupple by the following methods. >> >> (1) Finds the right place(page) to insert the tuple. >> (2) If free space is insufficient, we operate as follows. >> + (2.1) Confirm whether the lock to the found page is considerable >> super-exclusive lock in BufferSuperLockHeldByMe()[new]. We >> check bufHdr->refcount(=pin), and if pin == 1, we know this >> lock is super-exclusive lock. >> If our lock is not super-exclusive lock, goto (2.4). >> + (2.2) If our lock is super-exclusive lock, and the found page is >> leaf, we look for tupples marked as "LP_DELETE" from the found >> page, and remove found tuples in _bt_vacuum_page()[new]. >> + (2.3) If free space is sufficient after removal, goto (4). >> (2.4) Step right to next non-dead page. >> (2.5) (2) is repeated until an enough free space is found or it >> reaches a right edge at last. >> (3) When an enough free space is not found by processing (2), splits >> the target page (making sure that the split is equitable as far as >> post-insert free space goes). >> (4) Inserts the tuple. >> >> The numbers from 2.1) to 2.3) are new processing. >> >> The pointer's shifting by removing "LP_DELETE" tuples as shown in the >> above-mentioned becomes a problem, when we resume a stopped indexscan. >> Therefore, I am having it confirm the state of the lock by (2.1), and >> process only at super-exclucive lock, same as btbulkdelete(). >> >> However, because own indexscan might be lost because of this removal, >> I also modify _bt_restscan() as follows. >> >> (1) Check for index tuple we stopped on. >> (2) If index tuple is moved, first of all, we search in this page >> right. >> +(3) If not found, we try to look for from the head of this page >> because index tuple may moved left due to removal. >> (4) If still not found, we look for index tuple right. >> >> The number (3) is new processing. >> >> We do not delete the empty page. Therefore, there is no necessity for >> tracing a left page. >> >> I think that no problem occurs by this logic. >> Please point it out if there is a wrong point. Thanks. > > Please read comments in nbtree.c p.557-566 which describe the required > locking protocol for btree pages. > > Just because you hold the lock does *not* mean you are allowed to delete > tuples marked as deleted. This patch doesn't work in all cases, which is > what is required, since the failure cases don't raise an ERROR - they > just miss data - which is unacceptable. So your patch works 99.9% of the > time, but sometimes gives wrong answers to queries without you knowing. > So, patch rejected, but good thinking all the same. > > You will need to argue for a change in the locking protocol before it > could be accepted. That would speed up VACUUM also, so if it were > possible it would be gratefully accepted. > > The only help I can give is this: Why does an index scan ever want to > stop on a deleted tuple? Currently, the scan only remembers one tuple > its seen. If the tuple was dead, but wasn't yet marked as dead the index > scan will have read it and stopped there. Some while later, the index > scan will return and begin scanning right for the tuple. If you delete > it, the index scan *may* not be able to work out where to restart -- > ERROR. So, if you can work out a way of not restarting an index scan on > a deleted tuple (and yet still marking them as deleted when it sees > them) then you may be able to solve the problem. > > I'm looking at the same problem domain also (table bloat), but focusing > on a different approach that avoids this issue. > > Best Regards, Simon Riggs -- Junji Teramoto / teramoto.junji (a) lab.ntt.co.jp
В списке pgsql-patches по дате отправления:
Следующее
От: Bruce MomjianДата:
Сообщение: Re: [HACKERS] FW: PGBuildfarm member snake Branch REL8_0_STABLE Status