Re: btree shrinking again
От | Curtis Faith |
---|---|
Тема | Re: btree shrinking again |
Дата | |
Msg-id | DMEEJMCDOJAKPPFACMPMOEFACFAA.curtis@galtair.com обсуждение исходный текст |
Ответ на | Re: btree shrinking again (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-hackers |
> Alvaro Herrera <alvherre@dcc.uchile.cl> writes: > > + Deletions are handled by getting a super-exclusive lock on the target > > page, so that no other backend has a pin on the page when the deletion > > starts. This means no scan is pointing at the page. This is OK for > > deleting leaf items, probably not OK for deleting internal nodes; > > will need to think harder when it's time to support index compaction. > > > In what cases is not OK to delete an item from an internal node, holding > > a super-exclusive lock? > > I believe the thing I was worried about when I wrote that note was the > stack of ancestor pointers maintained by an insert operation: the insert > will not have pins on those pages, but might try to return to them > later (to service a page split). tom lane wrote: > A simple-minded solution might be to keep the pins until the insert is > done, but you'd have to think about possible deadlock conditions as well > as loss of concurrency. I'd prefer to find a solution that didn't > require that. There is an algorithm that work pretty well for this and that don't require holding pins up the tree in most cases. The locking for deletes mirrors that for inserts. In each case, either a delete or an insert could require a change to the parent page recursively up to and including the root page. Traversals pin read-only starting at the root, after each page pin is acquired pins on previous, more interior pages are released. Leaf pages are pinned read/write for insert or delete cases. The strict ordering of lock acquisition pinning from the root toward the more exterior/leaf nodes, and never from the leaf up, will prevent deadlocks. Each page keeps a version number that is incremented after each change (you don't really have to worry about rollover on this since checks against the version number are for equality and inequality only). If an insert will cause a split and hence an insert for the interior page then the parent interior page of the page that has been split is pinned for write without waiting (waiting could cause deadlocks since it would violate the root to leaf acquisition ordering mentioned above). If the pin fails because another lock is held on that page, or the page is not there; or after pinning the version number is different than the version number found when the page was originally pinned on the way down to the leaf, the insert is aborted. In case of abort, the btree is searched again starting at the root going down to the page which needs to be inserted write pinning each of the pages on the way down. Another variation simply retries the insert again in the case of aborts but acquires a write lock on page one level up from the lowest level of the last try. In the case of the first abort for a four-level tree this would acquire a read lock on the root node (level 0) and level 1, and write locks on levels 2 and 3, since level 3 failed the previous time. This method provides better concurrency at the expense of slightly more work for inserts or deletes that require major restructuring of the btrees. Leaf pages can be pinned across the bottom of the tree for index scans but never up the tree. This means that leaf nodes require sibling pointers in order to support index scans. I don't know if PostgreSQL's btree has these already. I'm sure I have missed some detail but this is the general idea. - Curtis
В списке pgsql-hackers по дате отправления: