Race condition in b-tree page deletion

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Race condition in b-tree page deletion
Дата
Msg-id 527E5C7A.3090902@vmware.com
обсуждение исходный текст
Ответы Re: Race condition in b-tree page deletion  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Race condition in b-tree page deletion  (Jim Nasby <jim@nasby.net>)
Список pgsql-hackers
The B-tree page deletion algorithm has a race condition. We don't allow 
a page to be deleted if it's the rightmost child of its parent, but that 
situation can change after we check for it.

Problem
-------

We check that the page to be deleted is not the rightmost child of its 
parent, and then lock its left sibling, the page itself, its right 
sibling, and the parent, in that order. However, if the parent page is 
split after the check but before acquiring the locks, the target page 
might become the rightmost child, if the split happens at the right 
place. That leads to an error in vacuum (I reproduced this by setting a 
breakpoint in debugger):

ERROR:  failed to delete rightmost child 41 of block 3 in index "foo_pkey"

We currently re-check that the page is still the rightmost child, and 
throw the above error if it's not. We could easily just give up rather 
than throw an error, but that approach doesn't scale to half-dead pages. 
To recap, although we don't normally allow deleting the rightmost child, 
if the page is the *only* child of its parent, we delete the child page 
and mark the parent page as half-dead in one atomic operation. But 
before we do that, we check that the parent can later be deleted, by 
checking that it in turn is not the rightmost child of the grandparent 
(potentially recursing all the way up to the root). But the same 
situation can arise there - the grandparent can be split while we're not 
holding the locks. We end up with a half-dead page that we cannot delete.

To make things worse, the keyspace of the deleted page has already been 
transferred to its right sibling. As the README points out, the keyspace 
at the grandparent level is "out-of-whack" until the half-dead page is 
deleted, and if enough tuples with keys in the transferred keyspace are 
inserted, the page might get split and a downlink might be inserted into 
the grandparent that is out-of-order. That might not cause any serious 
problem if it's transient (as the README ponders), but is surely bad if 
it stays that way.

Solutions
---------

1. The simplest solution I can see is to never delete internal pages, 
avoiding the whole concept of half-dead pages. That obviously leads to 
some bloat, though.

2. The second-simplest solution I see is to keep locked the whole chain 
of pages that will be deleted, and delete all of them as one atomic 
WAL-logged operation. Ie. the leaf page, and all the parent pages above 
it that will become half-dead, and the parent of the last half-dead page.

3. Another approach would be to get rid of the "can't delete rightmost 
child" limitation. We currently have that limitation because it ensures 
that we never need to change the high-key of a page. If we delete a page 
that is the rightmost child of its parent, we transfer the deleted 
keyspace from the parent page to its right sibling. To do that, we need 
to update the high key of the parent, as well as the downlink of the 
right sibling at the grandparent level. That's a bit complicated, 
because updating the high key might require splitting the page.

Still, I think it would be doable: When we delete a page that is the 
rightmost child of its parent, we mark the right sibling with an 
"out-of-whack" flag. It indicates that the keyspace of that page is not 
in sync in the parent level. Searches work fine, as there are no keys in 
the keyspace that is out-of-whack, but before allowing any insertions to 
the page, the situation needs to be fixed. That is the responsibility of 
the next insert to the page that comes along. To fix it, the parent's 
left sibling's high key needs to be updated, as well as the parent's 
downlink in the grandparent. We can do that in a few discrete steps; 
first update the high key, then update the downlink, and finally once 
the high key and parent's downlink are in sync with the reality at the 
bottom level, clear the "out-of-whack" flag.

On reflection, approach 2 seems best. It's fairly simple, although it 
potentially requires holding many pages locked simultaneously. In 
practice, B-trees are typically not very deep. We have a limit of 4 
full-page images in a single WAL record, but since all the pages are 
empty, we can just WAL-log all the information needed to reconstruct the 
pages in deleted state without using full-page images. This also might 
be back-patchable, although it requires changing the WAL record format, 
so it would break replication from higher minor version to lower.

- Heikki



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Craig Ringer
Дата:
Сообщение: Re: Row-security writer-side checks proposal
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Race condition in b-tree page deletion