Re: B-tree parent pointer and checkpoints

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: B-tree parent pointer and checkpoints
Дата
Msg-id 4CDBD20F.8090807@enterprisedb.com
обсуждение исходный текст
Ответ на Re: B-tree parent pointer and checkpoints  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On 11.11.2010 00:49, Tom Lane wrote:
> I wrote:
>> What happens if you error out in between?  Or is it assumed that the
>> *entire* sequence is a critical section?  If it has to be that way,
>> one might wonder what's the point of trying to split it into multiple
>> WAL records.
>
> Or, to be more concrete: I'm wondering if this *entire* mechanism isn't
> a bad idea that we should just rip out.
>
> The question that ought to be asked here, I think, is whether it
> shouldn't be required that every inter-WAL-record state is a valid
> consistent state that doesn't require post-crash fixups.  If that
> isn't the case, then a simple ERROR or FATAL exit out of the backend
> that was creating the sequence originally will leave the system in
> an unacceptable state.  We could prevent such an exit by wrapping the
> whole sequence in a critical section, but if you have to do that then
> it's not apparent why you shouldn't fold it into one WAL record.
>
> IOW, forget this patch.  Take out the logic that tries to complete
> pending splits during replay, instead.  I believe this is perfectly safe
> for btree: loss of a parent record isn't fatal, as proven by the fact
> that searches don't have to be locked out while a split proceeds.
> (We might want to make btree_page_del not think that a missing parent
> record is an error, but it shouldn't think that anyway, because of the
> possibility of a non-crashing failure during the original split.)
> This approach might not be safe for GIST or GIN; but if it isn't, they
> need fixes anyway.

GIN is similar to b-tree, the incomplete split logic there is for 
inserting the parent pointers in the b-tree within the GIN index, just 
like nbtree.

GiST is different. When you insert a key to a leaf page, you (sometimes) 
need to adjust the parent pointer to reflect the new key as well. B-tree 
tolerates incomplete splits with the 'next page' pointer, but that is 
not applicable to gist. Teodor described the issue back in 2005 when 
WAL-logging was added to GiST 
(http://archives.postgresql.org/pgsql-hackers/2005-06/msg00555.php):

> The problem with incopmleted inserts is: when new entry is installed into leaf page, all chain (in a worst case) of
keysfrom root page to leaf should be updated. Of course, case of updating all chain is rarely and usially it's updated
onlypart. Each key on inner pages contains union of keys (bounding box in a case of rtree, for example) on page which
itpoints. This union can formed only with a help of user-defined function of opclass, because of GiST doesn't know
somethingabout nature of keys. Returning to WAL, GiST core write xlog entry with all nessary information for
restorationbefore write page, but in this moment it doesn't know it should update keys on parent page or key is
unchanged.So GiST's WAL restoration code should remember this page's update as incompleted insert. When insert
complited,GiST's core write to log that insert is completed and restore code can clean up stored incompleted insert. If
itwas crash, then sign of completed insert can be absent in 
 
log, and GiST's restore code should continue it. While continue, it's know which page was changed and should walk up to
root.On each step of walk it should form union for page and insert it to parent.
 

Reading that I wonder: what harm would an incomplete insert cause if we 
just left it in the tree? Imagine that you insert a key to a leaf page, 
but crash before updating the parent. If you search for the key starting 
from the root, you'll fail to find it, because the parent pointer claims 
that there are no entries with such a key on the child page. But that's 
OK, the inserting transaction aborted with the crash!

Do some of the other GiST algorithms get confused if there's a key on a 
page that's not represented by the parent pointer? It's possible that 
you insert a key to the leaf, update the leaf's immediate parent, but 
crash before updating the parent's parent. As far as search is 
concerned, that's OK as well, but it creates a hazard for subsequent 
inserts. If you later insert a tuple with the same key to the same leaf 
page, the insertion will see that the parent pointer already includes 
the key, and will fail to update the parent's parent. That's a problem.

Would it be hard to change the algorithm to update the parent keys 
top-down rather than bottom-up?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: B-tree parent pointer and checkpoints
Следующее
От: Dave Page
Дата:
Сообщение: Re: improved parallel make support