Re: B-tree parent pointer and checkpoints
От | Heikki Linnakangas |
---|---|
Тема | Re: B-tree parent pointer and checkpoints |
Дата | |
Msg-id | 4CDBD253.6060304@enterprisedb.com обсуждение исходный текст |
Ответ на | Re: B-tree parent pointer and checkpoints (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: B-tree parent pointer and checkpoints
(Tom Lane <tgl@sss.pgh.pa.us>)
|
Список | pgsql-hackers |
On 11.11.2010 00:49, Tom Lane wrote: > I wrote: >> What happens if you error out in between? Or is it assumed that the >> *entire* sequence is a critical section? If it has to be that way, >> one might wonder what's the point of trying to split it into multiple >> WAL records. > > Or, to be more concrete: I'm wondering if this *entire* mechanism isn't > a bad idea that we should just rip out. > > The question that ought to be asked here, I think, is whether it > shouldn't be required that every inter-WAL-record state is a valid > consistent state that doesn't require post-crash fixups. If that > isn't the case, then a simple ERROR or FATAL exit out of the backend > that was creating the sequence originally will leave the system in > an unacceptable state. We could prevent such an exit by wrapping the > whole sequence in a critical section, but if you have to do that then > it's not apparent why you shouldn't fold it into one WAL record. > > IOW, forget this patch. Take out the logic that tries to complete > pending splits during replay, instead. I believe this is perfectly safe > for btree: loss of a parent record isn't fatal, as proven by the fact > that searches don't have to be locked out while a split proceeds. > (We might want to make btree_page_del not think that a missing parent > record is an error, but it shouldn't think that anyway, because of the > possibility of a non-crashing failure during the original split.) > This approach might not be safe for GIST or GIN; but if it isn't, they > need fixes anyway. GIN is similar to b-tree, the incomplete split logic there is for inserting the parent pointers in the b-tree within the GIN index, just like nbtree. GiST is different. When you insert a key to a leaf page, you (sometimes) need to adjust the parent pointer to reflect the new key as well. B-tree tolerates incomplete splits with the 'next page' pointer, but that is not applicable to gist. Teodor described the issue back in 2005 when WAL-logging was added to GiST (http://archives.postgresql.org/pgsql-hackers/2005-06/msg00555.php): > The problem with incopmleted inserts is: when new entry is installed into leaf page, all chain (in a worst case) of keysfrom root page to leaf should be updated. Of course, case of updating all chain is rarely and usially it's updated onlypart. Each key on inner pages contains union of keys (bounding box in a case of rtree, for example) on page which itpoints. This union can formed only with a help of user-defined function of opclass, because of GiST doesn't know somethingabout nature of keys. Returning to WAL, GiST core write xlog entry with all nessary information for restorationbefore write page, but in this moment it doesn't know it should update keys on parent page or key is unchanged.So GiST's WAL restoration code should remember this page's update as incompleted insert. When insert complited,GiST's core write to log that insert is completed and restore code can clean up stored incompleted insert. If itwas crash, then sign of completed insert can be absent in log, and GiST's restore code should continue it. While continue, it's know which page was changed and should walk up to root.On each step of walk it should form union for page and insert it to parent. Reading that I wonder: what harm would an incomplete insert cause if we just left it in the tree? Imagine that you insert a key to a leaf page, but crash before updating the parent. If you search for the key starting from the root, you'll fail to find it, because the parent pointer claims that there are no entries with such a key on the child page. But that's OK, the inserting transaction aborted with the crash! Do some of the other GiST algorithms get confused if there's a key on a page that's not represented by the parent pointer? It's possible that you insert a key to the leaf, update the leaf's immediate parent, but crash before updating the parent's parent. As far as search is concerned, that's OK as well, but it creates a hazard for subsequent inserts. If you later insert a tuple with the same key to the same leaf page, the insertion will see that the parent pointer already includes the key, and will fail to update the parent's parent. That's a problem. Would it be hard to change the algorithm to update the parent keys top-down rather than bottom-up? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Peter EisentrautДата:
Сообщение: Re: Exposing an installation's default value of unix_socket_directory