Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
От | Peter Geoghegan |
---|---|
Тема | Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum |
Дата | |
Msg-id | CAH2-Wz=fUgig7oJjvNEVToFy-G=HWzsbq13zkePsVv2tuk9wgA@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum (Andres Freund <andres@anarazel.de>) |
Ответы |
Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
(Peter Geoghegan <pg@bowt.ie>)
|
Список | pgsql-bugs |
On Wed, Nov 10, 2021 at 7:08 PM Andres Freund <andres@anarazel.de> wrote: > It's a good optimization to remove the reundant BufferGetBlockNumber(), but > perhaps it's worth committing that separately? Ordinarily I would, but the intention here is to avoid new performance regressions. Is it okay if I backpatch this performance optimization, trivial though it may be? Or should I just do that on HEAD? Attached is v3, which works though some of your feedback, though there are still unresolved issues. I expect that we'll need to very carefully review this bugfix, since it's a particularly tricky issue. I expect to produce several more revisions. > > @@ -256,10 +260,8 @@ heap_page_prune(Relation relation, Buffer buffer, > > offnum <= maxoff; > > offnum = OffsetNumberNext(offnum)) > > { > > - ItemId itemid; > > - > > /* Ignore items already processed as part of an earlier chain */ > > - if (prstate.marked[offnum]) > > + if (prstate.fromvalidchain[offnum]) > > continue; > > > > /* > > ISTM that we should now call heap_prune_chain() only for the start of actual > chains (i.e. redirect or non-hot tuple). Seems easier to put the "heap-only tuple at rootoffnum, so it might be disconnected -- not for me to process unless and until I find it during later call that uses this item's HOT chain's root offset number" logic here. That's at least closer to what we have already. Let me know if you still don't like it in v3, though. I might just change it, but it's related to the way that the array is now called "visited" -- a question that I'll consider in the paragraph immediately following this one. > I think it'd be good to have a comment above the loop explaining that we're > just iterating over chains starting at a non-HOT element. You mean like heap_get_root_tuples() does it, with its own root_offsets array? I have changed the array -- it's now "visited". It's no longer specifically about HOT chains. It's actually more like "definitely not disconnected heap-only tuple" items, but I didn't call it that for the obvious reason. By the end of the first pass over the heap, all items that are not heap-only tuples (as well as most heap-only tuples) will be marked visited in v3. The advantage of not explicitly making it about root-ness is that it avoids relying on the already-squishy definition of "root item". We want to visit an LP_DEAD or LP_UNUSED item in the first pass over the page -- we must get them out of the way during the second pass (we don't expect to have to deal with any tuple that isn't a disconnected heap-only tuple at that point). Another advantage of the "visited" array design is that it doesn't matter if we happen to visit an LP_DEAD/LP_UNUSED item via a heap-only tuples bogus-but-valid t_ctid chain. Might as well just mark it "visited' at that point, avoiding a second visit during the first heap pass (in addition to the second pass). That micro optimization may not be worth much, but it feels kind of natural to me. What do you think? > > @@ -269,15 +271,29 @@ heap_page_prune(Relation relation, Buffer buffer, > > if (off_loc) > > *off_loc = offnum; > > > > - /* Nothing to do if slot is empty or already dead */ > > - itemid = PageGetItemId(page, offnum); > > - if (!ItemIdIsUsed(itemid) || ItemIdIsDead(itemid)) > > - continue; > > - > > /* Process this item or chain of items */ > > ndeleted += heap_prune_chain(buffer, offnum, &prstate); > > } > > Why did you move this into heap_prune_chain()? I find it simpler that way. Now both the first and second pass loops have essentially the same structure as each other. > > > > /* > > - * XXX Should we update the FSM information of this page ? > > - * > I don't think this should be part of the commit? Agreed. Pushed that as a separate commit just now. > I think we should make it an error to reach the same tuple multiple ways. It's > not *quite* as trivial as making this an assert/error though, as we can only > raise an error after checking that the xmin/xmax thing passes. But as you go on to say, we should *not* do that for LP_DEAD or LP_UNUSED items (nor heap-only tuples that don't pass xmin/xmax crosscheck) -- since in general a heap-only tuples's t_ctid link is allowed to point to just about anything (e.g., past the end of a heap page's line pointer array). Wouldn't it be simpler and more robust if we leaned on heap_prune_satisfies_vacuum() for this instead? There will end up being some "disconnected" remaining heap-only tuples when the same tuple is reachable multiple ways. It's very likely that these apparently-disconnected-but-who-knows tuples will fail some kind of validation by heap_prune_satisfies_vacuum() in the second pass over the page -- we expect to be able to treat them DEAD, which provides a good cross-check. (Granted, there is an unresolved question about what exact HTSV return codes we can expect here, but I imagine that a LIVE disconnected tuple will be fully impossible, for example). Note also that v3 directly Asserts() for the truly important case: The case where a HOT chain whose root item is an LP_REDIRECT has only one valid-looking item (the root item itself). That assertion is near the end of heap_prune_chain() in v3. This assertion more or less replaces the very dangerous "We found a redirect item that doesn't point to a valid follow-on" block from HEAD. > > /* > > * If we are looking at the redirected root line pointer, jump to the > > * first normal tuple in the chain. If we find a redirect somewhere > > @@ -619,42 +617,63 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate) > > { > > if (nchain > 0) > > break; /* not at start of chain */ > > + Assert(rootisredirect); > > chainitems[nchain++] = offnum; > > offnum = ItemIdGetRedirect(rootlp); > > continue; > > } > > Why are we handling this inside the loop? Removed in v3. > > + if (!ItemIdIsUsed(lp) || ItemIdIsDead(lp)) > > + { > > + /* > > + * XXX What if we just came from root item, which is a plain heap > > + * tuple? Do we need assert that notices when we reach an LP_DEAD > > + * or LP_UNUSED item having started from such a root item? > > + */ > > + Assert(!rootisredirect || nchain > 1); > > I don't think we can assert that. Agreed, it has been removed in v3. > I don't understand these new assert? We just set tupdead to false above? I don't either. Removed in v3. > I now actually wonder why this is correct. There's another comment about it: > > > We also prune any RECENTLY_DEAD tuples preceding a DEAD tuple. > > * This is OK because a RECENTLY_DEAD tuple preceding a DEAD tuple is really > > But that doesn't justify very much. > > What prevents the scenario that some other backend e.g. has a snapshot with > xmin=xmax=RECENTLY_DEAD-row. If the RECENTLY_DEAD row has an xid that is later > than the DEAD row, this afaict would make it perfectly legal to prune the DEAD > row, but *not* the RECENTLY_DEAD one. I'll need to think about this very carefully. I didn't think it was worth blocking v3 on, though naturally it's a big concern. > This comment can't really be understood without the historical behaviour of > the function. Agreed. Removed in v3. > > + if (res == HEAPTUPLE_DEAD || res == HEAPTUPLE_RECENTLY_DEAD) > > + { > > + heap_prune_record_unused(prstate, offnum); > > + HeapTupleHeaderAdvanceLatestRemovedXid(htup, > > + &prstate->latestRemovedXid); > > + } > > + else > > + Assert(false); > It seems weird to have the if (res == HEAPTUPLE_DEAD ..) branch, but then to > unconditionally return 1. I changed that in v3. > I'm not actually sure the Assert is unreachable. I can imagine cases where > we'd see e.g. DELETE/INSERT_IN_PROGRESS due to a concurrent subtransaction > abort or such. I don't know either. I am leaving this question unresolved for now. -- Peter Geoghegan
Вложения
В списке pgsql-bugs по дате отправления:
Следующее
От: Peter GeogheganДата:
Сообщение: Re: BUG #17245: Index corruption involving deduplicated entries