Hi,
On 2021-11-10 19:08:21 -0800, Andres Freund wrote:
> On 2021-11-09 19:07:02 -0800, Peter Geoghegan wrote:
> > @@ -700,11 +720,18 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
> > }
> >
> > /*
> > - * Remember the last DEAD tuple seen. We will advance past
> > - * RECENTLY_DEAD tuples just in case there's a DEAD one after them;
> > - * but we can't advance past anything else. We have to make sure that
> > - * we don't miss any DEAD tuples, since DEAD tuples that still have
> > - * tuple storage after pruning will confuse VACUUM.
> > + * Remember the last DEAD tuple seen from this HOT chain.
> > + *
> > + * We expect to sometimes find a RECENTLY_DEAD tuple after a DEAD
> > + * tuple. When this happens, the RECENTLY_DEAD tuple will be treated
> > + * as if it was DEAD all along. Manage that now.
>
> I now actually wonder why this is correct. There's another comment about it:
>
> > We also prune any RECENTLY_DEAD tuples preceding a DEAD tuple.
> > * This is OK because a RECENTLY_DEAD tuple preceding a DEAD tuple is really
>
> But that doesn't justify very much.
>
> What prevents the scenario that some other backend e.g. has a snapshot with
> xmin=xmax=RECENTLY_DEAD-row. If the RECENTLY_DEAD row has an xid that is later
> than the DEAD row, this afaict would make it perfectly legal to prune the DEAD
> row, but *not* the RECENTLY_DEAD one.
I think the code is actually correct if awfully commented. I.e. the above
scenario cannot happen in a harmful situation. The relevant piece is that the
set of "valid" snapshots is limited (for a specific cluster history).
A row can only be RECENTLY_DEAD if the deleting / updating transaction
committed.
An intermediary RECENTLY_DEAD row followed by a DEAD row would only be visible
to a snapshot that either had
1) row.xmax >= Snapshot.xmax
2) row.xmax in Snapshot.xip.
There cannot be a snapshot that sees the RECENTLY_DEAD row while also seeing
the subsequent row as DEAD, because the existence of one of the two conditions
would have held back the xid horizon:
1) is not possible, because Snapshot.xmin is <= Snapshot.xmax, and
Snapshot.xmin >= MyProc->xmin. The subsequent row could not be DEAD.
2) is not possible, because .xip only contains xids >= Snapshot.xmin.
IOW, there *can* be a snapshot with xmin=xmax=RECENTLY_DEAD-row, but it'd not
see the RECENTLY_DEAD row anyway.
Greetings,
Andres Freund