Found: some pretty ugly VACUUM bugs

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Found: some pretty ugly VACUUM bugs
Дата
Msg-id 10156.1124394535@sss.pgh.pa.us
обсуждение исходный текст
Ответы Re: Found: some pretty ugly VACUUM bugs  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Список pgsql-hackers
I believe I've traced down the cause of the Assert trap in VACUUM FULL
that Teodor reported here:
http://archives.postgresql.org/pgsql-hackers/2005-06/msg01278.php

The case that VACUUM is tripping up on is one in which some concurrent
transaction (call it X1) updates and then later deletes a row.  By the
time VACUUM starts, X1 is committed, but it's still within the
OldestXmin horizon, so there are open transactions that should not see
its effects.  We have two tuples in the table, call 'em T1 and T2,
representing the original and updated states of the row:
    XMIN    XMAX    t_ctid

T1        X0    X1    points to T2
T2        X1    X1    points to T2 (ie, itself)

(X0 is whichever transaction originally inserted T1; it's old enough
to not be interesting.  The reason T1 must point to T2 is that a READ
COMMITTED transaction that decides to update T1 must be able to find
T2 instead.)

The reason this configuration is troublesome is that
HeapTupleSatisfiesVacuum has this (premature?) optimization in it:
   if (TransactionIdEquals(HeapTupleHeaderGetXmin(tuple),                           HeapTupleHeaderGetXmax(tuple)))   {
     /*        * inserter also deleted it, so it was never visible to anyone        * else        */       return
HEAPTUPLE_DEAD;  }
 

This code causes VACUUM to delete T2, even though T1 is still pointing
at it --- and because T1's XMAX is past the OldestXmin horizon, T1 will
not be deleted.  The Assert that Teodor saw arose from the following
specific series of events (which can only occur if T2 is on a lower-
numbered page than T1):
* T2 is removed and its slot is marked free.* repair_frag moves some unrelated tuple into T2's slot.* repair_frag
visitsT1, decides it has to move a tuple chain,  and moves the new occupant of T2's slot (which is already  wrong
anyway).*when update_hint_bits scans T2's page, it finds the wrong  number of MOVED_IN tuples because the tuple that
wasmoved  into T2's slot is now MOVED_OFF.  This triggers the Assert.
 

However this is merely the least interesting symptom of the problem.
If the state with T1 pointing to a tuple that is actually unrelated
is allowed to persist, then after the VACUUM is done someone else could
find T1 and then update the new occupant of T2's slot under the mistaken
assumption that it's the descendant of T1.

My first instinct for a quick fix was to delete the above-quoted section
of HeapTupleSatisfiesVacuum.  This would ensure that we never remove
a tuple that some other tuple might still be pointing at, unless we are
going to remove the referencing tuple as well.  But this only fixes the
problem for VACUUM FULL, which exclusive-locks the whole table.  With
lazy VACUUM, concurrent transactions can see the intermediate state
with T2 gone and T1 not.  Thus they could write some new tuple into T2's
slot and then make the mistaken update before VACUUM gets a chance to
remove T1.

The only solution I can see (short of abandoning lazy VACUUM) is that
we have to make the code that follows t_ctid chains more wary.  That
code is already aware (at least in the places I looked at) that a t_ctid
link might lead to an empty slot, but if there is a tuple in the slot
it just assumes that that is really a descendant version of the tuple
pointing to it.  That won't do.  But I believe it would work if we also
test that the XMIN of the tuple in the slot equals the XMAX of the
referencing tuple.  If they are unequal, we can conclude that the
original child tuple is dead and has been removed, so there is no
current version of the referencing tuple.

It is clearly true that if the XMIN and XMAX are different, the putative
child tuple isn't really the child.  I believe this test is sufficient,
because VACUUM never removes rows generated by still-in-progress
transactions, and so it is not possible for a single transaction to
store different tuples into the same slot over its lifetime.

This is going to require a number of changes since there are several
places that follow t_ctid chains.  With the change in place, though,
I think it's OK to leave the xmin=xmax optimization in place in
HeapTupleSatisfiesVacuum.

Comments?  Anyone see any flaws in the reasoning?
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Josh Berkus
Дата:
Сообщение: Re: problem with coalesce
Следующее
От: Andrew Dunstan
Дата:
Сообщение: Windows + IP6 progress