Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Дата
Msg-id CAH2-WzmqBtFwdmBW=AG8ZZW_uF5a4entmv4QmqhsXOT-Fj4L-g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum  (Andres Freund <andres@anarazel.de>)
Ответы Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum  (Peter Geoghegan <pg@bowt.ie>)
Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum  (Andres Freund <andres@anarazel.de>)
Список pgsql-bugs
On Wed, Nov 10, 2021 at 11:20 AM Andres Freund <andres@anarazel.de> wrote:
> The way this definitely breaks - I have been able to reproduce this in
> isolation - is when one tuple is processed twice by heap_prune_chain(), and
> the result of HeapTupleSatisfiesVacuum() changes from
> HEAPTUPLE_DELETE_IN_PROGRESS to DEAD.

I had no idea that that was now possible. I really think that this
ought to be documented centrally.

As you know, I don't like the way that vacuumlazy.c doesn't explain
anything about the relationship between OldestXmin (which still
exists, but isn't used for pruning), and the similar GlobalVisState
state (used only during pruning). Surely this deserves to be
addressed, because we expect these two things to agree in certain
specific ways. But not necessarily in others.

> Note that there are several paths < 14, that cause HTSV()'s answer to change
> for the same xid. E.g. when the transaction inserting a tuple version aborts,
> we go from HEAPTUPLE_INSERT_IN_PROGRESS to DEAD.

Right -- that one I certainly knew about.  After all, the
tupgone-ectomy work from my commit 8523492d specifically targeted this
case.

> But I haven't quite found a
> path to trigger problems with that, because there won't be redirects to a
> tuple version that is HEAPTUPLE_INSERT_IN_PROGRESS (but there can be redirects
> to a HEAPTUPLE_DELETE_IN_PROGRESS or RECENTLY_DEAD).

That explains why the snapshot scalability either made these problems
possible for the first time, or at the very least made them far far
more likely in practice.

The relevant code in pruneheap.c was always incredibly fragile -- no
question. Even still, there is really no good reason to believe that
that was actually a problem before commit dc7420c2. Even if we assume
that there's a problem before 14, the surface area is vastly smaller
than on 14 -- the relevant pruneheap.c code hasn't really ever changed
since HOT went in. And so I think that the most sensible course of
action here is this: commit a fix to Postgres 14 + HEAD only -- no
backpatch to earlier versions.

We could go back further than that, but ISTM that the risk of causing
new problems far outweighs the benefits. Whereas I feel pretty
confident that we need to do something on 14.

-- 
Peter Geoghegan



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Peter Geoghegan
Дата:
Сообщение: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum