Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Дата
Msg-id CAH2-Wzma=Y3O+LRx2Wj_HwGbbbeNwr6FoJzXni8hxOMw55pcZg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum  (Dmitry Dolgov <9erthalion6@gmail.com>)
Ответы Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-bugs
On Tue, Nov 9, 2021 at 7:01 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Yes, adding such condition works in this case, no non-heap-only tuples
> were recorded as unused in heap_prune_chain, and nothing else popped up
> afterwards. But now after a couple of runs I could also reproduce (at
> least partially) what Alexander was talking about:
>
>     ERROR:  could not open relation with OID 1056321

I've seen that too.

> Not sure yet where is it coming from.

I think that the additional check that I sketched (in
heap_prune_chain()) is protective in that it prevents a bad situation
from becoming even worse. At the same time it doesn't actually fix
anything.

I've discussed this privately with Andres -- expect more from him
soon. I came up with more sophisticated instrumentation (better
assertions, really) that shows that the problem begins in VACUUM, not
opportunistic pruning (certainly with the test case we have).

The real problem (identified by Andres) seems to be that pruning feels
entitled to corrupt HOT chains by making an existing LP_REDIRECT
continue to point to a DEAD item that it marks LP_UNUSED. That's never
supposed to happen. This seems to occur in the aborted heap-only tuple
"If the tuple is DEAD and doesn't chain to anything else, mark it
unused immediately" code path at the top of heap_prune_chain(). That
code path seems wonky, despite not having changed much in many years.

This wonky heap_prune_chain() code cares about whether the
to-be-set-LP_UNUSED heap-only tuple item chains to other items -- it's
only really needed for aborted tuples, but doesn't discriminate
between aborted tuples and other kinds of DEAD tuples. It doesn't seem
to get all the details right. In particular, it doesn't account for
the fact that it's not okay to break a HOT chain between the root
LP_REDIRECT item and the first heap-only tuple. It's only okay to do
that between two heap-only tuples.

HOT chain traversal code (in places like heap_hot_search_buffer())
knows how to deal with broken HOT chains when the breakage occurs
between two heap-only tuples, but not in this other LP_REDIRECT case.
It's just not possible for HOT chain traversal code to deal with that,
because there is not enough information in the LP_REDIRECT (just a
link to another item, no xmin or xmax) to validate anything on the
fly. It's pretty clear that we need to specifically make sure that
LP_REDIRECT items always point to something sensible. Anything less
risks causing confusion about which HOT chain lives at which TID.

Obviously we were quite right to suspect that there wasn't enough
rigor around the HOT chain invariants. Wasn't specifically expecting
it to help with this bug.

-- 
Peter Geoghegan



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Zuber Farooqui
Дата:
Сообщение: Re: BUG #17276: pg_tblspc Permission denied
Следующее
От: PG Bug reporting form
Дата:
Сообщение: BUG #17277: write past chunk when calling normalize() on an empty string