Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

Поиск
Список
Период
Сортировка
От Melanie Plageman
Тема Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
Дата
Msg-id CAAKRu_Y_NJzF4-8gzTTeaOuUL3CcGoXPjXcAHbTTygT8AyVqag@mail.gmail.com
обсуждение исходный текст
Ответ на Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae  (Melanie Plageman <melanieplageman@gmail.com>)
Ответы Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae  (Melanie Plageman <melanieplageman@gmail.com>)
Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae  (Noah Misch <noah@leadboat.com>)
Список pgsql-bugs
On Tue, Jun 18, 2024 at 6:51 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> I ended up manually backporting the logic from 1ccc1e05ae as opposed
> to cherry-picking because it relied on a struct introduced in
> 4e9fc3a9762065.  Attached is a patch set with this backport against
> REL_15_STABLE. The first patch is an updated repro (now even more
> minimal) with copious additional comments. I am not proposing we add
> this as an ongoing test. It won't be stable. It is purely for
> illustration.
> The fix's commit message still needs editing and citations.
>
> My repro no longer works against REL_14_STABLE, though I was able to
> backport the fix there. I'll investigate that.

I figured out why it wasn't repro-ing on 14 -- just a timing issue. I
threw in a sleep for now. There is also something I can do with
pg_stat_progress_vacuum's phase column in the repro, but that was
harder without poll_query_until() (added in 16).

> Finally, upthread there is discussion of how we could end up doing a
> catalog lookup after vacuum_get_cutoffs() and before the tuple
> visibility check on 16. Assuming this is true, we would want to
> backport the fix to 16 as well. I could use some help getting a repro
> (using btree index deletion for example) of the infinite loop on 16.

So, I ended up working on a new repro that works by forcing a round of
index vacuuming after the standby reconnects and before pruning a dead
tuple whose xmax is older than OldestXmin.

At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
calls GetOldestNonRemovableTransactionId(), thereby updating the
backend's GlobalVisState and moving maybe_needed backwards.

Then vacuum's first pass will continue with pruning and find our later
inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to
maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin.

I make sure that the standby reconnects between vacuum_get_cutoffs()
(vacuum_set_xid_limits() on 14/15) and pruning because I have a cursor
on the page keeping VACUUM FREEZE from getting a cleanup lock.

See the repros for step-by-step explanations of how it works.

With this, I can repro the infinite loop on 14-16.

Backporting 1ccc1e05ae fixes 16 but, with the new repro, 14 and 15
error out with "cannot freeze committed xmax". I'm going to
investigate further why this is happening. It definitely makes me
wonder about the fix.

Attached is the backport and repros for 15 and 16. Note that because
of differences in background_psl, the perl test had to be different in
15 than 16, so you'll have to use the repro targeted at the correct
version

- Melanie

Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: PG Bug reporting form
Дата:
Сообщение: BUG #18518: ::timestamp add minutes and seconds to the converted values
Следующее
От: Tom Lane
Дата:
Сообщение: Re: BUG #18516: Foreign key data integrity is not validated when reenabled the trigger on tables