snapshot recovery conflict despite hot_standby_feedback set to on

Поиск
Список
Период
Сортировка
От Drouvot, Bertrand
Тема snapshot recovery conflict despite hot_standby_feedback set to on
Дата
Msg-id 9aae233b-72ec-b1b8-5716-2a092909f89f@amazon.com
обсуждение исходный текст
Ответы Re: snapshot recovery conflict despite hot_standby_feedback set to on  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-bugs

Hi,

TL;DR:

  • Symptom: snapshot recovery conflict on standby despite hot_standby_feedback set to on.
  • Cause: The bug is caused by incorrect 32-bit comparison of xmin in the case of Btree page reuse WAL record. This should be fixed in version 14 based on commit e5d8a99903 which introduces 64-bit comparison.
  • Mitigation: This bug occurs if there are old deleted Btree pages, so a mitigation is to rebuild the index to remove the old deleted pages.


Detailed Explanation:

We have been able to get the xids being compared in GetConflictingVirtualXIDs() when a snapshot recovery conflict is occurring, as well as the associated WAL record replay being blocked, we got:

  • pxmin: 3882856499
  • limitXmin: 1557468379
  • WAL record replay being blocked: Btree/REUSE_PAGE


"Btree/REUSE_PAGE" means that on the primary a Btree page deleted some time ago has been reused.
The limitXmin being used in that case is the xid when the page has been deleted (+1) on the primary: 1557468379

The first question is then, why a conflict has been recorded?

Logically a conflict is being recorded for a backend if its xmin is <= the limitxmin.

But wait, we have pxmin (3882856499) > limitXmin (1557468379) so why is a conflict being recorded??

The function that is doing the comparison is TransactionIdFollows() (being called in GetConflictingVirtualXIDs()), and the comparison is done with a (int32) casting:

diff = (int32) (id1 - id2);
return (diff > 0);


As 3882856499 - 1557468379 is >= 2^31 + 1 then the diff is < 0 (due to the cast) so that TransactionIdFollows(3882856499, 1557468379) is wrongly returning that 3882856499 does not follow 1557468379, adding the associated backend to the list of conflicting backends.

This is the bug.

The second question is: does this big difference makes sense?

Yes, it does. The Btree page has been deleted a long time ago and has been reused a lot of transactions later. Logically that could happen.

More details about the bug circumstances:

It turned out that this bug is manifesting when there is another replication slot (means not linked to this standby) on the primary with a relatively old xmin.

Indeed, first, let's recall that a Btree deleted page is being reused (on the primary) if (see _bt_page_recyclable()):

TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin)

But TransactionIdPrecedes() is also using an (int32) casting comparison, so it could also return wrong result if the difference is >= 2^31 + 1.

In our case the comparison was done (on the primary) with something like TransactionIdPrecedes( 1.5B, 3.8B) , so it is wrongly returning than 1.5B does not precede 3.8B (as the difference is >= 2^31 + 1).

As a consequence, that Btree deleted page is wrongly not reused: this is the bug fixed by Peter in PG 14 (e5d8a99903 commit).

So the Btree deleted page is not being reused and as a consequence there is no "wrong" snapshot recovery conflict on the Standby.

So, in that case, the bug mentioned above is somehow "protecting" us from the "false" snapshot recovery conflict on the Standby.

This is when, the things could change if another replication slot would have been also present on the primary (with a relatively old xmin and then changing the RecentGlobalXmin).

Indeed, with another replication slot (that is not linked to the standby) on the primary reporting a relatively old xmin (so that this xmin is the oldest TransactionXmin across all running transactions) then acting as the RecentGlobalXmin.

As a matter of fact if this xmin coming from the other replication slot is old enough (means the difference with then btpo.xact is < 2^31 + 1) then TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin) is returning the correct result and the Btree deleted page is now reused.

But as the difference between the btpo.xact and the backend xmin on the Standby is >= 2 ^31 +1 then the false snapshot recovery conflict mentioned initially is triggered.

Peter's commit e5d8a99903 added in PG 14 should be fixing this bug (as it makes use of FullTransactionId for the Btree deleted page) even if the original intend was to avoid "leaking" of Btree deleted pages.

Recommendation:

Given the fact that the bug described here is occurring at very rare circumstances we don't think this is worth Peter's commit e5d8a99903 to be back patched.

The reason for this bug report is more to describe a scenario where it could happen in case someone is seeing snapshot recovery conflict despite hot_standby_feedback set to on.

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: BUG #17382: When vacuum full or vacuumdb - F is executed, a large number of empty files will be generated in the
Следующее
От: PG Bug reporting form
Дата:
Сообщение: BUG #17387: Working in PG13 but not in PGH14: array_agg(RECORD)