RE: How can end users know the cause of LR slot sync delays?

Поиск
Список
Период
Сортировка
От Zhijie Hou (Fujitsu)
Тема RE: How can end users know the cause of LR slot sync delays?
Дата
Msg-id TY4PR01MB169070A4CA8D544ACDFB1191094D1A@TY4PR01MB16907.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответ на RE: How can end users know the cause of LR slot sync delays?  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Список pgsql-hackers
On Tuesday, November 25, 2025 6:30 PM Kuroda, Hayato <kuroda.hayato@fujitsu.com> wrote:
> 
> Dear Hou, Amit,
> 
> > Right, I agree. Here is the patch to release the slot at necessary places.
> 
> Thanks for working on it. However, BF machines have not satisfied the fix yet.
> There are still two failures after 3df4df53b06 [1] [2].
> 
> The reported issue was that standby server failed to synchronize the slot after
> the slot is re-created on the primary. According to [1], slots on standby has
> newer catalog xmin than primary. Like:
> 
> ```
> LOG:  could not synchronize replication slot "lsub1_slot"
> DETAIL:  Synchronization could lead to data loss, because the remote slot
> needs WAL at LSN 0/030163A8 and catalog xmin 758, but the standby has
> LSN 0/030163A8 and catalog xmin 759.
> ```
> 
> Per analysis, the newly created logical slot on primary has the initial
> catalog_xmin as 758 due to the physical slot holding catalog_xmin:758. The
> standby does not have slots, so the new slot will have the latest xid (759) as
> catalog_xmin.
> 
> Anyway, I think this is a test issue.

The issue is that the physical slot on the primary retains a catalog_xmin of
758, causing newly created slots to inherit the same catalog_xmin. In contrast,
the standby, lacking slots, assigns an initial catalog_xmin of 759 to newly
synced slots. The problem arises because the logical slot on the primary isn't
being consumed, preventing the catalog_xmin from advancing, which leads to the
test timing out.

Previously, we avoided this issue by intentionally preventing xid assignment
during slotsync tests, ensuring xmin/catalog_xmin remained static in most cases.
However, the new test involves some DDLs in between tests causing this issue.
Rather than adding additional wait events for control, we discussed to relocate
the test to the end—after promoting the standby—where syncing the slot
successfully isn't necessary. Since the test's goal is solely to verify slotsync
skip statistics, this approach should suffice.

Here is the patch to modify the test.

> 
> [1]: https://buildfarm.postgresql.org/cgi-
> bin/show_log.pl?nm=scorpion&dt=2025-11-25%2009%3A03%3A17
> [2]: https://buildfarm.postgresql.org/cgi-
> bin/show_log.pl?nm=grassquit&dt=2025-11-25%2009%3A01%3A08

Best Regards,
Hou zj


Вложения

В списке pgsql-hackers по дате отправления: