RE: Newly created replication slot may be invalidated by checkpoint
| От | Zhijie Hou (Fujitsu) |
|---|---|
| Тема | RE: Newly created replication slot may be invalidated by checkpoint |
| Дата | |
| Msg-id | TY4PR01MB16907FF0F8D547A9C56B3975C94D7A@TY4PR01MB16907.jpnprd01.prod.outlook.com обсуждение исходный текст |
| Ответ на | RE: Newly created replication slot may be invalidated by checkpoint ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>) |
| Ответы |
RE: Newly created replication slot may be invalidated by checkpoint
|
| Список | pgsql-hackers |
On Wednesday, November 19, 2025 4:24 PM Hou, Zhijie<houzj.fnst@fujitsu.com> wrote: > > On Monday, November 17, 2025 6:50 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > > > Attach v2 patch that improves the fix based on above analysis. > > > > (I am still testing some other scenarios related to slotsync to ensure the > > current fix is complete) > > I have been testing whether taking exclusive ReplicationSlotAllocationLock > can > prevent newly synced slot from being invalidated, but found that it is > insufficient. > > During slotsync, since it fetches the remote LSN (queried from the publisher) > as > the initial restart_lsn for newly synced slot and we sync the slots in a > asynchronous manner, the synced restart_lsn could be behind the standby's > redo > pointer. So, there could be race conditions that the checkpoint removes the > required WALs and invalidates the newly synced slot: > > 1. Assuming there is no slot on standby, the minimum slot LSN would be > invalid, > and the checkpoint reaches InvalidateObsoleteReplicationSlots() with > oldestSegno =segno of redo. > 2. The slotsync successfully sync the slot A that has old restart_lsn. > 3. The checkpoint would then find that the newly synced slot A should be > invalidated due to old restart_lsn. .. > > Here is the V3 patch that fixes all the race conditions found so far. I am attaching a tap-test in 0002 that includes an injection point to reproduce this scenario. The test expects that while syncing a slot to the standby server, if the WAL before the remote restart_lsn is at risk of being removed by a checkpoint, the slot cannot be synced. Without the fix in 0001, the test would fail because the slot is synced to standby but immediately invalidated by the checkpoint. I am not adamant about merging this test but just for reference. Besides, I fixed a bug in v3-0001 where the restart_lsn is set to a wrong value. > > ------ > > Apart from addressing the fix for HEAD, I would like to inquire if anyone holds > a differing opinion regarding the revert of the origin fix 2090edc6f32f652a2c in > the back branches and applying the same fix as HEAD. > > I searched for extensions that rely on the size of ReplicationSlot and did not > find any, so I think it is OK to implement the same fix in the backpatches. > > It there are no alternative views, I will proceed with preparing the patches for > the back branches in upcoming versions. Best Regards, Hou zj
Вложения
В списке pgsql-hackers по дате отправления: