RE: Newly created replication slot may be invalidated by checkpoint
| От | Zhijie Hou (Fujitsu) |
|---|---|
| Тема | RE: Newly created replication slot may be invalidated by checkpoint |
| Дата | |
| Msg-id | TY4PR01MB16907DE58B4F5631E0E445E1694D9A@TY4PR01MB16907.jpnprd01.prod.outlook.com обсуждение исходный текст |
| Ответ на | Re: Newly created replication slot may be invalidated by checkpoint (Masahiko Sawada <sawada.mshk@gmail.com>) |
| Ответы |
Re: Newly created replication slot may be invalidated by checkpoint
|
| Список | pgsql-hackers |
On Wednesday, December 3, 2025 12:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Dec 1, 2025 at 10:19 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Tuesday, December 2, 2025 1:03 AM Masahiko Sawada > <sawada.mshk@gmail.com> wrote: > > > > > > On Fri, Nov 21, 2025 at 12:14 AM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > OK, I think it makes sense to start separate threads. > > > > > > > > I have split the patches based on the different bugs they address > > > > and am sharing them here for reference. > > > > > > > > > > I'm reviewing the 0001 patch and the problem that can be addressed > > > by that patch. While the proposed patch addresses the race condition > > > between a checkpointing and newly created slot, could the same issue > > > happen between the checkpointing and copying a slot? I'm trying to > > > understand when we have to acquire ReplicationSlotAllocationLock in > > > an exclusive mode in the new lock scheme. > > > > Thanks for reviewing ! > > > > I think the situation is somewhat different in the > > copy_replication_slot(). As noted in the comments[1], it's considered > > acceptable for WALs preceding the initial restart_lsn to be removed > > since the latest restart_lsn will be copied again in the second phase, so > latest WAL being reserved is safe. > > Right. But does it mean that the new slot could be invalidated while being > copied if the first copied restart_lsn becomes less than a new redo ptr set by a > concurrent checkpoint? I thought the problem the > 0001 patch is trying to fix is that the slot could end up being invalidated by a > concurrent checkpoint even while being created, so I wonder if the same > problem could occur. I think the invalidation cannot occur when copying because: Currently, there are no CHECK_FOR_INTERRUPTS() calls between the initial restart_lsn copy (first phase) and the latest restart_lsn copy (second phase). As a result, even if a checkpoint attempts to invalidate a slot and sends a SIGTERM to the backend, the backend will first update the restart_lsn during the second phase before responding to the signal. Consequently, during the next cycle of InvalidatePossiblyObsoleteSlot(), the checkpoint will observe the updated restart_lsn and skip the invalidation. For logical slots, although invoking the output plugin startup callback presents a slight chance of processing the signal (when using third-party plugins), slot invalidation in this scenario results in immediate slot dropping, because the slot is in RS_EPHEMERAL state, thus preventing invalidation. While theoretically, slot invalidation could occur if the code changes in the future, addressing that possibility could be considered an independent improvement task. What do you think ? Best Regards, Hou zj
В списке pgsql-hackers по дате отправления: