RE: Newly created replication slot may be invalidated by checkpoint

Поиск

Список

Период

Сортировка

От	Zhijie Hou (Fujitsu)
Тема	RE: Newly created replication slot may be invalidated by checkpoint
Дата	3 декабря 09:15:32
Msg-id	TY4PR01MB16907DE58B4F5631E0E445E1694D9A@TY4PR01MB16907.jpnprd01.prod.outlook.com обсуждение исходный текст
Ответ на	Re: Newly created replication slot may be invalidated by checkpoint (Masahiko Sawada <sawada.mshk@gmail.com>)
Ответы	Re: Newly created replication slot may be invalidated by checkpoint
Список	pgsql-hackers

Дерево обсуждения

On Wednesday, December 3, 2025 12:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> 
> On Mon, Dec 1, 2025 at 10:19 PM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Tuesday, December 2, 2025 1:03 AM Masahiko Sawada
> <sawada.mshk@gmail.com> wrote:
> > >
> > > On Fri, Nov 21, 2025 at 12:14 AM Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > > OK, I think it makes sense to start separate threads.
> > > >
> > > > I have split the patches based on the different bugs they address
> > > > and am sharing them here for reference.
> > > >
> > >
> > > I'm reviewing the 0001 patch and the problem that can be addressed
> > > by that patch. While the proposed patch addresses the race condition
> > > between a checkpointing and newly created slot, could the same issue
> > > happen between the checkpointing and copying a slot? I'm trying to
> > > understand when we have to acquire ReplicationSlotAllocationLock in
> > > an exclusive mode in the new lock scheme.
> >
> > Thanks for reviewing !
> >
> > I think the situation is somewhat different in the
> > copy_replication_slot(). As noted in the comments[1], it's considered
> > acceptable for WALs preceding the initial restart_lsn to be removed
> > since the latest restart_lsn will be copied again in the second phase, so
> latest WAL being reserved is safe.
> 
> Right. But does it mean that the new slot could be invalidated while being
> copied if the first copied restart_lsn becomes less than a new redo ptr set by a
> concurrent checkpoint? I thought the problem the
> 0001 patch is trying to fix is that the slot could end up being invalidated by a
> concurrent checkpoint even while being created, so I wonder if the same
> problem could occur.

I think the invalidation cannot occur when copying because:

Currently, there are no CHECK_FOR_INTERRUPTS() calls between the initial
restart_lsn copy (first phase) and the latest restart_lsn copy (second phase).
As a result, even if a checkpoint attempts to invalidate a slot and sends a
SIGTERM to the backend, the backend will first update the restart_lsn during the
second phase before responding to the signal. Consequently, during the next
cycle of InvalidatePossiblyObsoleteSlot(), the checkpoint will observe the
updated restart_lsn and skip the invalidation.

For logical slots, although invoking the output plugin startup callback presents
a slight chance of processing the signal (when using third-party plugins), slot
invalidation in this scenario results in immediate slot dropping, because the
slot is in RS_EPHEMERAL state, thus preventing invalidation.

While theoretically, slot invalidation could occur if the code changes in the
future, addressing that possibility could be considered an independent
improvement task. What do you think ?

Best Regards,
Hou zj

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

RE: Newly created replication slot may be invalidated by checkpoint