Re: promotion related handling in pg_sync_replication_slots()

Поиск

Список

Период

Сортировка

От	shveta malik
Тема	Re: promotion related handling in pg_sync_replication_slots()
Дата	18 апреля 2024 г. 07:05:14
Msg-id	CAJpy0uDahP8YJ7Eh0KtuMRJKx=xXw-sC6ZX-8FpZR56m3i38Mg@mail.gmail.com обсуждение исходный текст
Ответ на	Re: promotion related handling in pg_sync_replication_slots() (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>)
Ответы	Re: promotion related handling in pg_sync_replication_slots()
Список	pgsql-hackers

Дерево обсуждения

On Tue, Apr 16, 2024 at 2:52 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On Tue, Apr 16, 2024 at 02:06:45PM +0530, shveta malik wrote:
> > On Tue, Apr 16, 2024 at 1:55 PM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > > I personally feel adding the additional check for sync_replication_slots may
> > > not improve the situation here. Because the GUC sync_replication_slots can
> > > change at any point, the GUC could be false when performing this addition check
> > > and is set to true immediately after the check, so It could not simplify the logic
> > > anyway.
> >
> > +1.
> > I feel doc and "cannot synchronize replication slots concurrently"
> > check should suffice.
> >
> > In the scenario which Hou-San pointed out,  if after performing the
> > GUC check in SQL function, this GUC is enabled immediately and say
> > worker is started sooner than the function could get chance to sync,
> > in that case as well, SQL function will ultimately get error "cannot
> > synchronize replication slots concurrently", even though GUC is
> > enabled.  Thus, I feel we should stick with samer error in all
> > scenarios.
>
> Okay, fine by me, let's forget about checking sync_replication_slots then.

Thanks.

While reviewing and testing this patch further, we encountered 2
race-conditions which needs to be handled:

1) For slot sync worker, the order of cleanup execution was a) first
reset 'syncing' flag (slotsync_failure_callback) b) then reset pid and
syncing (slotsync_worker_onexit). But in ShutDownSlotSync(), we rely
only on the 'syncing' flag for wait-exit logic. So it may so happen
that in the window between these two callbacks, ShutDownSlotSync()
proceeds and calls update_synced_slots_inactive_since() which may then
hit assert  Assert((SlotSyncCtx->pid == InvalidPid).

2) Another problem as described by Hou-San off-list:
When the slotsync worker error out after acquiring a slot, it will
first call slotsync_worker_onexit() and then
ReplicationSlotShmemExit(), so in the window between these two
callbacks, it's possible that the SlotSyncCtx->syncing
SlotSyncCtx->pid has been reset but the slot->active_pid is still
valid. The Assert will be broken in this.
@@ -1471,6 +1503,9 @@ update_synced_slots_inactive_since(void)
        {
            Assert(SlotIsLogical(s));

+           /* The slot must not be acquired by any process */
+           Assert(s->active_pid == 0);
+

To fix above issues, these changes have been made in v7:
1) For worker, replaced slotsync_failure_callback() with
slotsync_worker_disconnect() so that the latter only disconnects and
thus slotsync_worker_onexit() does pid cleanup followed by syncing
flag cleanup. This will make ShutDownSlotSync()'s wait exit reliably.

2) To fix second problem, changes are:

2.1) For worker, moved slotsync_worker_onexit() registration before
BaseInit() (BaseInit is the one doing ReplicationSlotShmemExit
registration). If we do this change in order of registration, then
order of cleanup for worker will be a) slotsync_worker_disconnect() b)
ReplicationSlotShmemExit() c) slotsync_worker_onexit(). This order
ensures that the worker is actually done with slots release and
cleanup before it marks itself as done syncing.

2.2) For SQL function, did ReplicationSlotRelease() and
ReplicationSlotCleanup() as first step in slotsync_failure_callback().

While doing change 2.2, it occurred to us, that it would be a clean
solution to do ReplicationSlotCleanup() even on successful execution
of SQL function. It seems better that the temporary slots are
cleaned-up when SQL function exists, as we do not know when the user
will run this SQL function again and thus leaving temp slots for
longer does not seem a good idea. Thoughts?

thanks
Shveta

Вложения

v7-0001-Handle-stopSignaled-during-sync-function-call.patch

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: promotion related handling in pg_sync_replication_slots()

Вложения