Re: promotion related handling in pg_sync_replication_slots()
От | shveta malik |
---|---|
Тема | Re: promotion related handling in pg_sync_replication_slots() |
Дата | |
Msg-id | CAJpy0uDahP8YJ7Eh0KtuMRJKx=xXw-sC6ZX-8FpZR56m3i38Mg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: promotion related handling in pg_sync_replication_slots() (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>) |
Ответы |
Re: promotion related handling in pg_sync_replication_slots()
(shveta malik <shveta.malik@gmail.com>)
|
Список | pgsql-hackers |
On Tue, Apr 16, 2024 at 2:52 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Tue, Apr 16, 2024 at 02:06:45PM +0530, shveta malik wrote: > > On Tue, Apr 16, 2024 at 1:55 PM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > I personally feel adding the additional check for sync_replication_slots may > > > not improve the situation here. Because the GUC sync_replication_slots can > > > change at any point, the GUC could be false when performing this addition check > > > and is set to true immediately after the check, so It could not simplify the logic > > > anyway. > > > > +1. > > I feel doc and "cannot synchronize replication slots concurrently" > > check should suffice. > > > > In the scenario which Hou-San pointed out, if after performing the > > GUC check in SQL function, this GUC is enabled immediately and say > > worker is started sooner than the function could get chance to sync, > > in that case as well, SQL function will ultimately get error "cannot > > synchronize replication slots concurrently", even though GUC is > > enabled. Thus, I feel we should stick with samer error in all > > scenarios. > > Okay, fine by me, let's forget about checking sync_replication_slots then. Thanks. While reviewing and testing this patch further, we encountered 2 race-conditions which needs to be handled: 1) For slot sync worker, the order of cleanup execution was a) first reset 'syncing' flag (slotsync_failure_callback) b) then reset pid and syncing (slotsync_worker_onexit). But in ShutDownSlotSync(), we rely only on the 'syncing' flag for wait-exit logic. So it may so happen that in the window between these two callbacks, ShutDownSlotSync() proceeds and calls update_synced_slots_inactive_since() which may then hit assert Assert((SlotSyncCtx->pid == InvalidPid). 2) Another problem as described by Hou-San off-list: When the slotsync worker error out after acquiring a slot, it will first call slotsync_worker_onexit() and then ReplicationSlotShmemExit(), so in the window between these two callbacks, it's possible that the SlotSyncCtx->syncing SlotSyncCtx->pid has been reset but the slot->active_pid is still valid. The Assert will be broken in this. @@ -1471,6 +1503,9 @@ update_synced_slots_inactive_since(void) { Assert(SlotIsLogical(s)); + /* The slot must not be acquired by any process */ + Assert(s->active_pid == 0); + To fix above issues, these changes have been made in v7: 1) For worker, replaced slotsync_failure_callback() with slotsync_worker_disconnect() so that the latter only disconnects and thus slotsync_worker_onexit() does pid cleanup followed by syncing flag cleanup. This will make ShutDownSlotSync()'s wait exit reliably. 2) To fix second problem, changes are: 2.1) For worker, moved slotsync_worker_onexit() registration before BaseInit() (BaseInit is the one doing ReplicationSlotShmemExit registration). If we do this change in order of registration, then order of cleanup for worker will be a) slotsync_worker_disconnect() b) ReplicationSlotShmemExit() c) slotsync_worker_onexit(). This order ensures that the worker is actually done with slots release and cleanup before it marks itself as done syncing. 2.2) For SQL function, did ReplicationSlotRelease() and ReplicationSlotCleanup() as first step in slotsync_failure_callback(). While doing change 2.2, it occurred to us, that it would be a clean solution to do ReplicationSlotCleanup() even on successful execution of SQL function. It seems better that the temporary slots are cleaned-up when SQL function exists, as we do not know when the user will run this SQL function again and thus leaving temp slots for longer does not seem a good idea. Thoughts? thanks Shveta
Вложения
В списке pgsql-hackers по дате отправления: