Re: Improve pg_sync_replication_slots() to wait for primary to advance
От | shveta malik |
---|---|
Тема | Re: Improve pg_sync_replication_slots() to wait for primary to advance |
Дата | |
Msg-id | CAJpy0uBf=tY7HZAtBfAFWvFVtVbsNtehJ6s34w_KGDHUHoFKZA@mail.gmail.com обсуждение исходный текст |
Список | pgsql-hackers |
On Tue, Jun 24, 2025 at 4:11 PM Ajin Cherian <itsajin@gmail.com> wrote: > > Hello, > > Creating this thread for a POC based on discussions in thread [1]. > Hou-san had created this patch, and I just cleaned up some documents, > did some testing and now sharing the patch here. > > In this patch, the pg_sync_replication_slots() API now waits > indefinitely for the remote slot to catch up. We could later add a > timeout parameter to control maximum wait time if this approach seems > acceptable. If there are more ideas on improving this patch, let me > know. +1 on the idea. I believe the timeout option may not be necessary here, since the API can be manually canceled if needed. Otherwise, the recommended approach is to let it complete. But I would like to know what others think here. Few comments: 1) When the API is waiting for the primary to advance, standby fails to handle promotion requests. Promotion fails: ./pg_ctl -D ../../standbydb/ promote -w waiting for server to promote.................stopped waiting pg_ctl: server did not promote in time See the logs at [1] 2) Also when the API is waiting for a long time, it just dumps the 'waiting for remote_slot..' LOG only once. Do you think it makes sense to log it at a regular interval until the wait is over? See logs at [1]. It dumped the log once in 3minutes. 3) + /* + * It is possible to get null value for restart_lsn if the slot is + * invalidated on the primary server, so handle accordingly. + */ + if (new_invalidated || XLogRecPtrIsInvalid(new_restart_lsn)) + { + /* + * The slot won't be persisted by the caller; it will be cleaned up + * at the end of synchronization. + */ + ereport(WARNING, + errmsg("aborting initial sync for slot \"%s\"", + remote_slot->name), + errdetail("This slot was invalidated on the primary server.")); Which case are we referring to here where null restart_lsn would mean invalidation? Can you please point me to such code where it happens or a test-case which does that. I tried a few invalidation cases, but did not hit it. [1]: Log file: 2025-07-02 14:38:09.851 IST [153187] LOG: waiting for remote slot "failover_slot" LSN (0/3003F60) and catalog xmin (754) to pass local slot LSN (0/3003F60) and catalog xmin (767) 2025-07-02 14:38:09.851 IST [153187] STATEMENT: SELECT pg_sync_replication_slots(); 2025-07-02 14:41:36.200 IST [153164] LOG: received promote request thanks Shveta
В списке pgsql-hackers по дате отправления: