Обсуждение: Re: Improve pg_sync_replication_slots() to wait for primary to advance

Поиск

Список

Период

Сортировка

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Japin Li

Дата:

07 ноября 2025 г., 08:06:26

On Thu, 06 Nov 2025 at 18:53, Ajin Cherian <itsajin@gmail.com> wrote:
> On Tue, Nov 4, 2025 at 5:23 PM shveta malik <shveta.malik@gmail.com> wrote:
>>
>> >
>> > I have addressed the above comments in patch v21.
>> >
>>
>> Thank You. Please find a few comments:
>>
>> 1)
>> + fparams.slot_names = slot_names = NIL;
>>
>> I think it is not needed to set slot_names to NIL.
>>
>> 2)
>> -    WAIT_EVENT_REPLICATION_SLOTSYNC_MAIN);
>> +    WAIT_EVENT_REPLICATION_SLOTSYNC_PRIMARY_CATCHUP);
>>
>> The new name does not seem appropriate. For the slotsync-worker case,
>> even when the primary is not behind, the worker still waits but it is
>> not waiting for primary to catch-up. I could not find a better name
>> except the original one 'WAIT_EVENT_REPLICATION_SLOTSYNC_MAIN'. We can
>> change the explanation to :
>>
>> "Waiting in main loop of slot sync worker and slot sync API."
>> Or
>> "Waiting in main loop of slot synchronization."
>>
>> If anyone has any better name suggestions, we can consider changing.
>
> Changed as suggested above.
>
>>
>> 3)
>>
>> +# Attempt to synchronize slots using API. The API will continue retrying
>> +# synchronization until the remote slot catches up with the locally reserved
>> +# position. The API will not return until this happens, to be able to make
>> +# further calls, call the API in a background process.
>>
>> Shall we remove 'with the locally reserved position', it’s already
>> explained in the test header and the comment is good enough even
>> without it.
>>
>
> Changed.
>
>> 4)
>> +# Confirm log that the slot has been synced after becoming sync-ready.
>>
>> Shall we just say:
>> Confirm from the log that the slot is sync-ready now.
>>
>> 5)
>>  # Synchronize the primary server slots to the standby.
>>  $standby1->safe_psql('postgres', "SELECT pg_sync_replication_slots();");
>> @@ -945,6 +1007,7 @@ $subscriber1->safe_psql('postgres',
>>  is( $standby1->safe_psql(
>>   'postgres',
>>   q{SELECT count(*) = 2 FROM pg_replication_slots WHERE slot_name IN
>> ('lsub1_slot', 'snap_test_slot') AND synced AND NOT temporary;}
>> +
>>   ),
>>
>> Redundant change.
>
> Removed.
>
> Attaching patch v22 addressing the above comments.

@@ -62,8 +62,8 @@ LOGICAL_APPLY_MAIN    "Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN    "Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN    "Waiting in main loop of logical replication parallel apply process."
 RECOVERY_WAL_STREAM    "Waiting in main loop of startup process for WAL to arrive, during streaming recovery."
-REPLICATION_SLOTSYNC_MAIN    "Waiting in main loop of slot sync worker."
 REPLICATION_SLOTSYNC_SHUTDOWN    "Waiting for slot sync worker to shut down."
+REPLICATION_SLOTSYNC_MAIN    "Waiting in main loop of slot synchronization."
 SYSLOGGER_MAIN    "Waiting in main loop of syslogger process."
 WAL_RECEIVER_MAIN    "Waiting in main loop of WAL receiver process."
 WAL_SENDER_MAIN    "Waiting in main loop of WAL sender process."

I've noticed that all events are sorted alphabetically.  I think we should keep
the order of REPLICATION_SLOTSYNC_MAIN unchanged.

--
Regards,
Japin Li
ChengDu WenWu Information Technology Co., Ltd.

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

shveta malik

Дата:

10 ноября 2025 г., 12:31:01

On Fri, Nov 7, 2025 at 10:36 AM Japin Li <japinli@hotmail.com> wrote:
>
> >
> > Attaching patch v22 addressing the above comments.
>
> @@ -62,8 +62,8 @@ LOGICAL_APPLY_MAIN    "Waiting in main loop of logical replication apply process."
>  LOGICAL_LAUNCHER_MAIN  "Waiting in main loop of logical replication launcher process."
>  LOGICAL_PARALLEL_APPLY_MAIN    "Waiting in main loop of logical replication parallel apply process."
>  RECOVERY_WAL_STREAM    "Waiting in main loop of startup process for WAL to arrive, during streaming recovery."
> -REPLICATION_SLOTSYNC_MAIN      "Waiting in main loop of slot sync worker."
>  REPLICATION_SLOTSYNC_SHUTDOWN  "Waiting for slot sync worker to shut down."
> +REPLICATION_SLOTSYNC_MAIN      "Waiting in main loop of slot synchronization."
>  SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
>  WAL_RECEIVER_MAIN      "Waiting in main loop of WAL receiver process."
>  WAL_SENDER_MAIN        "Waiting in main loop of WAL sender process."
>
> I've noticed that all events are sorted alphabetically.  I think we should keep
> the order of REPLICATION_SLOTSYNC_MAIN unchanged.
>

+1.

Few trivial comments:

1)
Since we have always used the term 'SQL function' rather than API in
existing code, shall we change all references of API to 'SQL function'
in current patch:

+ * If the pg_sync_replication API is used to sync the slots, and if the slots
"If the SQL function pg_sync_replication_slots() is used.."

+ * the reasons mentioned above, then the API also waits and retries until the
API --> SQL function

+ * persist. It is utilized by the pg_sync_replication_slots() API.
 pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots()

+ * the API can retry.
API --> SQL function

+ /* Set this, so that API can retry */
API --> SQL function

+ * persist. It is utilized by the pg_sync_replication_slots() API.
pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots()

+ * slot_persistence_pending - boolean used by pg_sync_replication_slots
+ *   API to track if any slots could not be
pg_sync_replication_slots  API --> SQL function pg_sync_replication_slots()

+ * Interrupt handler for pg_sync_replication_slots() API.
pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots()


2)
ProcessSlotSyncAPIInterrupts
slotsync_api_reread_config
-- These also have API in it, but I do not have any better name
suggestions here, we can retain the current ones and see what others
say.

3)
/*
 * Re-read the config file.
 *
 * Exit if any of the slot sync GUCs have changed. The postmaster will
 * restart it.
 */
static void
slotsync_reread_config(void)

Shall we change this existing comment to: Re-read the config file for
slot sync worker.

4)

+/*
+ * Re-read the config file and check for critical parameter changes.
+ *
+ */
+static void
+slotsync_api_reread_config(void)

Shall we change comment to:
/*
 * Re-read the config file for SQL function pg_sync_replication_slots()
 *
 * Emit error if any of the slot sync GUCs have changed.
 */

thanks
Shveta

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Ashutosh Bapat

Дата:

10 ноября 2025 г., 13:44:27

On Mon, Nov 10, 2025 at 3:01 PM shveta malik <shveta.malik@gmail.com> wrote:
>
>
>
> 2)
> ProcessSlotSyncAPIInterrupts
> slotsync_api_reread_config
> -- These also have API in it, but I do not have any better name
> suggestions here, we can retain the current ones and see what others
> say.

ProcessSlotSyncInterrupts() handles shutdown waiting,
ProcessSlotSyncAPIInterrupts doesn't. Why is this difference? It will
be good to explain why we need two different functions for worker and
SQL function and also explain the difference between them.

$primary->psql('postgres', "SELECT pg_drop_replication_slot('lsub2_slot');");
+$primary->psql('postgres', "SELECT pg_drop_replication_slot('lsub1_slot');");

I think, the intention behind dropping the slot is to be able to
create it again for the next test. But there is no comment here
explaining that. There is also no comment explaining why we are
dropping both slots here; when the test only needs dropping one.
That's going to create confusion. One might think that all the slots
need to be dropped at this stage, and drop and create any future slots
that are used by prior code, for example. At the end of this test, we
recreate the slot using pg_create_logical_replication_slot(), which is
different method of creating slot than this test does. Though I can
understand the reason, it's not apparent. Generally reusing slot names
across multiple tests (in this file) is a source of confusion. But at
least for the test you are adding, you could use a different slot name
to avoid confusion.
+# Create some DDL on the primary so that the slot lags behind the standby
+$primary->safe_psql('postgres', "CREATE TABLE push_wal (a int);");
+
+# Attempt to synchronize slots using API. The API will continue retrying
+# synchronization until the remote slot catches up.
+# The API will not return until this happens, to be able to make
+# further calls, call the API in a background process.
+my $log_offset = -s $standby1->logfile;
+
+my $h = $standby1->background_psql('postgres', on_error_stop => 0);
+
+$h->query_until(qr//, "SELECT pg_sync_replication_slots();\n");

If the standby does not receive the WAL corresponding to the DDL
before this function is executed, the slot will get synchronized
immediately. I think we have to make sure that the standby has
received the DDL before executing this function.

Also most of the code which uses query_until has pattern like this:
$h->query_until(qr/start/, q{\echo start
SQL command});
But we expect an empty string here. Why this difference?

I think we need a similar test to test promotion while the function is
waiting for the slot to become sync-ready.

SyncReplicationSlots() and the main loop in ReplSlotSyncWorkerMain()
are similar with some functional differences. Some part of their code
needs to be kept in sync in future. How do we achieve that? At least
we need a comment saying so in each of those patches and keep those
two codes in proximity.

--
Best Wishes,
Ashutosh Bapat

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Ajin Cherian

Дата:

12 ноября 2025 г., 11:24:18

On Mon, Nov 10, 2025 at 8:31 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Fri, Nov 7, 2025 at 10:36 AM Japin Li <japinli@hotmail.com> wrote:
> >
> > >
> > > Attaching patch v22 addressing the above comments.
> >
> > @@ -62,8 +62,8 @@ LOGICAL_APPLY_MAIN    "Waiting in main loop of logical replication apply process."
> >  LOGICAL_LAUNCHER_MAIN  "Waiting in main loop of logical replication launcher process."
> >  LOGICAL_PARALLEL_APPLY_MAIN    "Waiting in main loop of logical replication parallel apply process."
> >  RECOVERY_WAL_STREAM    "Waiting in main loop of startup process for WAL to arrive, during streaming recovery."
> > -REPLICATION_SLOTSYNC_MAIN      "Waiting in main loop of slot sync worker."
> >  REPLICATION_SLOTSYNC_SHUTDOWN  "Waiting for slot sync worker to shut down."
> > +REPLICATION_SLOTSYNC_MAIN      "Waiting in main loop of slot synchronization."
> >  SYSLOGGER_MAIN "Waiting in main loop of syslogger process."
> >  WAL_RECEIVER_MAIN      "Waiting in main loop of WAL receiver process."
> >  WAL_SENDER_MAIN        "Waiting in main loop of WAL sender process."
> >
> > I've noticed that all events are sorted alphabetically.  I think we should keep
> > the order of REPLICATION_SLOTSYNC_MAIN unchanged.
> >
>
> +1.
>

Yes, changed.

> Few trivial comments:
>
> 1)
> Since we have always used the term 'SQL function' rather than API in
> existing code, shall we change all references of API to 'SQL function'
> in current patch:
>
> + * If the pg_sync_replication API is used to sync the slots, and if the slots
> "If the SQL function pg_sync_replication_slots() is used.."
>
> + * the reasons mentioned above, then the API also waits and retries until the
> API --> SQL function
>
> + * persist. It is utilized by the pg_sync_replication_slots() API.
>  pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots()
>
> + * the API can retry.
> API --> SQL function
>
> + /* Set this, so that API can retry */
> API --> SQL function
>
> + * persist. It is utilized by the pg_sync_replication_slots() API.
> pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots()
>
> + * slot_persistence_pending - boolean used by pg_sync_replication_slots
> + *   API to track if any slots could not be
> pg_sync_replication_slots  API --> SQL function pg_sync_replication_slots()
>
> + * Interrupt handler for pg_sync_replication_slots() API.
> pg_sync_replication_slots() API --> SQL function pg_sync_replication_slots()
>
>
> 2)
> ProcessSlotSyncAPIInterrupts
> slotsync_api_reread_config
> -- These also have API in it, but I do not have any better name
> suggestions here, we can retain the current ones and see what others
> say.
>

Changed.

> 3)
> /*
>  * Re-read the config file.
>  *
>  * Exit if any of the slot sync GUCs have changed. The postmaster will
>  * restart it.
>  */
> static void
> slotsync_reread_config(void)
>
> Shall we change this existing comment to: Re-read the config file for
> slot sync worker.
>

> 4)
>
> +/*
> + * Re-read the config file and check for critical parameter changes.
> + *
> + */
> +static void
> +slotsync_api_reread_config(void)
>
> Shall we change comment to:
> /*
>  * Re-read the config file for SQL function pg_sync_replication_slots()
>  *
>  * Emit error if any of the slot sync GUCs have changed.
>  */
>
Changed.




On Mon, Nov 10, 2025 at 9:44 PM Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:
>
> On Mon, Nov 10, 2025 at 3:01 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> >
> >
> > 2)
> > ProcessSlotSyncAPIInterrupts
> > slotsync_api_reread_config
> > -- These also have API in it, but I do not have any better name
> > suggestions here, we can retain the current ones and see what others
> > say.
>
> ProcessSlotSyncInterrupts() handles shutdown waiting,
> ProcessSlotSyncAPIInterrupts doesn't. Why is this difference? It will
> be good to explain why we need two different functions for worker and
> SQL function and also explain the difference between them.
>

I've updated the function header to explain this. The slot sync worker
is a specific background worker while the API runs in the regular
backend, so special handling is not needed.


> $primary->psql('postgres', "SELECT pg_drop_replication_slot('lsub2_slot');");
> +$primary->psql('postgres', "SELECT pg_drop_replication_slot('lsub1_slot');");
>
> I think, the intention behind dropping the slot is to be able to
> create it again for the next test. But there is no comment here
> explaining that. There is also no comment explaining why we are
> dropping both slots here; when the test only needs dropping one.
> That's going to create confusion. One might think that all the slots
> need to be dropped at this stage, and drop and create any future slots
> that are used by prior code, for example. At the end of this test, we
> recreate the slot using pg_create_logical_replication_slot(), which is
> different method of creating slot than this test does. Though I can
> understand the reason, it's not apparent. Generally reusing slot names
> across multiple tests (in this file) is a source of confusion. But at
> least for the test you are adding, you could use a different slot name
> to avoid confusion.

I've added a comment there that dropping both the slots is required
for the next test. Also I cannot change the name of the slot as the
next tests need the same slot synced.

> +# Create some DDL on the primary so that the slot lags behind the standby
> +$primary->safe_psql('postgres', "CREATE TABLE push_wal (a int);");
> +
> +# Attempt to synchronize slots using API. The API will continue retrying
> +# synchronization until the remote slot catches up.
> +# The API will not return until this happens, to be able to make
> +# further calls, call the API in a background process.
> +my $log_offset = -s $standby1->logfile;
> +
> +my $h = $standby1->background_psql('postgres', on_error_stop => 0);
> +
> +$h->query_until(qr//, "SELECT pg_sync_replication_slots();\n");
>
> If the standby does not receive the WAL corresponding to the DDL
> before this function is executed, the slot will get synchronized
> immediately. I think we have to make sure that the standby has
> received the DDL before executing this function.

Yes, I've added a line to make sure that the stanbdy has caught up.

>
> Also most of the code which uses query_until has pattern like this:
> $h->query_until(qr/start/, q{\echo start
> SQL command});
> But we expect an empty string here. Why this difference?
>

I've modified it as suggested.

> I think we need a similar test to test promotion while the function is
> waiting for the slot to become sync-ready.
>

Unfortunately, that will make this test too long if I add one more
wait loop for slot sync.

> SyncReplicationSlots() and the main loop in ReplSlotSyncWorkerMain()
> are similar with some functional differences. Some part of their code
> needs to be kept in sync in future. How do we achieve that? At least
> we need a comment saying so in each of those patches and keep those
> two codes in proximity.
>

I've added a comment in the header of ReplSlotSyncWorkerMain to suggest this.

Attaching patch v23 addressing these comments.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

v23-0001-Improve-initial-slot-synchronization-in-pg_sync_.patch

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Amit Kapila

Дата:

19 ноября 2025 г., 09:10:42

On Wed, Nov 12, 2025 at 1:54 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> Attaching patch v23 addressing these comments.
>

Few comments:
=============
1.
In contrast, automatic synchronization
       via <varname>sync_replication_slots</varname> provides continuous slot
       updates, enabling seamless failover and supporting high availability.
-      Therefore, it is the recommended method for synchronizing slots.

I think a slotsync worker should still be a recommended method. So, we
shouldn't remove the last line.

2. I think we can unify slotsync_api_reread_config() and
ProcessSlotSyncAPIInterrupts() with corresponding existing functions
for slotsync worker. Having separate functions for API and worker to
handle interrupts looks odd and bug-prone w.r.t future changes in this
area.

3.
+/*
+ * Helper function to extract slot names from a list of remote slots
+ */
+static List *
+extract_slot_names(List *remote_slots)
+{
+ List *slot_names = NIL;
+ MemoryContext oldcontext;
+
+ /* Switch to long-lived TopMemoryContext to store slot names */
+ oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+
+ foreach_ptr(RemoteSlot, remote_slot, remote_slots)

Why did we allocate this memory in TopMemoryContext here? We only need
till this function executes, isn't CurrentMemoryContext (which I think
is ExprContext) sufficient, if not, why? If we use some
function/query-level context then we don't need to make it part of
SlotSyncApiFailureParams. If we can get rid of slot_names from
SlotSyncApiFailureParams then we probably don't need struct
SlotSyncApiFailureParams.

--
With Regards,
Amit Kapila.

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

shveta malik

Дата:

19 ноября 2025 г., 10:04:54

On Wed, Nov 12, 2025 at 1:54 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
>
> Attaching patch v23 addressing these comments.
>

Thanks for the patch.

I observed that if the API is taking a nap in between slot sync cycles
and a promotion is triggered during that time, the promotion has to
wait for the entire nap period to finish before slot-sync stops and
the process can continue. There should be a mechanism to wake up the
backend so the API can exit early once stopSignaled is set. How about
doing SetLatch for the process doing synchronization in
ShutDownSlotSync()?

thanks
Shveta

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Ajin Cherian

Дата:

21 ноября 2025 г., 06:44:25

On Wed, Nov 19, 2025 at 5:10 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Nov 12, 2025 at 1:54 PM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > Attaching patch v23 addressing these comments.
> >
>
> Few comments:
> =============
> 1.
> In contrast, automatic synchronization
>        via <varname>sync_replication_slots</varname> provides continuous slot
>        updates, enabling seamless failover and supporting high availability.
> -      Therefore, it is the recommended method for synchronizing slots.
>
> I think a slotsync worker should still be a recommended method. So, we
> shouldn't remove the last line.
>

Changed.

> 2. I think we can unify slotsync_api_reread_config() and
> ProcessSlotSyncAPIInterrupts() with corresponding existing functions
> for slotsync worker. Having separate functions for API and worker to
> handle interrupts looks odd and bug-prone w.r.t future changes in this
> area.
>

I’ve refactored and unified slotsync_api_reread_config() and
ProcessSlotSyncAPIInterrupts() with their existing counterparts. As
part of this, I switched the shutdown signal for slot-syncing workers
and backends from SIGINT to SIGUSR1, so that all slot-synchronization
processes use a consistent signaling model. Background workers handle
SIGINT via StatementCancelHandler, which is not suitable for
coordinated shutdowns, but SIGUSR1 reliably sets the process latch and
wakes up both worker types. Once awakened, these processes invoke
ProcessSlotSyncInterrupts(), which now checks
SlotSyncCtx->stopSignaled to perform a clean shutdown with appropriate
logs.

> 3.
> +/*
> + * Helper function to extract slot names from a list of remote slots
> + */
> +static List *
> +extract_slot_names(List *remote_slots)
> +{
> + List *slot_names = NIL;
> + MemoryContext oldcontext;
> +
> + /* Switch to long-lived TopMemoryContext to store slot names */
> + oldcontext = MemoryContextSwitchTo(TopMemoryContext);
> +
> + foreach_ptr(RemoteSlot, remote_slot, remote_slots)
>
> Why did we allocate this memory in TopMemoryContext here? We only need
> till this function executes, isn't CurrentMemoryContext (which I think
> is ExprContext) sufficient, if not, why? If we use some
> function/query-level context then we don't need to make it part of
> SlotSyncApiFailureParams. If we can get rid of slot_names from
> SlotSyncApiFailureParams then we probably don't need struct
> SlotSyncApiFailureParams.
>

In further testing, I've found that a Transaction is always started
when pg_sync_replication_slots() is called, and there was no need to
start new transactions and there was no need for a seperate memory
context for remote_slots. I think the confusion was probably from an
earlier bug in my code. I have added an assert to make sure that
transactions are started when in the API (should I make it an error
instead?).

On Wed, Nov 19, 2025 at 6:05 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Wed, Nov 12, 2025 at 1:54 PM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> >
> > Attaching patch v23 addressing these comments.
> >
>
> Thanks for the patch.
>
> I observed that if the API is taking a nap in between slot sync cycles
> and a promotion is triggered during that time, the promotion has to
> wait for the entire nap period to finish before slot-sync stops and
> the process can continue. There should be a mechanism to wake up the
> backend so the API can exit early once stopSignaled is set. How about
> doing SetLatch for the process doing synchronization in
> ShutDownSlotSync()?
>

Yes. To address this, I now also store the background worker pid which
is calling pg_sync_replication_slots() in SlotSyncCtx->pid, and I've
modified the ShutDownSlotSync logic to issue a SIGUSR1 which will
SetLatch on SlotSyncCtx->pid.

Attaching patch v24, addressing the above comments.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

v24-0001-Improve-initial-slot-synchronization-in-pg_sync_.patch

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

shveta malik

Дата:

21 ноября 2025 г., 12:10:14

On Fri, Nov 21, 2025 at 9:14 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
>
> Attaching patch v24, addressing the above comments.
>

Thanks for the patch. Please find a few comments:

1)
Instead of passing an argument to slotsync_reread_config and
ProcessSlotSyncInterrupts, we can use 'AmLogicalSlotSyncWorkerProcess'
to distinguish the worker and API.

2)
Also, since we are not using a separate memory context, we don't need
to make structure 'SlotSyncApiFailureParams' for slot_names failure.
slot_names will be freed with the memory-context itself when
exec_simple_query finishes.

3)
- if (old_sync_replication_slots != sync_replication_slots)
+ /* Worker-specific check for sync_replication_slots change */
+ if (!from_api && old_sync_replication_slots != sync_replication_slots)
  {
  ereport(LOG,
- /* translator: %s is a GUC variable name */
- errmsg("replication slot synchronization worker will shut down
because \"%s\" is disabled", "sync_replication_slots"));
+ /* translator: %s is a GUC variable name */
+ errmsg("replication slot synchronization worker will shut down
because \"%s\" is disabled",
+    "sync_replication_slots"));
  proc_exit(0);
  }

Here, we need not to have different flow for api and worker. Both can
quit sync when this parameter is changed. The idea is if someone
enables 'sync_replication_slots' when API is working, that means we
need to start slot-sync worker, so it is okay if the API notices this
and exits too.

4)
+ if (from_api)
+ elevel = ERROR;
+ else
+ elevel = LOG;

- proc_exit(0);
+ ereport(elevel,
+ errmsg("replication slot synchronization will stop because of a
parameter change"));
+

We can do:
ereport(AmLogicalSlotSyncWorkerProcess() ? LOG : ERROR, ...);

5)
 SlotSyncCtx->syncing = true;

+ /* The worker pid must not be already assigned in SlotSyncCtx */
+ Assert(SlotSyncCtx->pid == InvalidPid);
+

We can shift Assert before we set the shared-memory flag 'SlotSyncCtx->syncing'.

thanks
Shveta

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Ajin Cherian

Дата:

24 ноября 2025 г., 09:42:28

On Fri, Nov 21, 2025 at 8:10 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Fri, Nov 21, 2025 at 9:14 AM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> >
> > Attaching patch v24, addressing the above comments.
> >
>
> Thanks for the patch. Please find a few comments:
>
>
> 1)
> Instead of passing an argument to slotsync_reread_config and
> ProcessSlotSyncInterrupts, we can use 'AmLogicalSlotSyncWorkerProcess'
> to distinguish the worker and API.
>

Changed as such.

> 2)
> Also, since we are not using a separate memory context, we don't need
> to make structure 'SlotSyncApiFailureParams' for slot_names failure.
> slot_names will be freed with the memory-context itself when
> exec_simple_query finishes.
>

Removed.

> 3)
> - if (old_sync_replication_slots != sync_replication_slots)
> + /* Worker-specific check for sync_replication_slots change */
> + if (!from_api && old_sync_replication_slots != sync_replication_slots)
>   {
>   ereport(LOG,
> - /* translator: %s is a GUC variable name */
> - errmsg("replication slot synchronization worker will shut down
> because \"%s\" is disabled", "sync_replication_slots"));
> + /* translator: %s is a GUC variable name */
> + errmsg("replication slot synchronization worker will shut down
> because \"%s\" is disabled",
> +    "sync_replication_slots"));
>   proc_exit(0);
>   }
>
> Here, we need not to have different flow for api and worker. Both can
> quit sync when this parameter is changed. The idea is if someone
> enables 'sync_replication_slots' when API is working, that means we
> need to start slot-sync worker, so it is okay if the API notices this
> and exits too.
>

changed but used a different error message.

> 4)
> + if (from_api)
> + elevel = ERROR;
> + else
> + elevel = LOG;
>
> - proc_exit(0);
> + ereport(elevel,
> + errmsg("replication slot synchronization will stop because of a
> parameter change"));
> +
>
> We can do:
> ereport(AmLogicalSlotSyncWorkerProcess() ? LOG : ERROR, ...);
>

changed as such.

> 5)
>  SlotSyncCtx->syncing = true;
>
> + /* The worker pid must not be already assigned in SlotSyncCtx */
> + Assert(SlotSyncCtx->pid == InvalidPid);
> +
>
> We can shift Assert before we set the shared-memory flag 'SlotSyncCtx->syncing'.
>

Done.

Attaching patch v25 addressing the above comments.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

v25-0001-Improve-initial-slot-synchronization-in-pg_sync_.patch

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

shveta malik

Дата:

26 ноября 2025 г., 08:01:21

On Mon, Nov 24, 2025 at 12:12 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> Attaching patch v25 addressing the above comments.
>

The patch needs rebase due to recent commit 76b78721.

thanks
Shveta

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

shveta malik

Дата:

26 ноября 2025 г., 11:42:29

 A few comments:

1)
+/*
+ * Structure holding parameters that need to be freed on error in
+ * pg_sync_replication_slots()
+ */
+typedef struct SlotSyncApiFailureParams
+{
+ WalReceiverConn *wrconn;
+ List *slot_names;
+} SlotSyncApiFailureParams;
+

We can get rid of it now as we do not use it.

2)
ProcessSlotSyncInterrupts():

+ if (!AmLogicalSlotSyncWorkerProcess())
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot continue replication slots synchronization"
+    " as standby promotion is triggered"));
+ else
+ {

Can we please reverse the if-else i.e. first worker and then API.
Negated if-condition can be avoided in this case.

3)

slotsync_reread_config():
+ /* Worker-specific check for sync_replication_slots change */

Now since we check for both API and worker, this comment is not needed.

4)
- ereport(LOG,
- errmsg("replication slot synchronization worker will restart because
of a parameter change"));

- /*
- * Reset the last-start time for this worker so that the postmaster
- * can restart it without waiting for SLOTSYNC_RESTART_INTERVAL_SEC.
- */
- SlotSyncCtx->last_start_time = 0;
+ ereport(AmLogicalSlotSyncWorkerProcess() ? LOG : ERROR,
+ errmsg("replication slot synchronization will stop because of a
parameter change"));

Here, we should retain the same old message for worker i.e. 'worker
will restart..'. instead of 'synchronization will stop'. I find the
old message better in this case.

5)
slotsync_reread_config() is slightly difficult to follow.
I think in the case of API, we can display a common error message
instead of 2 different messages for 'sync_replication_slot change' and
the rest of the parameters. We can mark if any of the parameters
changed in both 'if' blocks and if the current process has not exited,
then at the end based on 'parameter-changed', we can deal with API by
giving a common message. Something like:

/*
 * If we have reached here with a parameter change, we must be running
 * in SQL function, emit error in such a case.
 */
if (parameter_changed (new variable))
{
Assert (!AmLogicalSlotSyncWorkerProcess);
ereport(ERROR,
errmsg("replication slot synchronization will stop because of a
parameter change"));
}

thanks
Shveta

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Ajin Cherian

Дата:

28 ноября 2025 г., 07:46:18

On Wed, Nov 26, 2025 at 7:42 PM shveta malik <shveta.malik@gmail.com> wrote:
>
>  A few comments:
>
> 1)
> +/*
> + * Structure holding parameters that need to be freed on error in
> + * pg_sync_replication_slots()
> + */
> +typedef struct SlotSyncApiFailureParams
> +{
> + WalReceiverConn *wrconn;
> + List *slot_names;
> +} SlotSyncApiFailureParams;
> +
>
> We can get rid of it now as we do not use it.
>

Removed.

> 2)
> ProcessSlotSyncInterrupts():
>
> + if (!AmLogicalSlotSyncWorkerProcess())
> + ereport(ERROR,
> + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("cannot continue replication slots synchronization"
> +    " as standby promotion is triggered"));
> + else
> + {
>
> Can we please reverse the if-else i.e. first worker and then API.
> Negated if-condition can be avoided in this case.
>

Changed.

> 3)
>
> slotsync_reread_config():
> + /* Worker-specific check for sync_replication_slots change */
>
> Now since we check for both API and worker, this comment is not needed.
>

Removed.

> 4)
> - ereport(LOG,
> - errmsg("replication slot synchronization worker will restart because
> of a parameter change"));
>
> - /*
> - * Reset the last-start time for this worker so that the postmaster
> - * can restart it without waiting for SLOTSYNC_RESTART_INTERVAL_SEC.
> - */
> - SlotSyncCtx->last_start_time = 0;
> + ereport(AmLogicalSlotSyncWorkerProcess() ? LOG : ERROR,
> + errmsg("replication slot synchronization will stop because of a
> parameter change"));
>
> Here, we should retain the same old message for worker i.e. 'worker
> will restart..'. instead of 'synchronization will stop'. I find the
> old message better in this case.
>
> 5)
> slotsync_reread_config() is slightly difficult to follow.
> I think in the case of API, we can display a common error message
> instead of 2 different messages for 'sync_replication_slot change' and
> the rest of the parameters. We can mark if any of the parameters
> changed in both 'if' blocks and if the current process has not exited,
> then at the end based on 'parameter-changed', we can deal with API by
> giving a common message. Something like:
>
> /*
>  * If we have reached here with a parameter change, we must be running
>  * in SQL function, emit error in such a case.
>  */
> if (parameter_changed (new variable))
> {
> Assert (!AmLogicalSlotSyncWorkerProcess);
> ereport(ERROR,
> errmsg("replication slot synchronization will stop because of a
> parameter change"));
> }
>

Fixed as above.

I've addressed the above comments as well as rebased the patch based
on changes in commit 76b7872 in patch v26

regards,
Ajin Cherian
Fujitsu Australia

Вложения

v26-0001-Improve-initial-slot-synchronization-in-pg_sync_.patch

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

shveta malik

Дата:

28 ноября 2025 г., 09:25:18

On Fri, Nov 28, 2025 at 10:16 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
>
> Fixed as above.
>
> I've addressed the above comments as well as rebased the patch based
> on changes in commit 76b7872 in patch v26
>

Thanks for the patch. Please find a few trivial comments:

1)
+ if (AmLogicalSlotSyncWorkerProcess())
+ Assert(sync_replication_slots);

Here too we can use 'worker'.

2)
+ /* check for sync_replication_slots change */

check --> Check

3)
Assert (!worker)
Extra space in between.

4)
check_and_set_sync_info() and ShutDownSlotSync() refers to the pid as
worker_pid. But now it could be backend-pid as well.
Using 'worker' in this variable could be misleading. Shall we make it
sync_process_pid?

5)
/*
 * Interrupt handler for main loop of slot sync worker.
 */
static void
ProcessSlotSyncInterrupts()

We can modify the comment to include API as well.

6)
/*
 * Shut down the slot sync worker.
 *
 * This function sends signal to shutdown slot sync worker, if required. It
 * also waits till the slot sync worker has exited or
 * pg_sync_replication_slots() has finished.
 */
void
ShutDownSlotSync(void)

We should change comments to give details on API as well.

7)
+# Remove the standby from the synchronized_standby_slots list and reload the
+# configuration.
+$primary->adjust_conf('postgresql.conf', 'synchronized_standby_slots', "''");
+$primary->reload;

We can update the comment to below for better clarity:
Remove the dropped sb1_slot from the ...

8)
+# Attempt to synchronize slots using API. The API will continue retrying
+# synchronization until the remote slot catches up.
+# The API will not return until this happens, to be able to make
+# further calls, call the API in a background process.

We can move these comments atop:
my $h = $standby2->background_psql('postgres', on_error_stop => 0);

thanks
Shveta

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Ajin Cherian

Дата:

02 декабря 2025 г., 11:27:57

On Fri, Nov 28, 2025 at 5:03 PM Japin Li <japinli@hotmail.com> wrote:
>
> 1.
> Initialize slot_persistence_pending to false (to avoid uninitialized values, or
> initialize to true by mistaken) in update_and_persist_local_synced_slot(). This
> aligns with the handling of found_consistent_snapshot and remote_slot_precedes
> in update_local_synced_slot().
>
> diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
> index 20eada3393..c55ba11f17 100644
> --- a/src/backend/replication/logical/slotsync.c
> +++ b/src/backend/replication/logical/slotsync.c
> @@ -617,6 +617,9 @@ update_and_persist_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid,
>         bool            found_consistent_snapshot = false;
>         bool            remote_slot_precedes = false;
>
> +       if (slot_persistence_pending)
> +               *slot_persistence_pending = false;
> +
>         /* Slotsync skip stats are handled in function update_local_synced_slot() */
>         (void) update_local_synced_slot(remote_slot, remote_dbid,
>                                                                         &found_consistent_snapshot,
>

I don't understand what the comment is here.

> 2.
> This change seems unnecessary。
>
>  static void
> -slotsync_reread_config(void)
> +slotsync_reread_config()
>
>  static void
> -ProcessSlotSyncInterrupts(void)
> +ProcessSlotSyncInterrupts()
>

Fixed.

> 3.
> Since we are already caching the result of AmLogicalSlotSyncWorkerProcess() in
> a local worker variable, how about applying this replacement:
> s/if (AmLogicalSlotSyncWorkerProcess())/if (worker)/g?
>
> +       bool            worker = AmLogicalSlotSyncWorkerProcess();
> +       bool            parameter_changed = false;
>
> -       Assert(sync_replication_slots);
> +       if (AmLogicalSlotSyncWorkerProcess())
> +               Assert(sync_replication_slots);
>

Fixed.



On Fri, Nov 28, 2025 at 5:25 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> Thanks for the patch. Please find a few trivial comments:
>
> 1)
> + if (AmLogicalSlotSyncWorkerProcess())
> + Assert(sync_replication_slots);
>
> Here too we can use 'worker'.
>

Fixed.

> 2)
> + /* check for sync_replication_slots change */
>
> check --> Check
>

Fixed.

> 3)
> Assert (!worker)
> Extra space in between.
>

Fixed

> 4)
> check_and_set_sync_info() and ShutDownSlotSync() refers to the pid as
> worker_pid. But now it could be backend-pid as well.
> Using 'worker' in this variable could be misleading. Shall we make it
> sync_process_pid?
>

Changed.

> 5)
> /*
>  * Interrupt handler for main loop of slot sync worker.
>  */
> static void
> ProcessSlotSyncInterrupts()
>
> We can modify the comment to include API as well.

Changed.

>
> 6)
> /*
>  * Shut down the slot sync worker.
>  *
>  * This function sends signal to shutdown slot sync worker, if required. It
>  * also waits till the slot sync worker has exited or
>  * pg_sync_replication_slots() has finished.
>  */
> void
> ShutDownSlotSync(void)
>
> We should change comments to give details on API as well.
>

changed.

> 7)
> +# Remove the standby from the synchronized_standby_slots list and reload the
> +# configuration.
> +$primary->adjust_conf('postgresql.conf', 'synchronized_standby_slots', "''");
> +$primary->reload;
>
> We can update the comment to below for better clarity:
> Remove the dropped sb1_slot from the ...
>

Changed.

> 8)
> +# Attempt to synchronize slots using API. The API will continue retrying
> +# synchronization until the remote slot catches up.
> +# The API will not return until this happens, to be able to make
> +# further calls, call the API in a background process.
>
> We can move these comments atop:
> my $h = $standby2->background_psql('postgres', on_error_stop => 0);
>

Changed.

Attached patch v27 addresses the above comments.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

v27-0001-Improve-initial-slot-synchronization-in-pg_sync_.patch

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

shveta malik

Дата:

02 декабря 2025 г., 12:35:26

On Tue, Dec 2, 2025 at 1:58 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
>
> Attached patch v27 addresses the above comments.
>

Thanks for the patch. Please find a few comments:

1)
+ /* The worker pid must not be already assigned in SlotSyncCtx */
+ Assert(SlotSyncCtx->pid == InvalidPid);
+

We can mention just 'pid' here instead of 'worker pid'

2)
+ /*
+ * The syscache access in fetch_or_refresh_remote_slots() needs a
+ * transaction env.
+ */
fetch_or_refresh_remote_slots --> fetch_remote_slots

3)
 SyncReplicationSlots(WalReceiverConn *wrconn)
 {
+

We can get rid of this blank line at the start of the function.

4)
 /*
  * Shut down the slot sync worker.
  *
- * This function sends signal to shutdown slot sync worker, if required. It
+ * This function sends signal to shutdown the slot sync process, if
required. It
  * also waits till the slot sync worker has exited or
  * pg_sync_replication_slots() has finished.
  */
Shall we change comment to something like (rephrase if required):

Shut down the slot synchronization.
This function wakes up the slot sync process (either worker or backend
running SQL function) and sets stopSignaled=true
so that worker can exit or SQL function pg_sync_replication_slots()
can finish. It also waits till the slot sync worker has exited or
pg_sync_replication_slots() has finished.

5)
We should change the comment atop 'SlotSyncCtxStruct' as well to
mention that this pid is either the slot sync worker's pid or
backend's pid running the SQL function. It is needed by the startup
process to wake these up, so that they can stop synchronization on
seeing stopSignaled.  <please rephrase as needed>

6)
+ ereport(LOG,
+ errmsg("replication slot synchronization worker is shutting down on
receiving SIGUSR1"));

SIGUSR1 was actually just a wake-up signal. We may change the comment to:
replication slot synchronization worker is shutting down as promotion
is triggered.

7)
update_synced_slots_inactive_since:
/* The slot sync worker or SQL function mustn't be running by now */
- Assert((SlotSyncCtx->pid == InvalidPid) && !SlotSyncCtx->syncing);
+ Assert(!SlotSyncCtx->syncing);

Regarding this, I see that 'update_synced_slots_inactive_since' is
only called when we are sure that 'syncing' is false. So shouldn't pid
be also Invalid by that time? Even if it was backend's pid to start
with, but since backend has stopped syncing (finished or error-ed
out),
pid should be reset to Invalid in such a case. And this Assert need
not to be changed.

8)

+ if (sync_process_pid != InvalidPid)
+ kill(sync_process_pid, SIGUSR1);

We can write comments to say wake-up slot sync process.

thanks
Shveta

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Ajin Cherian

Дата:

03 декабря 2025 г., 06:21:35

On Tue, Dec 2, 2025 at 8:35 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Tue, Dec 2, 2025 at 1:58 PM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> >
> > Attached patch v27 addresses the above comments.
> >
>
> Thanks for the patch. Please find a few comments:
>
> 1)
> + /* The worker pid must not be already assigned in SlotSyncCtx */
> + Assert(SlotSyncCtx->pid == InvalidPid);
> +
>
> We can mention just 'pid' here instead of 'worker pid'
>

Changed.

> 2)
> + /*
> + * The syscache access in fetch_or_refresh_remote_slots() needs a
> + * transaction env.
> + */
> fetch_or_refresh_remote_slots --> fetch_remote_slots
>
> 3)
>  SyncReplicationSlots(WalReceiverConn *wrconn)
>  {
> +
>
> We can get rid of this blank line at the start of the function.
>

Fixed.

> 4)
>  /*
>   * Shut down the slot sync worker.
>   *
> - * This function sends signal to shutdown slot sync worker, if required. It
> + * This function sends signal to shutdown the slot sync process, if
> required. It
>   * also waits till the slot sync worker has exited or
>   * pg_sync_replication_slots() has finished.
>   */
> Shall we change comment to something like (rephrase if required):
>
> Shut down the slot synchronization.
> This function wakes up the slot sync process (either worker or backend
> running SQL function) and sets stopSignaled=true
> so that worker can exit or SQL function pg_sync_replication_slots()
> can finish. It also waits till the slot sync worker has exited or
> pg_sync_replication_slots() has finished.
>

Changed.

> 5)
> We should change the comment atop 'SlotSyncCtxStruct' as well to
> mention that this pid is either the slot sync worker's pid or
> backend's pid running the SQL function. It is needed by the startup
> process to wake these up, so that they can stop synchronization on
> seeing stopSignaled.  <please rephrase as needed>
>

Changed.

> 6)
> + ereport(LOG,
> + errmsg("replication slot synchronization worker is shutting down on
> receiving SIGUSR1"));
>
> SIGUSR1 was actually just a wake-up signal. We may change the comment to:
> replication slot synchronization worker is shutting down as promotion
> is triggered.
>

Changed.

> 7)
> update_synced_slots_inactive_since:
> /* The slot sync worker or SQL function mustn't be running by now */
> - Assert((SlotSyncCtx->pid == InvalidPid) && !SlotSyncCtx->syncing);
> + Assert(!SlotSyncCtx->syncing);
>
> Regarding this, I see that 'update_synced_slots_inactive_since' is
> only called when we are sure that 'syncing' is false. So shouldn't pid
> be also Invalid by that time? Even if it was backend's pid to start
> with, but since backend has stopped syncing (finished or error-ed
> out),
> pid should be reset to Invalid in such a case. And this Assert need
> not to be changed.
>

Fixed.

> 8)
>
> + if (sync_process_pid != InvalidPid)
> + kill(sync_process_pid, SIGUSR1);
>
> We can write comments to say wake-up slot sync process.
>

Added comments.

Attaching patch v28 addressing these comments.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

v28-0001-Improve-initial-slot-synchronization-in-pg_sync_.patch

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

shveta malik

Дата:

03 декабря 2025 г., 07:00:49

On Wed, Dec 3, 2025 at 8:51 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Tue, Dec 2, 2025 at 8:35 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Tue, Dec 2, 2025 at 1:58 PM Ajin Cherian <itsajin@gmail.com> wrote:
> > >
> > >
> > > Attached patch v27 addresses the above comments.
> > >
> >
> > Thanks for the patch. Please find a few comments:
> >
> > 1)
> > + /* The worker pid must not be already assigned in SlotSyncCtx */
> > + Assert(SlotSyncCtx->pid == InvalidPid);
> > +
> >
> > We can mention just 'pid' here instead of 'worker pid'
> >
>
> Changed.
>
> > 2)
> > + /*
> > + * The syscache access in fetch_or_refresh_remote_slots() needs a
> > + * transaction env.
> > + */
> > fetch_or_refresh_remote_slots --> fetch_remote_slots
> >
> > 3)
> >  SyncReplicationSlots(WalReceiverConn *wrconn)
> >  {
> > +
> >
> > We can get rid of this blank line at the start of the function.
> >
>
> Fixed.
>
> > 4)
> >  /*
> >   * Shut down the slot sync worker.
> >   *
> > - * This function sends signal to shutdown slot sync worker, if required. It
> > + * This function sends signal to shutdown the slot sync process, if
> > required. It
> >   * also waits till the slot sync worker has exited or
> >   * pg_sync_replication_slots() has finished.
> >   */
> > Shall we change comment to something like (rephrase if required):
> >
> > Shut down the slot synchronization.
> > This function wakes up the slot sync process (either worker or backend
> > running SQL function) and sets stopSignaled=true
> > so that worker can exit or SQL function pg_sync_replication_slots()
> > can finish. It also waits till the slot sync worker has exited or
> > pg_sync_replication_slots() has finished.
> >
>
> Changed.
>
> > 5)
> > We should change the comment atop 'SlotSyncCtxStruct' as well to
> > mention that this pid is either the slot sync worker's pid or
> > backend's pid running the SQL function. It is needed by the startup
> > process to wake these up, so that they can stop synchronization on
> > seeing stopSignaled.  <please rephrase as needed>
> >
>
> Changed.
>
> > 6)
> > + ereport(LOG,
> > + errmsg("replication slot synchronization worker is shutting down on
> > receiving SIGUSR1"));
> >
> > SIGUSR1 was actually just a wake-up signal. We may change the comment to:
> > replication slot synchronization worker is shutting down as promotion
> > is triggered.
> >
>
> Changed.
>
> > 7)
> > update_synced_slots_inactive_since:
> > /* The slot sync worker or SQL function mustn't be running by now */
> > - Assert((SlotSyncCtx->pid == InvalidPid) && !SlotSyncCtx->syncing);
> > + Assert(!SlotSyncCtx->syncing);
> >
> > Regarding this, I see that 'update_synced_slots_inactive_since' is
> > only called when we are sure that 'syncing' is false. So shouldn't pid
> > be also Invalid by that time? Even if it was backend's pid to start
> > with, but since backend has stopped syncing (finished or error-ed
> > out),
> > pid should be reset to Invalid in such a case. And this Assert need
> > not to be changed.
> >
>
> Fixed.
>
> > 8)
> >
> > + if (sync_process_pid != InvalidPid)
> > + kill(sync_process_pid, SIGUSR1);
> >
> > We can write comments to say wake-up slot sync process.
> >
>
> Added comments.
>
> Attaching patch v28 addressing these comments.
>

Thanks for the patch. A few trivial comments:

1)
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot continue replication slots synchronization"
+    " as standby promotion is triggered"));

slots->slot, since in every error-message we use 'replication slot
synchronization'

2)
+ * The pid is either the slot sync worker's pid or the backend's pid running
+ * the SQL function pg_sync_replication_slots(). It is needed by the startup
+ * process to wake these up, so that they can stop synchronization on seeing
+ * stopSignaled on promotion.
+ * Setting stopSignaled is also used to handle the race condition when the

Can we rephrase slightly to indicate clearly that it is the startup
process which sets 'stopSignaled' during promotion. Suggestion:

The pid is either the slot sync worker’s pid or the backend’s pid running
the SQL function pg_sync_replication_slots(). When the startup process
sets stopSignaled during promotion, it uses this pid to wake the
currently synchronizing process so that the process can immediately stop its
synchronization work upon seeing stopSignaled set to true.
Setting stopSignaled....

thanks
Shveta

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Amit Kapila

Дата:

03 декабря 2025 г., 14:19:03

On Wed, Dec 3, 2025 at 8:51 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> Attaching patch v28 addressing these comments.
>

Can we extract the part of the patch that handles SIGUSR1 signal
separately as a first patch and the remaining as a second patch?
Please do mention the reason in the commit message as to why we are
changing the signal for SIGINT to SIGUSR1.

--
With Regards,
Amit Kapila.

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Ajin Cherian

Дата:

04 декабря 2025 г., 08:21:39

On Wed, Dec 3, 2025 at 10:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Dec 3, 2025 at 8:51 AM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > Attaching patch v28 addressing these comments.
> >
>
> Can we extract the part of the patch that handles SIGUSR1 signal
> separately as a first patch and the remaining as a second patch?
> Please do mention the reason in the commit message as to why we are
> changing the signal for SIGINT to SIGUSR1.
>

I have extracted out the SIGUSR1 signal handling changes separately
into a patch and sharing. I will share the next patch later.
Let me know if there are any comments for this patch.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

v29-0001-Use-SIGUSR1-to-interrupt-slot-synchronization-wo.patch

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

shveta malik

Дата:

04 декабря 2025 г., 09:03:50

On Thu, Dec 4, 2025 at 10:51 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Wed, Dec 3, 2025 at 10:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Dec 3, 2025 at 8:51 AM Ajin Cherian <itsajin@gmail.com> wrote:
> > >
> > > Attaching patch v28 addressing these comments.
> > >
> >
> > Can we extract the part of the patch that handles SIGUSR1 signal
> > separately as a first patch and the remaining as a second patch?
> > Please do mention the reason in the commit message as to why we are
> > changing the signal for SIGINT to SIGUSR1.
> >
>
> I have extracted out the SIGUSR1 signal handling changes separately
> into a patch and sharing. I will share the next patch later.
> Let me know if there are any comments for this patch.
>

I have just 2 trivial comments for v29-001:

1)
-   * receives a SIGINT from the startup process, or when there is an error.
+   * receives a SIGUSR1 from the startup process, or when there is an error.

In above we should mention stopSignaled rather than SIGUSR1, as
SIGUSR1 is just a wakeup signal and not termination signal.

 2)
+    else
+      ereport(ERROR,
+          errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+          errmsg("cannot continue replication slot synchronization"
+               " as standby promotion is triggered"));

Please mention that it is SQL-function in the comment for else-block.

~~

I tested the touched scenarios and here are the LOGs:

a)
When promotion is ongoing and the startup  process has terminated
slot-sync worker but if the postmaster has not noticed that, it may
end up starting slotsync worker again. For that scenario, we get
these:

11:03:19.712 IST [151559] LOG:  replication slot synchronization
worker is shutting down as promotion is triggered
11:03:19.726 IST [151629] LOG:  slot sync worker started
11:03:19.795 IST [151629] LOG:  replication slot synchronization
worker is shutting down as promotion is triggered

b)
On promotion, API gets this (originating from ProcessSlotSyncInterrupts now):
postgres=# SELECT pg_sync_replication_slots();
ERROR:  cannot continue replication slot synchronization as standby
promotion is triggered

c)
If any parameter is changed between ValidateSlotSyncParams() and
ProcessSlotSyncInterrupts() for API, we get this:
postgres=# SELECT pg_sync_replication_slots();
ERROR:  replication slot synchronization will stop because of a parameter change

--on re-run (originating from ValidateSlotSyncParams())
postgres=# SELECT pg_sync_replication_slots();
ERROR:  replication slot synchronization requires
"hot_standby_feedback" to be enabled

~~

The tested scenarios' behaviour looks good to me.

thanks
Shveta

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Ajin Cherian

Дата:

04 декабря 2025 г., 12:01:04

On Thu, Dec 4, 2025 at 5:04 PM shveta malik <shveta.malik@gmail.com> wrote:
>
>
> I have just 2 trivial comments for v29-001:
>
> 1)
> -   * receives a SIGINT from the startup process, or when there is an error.
> +   * receives a SIGUSR1 from the startup process, or when there is an error.
>
> In above we should mention stopSignaled rather than SIGUSR1, as
> SIGUSR1 is just a wakeup signal and not termination signal.
>

Fixed.

>  2)
> +    else
> +      ereport(ERROR,
> +          errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +          errmsg("cannot continue replication slot synchronization"
> +               " as standby promotion is triggered"));
>
> Please mention that it is SQL-function in the comment for else-block.
>

Fixed.

> ~~
>
> I tested the touched scenarios and here are the LOGs:
>

Thanks for testing!

Attaching patch v30 with the above changes addressed. I've also run
pgindent on the changes.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

v30-0001-Use-SIGUSR1-to-interrupt-slot-synchronization-wo.patch

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Ajin Cherian

Дата:

09 декабря 2025 г., 13:33:59

Hi all,

Since commit 04396ea [1] has been pushed, which included part of the
changes this patch set was addressing, I have updated and rebased the
patch set to incorporate those changes.

The patch set now contains two patches:

0001 – Modify the pg_sync_replication_slots API to also handle
promotion signals and stop synchronization, similar to the slot sync
worker.
0002 – Improve pg_sync_replication_slots to wait for and persist slots
until they are sync-ready.

Please review the updated patch set (v31).

Regards,
Ajin Cherian
Fujitsu Australia

[1] https://github.com/postgres/postgres/commit/04396eacd3faeaa4fa3d084a6749e4e384bdf0db

Вложения

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

shveta malik

Дата:

09 декабря 2025 г., 14:45:24

On Tue, Dec 9, 2025 at 4:04 PM Ajin Cherian <itsajin@gmail.com> wrote:
>
> Hi all,
>
> Since commit 04396ea [1] has been pushed, which included part of the
> changes this patch set was addressing, I have updated and rebased the
> patch set to incorporate those changes.
>
> The patch set now contains two patches:
>
> 0001 – Modify the pg_sync_replication_slots API to also handle
> promotion signals and stop synchronization, similar to the slot sync
> worker.
> 0002 – Improve pg_sync_replication_slots to wait for and persist slots
> until they are sync-ready.
>
> Please review the updated patch set (v31).
>

Thanks. Please find a few comments on 001:

1)
Commit message:
"That meant backends
that perform slot synchronization via the pg_sync_replication_slots()
SQL function were not signalled at all because their PIDs were not
recorded in the slot-sync context."

We should phrase it in the singular ('backend'), since multiple
backends cannot run simultaneously because concurrent sync is not
allowed.
Using the term 'their PIDs' gives an impression that there could be
multiple PIDs, which is not the case.

2)
  primary_slotname_changed = strcmp(old_primary_slotname, PrimarySlotName) != 0;
+
  pfree(old_primary_conninfo);

This change to put blank line is not needed.

3)

+ /* Check for sync_replication_slots change */
+ /* Check for parameter changes common to both API and worker */

IMO, these comments are not needed as it is self-explanatory. Even if
we plan to put these, both should be same, either both mentioning API
and worker or neither.

4)
- * receives a stop request from the startup process, or when there is an
+ * because receives a stop request from the startup process, or when
there is an

I think, this change is done by mistake.

5)
+ * Signal slotsync worker or backend process running
pg_sync_replication_slots()
+ * if running. The process will stop upon detecting that the stopSignaled
+ * flag is set to true.

Comment looks slightly odd. Shall we say:

Signal the slotsync worker or the backend process running
pg_sync_replication_slots(), if either one is active. The process...

thanks
Shveta

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Ajin Cherian

Дата:

10 декабря 2025 г., 05:23:10

On Tue, Dec 9, 2025 at 10:45 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Tue, Dec 9, 2025 at 4:04 PM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > Hi all,
> >
> > Since commit 04396ea [1] has been pushed, which included part of the
> > changes this patch set was addressing, I have updated and rebased the
> > patch set to incorporate those changes.
> >
> > The patch set now contains two patches:
> >
> > 0001 – Modify the pg_sync_replication_slots API to also handle
> > promotion signals and stop synchronization, similar to the slot sync
> > worker.
> > 0002 – Improve pg_sync_replication_slots to wait for and persist slots
> > until they are sync-ready.
> >
> > Please review the updated patch set (v31).
> >
>
> Thanks. Please find a few comments on 001:
>
> 1)
> Commit message:
> "That meant backends
> that perform slot synchronization via the pg_sync_replication_slots()
> SQL function were not signalled at all because their PIDs were not
> recorded in the slot-sync context."
>
> We should phrase it in the singular ('backend'), since multiple
> backends cannot run simultaneously because concurrent sync is not
> allowed.
> Using the term 'their PIDs' gives an impression that there could be
> multiple PIDs, which is not the case.
>

Fixed.

> 2)
>   primary_slotname_changed = strcmp(old_primary_slotname, PrimarySlotName) != 0;
> +
>   pfree(old_primary_conninfo);
>
> This change to put blank line is not needed.
>

Removed.

> 3)
>
> + /* Check for sync_replication_slots change */
> + /* Check for parameter changes common to both API and worker */
>
> IMO, these comments are not needed as it is self-explanatory. Even if
> we plan to put these, both should be same, either both mentioning API
> and worker or neither.
>

Removed.

> 4)
> - * receives a stop request from the startup process, or when there is an
> + * because receives a stop request from the startup process, or when
> there is an
>
> I think, this change is done by mistake.
>

Yes, removed.

> 5)
> + * Signal slotsync worker or backend process running
> pg_sync_replication_slots()
> + * if running. The process will stop upon detecting that the stopSignaled
> + * flag is set to true.
>
> Comment looks slightly odd. Shall we say:
>
> Signal the slotsync worker or the backend process running
> pg_sync_replication_slots(), if either one is active. The process...
>

Changed.

Attaching patch v32 addressing the above comments.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Chao Li

Дата:

10 декабря 2025 г., 05:28:39


> On Dec 9, 2025, at 18:33, Ajin Cherian <itsajin@gmail.com> wrote:
>
> Hi all,
>
> Since commit 04396ea [1] has been pushed, which included part of the
> changes this patch set was addressing, I have updated and rebased the
> patch set to incorporate those changes.
>
> The patch set now contains two patches:
>
> 0001 – Modify the pg_sync_replication_slots API to also handle
> promotion signals and stop synchronization, similar to the slot sync
> worker.
> 0002 – Improve pg_sync_replication_slots to wait for and persist slots
> until they are sync-ready.
>
> Please review the updated patch set (v31).
>
> Regards,
> Ajin Cherian
> Fujitsu Australia
>
> [1] https://github.com/postgres/postgres/commit/04396eacd3faeaa4fa3d084a6749e4e384bdf0db
>
<v31-0001-Signal-backends-running-pg_sync_replication_slot.patch><v31-0002-Improve-initial-slot-synchronization-in-pg_sync_.patch>

Hi Ajin,

I’d like to revisit this patch, but looks like 04396eacd3faeaa4fa3d084a6749e4e384bdf0db has some conflicts to this
patch.So can you please rebase this patch? 

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Ajin Cherian

Дата:

10 декабря 2025 г., 05:40:10

On Wed, Dec 10, 2025 at 1:29 PM Chao Li <li.evan.chao@gmail.com> wrote:
>
> Hi Ajin,
>
> I’d like to revisit this patch, but looks like 04396eacd3faeaa4fa3d084a6749e4e384bdf0db has some conflicts to this
patch.So can you please rebase this patch? 
>
> Best regards,
> --

It's been rebased. Have a look at the latest version.

regards,
Ajin Cherian
Fujitsu Australia

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

shveta malik

Дата:

10 декабря 2025 г., 07:05:01

On Wed, Dec 10, 2025 at 8:10 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Wed, Dec 10, 2025 at 1:29 PM Chao Li <li.evan.chao@gmail.com> wrote:
> >
> > Hi Ajin,
> >
> > I’d like to revisit this patch, but looks like 04396eacd3faeaa4fa3d084a6749e4e384bdf0db has some conflicts to this
patch.So can you please rebase this patch? 
> >
> > Best regards,
> > --
>
> It's been rebased. Have a look at the latest version.
>

Few comments on 001:

1)
/*
* Emit an error if a promotion or a concurrent sync call is in progress.
* Otherwise, advertise that a sync is in progress.
*/
static void
check_and_set_sync_info

We need to change this comment because now this function does not
handle promotion case.

2)
+  if (sync_process_pid!= InvalidPid)
+    kill(sync_process_pid, SIGUSR1);

We need to have space between sync_process_pid and '!='

3)
+ * Exit or throw errors if relevant GUCs have changed depending on whether

errors->error

4)
In slotsync_reread_config(), even when we mark parameter_changed=true
in the first if-block, we still go to the second if-block which was
not needed.  So shall we make second if-block as else-if to avoid
this? Thoughts?

5)
As discussed in [1], we can make this change in ProcessSlotSyncInterrupts():

'replication slot synchronization worker is shutting down because
promotion is triggered'
to
'replication slot synchronization worker will stop because promotion
is triggered'

[1]: https://www.postgresql.org/message-id/6AE56C64-F760-4CBD-BABF-72633D3F7B5E%40gmail.com

thanks
Shveta

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Chao Li

Дата:

10 декабря 2025 г., 07:26:39


> On Dec 10, 2025, at 10:40, Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Wed, Dec 10, 2025 at 1:29 PM Chao Li <li.evan.chao@gmail.com> wrote:
>>
>> Hi Ajin,
>>
>> I’d like to revisit this patch, but looks like 04396eacd3faeaa4fa3d084a6749e4e384bdf0db has some conflicts to this
patch.So can you please rebase this patch? 
>>
>> Best regards,
>> --
>
> It's been rebased. Have a look at the latest version.
>

Here are some comments for v32.

1 - 0001
```
- * The slot sync worker's pid is needed by the startup process to shut it
- * down during promotion. The startup process shuts down the slot sync worker
- * and also sets stopSignaled=true to handle the race condition when the
+ * The pid is either the slot sync worker's pid or the backend's pid running
```

I think we should add single quotes for “pid” and “stopSignaled". Looking at other comment lines, structure fields all
havesingle-quoted: 
```
* The 'syncing' flag is needed to prevent concurrent slot syncs to avoid slot

* The 'last_start_time' is needed by postmaster to start the slot sync worker
```

2 - 0001 - the same code block as 1

I wonder how to distinct if the “pid” is a slot sync worker or a backend process?

3 - 0001
```
+    bool        worker = AmLogicalSlotSyncWorkerProcess();
```

The variable name “worker” doesn’t indicate a bool type, maybe renamed to “is_slotsync_worker”.

4 - 0001
```
+    /*
+     * If we have reached here with a parameter change, we must be running in
+     * SQL function, emit error in such a case.
+     */
+    if (parameter_changed)
+    {
+        Assert(!worker);
+        ereport(ERROR,
+                errmsg("replication slot synchronization will stop because of a parameter change"));
     }
```

The Assert(!worker) feels redundant, because it then immediately error out.

5 - 0001
```
+ * Exit or throw errors if relevant GUCs have changed depending on whether
+ * called from slotsync worker or from SQL function pg_sync_replication_slots()
```

Let’s change “slotsync” to “slot sync” because elsewhere comments all use “slot sync”, just to keep consistent.

6 - 0001
```
- * Interrupt handler for main loop of slot sync worker.
+ * Interrupt handler for main loop of slot sync worker and
+ * SQL function pg_sync_replication_slots().
```

Missing “the” before “SQL function”. This comment applies to multiple places.

7 - 0001
```
+    if (sync_process_pid!= InvalidPid)
+        kill(sync_process_pid, SIGUSR1);
```

Nit: missing a white space before “!="

8 - 0002
```
+    if (slot_names != NIL)
     {
-        StartTransactionCommand();
-        started_tx = true;
+        bool        first_slot = true;
+
+        /*
+         * Construct the query to fetch only the specified slots
+         */
+        appendStringInfoString(&query, " AND slot_name IN (");
+
+        foreach_ptr(char, slot_name, slot_names)
+        {
+            if (!first_slot)
+                appendStringInfoString(&query, ", ");
+
+            appendStringInfo(&query, "'%s'", slot_name);
+            first_slot = false;
+        }
+        appendStringInfoChar(&query, ')');
     }
```

The logic of appending “, “ can be slightly simplified as:
```
If (slot_names != NIL)
{
    Const char *sep = “”;
    appendStringInfoString(&query, " AND slot_name IN (“);
    foreach_ptr(char, slot_name, slot_names)
    {
        appendStringInfo(&query, “%s'%s'", sep, slot_name);
        sep = “, “;
    }
}
```

That saves a “if” check and a appendStringInfoString().

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Ajin Cherian

Дата:

10 декабря 2025 г., 08:07:34

On Wed, Dec 10, 2025 at 3:05 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Wed, Dec 10, 2025 at 8:10 AM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > On Wed, Dec 10, 2025 at 1:29 PM Chao Li <li.evan.chao@gmail.com> wrote:
> > >
> > > Hi Ajin,
> > >
> > > I’d like to revisit this patch, but looks like 04396eacd3faeaa4fa3d084a6749e4e384bdf0db has some conflicts to
thispatch. So can you please rebase this patch? 
> > >
> > > Best regards,
> > > --
> >
> > It's been rebased. Have a look at the latest version.
> >
>
> Few comments on 001:
>
> 1)
> /*
> * Emit an error if a promotion or a concurrent sync call is in progress.
> * Otherwise, advertise that a sync is in progress.
> */
> static void
> check_and_set_sync_info
>
> We need to change this comment because now this function does not
> handle promotion case.

Fixed.

>
> 2)
> +  if (sync_process_pid!= InvalidPid)
> +    kill(sync_process_pid, SIGUSR1);
>
> We need to have space between sync_process_pid and '!='
>

Fixed.

> 3)
> + * Exit or throw errors if relevant GUCs have changed depending on whether
>
> errors->error
>

Fixed.

> 4)
> In slotsync_reread_config(), even when we mark parameter_changed=true
> in the first if-block, we still go to the second if-block which was
> not needed.  So shall we make second if-block as else-if to avoid
> this? Thoughts?
>

Changed as suggested.

> 5)
> As discussed in [1], we can make this change in ProcessSlotSyncInterrupts():
>
> 'replication slot synchronization worker is shutting down because
> promotion is triggered'
> to
> 'replication slot synchronization worker will stop because promotion
> is triggered'
>
> [1]: https://www.postgresql.org/message-id/6AE56C64-F760-4CBD-BABF-72633D3F7B5E%40gmail.com
>

Changed.


On Wed, Dec 10, 2025 at 3:27 PM Chao Li <li.evan.chao@gmail.com> wrote:
>
>
>
>
> Here are some comments for v32.
>
> 1 - 0001
> ```
> - * The slot sync worker's pid is needed by the startup process to shut it
> - * down during promotion. The startup process shuts down the slot sync worker
> - * and also sets stopSignaled=true to handle the race condition when the
> + * The pid is either the slot sync worker's pid or the backend's pid running
> ```
>
> I think we should add single quotes for “pid” and “stopSignaled". Looking at other comment lines, structure fields
allhave single-quoted: 
> ```
> * The 'syncing' flag is needed to prevent concurrent slot syncs to avoid slot
>
> * The 'last_start_time' is needed by postmaster to start the slot sync worker
> ```
>

Changed as suggested.

> 2 - 0001 - the same code block as 1
>
> I wonder how to distinct if the “pid” is a slot sync worker or a backend process?
>

No, there is now way currently and iis not really required with the
current logic.


> 3 - 0001
> ```
> +       bool            worker = AmLogicalSlotSyncWorkerProcess();
> ```
>
> The variable name “worker” doesn’t indicate a bool type, maybe renamed to “is_slotsync_worker”.
>

Changed as suggested.

> 4 - 0001
> ```
> +       /*
> +        * If we have reached here with a parameter change, we must be running in
> +        * SQL function, emit error in such a case.
> +        */
> +       if (parameter_changed)
> +       {
> +               Assert(!worker);
> +               ereport(ERROR,
> +                               errmsg("replication slot synchronization will stop because of a parameter change"));
>         }
> ```
>
> The Assert(!worker) feels redundant, because it then immediately error out.
>

I don't think it is redundant as Asserts are used to catch unexpected
code paths in testing.

> 5 - 0001
> ```
> + * Exit or throw errors if relevant GUCs have changed depending on whether
> + * called from slotsync worker or from SQL function pg_sync_replication_slots()
> ```
>
> Let’s change “slotsync” to “slot sync” because elsewhere comments all use “slot sync”, just to keep consistent.
>

Changed.

> 6 - 0001
> ```
> - * Interrupt handler for main loop of slot sync worker.
> + * Interrupt handler for main loop of slot sync worker and
> + * SQL function pg_sync_replication_slots().
> ```
>
> Missing “the” before “SQL function”. This comment applies to multiple places.
>

Changed.


> 7 - 0001
> ```
> +       if (sync_process_pid!= InvalidPid)
> +               kill(sync_process_pid, SIGUSR1);
> ```
>
> Nit: missing a white space before “!="
>

Fixed.

> 8 - 0002
> ```
> +       if (slot_names != NIL)
>         {
> -               StartTransactionCommand();
> -               started_tx = true;
> +               bool            first_slot = true;
> +
> +               /*
> +                * Construct the query to fetch only the specified slots
> +                */
> +               appendStringInfoString(&query, " AND slot_name IN (");
> +
> +               foreach_ptr(char, slot_name, slot_names)
> +               {
> +                       if (!first_slot)
> +                               appendStringInfoString(&query, ", ");
> +
> +                       appendStringInfo(&query, "'%s'", slot_name);
> +                       first_slot = false;
> +               }
> +               appendStringInfoChar(&query, ')');
>         }
> ```
>
> The logic of appending “, “ can be slightly simplified as:
> ```
> If (slot_names != NIL)
> {
>     Const char *sep = “”;
>     appendStringInfoString(&query, " AND slot_name IN (“);
>     foreach_ptr(char, slot_name, slot_names)
>     {
>         appendStringInfo(&query, “%s'%s'", sep, slot_name);
>         sep = “, “;
>     }
> }
> ```
>
> That saves a “if” check and a appendStringInfoString().

I'm not sure if this is much of an improvement, I like the current
approach and matches with similar coding patterns in the code base.

Attaching v34 addressing the above comments.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

"Yilin Zhang"

Дата:

10 декабря 2025 г., 08:22:39

At 2025-12-10 13:07:34, "Ajin Cherian" <itsajin@gmail.com> wrote:

>
>I'm not sure if this is much of an improvement, I like the current
>approach and matches with similar coding patterns in the code base.
>
>Attaching v34 addressing the above comments.
>
Hi,
Few comments for v34.

1 - 0002
```
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -39,6 +39,12 @@
  * the last cycle. Refer to the comments above wait_for_slot_activity() for
  * more details.
  *
+ * If the SQL function pg_sync_replication is used to sync the slots, and if
```

Typo, it should be "pg_sync_replication_slots()" instead of "pg_sync_replication".



2 - 0002
```
+		/*
+		 * The syscache access in fetch_or_refresh_remote_slots() needs a
+		 * transaction env.
+		 */
```

Typo, it should be "fetch_remote_slots()" instead of "fetch_or_refresh_remote_slots()".



3 - 0002
```
+			appendStringInfo(&query, "'%s'", slot_name);
```

Instead of manually add single quotes to slot name, consider using quote_literal_cstr().

While I was reviewing patch v32, Ajin Cherian had already submitted patch v34, but these issues still persisted.



Best regards,
--
Yilin Zhang

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Amit Kapila

Дата:

10 декабря 2025 г., 15:23:50

On Wed, Dec 10, 2025 at 10:37 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> Attaching v34 addressing the above comments.
>

0001 looks mostly good to me. I have made minor edits in the comments
and added error_code for one of the error messages. Please check
attached and let me know what you think?

--
With Regards,
Amit Kapila.

Вложения

v35-0001-Signal-backends-running-pg_sync_replication_slot.patch

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Chao Li

Дата:

10 декабря 2025 г., 16:34:26


> On Dec 10, 2025, at 20:23, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Dec 10, 2025 at 10:37 AM Ajin Cherian <itsajin@gmail.com> wrote:
>>
>> Attaching v34 addressing the above comments.
>>
>
> 0001 looks mostly good to me. I have made minor edits in the comments
> and added error_code for one of the error messages. Please check
> attached and let me know what you think?
>
> --
> With Regards,
> Amit Kapila.
> <v35-0001-Signal-backends-running-pg_sync_replication_slot.patch>

I saw you have integrated the missed change from the last push into v35, and v35 looks good to me.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

shveta malik

Дата:

11 декабря 2025 г., 06:07:03

On Wed, Dec 10, 2025 at 5:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
>
> 0001 looks mostly good to me. I have made minor edits in the comments
> and added error_code for one of the error messages. Please check
> attached and let me know what you think?
>

v35 looks good to me.

thanks
Shveta

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Ajin Cherian

Дата:

11 декабря 2025 г., 08:15:16

On Wed, Dec 10, 2025 at 4:23 PM Yilin Zhang <jiezhilove@126.com> wrote:
>
> Hi,
> Few comments for v34.
>

Thanks for your review!

I've addressed your comments.

As patch 0001 has been pushed. I've rebased and created a new version
v36 with the remaining patch.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

v36-0001-Improve-initial-slot-synchronization-in-pg_sync_.patch

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

shveta malik

Дата:

11 декабря 2025 г., 11:57:38

On Thu, Dec 11, 2025 at 10:45 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Wed, Dec 10, 2025 at 4:23 PM Yilin Zhang <jiezhilove@126.com> wrote:
> >
> > Hi,
> > Few comments for v34.
> >
>
> Thanks for your review!
>
> I've addressed your comments.
>
> As patch 0001 has been pushed. I've rebased and created a new version
> v36 with the remaining patch.
>

Verified, the patch works well.

thanks
Shveta

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Amit Kapila

Дата:

11 декабря 2025 г., 15:23:22

On Thu, Dec 11, 2025 at 10:45 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> As patch 0001 has been pushed. I've rebased and created a new version
> v36 with the remaining patch.
>

I have made a number of changes in code comments and docs. Kindly
review and if you are okay with these then include them in the next
version.

--
With Regards,
Amit Kapila.

Вложения

v36_amit_1.patch.txt

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Chao Li

Дата:

12 декабря 2025 г., 01:49:03

> On Dec 11, 2025, at 20:23, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Dec 11, 2025 at 10:45 AM Ajin Cherian <itsajin@gmail.com> wrote:
>>
>> As patch 0001 has been pushed. I've rebased and created a new version
>> v36 with the remaining patch.
>>
>
> I have made a number of changes in code comments and docs. Kindly
> review and if you are okay with these then include them in the next
> version.
>

This diff enhanced docs and comments, overall looks good to me. A few nit comments:

1
```
- * Returns:
- *    List of remote slot information structures. Returns NIL if no slot
- *    is found.
- *
+ * Returns list of remote slot information structures, if any, otherwise,
+ * NIL if no slot is found.
```

I think “a” is needed before “list”, and “if any, otherwise,” looks rarely seen in code comments. So suggesting:
```
 * Returns a list of remote slot information structures, or NIL if none
 * are found.
```

2
```
- * Parameters:
- * wrconn - Connection to the primary server
- * remote_slot_list - List of RemoteSlot structures to synchronize.
- * slot_persistence_pending - boolean used by SQL function
- *                               pg_sync_replication_slots() to track if any slots
- *                               could not be persisted and need to be retried.
+ * If slot_persistence_pending is not NULL, it will be set to true if one or
+ * more slots could not be persisted.
```

The changed version loses the meaning of retry. So, suggesting:
```
 * If slot_persistence_pending is not NULL, it will be set to true if one
 * or more slots could not be persisted.  This allows callers such as
 * pg_sync_replication_slots() to retry those slots.
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Ajin Cherian

Дата:

12 декабря 2025 г., 03:05:04

On Thu, Dec 11, 2025 at 11:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Dec 11, 2025 at 10:45 AM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> > As patch 0001 has been pushed. I've rebased and created a new version
> > v36 with the remaining patch.
> >
>
> I have made a number of changes in code comments and docs. Kindly
> review and if you are okay with these then include them in the next
> version.
>

I have included these changes as well as comments by Chao. Attaching
v37 with the changes.

regards,
Ajin Cherian
Fujitsu Australia

Вложения

v37-0001-Improve-initial-slot-synchronization-in-pg_sync_.patch

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

shveta malik

Дата:

12 декабря 2025 г., 06:22:55

On Fri, Dec 12, 2025 at 5:35 AM Ajin Cherian <itsajin@gmail.com> wrote:
>
> On Thu, Dec 11, 2025 at 11:23 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Dec 11, 2025 at 10:45 AM Ajin Cherian <itsajin@gmail.com> wrote:
> > >
> > > As patch 0001 has been pushed. I've rebased and created a new version
> > > v36 with the remaining patch.
> > >
> >
> > I have made a number of changes in code comments and docs. Kindly
> > review and if you are okay with these then include them in the next
> > version.
> >
>
> I have included these changes as well as comments by Chao. Attaching
> v37 with the changes.
>

Thanks. v37 LGTM.

thanks
Shveta

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Amit Kapila

Дата:

15 декабря 2025 г., 14:06:20

On Fri, Dec 12, 2025 at 8:53 AM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Fri, Dec 12, 2025 at 5:35 AM Ajin Cherian <itsajin@gmail.com> wrote:
> >
> >
> > I have included these changes as well as comments by Chao. Attaching
> > v37 with the changes.
> >
>
> Thanks. v37 LGTM.
>

Pushed.

--
With Regards,
Amit Kapila.

RE: Improve pg_sync_replication_slots() to wait for primary to advance

От

"Zhijie Hou (Fujitsu)"

Дата:

17 декабря 2025 г., 13:28:28

On Monday, December 15, 2025 7:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Fri, Dec 12, 2025 at 8:53 AM shveta malik <shveta.malik@gmail.com>
> wrote:
> >
> > On Fri, Dec 12, 2025 at 5:35 AM Ajin Cherian <itsajin@gmail.com> wrote:
> > >
> > >
> > > I have included these changes as well as comments by Chao. Attaching
> > > v37 with the changes.
> > >
> >
> > Thanks. v37 LGTM.
> >
> 
> Pushed.

My college reported a related BF failure[1] to me off-list.

After analyzing, I think the issue is that the newly added test in
040_standby_failover_slots_sync synchronizes a replication slot to the standby
server without configuring synchronized_standby_slots. This omission allows
logical failover slots to advance beyond the designated physical replication
slot, resulting in intermittent synchronization failures.

I confirmed the same from the log where the slotsync failed due to the
reason mentioned above:

--
2025-12-15 12:30:33.502 CET [3015371][client backend][1/2:0] ERROR:  skipping slot synchronization because the received
slotsync LSN 0/06017C90 for slot "lsub1_slot" is ahead of the standby position 0/06017C58

2025-12-15 12:30:33.502 CET [3015371][client backend][1/2:0] STATEMENT:  SELECT pg_sync_replication_slots();
--

Here is a small patch to fix it.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&dt=2025-12-15%2011%3A25%3A38

Best Regards,
Hou zj

Вложения

v1-0001-Fix-an-intermittent-BF-failure.patch

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Andres Freund

Дата:

17 декабря 2025 г., 20:09:25

Hi,

On 2025-12-17 10:28:28 +0000, Zhijie Hou (Fujitsu) wrote:
> On Monday, December 15, 2025 7:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > 
> > On Fri, Dec 12, 2025 at 8:53 AM shveta malik <shveta.malik@gmail.com>
> > wrote:
> > >
> > > On Fri, Dec 12, 2025 at 5:35 AM Ajin Cherian <itsajin@gmail.com> wrote:
> > > >
> > > >
> > > > I have included these changes as well as comments by Chao. Attaching
> > > > v37 with the changes.
> > > >
> > >
> > > Thanks. v37 LGTM.
> > >
> > 
> > Pushed.
> 
> My college reported a related BF failure[1] to me off-list.

FWIW, this also fails semi-regularly in CI. E.g.
https://cirrus-ci.com/task/6281872222715904
https://cirrus-ci.com/task/5243530626465792

Greetings,

Andres Freund

Re: Improve pg_sync_replication_slots() to wait for primary to advance

От

Amit Kapila

Дата:

18 декабря 2025 г., 14:39:37

On Wed, Dec 17, 2025 at 3:58 PM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Here is a small patch to fix it.
>

Thanks, I've pushed the patch. BTW, looking at the code of slot_sync
API code path, I could think of the following improvements.

*
if (remote_slot->confirmed_lsn > latestFlushPtr)
{
update_slotsync_skip_stats(SS_SKIP_WAL_NOT_FLUSHED);

/*
* Can get here only if GUC 'synchronized_standby_slots' on the
* primary server was not configured correctly.
*/
ereport(AmLogicalSlotSyncWorkerProcess() ? LOG : ERROR,
errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),

Can we change this ERROR to LOG even for API as now the API also
retires to sync the slots during initial sync?

* The use of the slot_persistence_pending flag in the internal APIs
seems to be the reverse of what it should be. I mean to say that
initially it should be true and when we actually persist the slot then
we can set it to false.

* We can retry to sync all the slots present in the primary at the
start of API, not only temporary slots. If we do this then the
previous point may not be required. Also, please mention something
like: "It retries cyclically until all the failover slots that existed
on primary at the start of the function call are synchronized." in the
function description [1] as well.

[1] - https://www.postgresql.org/docs/devel/functions-admin.html#FUNCTIONS-REPLICATION
--
With Regards,
Amit Kapila.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Re: Improve pg_sync_replication_slots() to wait for primary to advance

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения