Обсуждение: Fix LOCK_TIMEOUT handling in slotsync worker

Поиск
Список
Период
Сортировка

Fix LOCK_TIMEOUT handling in slotsync worker

От
"Zhijie Hou (Fujitsu)"
Дата:
Hi,

Previously, the slotsync worker used SIGINT to receive a graceful shutdown
signal from the startup process on promotion. However, SIGINT is also used by
the LOCK_TIMEOUT handler to trigger a query-cancel interrupt. Given that the
slotsync worker can access and lock catalog tables while parsing libpq tuples,
this overlapping use of SIGINT led to the slotsync worker ignoring LOCK_TIMEOUT
signals and consequently waiting indefinitely on locks.

I can reproduce the issue by:

1) create a failover replication slot for slotsync on primary.
2) start slotsync worker on standby and uses gdb to make the slotsync
worker block before accessing pg_type catalog via walrcv_exec -> libpqrcv_exec ->
libpqrcv_processTuples -> TupleDescInitEntry -> SearchSysCache1.
3) take ACCESS EXCLUSIVE lock on pg_type on primary.
4) log standby snapshot to replicate the lock to standby.
5) release the slotsync worker, it will start waiting for the lock on pg_type to
   be released. And on HEAD, it would not be canceled by the lock_timeout
   setting.

Here is a patch to resolve this by replacing the current signal handler with the
appropriate StatementCancelHandler for SIGINT within the slotsync worker.
Furthermore, it updates the startup process to send a SIGUSR1 signal to notify
slotsync of the need to stop during promotion. The slotsync worker now stops
upon detecting that the shared memory flag (stopSignaled) is set to true.

I did not add a tap-test in the patch for now. Although feasible, it requires
a strong lock on a catalog and an injection point to control the
process.

Best Regards,
Hou zj

Вложения

Re: Fix LOCK_TIMEOUT handling in slotsync worker

От
shveta malik
Дата:
On Mon, Dec 8, 2025 at 7:34 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> Hi,
>
> Previously, the slotsync worker used SIGINT to receive a graceful shutdown
> signal from the startup process on promotion. However, SIGINT is also used by
> the LOCK_TIMEOUT handler to trigger a query-cancel interrupt. Given that the
> slotsync worker can access and lock catalog tables while parsing libpq tuples,
> this overlapping use of SIGINT led to the slotsync worker ignoring LOCK_TIMEOUT
> signals and consequently waiting indefinitely on locks.
>
> I can reproduce the issue by:
>
> 1) create a failover replication slot for slotsync on primary.
> 2) start slotsync worker on standby and uses gdb to make the slotsync
> worker block before accessing pg_type catalog via walrcv_exec -> libpqrcv_exec ->
> libpqrcv_processTuples -> TupleDescInitEntry -> SearchSysCache1.
> 3) take ACCESS EXCLUSIVE lock on pg_type on primary.
> 4) log standby snapshot to replicate the lock to standby.
> 5) release the slotsync worker, it will start waiting for the lock on pg_type to
>    be released. And on HEAD, it would not be canceled by the lock_timeout
>    setting.
>
> Here is a patch to resolve this by replacing the current signal handler with the
> appropriate StatementCancelHandler for SIGINT within the slotsync worker.
> Furthermore, it updates the startup process to send a SIGUSR1 signal to notify
> slotsync of the need to stop during promotion. The slotsync worker now stops
> upon detecting that the shared memory flag (stopSignaled) is set to true.
>
> I did not add a tap-test in the patch for now. Although feasible, it requires
> a strong lock on a catalog and an injection point to control the
> process.
>

Thanks for the patch. I agree with the issue mentioned and can
reproduce it on HEAD; verified that the patch fixes it.
The patch looks good to me.

thanks
Shveta



Re: Fix LOCK_TIMEOUT handling in slotsync worker

От
Chao Li
Дата:
Hi Zhijie,

Thanks for the patch. The change looks reasonable. ShutDownSlotSync() has set SlotSyncCtx->stopSignaled, and SIGUSR1’s
procsignal_sigusr1_handlerwill do SetLatch(MyLatch) that will in turn wake up ProcessSlotSyncInterrupts(), then hit if
(SlotSyncCtx->stopSignaled)and the slotsync worker terminates. 

> On Dec 8, 2025, at 10:04, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
>
> <v1-0001-Fix-LOCK_TIMEOUT-handling-in-slotsync-worker.patch>

I have nit comment:

```
-    if (ShutdownRequestPending)
+    if (SlotSyncCtx->stopSignaled)
     {
         ereport(LOG,
-                errmsg("replication slot synchronization worker is shutting down on receiving SIGINT"));
+                errmsg("replication slot synchronization worker is shutting down because promotion is triggered"));
```

In the error message, “because promotion is triggered" sound a little redundant, can be just:

"replication slot synchronization worker is shutting down due to promotion"

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







Re: Fix LOCK_TIMEOUT handling in slotsync worker

От
Amit Kapila
Дата:
On Mon, Dec 8, 2025 at 1:06 PM Chao Li <li.evan.chao@gmail.com> wrote:
>
> I have nit comment:
>
> ```
> -       if (ShutdownRequestPending)
> +       if (SlotSyncCtx->stopSignaled)
>         {
>                 ereport(LOG,
> -                               errmsg("replication slot synchronization worker is shutting down on receiving
SIGINT"));
> +                               errmsg("replication slot synchronization worker is shutting down because promotion is
triggered"));
> ```
>
> In the error message, “because promotion is triggered" sound a little redundant, can be just:
>
> "replication slot synchronization worker is shutting down due to promotion"
>

We have a number of existing similar messages like:  "logical
replication parallel apply worker for subscription \"%s\" will stop
because of a parameter change". So, how about: "replication slot
synchronization worker will stop because promotion is triggered"?

--
With Regards,
Amit Kapila.



Re: Fix LOCK_TIMEOUT handling in slotsync worker

От
Chao Li
Дата:

> On Dec 9, 2025, at 13:34, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Dec 8, 2025 at 1:06 PM Chao Li <li.evan.chao@gmail.com> wrote:
>>
>> I have nit comment:
>>
>> ```
>> -       if (ShutdownRequestPending)
>> +       if (SlotSyncCtx->stopSignaled)
>>        {
>>                ereport(LOG,
>> -                               errmsg("replication slot synchronization worker is shutting down on receiving
SIGINT"));
>> +                               errmsg("replication slot synchronization worker is shutting down because promotion
istriggered")); 
>> ```
>>
>> In the error message, “because promotion is triggered" sound a little redundant, can be just:
>>
>> "replication slot synchronization worker is shutting down due to promotion"
>>
>
> We have a number of existing similar messages like:  "logical
> replication parallel apply worker for subscription \"%s\" will stop
> because of a parameter change". So, how about: "replication slot
> synchronization worker will stop because promotion is triggered"?
>
> --
> With Regards,
> Amit Kapila.

Yeah, I just searched and see similar messages:

```
logical replication parallel apply worker for subscription \"%s\" will stop because the subscription owner's superuser
privilegeshave been revoked 

logical replication worker for subscription \"%s\" will restart because the subscription owner's superuser privileges
havebeen revoked 
```

I think the new phrase is better. Maybe “is triggered” could be “has been triggered”?

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







RE: Fix LOCK_TIMEOUT handling in slotsync worker

От
"Zhijie Hou (Fujitsu)"
Дата:
On Tuesday, December 9, 2025 1:34 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Mon, Dec 8, 2025 at 1:06 PM Chao Li <li.evan.chao@gmail.com> wrote:
> >
> > I have nit comment:
> >
> > ```
> > -       if (ShutdownRequestPending)
> > +       if (SlotSyncCtx->stopSignaled)
> >         {
> >                 ereport(LOG,
> > -                               errmsg("replication slot synchronization worker is shutting
> down on receiving SIGINT"));
> > +                               errmsg("replication slot
> > + synchronization worker is shutting down because promotion is
> > + triggered"));
> > ```
> >
> > In the error message, “because promotion is triggered" sound a little
> redundant, can be just:
> >
> > "replication slot synchronization worker is shutting down due to promotion"
> >
> 
> We have a number of existing similar messages like:  "logical replication
> parallel apply worker for subscription \"%s\" will stop because of a parameter
> change". So, how about: "replication slot synchronization worker will stop
> because promotion is triggered"?

The suggested message looks better. Here is the updated patch which
can be applied to all branches that supports slotsync.

Best Regards,
Hou zj

Вложения

Re: Fix LOCK_TIMEOUT handling in slotsync worker

От
Amit Kapila
Дата:
On Tue, Dec 9, 2025 at 11:23 AM Chao Li <li.evan.chao@gmail.com> wrote:
>
>
> Yeah, I just searched and see similar messages:
>
> ```
> logical replication parallel apply worker for subscription \"%s\" will stop because the subscription owner's
superuserprivileges have been revoked 
>
> logical replication worker for subscription \"%s\" will restart because the subscription owner's superuser privileges
havebeen revoked 
> ```
>
> I think the new phrase is better. Maybe “is triggered” could be “has been triggered”?
>

My AI tool says:

Both options are grammatically correct, but the nuance differs:
"will stop because promotion is triggered"
This uses the present tense ("is triggered"), which suggests the
promotion event is happening right now, concurrently with the stopping
action.
"will stop because promotion has been triggered"
This uses the present perfect tense ("has been triggered"), which
implies the promotion event already occurred and is the reason for the
upcoming stop.

In this case, because ShutDownSlotSync() will wait for the slotsync
worker to exit, so the first one ("will stop because promotion is
triggered") fits better.

--
With Regards,
Amit Kapila.



Re: Fix LOCK_TIMEOUT handling in slotsync worker

От
Chao Li
Дата:

> On Dec 9, 2025, at 14:12, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Dec 9, 2025 at 11:23 AM Chao Li <li.evan.chao@gmail.com> wrote:
>>
>>
>> Yeah, I just searched and see similar messages:
>>
>> ```
>> logical replication parallel apply worker for subscription \"%s\" will stop because the subscription owner's
superuserprivileges have been revoked 
>>
>> logical replication worker for subscription \"%s\" will restart because the subscription owner's superuser
privilegeshave been revoked 
>> ```
>>
>> I think the new phrase is better. Maybe “is triggered” could be “has been triggered”?
>>
>
> My AI tool says:
>
> Both options are grammatically correct, but the nuance differs:
> "will stop because promotion is triggered"
> This uses the present tense ("is triggered"), which suggests the
> promotion event is happening right now, concurrently with the stopping
> action.
> "will stop because promotion has been triggered"
> This uses the present perfect tense ("has been triggered"), which
> implies the promotion event already occurred and is the reason for the
> upcoming stop.
>
> In this case, because ShutDownSlotSync() will wait for the slotsync
> worker to exit, so the first one ("will stop because promotion is
> triggered") fits better.
>

Make sense. Then Zhijie’s v2 looks good to me.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







Re: Fix LOCK_TIMEOUT handling in slotsync worker

От
Amit Kapila
Дата:
On Tue, Dec 9, 2025 at 11:50 AM Chao Li <li.evan.chao@gmail.com> wrote:
>
> > On Dec 9, 2025, at 14:12, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Dec 9, 2025 at 11:23 AM Chao Li <li.evan.chao@gmail.com> wrote:
> >>
> >>
> >> Yeah, I just searched and see similar messages:
> >>
> >> ```
> >> logical replication parallel apply worker for subscription \"%s\" will stop because the subscription owner's
superuserprivileges have been revoked 
> >>
> >> logical replication worker for subscription \"%s\" will restart because the subscription owner's superuser
privilegeshave been revoked 
> >> ```
> >>
> >> I think the new phrase is better. Maybe “is triggered” could be “has been triggered”?
> >>
> >
> > My AI tool says:
> >
> > Both options are grammatically correct, but the nuance differs:
> > "will stop because promotion is triggered"
> > This uses the present tense ("is triggered"), which suggests the
> > promotion event is happening right now, concurrently with the stopping
> > action.
> > "will stop because promotion has been triggered"
> > This uses the present perfect tense ("has been triggered"), which
> > implies the promotion event already occurred and is the reason for the
> > upcoming stop.
> >
> > In this case, because ShutDownSlotSync() will wait for the slotsync
> > worker to exit, so the first one ("will stop because promotion is
> > triggered") fits better.
> >
>
> Make sense. Then Zhijie’s v2 looks good to me.
>

Thanks for the review. Pushed.

--
With Regards,
Amit Kapila.



Re: Fix LOCK_TIMEOUT handling in slotsync worker

От
Amit Kapila
Дата:
On Tue, Dec 9, 2025 at 11:50 AM Chao Li <li.evan.chao@gmail.com> wrote:
>
> > On Dec 9, 2025, at 14:12, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Tue, Dec 9, 2025 at 11:23 AM Chao Li <li.evan.chao@gmail.com> wrote:
> >>
> >>
> >> Yeah, I just searched and see similar messages:
> >>
> >> ```
> >> logical replication parallel apply worker for subscription \"%s\" will stop because the subscription owner's
superuserprivileges have been revoked 
> >>
> >> logical replication worker for subscription \"%s\" will restart because the subscription owner's superuser
privilegeshave been revoked 
> >> ```
> >>
> >> I think the new phrase is better. Maybe “is triggered” could be “has been triggered”?
> >>
> >
> > My AI tool says:
> >
> > Both options are grammatically correct, but the nuance differs:
> > "will stop because promotion is triggered"
> > This uses the present tense ("is triggered"), which suggests the
> > promotion event is happening right now, concurrently with the stopping
> > action.
> > "will stop because promotion has been triggered"
> > This uses the present perfect tense ("has been triggered"), which
> > implies the promotion event already occurred and is the reason for the
> > upcoming stop.
> >
> > In this case, because ShutDownSlotSync() will wait for the slotsync
> > worker to exit, so the first one ("will stop because promotion is
> > triggered") fits better.
> >
>
> Make sense. Then Zhijie’s v2 looks good to me.
>

BTW, by mistake, I ended up pushing 0001 which I think in itself is
not a bad idea. However, we can improve it at least in HEAD as part of
patch[1] where we are making changes in the same part of code. Do you
think that is okay?

[1] - https://www.postgresql.org/message-id/CAFPTHDYHjqq53f1Cbata2MrV2nRBDe6XgxXfqv4tw4rcT2-Y8Q%40mail.gmail.com

--
With Regards,
Amit Kapila.



Re: Fix LOCK_TIMEOUT handling in slotsync worker

От
Chao Li
Дата:

> On Dec 9, 2025, at 19:39, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Dec 9, 2025 at 11:50 AM Chao Li <li.evan.chao@gmail.com> wrote:
>>
>>> On Dec 9, 2025, at 14:12, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>
>>> On Tue, Dec 9, 2025 at 11:23 AM Chao Li <li.evan.chao@gmail.com> wrote:
>>>>
>>>>
>>>> Yeah, I just searched and see similar messages:
>>>>
>>>> ```
>>>> logical replication parallel apply worker for subscription \"%s\" will stop because the subscription owner's
superuserprivileges have been revoked 
>>>>
>>>> logical replication worker for subscription \"%s\" will restart because the subscription owner's superuser
privilegeshave been revoked 
>>>> ```
>>>>
>>>> I think the new phrase is better. Maybe “is triggered” could be “has been triggered”?
>>>>
>>>
>>> My AI tool says:
>>>
>>> Both options are grammatically correct, but the nuance differs:
>>> "will stop because promotion is triggered"
>>> This uses the present tense ("is triggered"), which suggests the
>>> promotion event is happening right now, concurrently with the stopping
>>> action.
>>> "will stop because promotion has been triggered"
>>> This uses the present perfect tense ("has been triggered"), which
>>> implies the promotion event already occurred and is the reason for the
>>> upcoming stop.
>>>
>>> In this case, because ShutDownSlotSync() will wait for the slotsync
>>> worker to exit, so the first one ("will stop because promotion is
>>> triggered") fits better.
>>>
>>
>> Make sense. Then Zhijie’s v2 looks good to me.
>>
>
> BTW, by mistake, I ended up pushing 0001 which I think in itself is
> not a bad idea. However, we can improve it at least in HEAD as part of
> patch[1] where we are making changes in the same part of code. Do you
> think that is okay?
>
> [1] - https://www.postgresql.org/message-id/CAFPTHDYHjqq53f1Cbata2MrV2nRBDe6XgxXfqv4tw4rcT2-Y8Q%40mail.gmail.com
>

Sure, no problem. Thanks for taking care of my comment.


Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/