Обсуждение: Improving the latch handling between logical replication launcher and worker processes.

Поиск
Список
Период
Сортировка
Hi,

Currently the launcher's latch is used for the following: a) worker
process attach b) worker process exit and c) subscription creation.
Since this same latch is used for multiple cases, the launcher process
is not able to handle concurrent scenarios like: a) Launcher started a
new apply worker and waiting for apply worker to attach and b) create
subscription sub2 sending launcher wake up signal. In this scenario,
both of them will set latch of the launcher process, the launcher
process is not able to identify that both operations have occurred 1)
worker is attached 2) subscription is created and apply worker should
be started. As a result the apply worker does not get started for the
new subscription created immediately and gets started after the
timeout of 180 seconds.

I have started a new thread for this based on suggestions at [1].

We could improvise this by one of the following:
a) Introduce a new latch to handle worker attach and exit.
b) Add a new GUC launcher_retry_time which gives more flexibility to
users as suggested by Amit at [1]. Before 5a3a953, the
wal_retrieve_retry_interval plays a similar role as the suggested new
GUC launcher_retry_time, e.g. even if a worker is launched, the
launcher only wait wal_retrieve_retry_interval time before next round.
c) Don't reset the latch at worker attach and allow launcher main to
identify and handle it. For this there is a patch v6-0002 available at
[2].

I'm not sure which approach is better in this case. I was thinking if
we should add a new latch to handle worker attach and exit.
Thoughts?

[1] - https://www.postgresql.org/message-id/CAA4eK1KR29XfBi5rObgV06xcBLn7y%2BLCuxcSMdKUkKZK740L2w%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CALDaNm10R7L0Dxq%2B-J%3DPp3AfM_yaokpbhECvJ69QiGH8-jQquw%40mail.gmail.com

Regards,
Vignesh



RE: Improving the latch handling between logical replication launcher and worker processes.

От
"Zhijie Hou (Fujitsu)"
Дата:
On Thursday, April 25, 2024 4:59 PM vignesh C <vignesh21@gmail.com> wrote:
> 
> Hi,
> 
> Currently the launcher's latch is used for the following: a) worker process attach
> b) worker process exit and c) subscription creation.
> Since this same latch is used for multiple cases, the launcher process is not able
> to handle concurrent scenarios like: a) Launcher started a new apply worker and
> waiting for apply worker to attach and b) create subscription sub2 sending
> launcher wake up signal. In this scenario, both of them will set latch of the
> launcher process, the launcher process is not able to identify that both
> operations have occurred 1) worker is attached 2) subscription is created and
> apply worker should be started. As a result the apply worker does not get
> started for the new subscription created immediately and gets started after the
> timeout of 180 seconds.
> 
> I have started a new thread for this based on suggestions at [1].
> 
> a) Introduce a new latch to handle worker attach and exit.

I found the startup process also uses two latches(see recoveryWakeupLatch) for
different purposes, so maybe this is OK. But note that both logical launcher
and apply worker will call logicalrep_worker_launch(), if we only add one new
latch, it means both workers will wait on the same new Latch, and the launcher
may consume the SetLatch that should have been consumed by the apply
worker(although it's rare).

> b) Add a new GUC launcher_retry_time which gives more flexibility to users as
> suggested by Amit at [1]. Before 5a3a953, the wal_retrieve_retry_interval plays
> a similar role as the suggested new GUC launcher_retry_time, e.g. even if a
> worker is launched, the launcher only wait wal_retrieve_retry_interval time
> before next round.

IIUC, the issue does not happen frequently, and may not be noticeable where
apply workers wouldn't be restarted often. So, I am slightly not sure if it's
worth adding a new GUC.

> c) Don't reset the latch at worker attach and allow launcher main to identify and
> handle it. For this there is a patch v6-0002 available at [2].

This seems simple. I found we are doing something similar in
RegisterSyncRequest() and WalSummarizerMain().

Best Regards,
Hou zj

RE: Improving the latch handling between logical replication launcher and worker processes.

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Vignesh,

Thanks for raising idea!

> a) Introduce a new latch to handle worker attach and exit.

Just to confirm - there are three wait events for launchers, so I feel we may be
able to create latches per wait event. Is there a reason to introduce
"a" latch?

> b) Add a new GUC launcher_retry_time which gives more flexibility to
> users as suggested by Amit at [1]. Before 5a3a953, the
> wal_retrieve_retry_interval plays a similar role as the suggested new
> GUC launcher_retry_time, e.g. even if a worker is launched, the
> launcher only wait wal_retrieve_retry_interval time before next round.

Hmm. My concern is how users estimate the value. Maybe the default will be
3min, but should users change it? If so, how long? I think even if it becomes
tunable, they cannot control well.

> c) Don't reset the latch at worker attach and allow launcher main to
> identify and handle it. For this there is a patch v6-0002 available at
> [2].

Does it mean that you want to remove ResetLatch() from
WaitForReplicationWorkerAttach(), right? If so, what about the scenario?

1) The launcher waiting the worker is attached in WaitForReplicationWorkerAttach(),
and 2) subscription is created before attaching. In this case, the launcher will
become un-sleepable because the latch is set but won't be reset. It may waste the
CPU time.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 

On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignesh21@gmail.com> wrote:
>
...
> a) Introduce a new latch to handle worker attach and exit.

IIUC there is normally only the “shared” latch or a “process local”
latch. i.e. AFAICT is is quite uncommon to invent a new latches like
option a is proposing. I did not see any examples of making new
latches (e.g. MyLatchA, MyLatchB, MyLatchC) in the PostgreSQL source
other than that ‘recoveryWakeupLatch’ mentioned above by Hou-san. So
this option could be OK, but OTOH since there is hardly any precedent
maybe that should be taken as an indication to avoid doing it this way
(???)

> b) Add a new GUC launcher_retry_time which gives more flexibility to
> users as suggested by Amit at [1].

I'm not sure that introducing a new GUC is a good option because this
seems a rare problem to have -- so it will be hard to tune since it
will be difficult to know you even have this problem and then
difficult to know that it is fixed. Anyway. this made me consider more
what the WaitLatch timeout value should be. Here are some thoughts:

Idea 1)

I was wondering where did that DEFAULT_NAPTIME_PER_CYCLE value of 180
seconds come from or was that just a made-up number? AFAICT it just
came into existence in the first pub/sub commit [1] but there is no
explanation for why 180s was chosen. Anyway, I assume a low value
(5s?) would be bad because it incurs unacceptable CPU usage, right?
But if 180s is too long and 5s is too short then what does a “good”
number even look like? E.g.,. if 60s is deemed OK, then is there any
harm in just defining DEFAULT_NAPTIME_PER_CYCLE to be 60s and leaving
it at that?

Idea 2)

Another idea could be to use a “dynamic timeout” in the WaitLatch of
ApplyLauncherMain. Let me try to explain my thought bubble:
- If the preceding foreach(lc, sublist) loop launched any workers then
the WaitLatch timeout can be a much shorter 10s
- If the preceding foreach(lc, sublist) loop did NOT launch any
workers (this would be the most common case) then the WaitLatch
timeout can be the usual 180s.

IIUC this strategy will still give any recently launched workers
enough time to attach shmem but it also means that any concurrent
CREATE SUBSCRIPTION will be addressed within 10s instead of 180s.
Maybe this is sufficient to make an already rare problem become
insignificant.


> c) Don't reset the latch at worker attach and allow launcher main to
> identify and handle it. For this there is a patch v6-0002 available at
> [2].

This option c seems the easiest. Can you explain what are the
drawbacks of using this approach?

======
[1] github -
https://github.com/postgres/postgres/commit/665d1fad99e7b11678b0d5fa24d2898424243cd6#diff-127f8eb009151ec548d14c877a57a89d67da59e35ea09189411aed529c6341bf

Kind Regards,
Peter Smith.
Fujitsu Australia



On Fri, 10 May 2024 at 07:39, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Vignesh,
>
> Thanks for raising idea!
>
> > a) Introduce a new latch to handle worker attach and exit.
>
> Just to confirm - there are three wait events for launchers, so I feel we may be
> able to create latches per wait event. Is there a reason to introduce
> "a" latch?

One latch is enough, we can use the new latch for both worker starting
and worker exiting. The other existing latch can be used for other
purposes.  Something like the attached patch.

Regards,
Vignesh

Вложения
On Wed, 29 May 2024 at 10:41, Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignesh21@gmail.com> wrote:
> >
>
> > c) Don't reset the latch at worker attach and allow launcher main to
> > identify and handle it. For this there is a patch v6-0002 available at
> > [2].
>
> This option c seems the easiest. Can you explain what are the
> drawbacks of using this approach?

This solution will resolve the issue. However, one drawback to
consider is that because we're not resetting the latch, in this
scenario, the launcher process will need to perform an additional
round of acquiring subscription details and determining whether the
worker should start, regardless of any changes in subscriptions.

Regards,
Vignesh



On Wed, May 29, 2024 at 7:53 PM vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, 29 May 2024 at 10:41, Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignesh21@gmail.com> wrote:
> > >
> >
> > > c) Don't reset the latch at worker attach and allow launcher main to
> > > identify and handle it. For this there is a patch v6-0002 available at
> > > [2].
> >
> > This option c seems the easiest. Can you explain what are the
> > drawbacks of using this approach?
>
> This solution will resolve the issue. However, one drawback to
> consider is that because we're not resetting the latch, in this
> scenario, the launcher process will need to perform an additional
> round of acquiring subscription details and determining whether the
> worker should start, regardless of any changes in subscriptions.
>

Hmm. IIUC the WaitLatch of the Launcher.WaitForReplicationWorkerAttach
was not expecting to get notified.

e.g.1. The WaitList comment in the function says so:
/*
 * We need timeout because we generally don't get notified via latch
 * about the worker attach.  But we don't expect to have to wait long.
 */

e.g.2 The logicalrep_worker_attach() function (which is AFAIK what
WaitForReplicationWorkerAttach  was waiting for) is not doing any
SetLatch. So that matches what the comment said.

~~~

AFAICT the original problem reported by this thread happened because
the SetLatch (from CREATE SUBSCRIPTION) has been inadvertently gobbled
by the WaitForReplicationWorkerAttach.WaitLatch/ResetLatch which BTW
wasn't expecting to be notified at all.

~~~

Your option c removes the ResetLatch done by WaitForReplicationWorkerAttach:

You said above that one drawback is "the launcher process will need to
perform an additional round of acquiring subscription details and
determining whether the worker should start, regardless of any changes
in subscriptions"

I think you mean if some CREATE SUBSCRIPTION (i.e. SetLatch) happens
during the attaching of other workers then the latch would (now after
option c) remain set and so the WaitLatch of ApplyLauncherMain would
be notified and/or return immediately end causing an immediate
re-iteration of the "foreach(lc, sublist)" loop.

But I don't understand why that is a problem.

a) I didn't know what you meant "regardless of any changes in
subscriptions" because I think the troublesome SetLatch originated
from the CREATE SUBSCRIPTION and so there *is* a change to
subscriptions.

b) We will need to repeat that sublist loop anyway to start the worker
for the new CREATE SUBSCRIPTION, and we need to do that at the
earliest opportunity because the whole point of the SetLatch is so the
CREATE SUBSCRIPTION worker can get started promptly. So the earlier we
do that the better, right?

c) AFAICT there is no danger of accidentally tying to starting workers
who are still in the middle of trying to start (from the previous
iteration) because those cases should be guarded by the
ApplyLauncherGetWorkerStartTime logic.

~~

To summarise, I felt removing the ResetLatch and the WL_LATCH_SET
bitset (like your v6-0002 patch does) is not only an easy way of
fixing the problem reported by this thread, but also it now makes that
WaitForReplicationWorkerAttach code behave like the comment ("we
generally don't get notified via latch about the worker attach") is
saying.

(Unless there are some other problems with it that I can't see)

======
Kind Regards,
Peter Smith.
Fujitsu Australia



On Thu, 30 May 2024 at 08:46, Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Wed, May 29, 2024 at 7:53 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Wed, 29 May 2024 at 10:41, Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > On Thu, Apr 25, 2024 at 6:59 PM vignesh C <vignesh21@gmail.com> wrote:
> > > >
> > >
> > > > c) Don't reset the latch at worker attach and allow launcher main to
> > > > identify and handle it. For this there is a patch v6-0002 available at
> > > > [2].
> > >
> > > This option c seems the easiest. Can you explain what are the
> > > drawbacks of using this approach?
> >
> > This solution will resolve the issue. However, one drawback to
> > consider is that because we're not resetting the latch, in this
> > scenario, the launcher process will need to perform an additional
> > round of acquiring subscription details and determining whether the
> > worker should start, regardless of any changes in subscriptions.
> >
>
> Hmm. IIUC the WaitLatch of the Launcher.WaitForReplicationWorkerAttach
> was not expecting to get notified.
>
> e.g.1. The WaitList comment in the function says so:
> /*
>  * We need timeout because we generally don't get notified via latch
>  * about the worker attach.  But we don't expect to have to wait long.
>  */
>
> e.g.2 The logicalrep_worker_attach() function (which is AFAIK what
> WaitForReplicationWorkerAttach  was waiting for) is not doing any
> SetLatch. So that matches what the comment said.
>
> ~~~
>
> AFAICT the original problem reported by this thread happened because
> the SetLatch (from CREATE SUBSCRIPTION) has been inadvertently gobbled
> by the WaitForReplicationWorkerAttach.WaitLatch/ResetLatch which BTW
> wasn't expecting to be notified at all.
>
> ~~~
>
> Your option c removes the ResetLatch done by WaitForReplicationWorkerAttach:
>
> You said above that one drawback is "the launcher process will need to
> perform an additional round of acquiring subscription details and
> determining whether the worker should start, regardless of any changes
> in subscriptions"
>
> I think you mean if some CREATE SUBSCRIPTION (i.e. SetLatch) happens
> during the attaching of other workers then the latch would (now after
> option c) remain set and so the WaitLatch of ApplyLauncherMain would
> be notified and/or return immediately end causing an immediate
> re-iteration of the "foreach(lc, sublist)" loop.
>
> But I don't understand why that is a problem.
>
> a) I didn't know what you meant "regardless of any changes in
> subscriptions" because I think the troublesome SetLatch originated
> from the CREATE SUBSCRIPTION and so there *is* a change to
> subscriptions.

The process of setting the latch unfolds as follows: Upon creating a
new subscription, the launcher process initiates a request to the
postmaster, prompting it to initiate a new apply worker process.
Subsequently, the postmaster commences the apply worker process and
dispatches a SIGUSR1 signal to the launcher process(this is done from
do_start_bgworker & ReportBackgroundWorkerPID). Upon receiving this
signal, the launcher process sets the latch.
Now, there are two potential scenarios:
a) Concurrent Creation of Another Subscription: In this situation, the
launcher traverses the subscription list to detect the creation of a
new subscription and proceeds to initiate a new apply worker for the
concurrently created subscription. This is ok. b) Absence of
Concurrent Subscription Creation: In this case, since the latch
remains unset, the launcher iterates through the subscription list and
identifies the absence of new subscriptions. This verification occurs
as the latch remains unset. Here there is an additional check.
I'm talking about the second scenario where no subscription is
concurrently created. In this case, as the latch remains unset, we
perform an additional check on the subscription list. There is no
problem with this.
This additional check can occur in the existing code too if the
function WaitForReplicationWorkerAttach returns from the initial if
check i.e. if the worker already started when this check happens.

Regards,
Vignesh