Обсуждение: add retry mechanism for achieving recovery target before emitting FATA error "recovery ended before configured recovery target was reached"

Поиск
Список
Период
Сортировка
Hi,

The  FATAL error "recovery ended before configured recovery target was
reached" introduced by commit at [1] in PG 14 is causing the standby
to go down after having spent a good amount of time in recovery. There
can be cases where the arrival of required WAL (for reaching recovery
target) from the archive location to the standby may take time and
meanwhile the standby failing with the FATAL error isn't good.
Instead, how about we make the standby wait for a certain amount of
time (with a GUC) so that it can keep looking for the required WAL. If
it gets the required WAL during the wait time, then it succeeds in
reaching the recovery target (no FATAL error of course). If it
doesn't, the timeout occurs and the standby fails with the FATAL
error. The value of the new GUC can probably be set to the average
time it takes for the WAL to reach archive location from the primary +
from archive location to the standby, default 0 i.e. disabled.

I'm attaching a WIP patch. I've tested it on my dev system and the
recovery regression tests are passing with it. I will provide a better
version later, probably with a test case.

Thoughts?

[1] commit dc788668bb269b10a108e87d14fefd1b9301b793

Author: Peter Eisentraut <peter@eisentraut.org>
Date:   Wed Jan 29 15:43:32 2020 +0100

    Fail if recovery target is not reached

    Before, if a recovery target is configured, but the archive ended
    before the target was reached, recovery would end and the server would
    promote without further notice.  That was deemed to be pretty wrong.
    With this change, if the recovery target is not reached, it is a fatal
    error.

    Based-on-patch-by: Leif Gunnar Erlandsen <leif@lako.no>
    Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
    Discussion:
https://www.postgresql.org/message-id/flat/993736dd3f1713ec1f63fc3b653839f5@lako.no

Regards,
Bharath Rupireddy.

Вложения
On Wed, 2021-10-20 at 21:35 +0530, Bharath Rupireddy wrote:
> The  FATAL error "recovery ended before configured recovery target
> was
> reached" introduced by commit at [1] in PG 14 is causing the standby
> to go down after having spent a good amount of time in recovery.
> There
> can be cases where the arrival of required WAL (for reaching recovery
> target) from the archive location to the standby may take time and
> meanwhile the standby failing with the FATAL error isn't good.
> Instead, how about we make the standby wait for a certain amount of
> time (with a GUC) so that it can keep looking for the required WAL. 

How is archiving configured, and would it be possible to introduce
logic into the restore_command to handle slow-to-arrive WAL?

Regards,
    Jeff Davis





On Fri, Oct 22, 2021 at 5:54 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Wed, 2021-10-20 at 21:35 +0530, Bharath Rupireddy wrote:
> > The  FATAL error "recovery ended before configured recovery target
> > was
> > reached" introduced by commit at [1] in PG 14 is causing the standby
> > to go down after having spent a good amount of time in recovery.
> > There
> > can be cases where the arrival of required WAL (for reaching recovery
> > target) from the archive location to the standby may take time and
> > meanwhile the standby failing with the FATAL error isn't good.
> > Instead, how about we make the standby wait for a certain amount of
> > time (with a GUC) so that it can keep looking for the required WAL.
>
> How is archiving configured, and would it be possible to introduce
> logic into the restore_command to handle slow-to-arrive WAL?

Thanks Jeff!

If the suggestion is to have the wait and retry logic embedded into
the user-written restore_command, IMHO, it's not a good idea as the
restore_command is external to the core PG and the FATAL error
"recovery ended before configured recovery target was reached" is an
internal thing. Having the retry logic (controlled with a GUC) within
the core, when the startup process hits the recovery end before the
target, is a better way and it is something the core PG can offer.
With this, the amount  of work spent in recovery by the standby isn't
wasted if the GUC is enabled with the right value. The optimal value
someone can set is the average time it takes for the WAL to reach
archive location from the primary + from archive location to the
standby. By default, we can disable the new GUC with value 0 so that
whoever wants can set it.

Regards,
Bharath Rupireddy.



On Fri, 2021-10-22 at 15:34 +0530, Bharath Rupireddy wrote:
> If the suggestion is to have the wait and retry logic embedded into
> the user-written restore_command, IMHO, it's not a good idea as the
> restore_command is external to the core PG and the FATAL error
> "recovery ended before configured recovery target was reached" is an
> internal thing. 

It seems likely that you'd want to tweak the exact behavior for the
given system. For instance, if the files are making some progress, and
you can estimate that in 2 more minutes everything will be fine, then
you may be more willing to wait those two minutes. But if no progress
has happened since recovery began 15 minutes ago, you may want to fail
immediately.

All of this nuance would be better captured in a specialized script
than a generic timeout in the server code.

What do you want to do after the timeout happens? If you want to issue
a WARNING instead of failing outright, perhaps that makes sense for
exploratory PITR cases. That could be a simple boolean GUC without
needing to introduce the timeout logic into the server.

I think it's an interesting point that it can be hard to choose a
reasonable recovery target if the system is completely down. We could
use some better tooling or metadata around the lsns, xids or timestamp
ranges available in a pg_wal directory or an archive. Even better would
be to see the available named restore points. This would make is easier
to calculate how long recovery might take for a given restore point, or
whether it's not going to work at all because there's not enough WAL.

Regards,
    Jeff Davis





On Sat, Oct 23, 2021 at 1:46 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Fri, 2021-10-22 at 15:34 +0530, Bharath Rupireddy wrote:
> > If the suggestion is to have the wait and retry logic embedded into
> > the user-written restore_command, IMHO, it's not a good idea as the
> > restore_command is external to the core PG and the FATAL error
> > "recovery ended before configured recovery target was reached" is an
> > internal thing.
>
> What do you want to do after the timeout happens? If you want to issue
> a WARNING instead of failing outright, perhaps that makes sense for
> exploratory PITR cases. That could be a simple boolean GUC without
> needing to introduce the timeout logic into the server.

If you are suggesting to give the user more control on what should
happen to the standby even after the timeout,  then, the 2 new GUCs
recovery_target_retry_timeout (int) and
recovery_target_continue_after_timeout (bool) will really help users
choose what they want. I'm not sure if it is okay to have 2 new GUCs.
Let's hear from other hackers what they think about this.

> I think it's an interesting point that it can be hard to choose a
> reasonable recovery target if the system is completely down. We could
> use some better tooling or metadata around the lsns, xids or timestamp
> ranges available in a pg_wal directory or an archive. Even better would
> be to see the available named restore points. This would make is easier
> to calculate how long recovery might take for a given restore point, or
> whether it's not going to work at all because there's not enough WAL.

I think pg_waldump can help here to do some exploratory analysis of
the available WAL in the directory where the WAL files are present.
Since it is an independent C program, it can run even when the server
is down and also run on archive location.

Regards,
Bharath Rupireddy.



On Sat, 2021-10-23 at 09:31 +0530, Bharath Rupireddy wrote:
> If you are suggesting ...

Your complaint seems to be coming from commit dc788668, so the most
direct answer would be to make that configurable to the old behavior,
not to invent a new timeout behavior.

If I understand correctly, you are doing PITR from an archive, right?
So would restore_command be a reasonable place for the timeout?

And can you provide some approximate numbers to help me understand
where the timeout would be helpful? E.g. you have W GB of WAL to
replay, and restore would take X minutes, but some WAL is missing so
you fail after X-Y minutes, but if you has timeout Z everything would
be great.

> I think pg_waldump can help here to do some exploratory analysis of
> the available WAL in the directory where the WAL files are present.
> Since it is an independent C program, it can run even when the server
> is down and also run on archive location.

Right, it's possible to do, but I think there's room for improvement so
we don't have to always scan the WAL. I'm getting a bit off-topic from
your proposal though. I'll bring it up in another thread when my
thoughts on this are more clear.

Regards,
    Jeff Davis





At Wed, 20 Oct 2021 21:35:44 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in 
> Hi,
> 
> The  FATAL error "recovery ended before configured recovery target was
> reached" introduced by commit at [1] in PG 14 is causing the standby
> to go down after having spent a good amount of time in recovery. There
> can be cases where the arrival of required WAL (for reaching recovery
> target) from the archive location to the standby may take time and
> meanwhile the standby failing with the FATAL error isn't good.
> Instead, how about we make the standby wait for a certain amount of
> time (with a GUC) so that it can keep looking for the required WAL. If
> it gets the required WAL during the wait time, then it succeeds in
> reaching the recovery target (no FATAL error of course). If it
> doesn't, the timeout occurs and the standby fails with the FATAL
> error. The value of the new GUC can probably be set to the average
> time it takes for the WAL to reach archive location from the primary +
> from archive location to the standby, default 0 i.e. disabled.
> 
> I'm attaching a WIP patch. I've tested it on my dev system and the
> recovery regression tests are passing with it. I will provide a better
> version later, probably with a test case.
> 
> Thoughts?

It looks like starting a server in non-hot standby mode only fetching
from archive. The only difference is it doesn't have timeout.

Doesn't that cofiguration meet your requirements?

Or, if timeout matters, I agree with Jeff. Retrying in restore_command
looks fine.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



On Sat, Oct 23, 2021 at 1:46 AM Jeff Davis <pgsql@j-davis.com> wrote:
> What do you want to do after the timeout happens? If you want to issue
> a WARNING instead of failing outright, perhaps that makes sense for
> exploratory PITR cases. That could be a simple boolean GUC without
> needing to introduce the timeout logic into the server.

Thanks Jeff. I posted the patch in a separate thread[1] for new GUC
(WARN + promotion or shutdown with FATAL error) in case the recovery
target isn't reached.

[1] -
https://www.postgresql.org/message-id/flat/CALj2ACWR4iaph7AWCr5-V9dXqpf2p5B%3D3fTyvLfL8VD_E%2Bx0tA%40mail.gmail.com.

Regards,
Bharath Rupireddy.