Обсуждение: add retry mechanism for achieving recovery target before emitting FATA error "recovery ended before configured recovery target was reached"
Hi, The FATAL error "recovery ended before configured recovery target was reached" introduced by commit at [1] in PG 14 is causing the standby to go down after having spent a good amount of time in recovery. There can be cases where the arrival of required WAL (for reaching recovery target) from the archive location to the standby may take time and meanwhile the standby failing with the FATAL error isn't good. Instead, how about we make the standby wait for a certain amount of time (with a GUC) so that it can keep looking for the required WAL. If it gets the required WAL during the wait time, then it succeeds in reaching the recovery target (no FATAL error of course). If it doesn't, the timeout occurs and the standby fails with the FATAL error. The value of the new GUC can probably be set to the average time it takes for the WAL to reach archive location from the primary + from archive location to the standby, default 0 i.e. disabled. I'm attaching a WIP patch. I've tested it on my dev system and the recovery regression tests are passing with it. I will provide a better version later, probably with a test case. Thoughts? [1] commit dc788668bb269b10a108e87d14fefd1b9301b793 Author: Peter Eisentraut <peter@eisentraut.org> Date: Wed Jan 29 15:43:32 2020 +0100 Fail if recovery target is not reached Before, if a recovery target is configured, but the archive ended before the target was reached, recovery would end and the server would promote without further notice. That was deemed to be pretty wrong. With this change, if the recovery target is not reached, it is a fatal error. Based-on-patch-by: Leif Gunnar Erlandsen <leif@lako.no> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/993736dd3f1713ec1f63fc3b653839f5@lako.no Regards, Bharath Rupireddy.
Вложения
On Wed, 2021-10-20 at 21:35 +0530, Bharath Rupireddy wrote: > The FATAL error "recovery ended before configured recovery target > was > reached" introduced by commit at [1] in PG 14 is causing the standby > to go down after having spent a good amount of time in recovery. > There > can be cases where the arrival of required WAL (for reaching recovery > target) from the archive location to the standby may take time and > meanwhile the standby failing with the FATAL error isn't good. > Instead, how about we make the standby wait for a certain amount of > time (with a GUC) so that it can keep looking for the required WAL. How is archiving configured, and would it be possible to introduce logic into the restore_command to handle slow-to-arrive WAL? Regards, Jeff Davis
On Fri, Oct 22, 2021 at 5:54 AM Jeff Davis <pgsql@j-davis.com> wrote: > > On Wed, 2021-10-20 at 21:35 +0530, Bharath Rupireddy wrote: > > The FATAL error "recovery ended before configured recovery target > > was > > reached" introduced by commit at [1] in PG 14 is causing the standby > > to go down after having spent a good amount of time in recovery. > > There > > can be cases where the arrival of required WAL (for reaching recovery > > target) from the archive location to the standby may take time and > > meanwhile the standby failing with the FATAL error isn't good. > > Instead, how about we make the standby wait for a certain amount of > > time (with a GUC) so that it can keep looking for the required WAL. > > How is archiving configured, and would it be possible to introduce > logic into the restore_command to handle slow-to-arrive WAL? Thanks Jeff! If the suggestion is to have the wait and retry logic embedded into the user-written restore_command, IMHO, it's not a good idea as the restore_command is external to the core PG and the FATAL error "recovery ended before configured recovery target was reached" is an internal thing. Having the retry logic (controlled with a GUC) within the core, when the startup process hits the recovery end before the target, is a better way and it is something the core PG can offer. With this, the amount of work spent in recovery by the standby isn't wasted if the GUC is enabled with the right value. The optimal value someone can set is the average time it takes for the WAL to reach archive location from the primary + from archive location to the standby. By default, we can disable the new GUC with value 0 so that whoever wants can set it. Regards, Bharath Rupireddy.
On Fri, 2021-10-22 at 15:34 +0530, Bharath Rupireddy wrote: > If the suggestion is to have the wait and retry logic embedded into > the user-written restore_command, IMHO, it's not a good idea as the > restore_command is external to the core PG and the FATAL error > "recovery ended before configured recovery target was reached" is an > internal thing. It seems likely that you'd want to tweak the exact behavior for the given system. For instance, if the files are making some progress, and you can estimate that in 2 more minutes everything will be fine, then you may be more willing to wait those two minutes. But if no progress has happened since recovery began 15 minutes ago, you may want to fail immediately. All of this nuance would be better captured in a specialized script than a generic timeout in the server code. What do you want to do after the timeout happens? If you want to issue a WARNING instead of failing outright, perhaps that makes sense for exploratory PITR cases. That could be a simple boolean GUC without needing to introduce the timeout logic into the server. I think it's an interesting point that it can be hard to choose a reasonable recovery target if the system is completely down. We could use some better tooling or metadata around the lsns, xids or timestamp ranges available in a pg_wal directory or an archive. Even better would be to see the available named restore points. This would make is easier to calculate how long recovery might take for a given restore point, or whether it's not going to work at all because there's not enough WAL. Regards, Jeff Davis
On Sat, Oct 23, 2021 at 1:46 AM Jeff Davis <pgsql@j-davis.com> wrote: > > On Fri, 2021-10-22 at 15:34 +0530, Bharath Rupireddy wrote: > > If the suggestion is to have the wait and retry logic embedded into > > the user-written restore_command, IMHO, it's not a good idea as the > > restore_command is external to the core PG and the FATAL error > > "recovery ended before configured recovery target was reached" is an > > internal thing. > > What do you want to do after the timeout happens? If you want to issue > a WARNING instead of failing outright, perhaps that makes sense for > exploratory PITR cases. That could be a simple boolean GUC without > needing to introduce the timeout logic into the server. If you are suggesting to give the user more control on what should happen to the standby even after the timeout, then, the 2 new GUCs recovery_target_retry_timeout (int) and recovery_target_continue_after_timeout (bool) will really help users choose what they want. I'm not sure if it is okay to have 2 new GUCs. Let's hear from other hackers what they think about this. > I think it's an interesting point that it can be hard to choose a > reasonable recovery target if the system is completely down. We could > use some better tooling or metadata around the lsns, xids or timestamp > ranges available in a pg_wal directory or an archive. Even better would > be to see the available named restore points. This would make is easier > to calculate how long recovery might take for a given restore point, or > whether it's not going to work at all because there's not enough WAL. I think pg_waldump can help here to do some exploratory analysis of the available WAL in the directory where the WAL files are present. Since it is an independent C program, it can run even when the server is down and also run on archive location. Regards, Bharath Rupireddy.
On Sat, 2021-10-23 at 09:31 +0530, Bharath Rupireddy wrote: > If you are suggesting ... Your complaint seems to be coming from commit dc788668, so the most direct answer would be to make that configurable to the old behavior, not to invent a new timeout behavior. If I understand correctly, you are doing PITR from an archive, right? So would restore_command be a reasonable place for the timeout? And can you provide some approximate numbers to help me understand where the timeout would be helpful? E.g. you have W GB of WAL to replay, and restore would take X minutes, but some WAL is missing so you fail after X-Y minutes, but if you has timeout Z everything would be great. > I think pg_waldump can help here to do some exploratory analysis of > the available WAL in the directory where the WAL files are present. > Since it is an independent C program, it can run even when the server > is down and also run on archive location. Right, it's possible to do, but I think there's room for improvement so we don't have to always scan the WAL. I'm getting a bit off-topic from your proposal though. I'll bring it up in another thread when my thoughts on this are more clear. Regards, Jeff Davis
At Wed, 20 Oct 2021 21:35:44 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in > Hi, > > The FATAL error "recovery ended before configured recovery target was > reached" introduced by commit at [1] in PG 14 is causing the standby > to go down after having spent a good amount of time in recovery. There > can be cases where the arrival of required WAL (for reaching recovery > target) from the archive location to the standby may take time and > meanwhile the standby failing with the FATAL error isn't good. > Instead, how about we make the standby wait for a certain amount of > time (with a GUC) so that it can keep looking for the required WAL. If > it gets the required WAL during the wait time, then it succeeds in > reaching the recovery target (no FATAL error of course). If it > doesn't, the timeout occurs and the standby fails with the FATAL > error. The value of the new GUC can probably be set to the average > time it takes for the WAL to reach archive location from the primary + > from archive location to the standby, default 0 i.e. disabled. > > I'm attaching a WIP patch. I've tested it on my dev system and the > recovery regression tests are passing with it. I will provide a better > version later, probably with a test case. > > Thoughts? It looks like starting a server in non-hot standby mode only fetching from archive. The only difference is it doesn't have timeout. Doesn't that cofiguration meet your requirements? Or, if timeout matters, I agree with Jeff. Retrying in restore_command looks fine. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Sat, Oct 23, 2021 at 1:46 AM Jeff Davis <pgsql@j-davis.com> wrote: > What do you want to do after the timeout happens? If you want to issue > a WARNING instead of failing outright, perhaps that makes sense for > exploratory PITR cases. That could be a simple boolean GUC without > needing to introduce the timeout logic into the server. Thanks Jeff. I posted the patch in a separate thread[1] for new GUC (WARN + promotion or shutdown with FATAL error) in case the recovery target isn't reached. [1] - https://www.postgresql.org/message-id/flat/CALj2ACWR4iaph7AWCr5-V9dXqpf2p5B%3D3fTyvLfL8VD_E%2Bx0tA%40mail.gmail.com. Regards, Bharath Rupireddy.