Allow users to choose what happens when recovery target is not reached

Поиск
Список
Период
Сортировка
От Bharath Rupireddy
Тема Allow users to choose what happens when recovery target is not reached
Дата
Msg-id CALj2ACWR4iaph7AWCr5-V9dXqpf2p5B=3fTyvLfL8VD_E+x0tA@mail.gmail.com
обсуждение исходный текст
Ответы Re: Allow users to choose what happens when recovery target is not reached  (Julien Rouhaud <rjuju123@gmail.com>)
Список pgsql-hackers
Hi,

Currently, the server shuts down with a FATAL error (added by commit
[1]) when the recovery target isn't reached. This can cause a server
availability problem, especially in case of disaster recovery (geo
restores) where the primary was down and the user is doing a PITR on a
server lying in another region where it had missed to receive few of
the last WAL files required to reach the recovery target. In this
case, users might want the server to be available rather than a no
server. With the commit [1], there's no way to achieve what users
wanted.

There can be many reasons for the last few WAL files not reaching the
target server where the user is performing the PITR. The primary may
have been down before archiving the last few WAL files to the archive
locations, or archive command fails for whatever reasons or network
latency from primary to archive location and archive location to the
target server, or recovery command on the target server fails or users
may have chosen some wrong/futuristic recovery targets etc. If the
PITR fails with FATAL error and we may ask them to restart the server,
but imagine the wastage of compute resources - if there are a 1 TB of
WAL files to be replayed and just last 16MB WAL file is missing,
everything has to be replayed from the beginning.

Here's a proposal(and a patch) to have a GUC so that users can choose
either to emit a warning and promote or shutdown with FATAL error (as
default) when recovery target isn't reached. In reality, users can
choose to shutdown with FATAL error, if strict consistency is the
necessity, otherwise they can choose to get promoted, if availability
is preferred. There is some discussion around this idea in [2].

Thoughts?

[1] - commit dc788668bb269b10a108e87d14fefd1b9301b793
Author: Peter Eisentraut <peter@eisentraut.org>
Date:   Wed Jan 29 15:43:32 2020 +0100

    Fail if recovery target is not reached

    Before, if a recovery target is configured, but the archive ended
    before the target was reached, recovery would end and the server would
    promote without further notice.  That was deemed to be pretty wrong.
    With this change, if the recovery target is not reached, it is a fatal
    error.

    Based-on-patch-by: Leif Gunnar Erlandsen <leif@lako.no>
    Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
    Discussion:
https://www.postgresql.org/message-id/flat/993736dd3f1713ec1f63fc3b653839f5@lako.no

[2] - https://www.postgresql.org/message-id/b334d61396e6b0657a63dc38e16d429703fe9b96.camel%40j-davis.com

Regards,
Bharath Rupireddy.

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "tanghy.fnst@fujitsu.com"
Дата:
Сообщение: RE: Logical replication timeout problem
Следующее
От: Bharath Rupireddy
Дата:
Сообщение: Re: add retry mechanism for achieving recovery target before emitting FATA error "recovery ended before configured recovery target was reached"