Re: Allow users to choose what happens when recovery target is not reached

Поиск
Список
Период
Сортировка
От Euler Taveira
Тема Re: Allow users to choose what happens when recovery target is not reached
Дата
Msg-id 42f7e161-cbcb-42d8-acc9-3049f2275982@www.fastmail.com
обсуждение исходный текст
Ответ на Re: Allow users to choose what happens when recovery target is not reached  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Список pgsql-hackers
On Sat, Nov 13, 2021, at 10:15 AM, Bharath Rupireddy wrote:
Firstly, the proposed patch adds no new behaviour as such, it just
gives the ability that is existing today on v12 and below (prior to
commit dc78866 which went into v13 and later).
It reintroduces an awkward behavior [1].

I think performing PITR is the user's wish - whether the primary is
available or not, it is completely the user's choice. The user might
start the PITR, when the primary is available, thinking that it sends
all the WAL files required for achieving recovery target. But imagine
a disaster happens and the primary server crashes, say the recovery
has replayed a huge bunch of WAL records (a TB may be), and the
primary failed without sending the last one or few WAL files, should
the PITR target server be failing this case after replaying a huge
bunch of WAL records? The user might want the target server to be
available instead of FATALly shutting down. This is the exact problem
the proposed patch is trying to solve.
Are you archiving on the primary server? You are risking your customer's
business suggesting such setup. You should store the WAL files on your backup
server.

It seems your setup has a flaw. You set a recovery target but accept a scenario
that is not what you initially asked for. If it is a real PITR, it is awkward
like Peter [1] said. You could validate your recovery settings checking the
timestamp of the last WAL file as a rough approximation of the maximum recovery
target time. The other option is to run pg_waldump to obtain the last commit
timestamp.

If you care about your customer's data, you won't use such option. Otherwise, I
repeat the Julien's question [2]: isn't it better to simply don't specify a target
and let the recovery go as far as possible?

As I said earlier, the behaviour is not too dangerous as it is not
something new that the patch is proposing, it exists today in v12 and
below. In fact, it gives a way out of a "dangerous situation" if the
user ever gets stuck in it without wasting recovery cycles and compute
resources, by quickly getting the database to be available(of course,
the responsibility lies with the user to deal with the missing WAL
files).
Your proposal seems that the user is shooting in the dark. If a FATAL message
was got it means the user missed the target. Even after that the user accepts
the situation, remove the target parameters and start the server again. I think
promote or even pause might lead to incorrect expectations (if the user doesn't
carefully inspect the log messages).

A disadvantage of this proposal is that if you have it set to 'promote', start
the recovery and the server gets promoted before reaching the target. While
inspecting your server configuration, you realized that you are pointing to the
incorrect archive or the WAL files were not available in time (due to timing
issues). You have no option but start from scratch.



--
Euler Taveira

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Inconsistent error message for varchar(n)
Следующее
От: Zhihong Yu
Дата:
Сообщение: Re: support for MERGE