Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

Поиск
Список
Период
Сортировка
От Michael Paquier
Тема Re: Requiring recovery.signal or standby.signal when recovering with a backup_label
Дата
Msg-id ZUxMM8xZtc_yYnGr@paquier.xyz
обсуждение исходный текст
Ответ на Re: Requiring recovery.signal or standby.signal when recovering with a backup_label  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: Requiring recovery.signal or standby.signal when recovering with a backup_label  (Michael Paquier <michael@paquier.xyz>)
Список pgsql-hackers
On Wed, Nov 08, 2023 at 01:16:58PM -0500, Robert Haas wrote:
> On Tue, Oct 31, 2023 at 7:39 PM Michael Paquier <michael@paquier.xyz> wrote:
>>>> As you're telling me, and I've considered that as an option as well,
>>>> perhaps we should just consider the presence of a backup_label file
>>>> with no .signal files as a synonym of crash recovery?  In the recovery
>>>> path, currently the essence of the problem is when we do
>>>> InArchiveRecovery=true, but ArchiveRecoveryRequested=false, meaning
>>>> that it should do archive recovery but we don't want it, and that does
>>>> not really make sense.  The rest of the code sort of implies that this
>>>> is not a suported combination.  So basically, my suggestion here, is
>>>> to just replay WAL up to the end of what's in your local pg_wal/ and
>>>> hope for the best, without TLI jumps, except that we'd do nothing.
>>>
>>> This sentence seems to be incomplete.
>>
>> I've re-read it, and it looks OK to me.
>
> Well, the sentence ends with "except that we'd do nothing" and I don't
> know what that means. It would make sense to me if it said "except
> that we'd do nothing about <whatever>" or "except that we'd do nothing
> instead of <something>" but as you've written it basically seems to
> boil down to "my suggestion is to replay WAL except do nothing" which
> makes no sense. If you replay WAL, you're not doing nothing.

Sure, sorry for the confusion.  By "we'd do nothing", I mean precirely
"to take no specific action related to archive recovery and recovery
parameters at the end of recovery", meaning that a combination of
backup_label with no signal file would be the same as crash recovery,
replaying WAL up to the end of what can be found in pg_wal/, and only
that.

>>> But I was not saying we should treat the case where we have a
>>> backup_label file like crash recovery. The real question here is why
>>> we don't treat it fully like archive recovery.
>>
>> Timeline jump at the end of recovery?  Archive recovery forces a TLI
>> jump by default at the end of redo if there's a signal file, and some
>> users may not want a TLI jump by default?
>
> Uggh. I don't know what to think about that. I bet some people do want
> that, but that makes it pretty easy to end up with multiple copies of
> the same cluster running on the same TLI, too, which is not a thing
> that you really want to have happen.

Andres has mentioned upthread that this is something he's been using
to quickly be able to clone a cluster.  I would not recommend doing
that, personally, but if that's useful in some cases, well, why not.

> At the end of the day, I'm coming around to the view that the biggest
> problem here is the documentation. Nobody can really know what's
> supposed to work right now because the documentation doesn't say which
> things you are and are not allowed to do and what results you should
> expect in each case. If it did, it would be easier to discuss possible
> behavior changes. Right now, it's hard to change any code at all,
> because there's no list of supported scenarios, so you can't tell
> whether a potential change affects a scenario that somebody thinks
> should work, or only cases that nobody can possibly care about. It's
> sort of possible to reason your way through that, to an extent, but
> it's pretty hard. The fact that I didn't know that starting from a
> backup with neither recovery.signal nor standby.signal was a thing
> that anybody did or cared about is good evidence of that.

That's one problem, not all of it, because the code takes extra
assumptions around that.

> I also feel like the terminology here sometimes obscures more than it
> illuminates. For instance, it seems like ArchiveRecoveryRequested
> really means "are any signal files present?" while InArchiveRecovery
> means "are we fetching WAL from outside pg_wal rather than using
> what's in pg_wal?". But these are not obvious from the names, and
> sometimes we have additional variables with overlapping meanings, like
> readSource, which indicates whether we're reading from pg_wal, the
> archive, or the walreceiver, and yet is probably not redundant with
> InArchiveRecovery. In any event, I think that we need to start with
> the question of what behavior(s) we want to expose to users, and then
> back into the question of what internal variables and states need to
> exist in order to support that behavior. We cannot start by deciding
> what variables we'd like to get rid of and then trying to justify the
> resulting behavior changes on the grounds that they simplify the code.
> Users aren't going to like that, hackers aren't going to like that,
> and the resulting behavior probably won't be anything great.

Note as well that InArchiveRecovery is set when there's a
backup_label, but that the code would check for the existence of a
restore_command only if a signal file exists.  That's strange, but if
people have been relying on this behavior, so be it.

At this stage, it looks pretty clear to me that there's no consensus
on what to do, and nobody's happy with the proposal of this thread, so
I am going to mark it as rejected.
--
Michael

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Geoghegan
Дата:
Сообщение: Re: [PATCHES] Post-special page storage TDE support
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: Requiring recovery.signal or standby.signal when recovering with a backup_label