Обсуждение: recovery starting when backup_label exists, but not recovery.signal

Поиск
Список
Период
Сортировка

recovery starting when backup_label exists, but not recovery.signal

От
Fujii Masao
Дата:
Hi,

When backup_label exists, the startup process enters archive recovery mode
even if recovery.signal file doesn't exist. In this case, the startup process
tries to retrieve WAL files by using restore_command. Then, at the beginning
of the archive recovery, the contents of backup_label are copied to pg_control
and backup_label file is removed. This would be an intentional behavior.

But I think the problem is that, if the server shuts down during that
archive recovery, the restart of the server may cause the recovery to fail
because neither backup_label nor recovery.signal exist and the server
doesn't enter an archive recovery mode. Is this intentional, too? Seems No.

So the problematic scenario is;

1. the server starts with backup_label, but not recovery.signal.
2. the startup process enters an archive recovery mode because
    backup_label exists.
3. the contents of backup_label are copied to pg_control and
    backup_label is deleted.
4. the server shuts down..
5. the server is restarted. neither backup_label nor recovery.signal exist.
6. the startup process starts just crash recovery because neither backup_label
    nor recovery.signal exist. Since it cannot retrieve WAL files from archival
    area, it may fail.

One idea to fix this issue is to make the above step #3 remember that
backup_label existed, in pg_control. Then we should make the subsequent
recovery enter an archive recovery mode if pg_control indicates that
even if neither backup_label nor recovery.signal exist. Thought?

Regards,

-- 
Fujii Masao



Re: recovery starting when backup_label exists, but notrecovery.signal

От
David Steele
Дата:
On 9/24/19 1:25 AM, Fujii Masao wrote:
> 
> When backup_label exists, the startup process enters archive recovery mode
> even if recovery.signal file doesn't exist. In this case, the startup process
> tries to retrieve WAL files by using restore_command. Then, at the beginning
> of the archive recovery, the contents of backup_label are copied to pg_control
> and backup_label file is removed. This would be an intentional behavior.

> But I think the problem is that, if the server shuts down during that
> archive recovery, the restart of the server may cause the recovery to fail
> because neither backup_label nor recovery.signal exist and the server
> doesn't enter an archive recovery mode. Is this intentional, too? Seems No.
> 
> So the problematic scenario is;
> 
> 1. the server starts with backup_label, but not recovery.signal.
> 2. the startup process enters an archive recovery mode because
>     backup_label exists.
> 3. the contents of backup_label are copied to pg_control and
>     backup_label is deleted.

Do you mean deleted or renamed to backup_label.old?

> 4. the server shuts down..

This happens after the cluster has reached consistency?

> 5. the server is restarted. neither backup_label nor recovery.signal exist.
> 6. the startup process starts just crash recovery because neither backup_label
>     nor recovery.signal exist. Since it cannot retrieve WAL files from archival
>     area, it may fail.

I tried a few ways to reproduce this but was not successful without
manually removing WAL.  Probably I just needed a much larger set of WAL.

I assume you have a repro?  Can you give more details?

> One idea to fix this issue is to make the above step #3 remember that
> backup_label existed, in pg_control. Then we should make the subsequent
> recovery enter an archive recovery mode if pg_control indicates that
> even if neither backup_label nor recovery.signal exist. Thought?

That seems pretty invasive to me at this stage.  I'd like to reproduce
it and see if there are alternatives.

Also, are you sure this is a new behavior?  I've been finding that some
behaviors that have existed for a long time are suddenly more apparent
or easier to hit with the new mechanism.  Examples of that are in [1].

-- 
-David
david@pgmasters.net

[1]
https://www.postgresql.org/message-id/5e6537c7-d10e-6a67-4813-bbd7455cfaf5%40pgmasters.net



Re: recovery starting when backup_label exists, but not recovery.signal

От
Masahiko Sawada
Дата:
On Fri, Sep 27, 2019 at 3:36 AM David Steele <david@pgmasters.net> wrote:
>
> On 9/24/19 1:25 AM, Fujii Masao wrote:
> >
> > When backup_label exists, the startup process enters archive recovery mode
> > even if recovery.signal file doesn't exist. In this case, the startup process
> > tries to retrieve WAL files by using restore_command. Then, at the beginning
> > of the archive recovery, the contents of backup_label are copied to pg_control
> > and backup_label file is removed. This would be an intentional behavior.
>
> > But I think the problem is that, if the server shuts down during that
> > archive recovery, the restart of the server may cause the recovery to fail
> > because neither backup_label nor recovery.signal exist and the server
> > doesn't enter an archive recovery mode. Is this intentional, too? Seems No.
> >
> > So the problematic scenario is;
> >
> > 1. the server starts with backup_label, but not recovery.signal.
> > 2. the startup process enters an archive recovery mode because
> >     backup_label exists.
> > 3. the contents of backup_label are copied to pg_control and
> >     backup_label is deleted.
>
> Do you mean deleted or renamed to backup_label.old?
>
> > 4. the server shuts down..
>
> This happens after the cluster has reached consistency?
>
> > 5. the server is restarted. neither backup_label nor recovery.signal exist.
> > 6. the startup process starts just crash recovery because neither backup_label
> >     nor recovery.signal exist. Since it cannot retrieve WAL files from archival
> >     area, it may fail.
>
> I tried a few ways to reproduce this but was not successful without
> manually removing WAL.

Hmm me too. I think that since we enter crash recovery at step #6 we
don't retrieve WAL files from archival area.

But I reproduced the problem Fujii-san mentioned that the restart of
the server during archive recovery causes to the crash recovery
instead of resuming the archive recovery. Which is the different
behavior from version 11 or before and I personally think it made
behavior worse.

Regards,

--
Masahiko Sawada



Re: recovery starting when backup_label exists, but not recovery.signal

От
Fujii Masao
Дата:
On Fri, Sep 27, 2019 at 3:36 AM David Steele <david@pgmasters.net> wrote:
>
> On 9/24/19 1:25 AM, Fujii Masao wrote:
> >
> > When backup_label exists, the startup process enters archive recovery mode
> > even if recovery.signal file doesn't exist. In this case, the startup process
> > tries to retrieve WAL files by using restore_command. Then, at the beginning
> > of the archive recovery, the contents of backup_label are copied to pg_control
> > and backup_label file is removed. This would be an intentional behavior.
>
> > But I think the problem is that, if the server shuts down during that
> > archive recovery, the restart of the server may cause the recovery to fail
> > because neither backup_label nor recovery.signal exist and the server
> > doesn't enter an archive recovery mode. Is this intentional, too? Seems No.
> >
> > So the problematic scenario is;
> >
> > 1. the server starts with backup_label, but not recovery.signal.
> > 2. the startup process enters an archive recovery mode because
> >     backup_label exists.
> > 3. the contents of backup_label are copied to pg_control and
> >     backup_label is deleted.
>
> Do you mean deleted or renamed to backup_label.old?

Sorry for the confusing wording..
I meant the following code that renames backup_label to .old, in StartupXLOG().

    /*
    * If there was a backup label file, it's done its job and the info
    * has now been propagated into pg_control.  We must get rid of the
    * label file so that if we crash during recovery, we'll pick up at
    * the latest recovery restartpoint instead of going all the way back
    * to the backup start point.  It seems prudent though to just rename
    * the file out of the way rather than delete it completely.
    */
    if (haveBackupLabel)
    {
        unlink(BACKUP_LABEL_OLD);
        durable_rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD, FATAL);
   }

> > 4. the server shuts down..
>
> This happens after the cluster has reached consistency?

You need to shutdown the server until WAL replay finishes,
no matter whether it reaches the consistent point or not.

> > 5. the server is restarted. neither backup_label nor recovery.signal exist.
> > 6. the startup process starts just crash recovery because neither backup_label
> >     nor recovery.signal exist. Since it cannot retrieve WAL files from archival
> >     area, it may fail.
>
> I tried a few ways to reproduce this but was not successful without
> manually removing WAL.  Probably I just needed a much larger set of WAL.
>
> I assume you have a repro?  Can you give more details?

What I did is:

1. Start PostgreSQL server with WAL archiving enabled.
2. Take an online backup by using pg_basebackup, for example,
     $ pg_basebackup -D backup
3. Execute many write SQL to generate lots of WAL files. During that execution,
    perform CHECKPOINT to remove some WAL files from pg_wal directory.
    You need to repeat these until you confirm that there are many WAL files
    that have already been removed from pg_wal but exist only in archive area.
 4. Shutdown the server.
 5. Remove PGDATA and restore it from backup.
 6. Set up restore_command.
 7. (Forget to put recovery.signal)
     That is, in this scenario, you want to recover the database up to
     the latest WAL records in the archive area. So you need to start archive
     recovery by setting restore_command and putting recovery.signal.
     But the problem happens when you forget to put recovery.signal.
 8. Start PostgreSQL server.
 9. Shutdown the server while it's restoring archived WAL files and replaying
     them. At this point, you will notice that the archive recovery starts
     even though recovery.signal doesn't exist. So even archived WAL files
     are successfully restored at this step.
 10. Restart PostgreSQL server. Since neither backup_label or recovery.signal
        exist, crash recovery starts and fail to restore the archived WAL files.
       So you fail to recover the database up to the latest WAL record
in archive
       directory. The recovery will finish at early point.

> > One idea to fix this issue is to make the above step #3 remember that
> > backup_label existed, in pg_control. Then we should make the subsequent
> > recovery enter an archive recovery mode if pg_control indicates that
> > even if neither backup_label nor recovery.signal exist. Thought?
>
> That seems pretty invasive to me at this stage.  I'd like to reproduce
> it and see if there are alternatives.
>
> Also, are you sure this is a new behavior?

In v11 or before, if backup_label exists but not recovery.conf,
the startup process doesn't enter an archive recovery mode. It starts
crash recovery in that case. So the bahavior is somewhat different
between versions.

Regards,

-- 
Fujii Masao



Re: recovery starting when backup_label exists, but not recovery.signal

От
Fujii Masao
Дата:
On Fri, Sep 27, 2019 at 4:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Fri, Sep 27, 2019 at 3:36 AM David Steele <david@pgmasters.net> wrote:
> >
> > On 9/24/19 1:25 AM, Fujii Masao wrote:
> > >
> > > When backup_label exists, the startup process enters archive recovery mode
> > > even if recovery.signal file doesn't exist. In this case, the startup process
> > > tries to retrieve WAL files by using restore_command. Then, at the beginning
> > > of the archive recovery, the contents of backup_label are copied to pg_control
> > > and backup_label file is removed. This would be an intentional behavior.
> >
> > > But I think the problem is that, if the server shuts down during that
> > > archive recovery, the restart of the server may cause the recovery to fail
> > > because neither backup_label nor recovery.signal exist and the server
> > > doesn't enter an archive recovery mode. Is this intentional, too? Seems No.
> > >
> > > So the problematic scenario is;
> > >
> > > 1. the server starts with backup_label, but not recovery.signal.
> > > 2. the startup process enters an archive recovery mode because
> > >     backup_label exists.
> > > 3. the contents of backup_label are copied to pg_control and
> > >     backup_label is deleted.
> >
> > Do you mean deleted or renamed to backup_label.old?
> >
> > > 4. the server shuts down..
> >
> > This happens after the cluster has reached consistency?
> >
> > > 5. the server is restarted. neither backup_label nor recovery.signal exist.
> > > 6. the startup process starts just crash recovery because neither backup_label
> > >     nor recovery.signal exist. Since it cannot retrieve WAL files from archival
> > >     area, it may fail.
> >
> > I tried a few ways to reproduce this but was not successful without
> > manually removing WAL.
>
> Hmm me too. I think that since we enter crash recovery at step #6 we
> don't retrieve WAL files from archival area.
>
> But I reproduced the problem Fujii-san mentioned that the restart of
> the server during archive recovery causes to the crash recovery
> instead of resuming the archive recovery.

Yes, it's strange and unexpected to start crash recovery
when restarting archive recovery. Archive recovery should
start again in that case, I think.

Regards,

-- 
Fujii Masao



Re: recovery starting when backup_label exists, but notrecovery.signal

От
David Steele
Дата:
On 9/27/19 4:41 AM, Fujii Masao wrote:
> On Fri, Sep 27, 2019 at 4:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>>
>> On Fri, Sep 27, 2019 at 3:36 AM David Steele <david@pgmasters.net> wrote:
>>>
>>> On 9/24/19 1:25 AM, Fujii Masao wrote:
>>>>
>>>> When backup_label exists, the startup process enters archive recovery mode
>>>> even if recovery.signal file doesn't exist. In this case, the startup process
>>>> tries to retrieve WAL files by using restore_command. Then, at the beginning
>>>> of the archive recovery, the contents of backup_label are copied to pg_control
>>>> and backup_label file is removed. This would be an intentional behavior.
>>>
>>>> But I think the problem is that, if the server shuts down during that
>>>> archive recovery, the restart of the server may cause the recovery to fail
>>>> because neither backup_label nor recovery.signal exist and the server
>>>> doesn't enter an archive recovery mode. Is this intentional, too? Seems No.
>>>>
>>>> So the problematic scenario is;
>>>>
>>>> 1. the server starts with backup_label, but not recovery.signal.
>>>> 2. the startup process enters an archive recovery mode because
>>>>     backup_label exists.
>>>> 3. the contents of backup_label are copied to pg_control and
>>>>     backup_label is deleted.
>>>
>>> Do you mean deleted or renamed to backup_label.old?
>>>
>>>> 4. the server shuts down..
>>>
>>> This happens after the cluster has reached consistency?
>>>
>>>> 5. the server is restarted. neither backup_label nor recovery.signal exist.
>>>> 6. the startup process starts just crash recovery because neither backup_label
>>>>     nor recovery.signal exist. Since it cannot retrieve WAL files from archival
>>>>     area, it may fail.
>>>
>>> I tried a few ways to reproduce this but was not successful without
>>> manually removing WAL.
>>
>> Hmm me too. I think that since we enter crash recovery at step #6 we
>> don't retrieve WAL files from archival area.
>>
>> But I reproduced the problem Fujii-san mentioned that the restart of
>> the server during archive recovery causes to the crash recovery
>> instead of resuming the archive recovery.
> 
> Yes, it's strange and unexpected to start crash recovery
> when restarting archive recovery. Archive recovery should
> start again in that case, I think.

+1

-- 
-David
david@pgmasters.net



Re: recovery starting when backup_label exists, but notrecovery.signal

От
David Steele
Дата:
On 9/27/19 4:34 AM, Fujii Masao wrote:
> On Fri, Sep 27, 2019 at 3:36 AM David Steele <david@pgmasters.net> wrote:
>>
>> On 9/24/19 1:25 AM, Fujii Masao wrote:
>>>
>>> When backup_label exists, the startup process enters archive recovery mode
>>> even if recovery.signal file doesn't exist. In this case, the startup process
>>> tries to retrieve WAL files by using restore_command. Then, at the beginning
>>> of the archive recovery, the contents of backup_label are copied to pg_control
>>> and backup_label file is removed. This would be an intentional behavior.
>>
>>> But I think the problem is that, if the server shuts down during that
>>> archive recovery, the restart of the server may cause the recovery to fail
>>> because neither backup_label nor recovery.signal exist and the server
>>> doesn't enter an archive recovery mode. Is this intentional, too? Seems No.
>>>
>>> So the problematic scenario is;
>>>
>>> 1. the server starts with backup_label, but not recovery.signal.
>>> 2. the startup process enters an archive recovery mode because
>>>     backup_label exists.
>>> 3. the contents of backup_label are copied to pg_control and
>>>     backup_label is deleted.
>>
>> Do you mean deleted or renamed to backup_label.old?
> 
> Sorry for the confusing wording..
> I meant the following code that renames backup_label to .old, in StartupXLOG().

Right, that makes sense.

>>
>> I assume you have a repro?  Can you give more details?
> 
> What I did is:
> 
> 1. Start PostgreSQL server with WAL archiving enabled.
> 2. Take an online backup by using pg_basebackup, for example,
>      $ pg_basebackup -D backup
> 3. Execute many write SQL to generate lots of WAL files. During that execution,
>     perform CHECKPOINT to remove some WAL files from pg_wal directory.
>     You need to repeat these until you confirm that there are many WAL files
>     that have already been removed from pg_wal but exist only in archive area.
>  4. Shutdown the server.
>  5. Remove PGDATA and restore it from backup.
>  6. Set up restore_command.
>  7. (Forget to put recovery.signal)
>      That is, in this scenario, you want to recover the database up to
>      the latest WAL records in the archive area. So you need to start archive
>      recovery by setting restore_command and putting recovery.signal.
>      But the problem happens when you forget to put recovery.signal.
>  8. Start PostgreSQL server.
>  9. Shutdown the server while it's restoring archived WAL files and replaying
>      them. At this point, you will notice that the archive recovery starts
>      even though recovery.signal doesn't exist. So even archived WAL files
>      are successfully restored at this step.
>  10. Restart PostgreSQL server. Since neither backup_label or recovery.signal
>         exist, crash recovery starts and fail to restore the archived WAL files.
>        So you fail to recover the database up to the latest WAL record
> in archive
>        directory. The recovery will finish at early point.

Yes, I see it now.  I did not have enough WAL to make it work before, as
I suspected.

>>> One idea to fix this issue is to make the above step #3 remember that
>>> backup_label existed, in pg_control. Then we should make the subsequent
>>> recovery enter an archive recovery mode if pg_control indicates that
>>> even if neither backup_label nor recovery.signal exist. Thought?
>>
>> That seems pretty invasive to me at this stage.  I'd like to reproduce
>> it and see if there are alternatives.
>>
>> Also, are you sure this is a new behavior?
> 
> In v11 or before, if backup_label exists but not recovery.conf,
> the startup process doesn't enter an archive recovery mode. It starts
> crash recovery in that case. So the bahavior is somewhat different
> between versions.

Agreed.  Since recovery options can be used in the presence of
backup_label *or* recovery.signal (or standby.signal for that matter)
this does represent a change in behavior.  And it doesn't appear to be a
beneficial change.

Regards,
-- 
-David
david@pgmasters.net



Re: recovery starting when backup_label exists, but notrecovery.signal

От
Peter Eisentraut
Дата:
On 2019-09-27 10:34, Fujii Masao wrote:
>> Also, are you sure this is a new behavior?
> In v11 or before, if backup_label exists but not recovery.conf,
> the startup process doesn't enter an archive recovery mode. It starts
> crash recovery in that case. So the bahavior is somewhat different
> between versions.

Can you bisect this?  I have traced through xlog.c in both versions and
I don't see how this logic is any different in any obvious way.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: recovery starting when backup_label exists, but notrecovery.signal

От
David Steele
Дата:
Hi Peter,

On 9/27/19 3:35 PM, Peter Eisentraut wrote:
> On 2019-09-27 10:34, Fujii Masao wrote:
>>> Also, are you sure this is a new behavior?
>> In v11 or before, if backup_label exists but not recovery.conf,
>> the startup process doesn't enter an archive recovery mode. It starts
>> crash recovery in that case. So the bahavior is somewhat different
>> between versions.
> 
> Can you bisect this?  I have traced through xlog.c in both versions and
> I don't see how this logic is any different in any obvious way.

What I've been seeing is that the underlying logic isn't different but
there are more ways to get into it.

Previously, there was no archive/targeted recovery without
recovery.conf, but now there are several ways to get to archive/targeted
recovery, i.e., making the recovery settings GUCs has bypassed controls
that previously had limited how they could be used and when.

The issues on the other thread [1], at least, were all introduced in
2dedf4d9.

Regards,
-- 
-David
david@pgmasters.net

[1]
https://www.postgresql.org/message-id/flat/e445616d-023e-a268-8aa1-67b8b335340c%40pgmasters.net