Re: Error restoring from a base backup taken from standby

Поиск
Список
Период
Сортировка
От Fujii Masao
Тема Re: Error restoring from a base backup taken from standby
Дата
Msg-id CAHGQGwGr+U1xujAaQwDqOk0eXH3ZD6iE_JWDV5u-vxJT8ocX=g@mail.gmail.com
обсуждение исходный текст
Ответ на Error restoring from a base backup taken from standby  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Список pgsql-hackers
On Tue, Dec 18, 2012 at 2:39 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> (This is different from the other issue related to timeline switches I just
> posted about. There's no timeline switch involved in this one.)
>
> If you do "pg_basebackup -x" against a standby server, in some circumstances
> the backup fails to restore with error like this:
>
> C 2012-12-17 19:09:44.042 EET 7832 LOG:  database system was not properly
> shut down; automatic recovery in progress
> C 2012-12-17 19:09:44.091 EET 7832 LOG:  record with zero length at
> 0/1764F48
> C 2012-12-17 19:09:44.091 EET 7832 LOG:  redo is not required
> C 2012-12-17 19:09:44.091 EET 7832 FATAL:  WAL ends before end of online
> backup
> C 2012-12-17 19:09:44.091 EET 7832 HINT:  All WAL generated while online
> backup was taken must be available at recovery.
> C 2012-12-17 19:09:44.092 EET 7831 LOG:  startup process (PID 7832) exited
> with exit code 1
> C 2012-12-17 19:09:44.092 EET 7831 LOG:  aborting startup due to startup
> process failure
>
> I spotted this bug while reading the code, and it took me quite a while to
> actually construct a test case to reproduce the bug, so let me begin by
> discussing the code where the bug is. You get the above error, "WAL ends
> before end of online backup", when you reach the end of WAL before reaching
> the backupEndPoint stored in the control file, which originally comes from
> the backup_label file. backupEndPoint is only used in a base backup taken
> from a standby, in a base backup taken from the master, the end-of-backup
> WAL record is used instead to mark the end of backup. In the xlog redo loop,
> after replaying each record, we check if we've just reached backupEndPoint,
> and clear it from the control file if we have. Now the problem is, if there
> are no WAL records after the checkpoint redo point, we never even enter the
> redo loop, so backupEndPoint is not cleared even though it's reached
> immediately after reading the initial checkpoint record.

Good catch!

> To deal with the similar situation wrt. reaching consistency for hot standby
> purposes, we call CheckRecoveryConsistency() before the redo loop. The
> straightforward fix is to copy-paste the check for backupEndPoint to just
> before the redo loop, next to the CheckRecoveryConsistency() call. Even
> better, I think we should move the backupEndPoint check into
> CheckRecoveryConsistency(). It's already responsible for keeping track of
> whether minRecoveryPoint has been reached, so it seems like a good idea to
> do this check there as well.
>
> Attached is a patch for that (for 9.2), as well as a script I used to
> reproduce the bug.

The patch looks good to me.

Regards,

-- 
Fujii Masao



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bernhard Schrader
Дата:
Сообщение: Re: [ADMIN] Problems with enums after pg_upgrade
Следующее
От: Tom Lane
Дата:
Сообщение: Re: pg_basebackup from cascading standby after timeline switch