BUG #8543: Standby recovery use incorrect timeline to determine WAL length

Поиск
Список
Период
Сортировка
От alfonso.vicente@logos.com.uy
Тема BUG #8543: Standby recovery use incorrect timeline to determine WAL length
Дата
Msg-id E1VYGWB-00056v-Dj@wrigleys.postgresql.org
обсуждение исходный текст
Ответы Re: BUG #8543: Standby recovery use incorrect timeline to determine WAL length  (Michael Paquier <michael.paquier@gmail.com>)
Список pgsql-bugs
The following bug has been logged on the website:

Bug reference:      8543
Logged by:          Alfonso Vicente
Email address:      alfonso.vicente@logos.com.uy
PostgreSQL version: 9.1.3
Operating system:   OpenSUSE 11
Description:

Environment: PostgreSQL 9.1.3 / OpenSUSE 11


The use case is a switchover and switchback test, using a standby cluster
with Streaming Replication. We have 2 datacenters: main-site and
backup-site


The cluster in main-site has timeline ID 1, and we perform the switchover to
backup-site without any trouble. When we execute the pg_ctl promote, the
cluster sets the new timeline ID 2 (OK).


But... we don't stop the cluster in main-site. So, the cluster in main-site
continue to archive logs in /archlogs directory, and archives the WAL
00000001000000FE00000002 of 16MB.


Immediately, we perform a backup in the backup-site, to prepare for
switchback. The pg_stop_backup() force a WAL switch, and generate a WAL
00000002000000FE00000002 of 2MB. Then, the archive_command script perform
the log shipping to the main-site and store the 00000002000000FE00000002 in
the same /archlogs directory


When we execute the pg_ctl promote in the main-site, the postgresql.log show
the following error:


2013-10-19 22:39:59 UYST   LOG:  database system was shut down in recovery
at 2013-10-19 21:47:07 UYST
sh: /home/postgres/9.1/backups/archlogs/00000002.history.gz: No existe el
fichero o el directorio
2013-10-19 22:39:59 UYST   LOG:  starting archive recovery
2013-10-19 22:39:59 UYST   FATAL:  archive file "00000002000000FE00000002"
has wrong size: 2293760 instead of 16777216
2013-10-19 22:39:59 UYST   LOG:  startup process (PID 10636) exited with
exit code 1
2013-10-19 22:39:59 UYST   LOG:  aborting startup due to startup process
failure


The WAL 00000002000000FE00000002 is of 2293760 bytes, this is not an error.
Why PostgreSQL think WAL 00000002000000FE00000002 should be of 16MB ? ... I
think the cause can be the WAL 00000001000000FE00000002 (timeline ID 1)
stored in the /archlogs directory


If we drop all the timeline ID 1 archived WALs, the failback executes
succesfully


There is an easy workaround, but please, confirm if this is a bug. The
documentation says: "The default behavior of recovery is to recover along
the same timeline that was current when the base backup was taken". The base
backup in backup-site was timeline ID 2


Thanks in advance

В списке pgsql-bugs по дате отправления:

Предыдущее
От: warren@ninefold.com
Дата:
Сообщение: BUG #8541: PGXS build fails due to 'include common/fe_memutils.h' which is not present
Следующее
От: bentzi.mor@hp.com
Дата:
Сообщение: BUG #8544: High nfs getattr