Postgres WAL Recovery Fails... And Then Works...

Поиск
Список
Период
Сортировка
От	Phil Monroe
Тема	Postgres WAL Recovery Fails... And Then Works...
Дата	14 января 2013 г. 19:50:22
Msg-id	50F0CB3E.4080708@identified.com обсуждение исходный текст
Ответы	Re: Postgres WAL Recovery Fails... And Then Works... Re: Postgres WAL Recovery Fails... And Then Works... Re: Postgres WAL Recovery Fails... And Then Works...
Список	pgsql-admin
Дерево обсуждения
Hi Everyone,

So we had to failover and do a full base backup to get our slave database back online and ran into a interesting scenario. After copying the data directory, setting up the recovery.conf, and starting the slave database, the database crashes while replaying xlogs. However, trying to start the database again, the database is able to replay xlogs farther than it initially got, but ultimately ended up failing out again. After starting the DB a third time, PostgreSQL replays even further and catches up to the master to start streaming replication. Is this common and or acceptable?

The error messages that cause PostgreSQL to stop replaying xlogs:

ERROR #1:
2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation base/16408/18967399 does not exist
 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
 2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to invalid pages
 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was terminated by signal 6: Aborted 

ERROR #2:
2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation base/16408/18967399 does not exist
 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
 2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to invalid pages
 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was terminated by signal 6: Aborted


Fortunately, these errors only pertain to indexes, which can be rebuilt. But is this a sign that the data directory on the slave is corrupt?

Here is a full timeline of the logs highlighting the issues. :

1. Data Directory Copy Finishes.
2. Recovery.conf Setup

3. Start #1:

2013-01-12 00:14:23 UTC::@:[61150]: LOG:  database system was shut down in recovery at 2013-01-12 00:14:06 UTC
2013-01-12 00:14:23 UTC:[local]:[unknown]@[unknown]:[61151]: LOG:  incomplete startup packet
2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61152]: FATAL:  the database system is starting up
unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  entering standby mode
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file "0000000900008E45000000B8" from archive
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file "0000000900008E450000008B" from archive
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  could not open directory "pg_snapshots": No such file or directory
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  redo starts at 8E45/8B173180
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  consistent recovery state reached at 8E45/8B174840
2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61175]: FATAL:  the database system is starting up
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file "0000000900008E450000008C" from archive
2013-01-12 00:14:24 UTC::@:[61150]: LOG:  restored log file "0000000900008E450000008D" from archive

*SNIP*

2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file "0000000900008E4800000066" from archive
2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file "0000000900008E4800000067" from archive
2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation base/16408/18967399 does not exist
2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to invalid pages
2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was terminated by signal 6: Aborted
2013-01-12 00:17:18 UTC::@:[61121]: LOG:  terminating any other active server processes


4. PostgreSQL shuts down... 
5. Debugging logs enabled in postgresql.conf.


6. Start #2: 

2013-01-12 00:51:48 UTC::@:[36831]: LOG:  database system was interrupted while in recovery at log time 2013-01-11 18:05:31 UTC
2013-01-12 00:51:48 UTC::@:[36831]: HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2013-01-12 00:51:48 UTC:[local]:[unknown]@[unknown]:[36832]: LOG:  incomplete startup packet
2013-01-12 00:51:48 UTC:[local]:postgres@postgres:[36833]: FATAL:  the database system is starting up
unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  entering standby mode
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file "0000000900008E45000000B8" from archive
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  checkpoint record is at 8E45/B80AF650
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  redo record is at 8E45/8B173180; shutdown FALSE
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next transaction ID: 0/552803703; next OID: 24427698
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next MultiXactId: 109027; next MultiXactOffset: 2442921
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  oldest unfrozen transaction ID: 3104202601, in database 16408
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  transaction ID wrap limit is 956718952, limited by database with OID 16408
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  resetting unlogged relations: cleanup 1 init 0
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  could not open directory "pg_snapshots": No such file or directory
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  initializing for hot standby
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file "0000000900008E450000008B" from archive
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  redo starts at 8E45/8B173180
2013-01-12 00:51:49 UTC::@:[36831]: LOG:  restored log file "0000000900008E450000008C" from archive

*SNIP*

2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000062" from archive
2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000063" from archive
2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000064" from archive
2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000065" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000066" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000067" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  consistent recovery state reached at 8E48/67AC4E28
2013-01-12 00:54:46 UTC::@:[36796]: LOG:  database system is ready to accept read only connections
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000068" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000069" from archive
2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file "0000000900008E480000006A" from archive
2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file "0000000900008E480000006B" from archive
2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file "0000000900008E480000006C" from archive

*SNIP*

2013-01-12 01:02:16 UTC::@:[36831]: LOG:  restored log file "0000000900008E4F00000079" from archive
2013-01-12 01:02:16 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 01:02:17 UTC::@:[36831]: LOG:  restored log file "0000000900008E4F0000007A" from archive
2013-01-12 01:02:17 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation base/16408/18967399 does not exist
2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to invalid pages
2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was terminated by signal 6: Aborted
2013-01-12 01:02:20 UTC::@:[36796]: LOG:  terminating any other active server processes


7. PostgreSQL shuts down...  


8. Start #3:

2013-01-12 01:04:04 UTC::@:[61742]: LOG:  database system was interrupted while in recovery at log time 2013-01-11 19:50:31 UTC
2013-01-12 01:04:04 UTC::@:[61742]: HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2013-01-12 01:04:04 UTC:[local]:[unknown]@[unknown]:[61743]: LOG:  incomplete startup packet
2013-01-12 01:04:04 UTC:[local]:postgres@postgres:[61744]: FATAL:  the database system is starting up
unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
2013-01-12 01:04:04 UTC::@:[61742]: LOG:  entering standby mode
2013-01-12 01:04:04 UTC::@:[61742]: LOG:  restored log file "0000000900008E4A00000039" from archive
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  checkpoint record is at 8E4A/39CD4BA0
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  redo record is at 8E4A/19F0D210; shutdown FALSE
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next transaction ID: 0/552859005; next OID: 24427698
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next MultiXactId: 109221; next MultiXactOffset: 2443321
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  oldest unfrozen transaction ID: 3104202601, in database 16408
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  transaction ID wrap limit is 956718952, limited by database with OID 16408
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  resetting unlogged relations: cleanup 1 init 0
2013-01-12 01:04:04 UTC::@:[61742]: LOG:  could not open directory "pg_snapshots": No such file or directory
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  initializing for hot standby
2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file "0000000900008E4A00000019" from archive
2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:04:05 UTC::@:[61742]: LOG:  redo starts at 8E4A/19F0D210
2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file "0000000900008E4A0000001A" from archive
2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive

*SNIP*

2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F00000077" from archive
2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F00000078" from archive
2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F00000079" from archive
2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F0000007A" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  consistent recovery state reached at 8E4F/7A22BD08
2013-01-12 01:09:38 UTC::@:[61729]: LOG:  database system is ready to accept read only connections
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F0000007B" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F0000007C" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F0000007D" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:39 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F0000007E" from archive


*SNIP*

2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file "0000000900008E53000000D9" from archive
2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file "0000000900008E53000000DA" from archive
2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file "0000000900008E53000000DB" from archive
2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file "0000000900008E53000000DC" from archive
2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file "0000000900008E53000000DD" from archive
2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
2013-01-12 01:14:30 UTC::@:[61742]: LOG:  unexpected pageaddr 8E4A/1000000 in log file 36435, segment 222, offset 0
unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
2013-01-12 01:14:30 UTC::@:[19504]: LOG:  streaming replication successfully connected to primary
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=1 file=base/16408/22873432 time=2.538 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=2 file=base/16408/18967506 time=12.054 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=3 file=base/16408/18967506_fsm time=0.095 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=4 file=base/16408/22873244 time=0.144 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=5 file=base/16408/22892823 time=0.087 msec

9. Slave DB connected to streaming replication with Master DB and all seems fine.


Any help would be appreciated. Thanks!


-- 
Phil Monroe
Stuff  |  Identified
185 Berry Street Suite 6100, San Francisco, CA 94107 
m 614 832 9495
e phil@identified.com
w identified.com
tw @philmonroe
e phil@philmonroe.com
w philmonroe.com


 
Вложения

В списке pgsql-admin по дате отправления:
Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Postgres WAL Recovery Fails... And Then Works...

Вложения