Обсуждение: Postgres WAL Recovery Fails... And Then Works...

Поиск

Список

Период

Сортировка

Postgres WAL Recovery Fails... And Then Works...

От

Phil Monroe

Дата:

14 января 2013 г., 22:50:22

Hi Everyone,

So we had to failover and do a full base backup to get our slave database back online and ran into a interesting scenario. After copying the data directory, setting up the recovery.conf, and starting the slave database, the database crashes while replaying xlogs. However, trying to start the database again, the database is able to replay xlogs farther than it initially got, but ultimately ended up failing out again. After starting the DB a third time, PostgreSQL replays even further and catches up to the master to start streaming replication. Is this common and or acceptable?

The error messages that cause PostgreSQL to stop replaying xlogs:

ERROR #1:
2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation base/16408/18967399 does not exist
 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
 2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to invalid pages
 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was terminated by signal 6: Aborted 

ERROR #2:
2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation base/16408/18967399 does not exist
 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
 2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to invalid pages
 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was terminated by signal 6: Aborted


Fortunately, these errors only pertain to indexes, which can be rebuilt. But is this a sign that the data directory on the slave is corrupt?

Here is a full timeline of the logs highlighting the issues. :

1. Data Directory Copy Finishes.
2. Recovery.conf Setup

3. Start #1:

2013-01-12 00:14:23 UTC::@:[61150]: LOG:  database system was shut down in recovery at 2013-01-12 00:14:06 UTC
2013-01-12 00:14:23 UTC:[local]:[unknown]@[unknown]:[61151]: LOG:  incomplete startup packet
2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61152]: FATAL:  the database system is starting up
unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  entering standby mode
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file "0000000900008E45000000B8" from archive
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file "0000000900008E450000008B" from archive
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  could not open directory "pg_snapshots": No such file or directory
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  redo starts at 8E45/8B173180
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  consistent recovery state reached at 8E45/8B174840
2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61175]: FATAL:  the database system is starting up
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file "0000000900008E450000008C" from archive
2013-01-12 00:14:24 UTC::@:[61150]: LOG:  restored log file "0000000900008E450000008D" from archive

*SNIP*

2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file "0000000900008E4800000066" from archive
2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file "0000000900008E4800000067" from archive
2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation base/16408/18967399 does not exist
2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to invalid pages
2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was terminated by signal 6: Aborted
2013-01-12 00:17:18 UTC::@:[61121]: LOG:  terminating any other active server processes


4. PostgreSQL shuts down... 
5. Debugging logs enabled in postgresql.conf.


6. Start #2: 

2013-01-12 00:51:48 UTC::@:[36831]: LOG:  database system was interrupted while in recovery at log time 2013-01-11 18:05:31 UTC
2013-01-12 00:51:48 UTC::@:[36831]: HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2013-01-12 00:51:48 UTC:[local]:[unknown]@[unknown]:[36832]: LOG:  incomplete startup packet
2013-01-12 00:51:48 UTC:[local]:postgres@postgres:[36833]: FATAL:  the database system is starting up
unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  entering standby mode
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file "0000000900008E45000000B8" from archive
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  checkpoint record is at 8E45/B80AF650
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  redo record is at 8E45/8B173180; shutdown FALSE
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next transaction ID: 0/552803703; next OID: 24427698
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next MultiXactId: 109027; next MultiXactOffset: 2442921
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  oldest unfrozen transaction ID: 3104202601, in database 16408
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  transaction ID wrap limit is 956718952, limited by database with OID 16408
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  resetting unlogged relations: cleanup 1 init 0
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  could not open directory "pg_snapshots": No such file or directory
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  initializing for hot standby
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file "0000000900008E450000008B" from archive
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  redo starts at 8E45/8B173180
2013-01-12 00:51:49 UTC::@:[36831]: LOG:  restored log file "0000000900008E450000008C" from archive

*SNIP*

2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000062" from archive
2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000063" from archive
2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000064" from archive
2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000065" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000066" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000067" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  consistent recovery state reached at 8E48/67AC4E28
2013-01-12 00:54:46 UTC::@:[36796]: LOG:  database system is ready to accept read only connections
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000068" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000069" from archive
2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file "0000000900008E480000006A" from archive
2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file "0000000900008E480000006B" from archive
2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file "0000000900008E480000006C" from archive

*SNIP*

2013-01-12 01:02:16 UTC::@:[36831]: LOG:  restored log file "0000000900008E4F00000079" from archive
2013-01-12 01:02:16 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 01:02:17 UTC::@:[36831]: LOG:  restored log file "0000000900008E4F0000007A" from archive
2013-01-12 01:02:17 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation base/16408/18967399 does not exist
2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to invalid pages
2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was terminated by signal 6: Aborted
2013-01-12 01:02:20 UTC::@:[36796]: LOG:  terminating any other active server processes


7. PostgreSQL shuts down...  


8. Start #3:

2013-01-12 01:04:04 UTC::@:[61742]: LOG:  database system was interrupted while in recovery at log time 2013-01-11 19:50:31 UTC
2013-01-12 01:04:04 UTC::@:[61742]: HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2013-01-12 01:04:04 UTC:[local]:[unknown]@[unknown]:[61743]: LOG:  incomplete startup packet
2013-01-12 01:04:04 UTC:[local]:postgres@postgres:[61744]: FATAL:  the database system is starting up
unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
2013-01-12 01:04:04 UTC::@:[61742]: LOG:  entering standby mode
2013-01-12 01:04:04 UTC::@:[61742]: LOG:  restored log file "0000000900008E4A00000039" from archive
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  checkpoint record is at 8E4A/39CD4BA0
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  redo record is at 8E4A/19F0D210; shutdown FALSE
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next transaction ID: 0/552859005; next OID: 24427698
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next MultiXactId: 109221; next MultiXactOffset: 2443321
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  oldest unfrozen transaction ID: 3104202601, in database 16408
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  transaction ID wrap limit is 956718952, limited by database with OID 16408
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  resetting unlogged relations: cleanup 1 init 0
2013-01-12 01:04:04 UTC::@:[61742]: LOG:  could not open directory "pg_snapshots": No such file or directory
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  initializing for hot standby
2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file "0000000900008E4A00000019" from archive
2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:04:05 UTC::@:[61742]: LOG:  redo starts at 8E4A/19F0D210
2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file "0000000900008E4A0000001A" from archive
2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive

*SNIP*

2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F00000077" from archive
2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F00000078" from archive
2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F00000079" from archive
2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F0000007A" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  consistent recovery state reached at 8E4F/7A22BD08
2013-01-12 01:09:38 UTC::@:[61729]: LOG:  database system is ready to accept read only connections
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F0000007B" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F0000007C" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F0000007D" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:39 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F0000007E" from archive


*SNIP*

2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file "0000000900008E53000000D9" from archive
2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file "0000000900008E53000000DA" from archive
2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file "0000000900008E53000000DB" from archive
2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file "0000000900008E53000000DC" from archive
2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file "0000000900008E53000000DD" from archive
2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
2013-01-12 01:14:30 UTC::@:[61742]: LOG:  unexpected pageaddr 8E4A/1000000 in log file 36435, segment 222, offset 0
unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
2013-01-12 01:14:30 UTC::@:[19504]: LOG:  streaming replication successfully connected to primary
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=1 file=base/16408/22873432 time=2.538 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=2 file=base/16408/18967506 time=12.054 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=3 file=base/16408/18967506_fsm time=0.095 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=4 file=base/16408/22873244 time=0.144 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=5 file=base/16408/22892823 time=0.087 msec

9. Slave DB connected to streaming replication with Master DB and all seems fine.


Any help would be appreciated. Thanks!


-- 
Phil Monroe
Stuff  |  Identified
185 Berry Street Suite 6100, San Francisco, CA 94107 
m 614 832 9495
e phil@identified.com
w identified.com
tw @philmonroe
e phil@philmonroe.com
w philmonroe.com


 

Вложения

Re: Postgres WAL Recovery Fails... And Then Works...

От

Lonni J Friedman

Дата:

14 января 2013 г., 23:02:41

Your errors look somewhat similar to a problem I reported last week
(no replies thus far):
http://www.postgresql.org/message-id/CAP=oouE5niXgAO_34Q+FGq=tS6EsMHXXRRoByov3zn9pU4249w@mail.gmail.com

Except in my case no number of restarts helped.  You didn't say, were
you explicitly copying $PGDATA or using some other mechanism to
migrate the data elsewhere?

Also, which version of postgres are you using?

On Fri, Jan 11, 2013 at 6:32 PM, Phil Monroe <phil@identified.com> wrote:
>
> Hi Everyone,
>
> So we had to failover and do a full base backup to get our slave database
> back online and ran into a interesting scenario. After copying the data
> directory, setting up the recovery.conf, and starting the slave database,
> the database crashes while replaying xlogs. However, trying to start the
> database again, the database is able to replay xlogs farther than it
> initially got, but ultimately ended up failing out again. After starting the
> DB a third time, PostgreSQL replays even further and catches up to the
> master to start streaming replication. Is this common and or acceptable?
>
> The error messages that cause PostgreSQL to stop replaying xlogs:
>
> ERROR #1:
> 2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation
> base/16408/18967399 does not exist
> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
> 2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to
> invalid pages
> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
> 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was
> terminated by signal 6: Aborted
>
> ERROR #2:
> 2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation
> base/16408/18967399 does not exist
> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
> 2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to
> invalid pages
> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
> 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was
> terminated by signal 6: Aborted
>
>
> Fortunately, these errors only pertain to indexes, which can be rebuilt.
> But is this a sign that the data directory on the slave is corrupt?
>
> Here is a full timeline of the logs highlighting the issues. :
>
> 1. Data Directory Copy Finishes.
> 2. Recovery.conf Setup
>
> 3. Start #1:
>
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  database system was shut down in
> recovery at 2013-01-12 00:14:06 UTC
> 2013-01-12 00:14:23 UTC:[local]:[unknown]@[unknown]:[61151]: LOG:
> incomplete startup packet
> 2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61152]: FATAL:  the
> database system is starting up
> unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  entering standby mode
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
> "0000000900008E45000000B8" from archive
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
> "0000000900008E450000008B" from archive
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  could not open directory
> "pg_snapshots": No such file or directory
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  redo starts at 8E45/8B173180
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  consistent recovery state
> reached at 8E45/8B174840
> 2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61175]: FATAL:  the
> database system is starting up
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
> "0000000900008E450000008C" from archive
> 2013-01-12 00:14:24 UTC::@:[61150]: LOG:  restored log file
> "0000000900008E450000008D" from archive
>
> *SNIP*
>
> 2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file
> "0000000900008E4800000066" from archive
> 2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file
> "0000000900008E4800000067" from archive
> 2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation
> base/16408/18967399 does not exist
> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
> 2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to
> invalid pages
> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
> 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was
> terminated by signal 6: Aborted
> 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  terminating any other active
> server processes
>
>
> 4. PostgreSQL shuts down...
> 5. Debugging logs enabled in postgresql.conf.
>
>
> 6. Start #2:
>
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  database system was interrupted
> while in recovery at log time 2013-01-11 18:05:31 UTC
> 2013-01-12 00:51:48 UTC::@:[36831]: HINT:  If this has occurred more than
> once some data might be corrupted and you might need to choose an earlier
> recovery target.
> 2013-01-12 00:51:48 UTC:[local]:[unknown]@[unknown]:[36832]: LOG:
> incomplete startup packet
> 2013-01-12 00:51:48 UTC:[local]:postgres@postgres:[36833]: FATAL:  the
> database system is starting up
> unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  entering standby mode
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E45000000B8" from archive
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  checkpoint record is at
> 8E45/B80AF650
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  redo record is at
> 8E45/8B173180; shutdown FALSE
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next transaction ID:
> 0/552803703; next OID: 24427698
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next MultiXactId: 109027; next
> MultiXactOffset: 2442921
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  oldest unfrozen transaction
> ID: 3104202601, in database 16408
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  transaction ID wrap limit is
> 956718952, limited by database with OID 16408
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  resetting unlogged relations:
> cleanup 1 init 0
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  could not open directory
> "pg_snapshots": No such file or directory
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  initializing for hot standby
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E450000008B" from archive
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  redo starts at 8E45/8B173180
> 2013-01-12 00:51:49 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E450000008C" from archive
>
> *SNIP*
>
> 2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000062" from archive
> 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000063" from archive
> 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000064" from archive
> 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000065" from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000066" from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000067" from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  consistent recovery state
> reached at 8E48/67AC4E28
> 2013-01-12 00:54:46 UTC::@:[36796]: LOG:  database system is ready to
> accept read only connections
> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000068" from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000069" from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E480000006A" from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E480000006B" from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E480000006C" from archive
>
> *SNIP*
>
> 2013-01-12 01:02:16 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4F00000079" from archive
> 2013-01-12 01:02:16 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:02:17 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4F0000007A" from archive
> 2013-01-12 01:02:17 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation
> base/16408/18967399 does not exist
> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
> 2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to
> invalid pages
> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
> 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was
> terminated by signal 6: Aborted
> 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  terminating any other active
> server processes
>
>
> 7. PostgreSQL shuts down...
>
>
> 8. Start #3:
>
> 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  database system was interrupted
> while in recovery at log time 2013-01-11 19:50:31 UTC
> 2013-01-12 01:04:04 UTC::@:[61742]: HINT:  If this has occurred more than
> once some data might be corrupted and you might need to choose an earlier
> recovery target.
> 2013-01-12 01:04:04 UTC:[local]:[unknown]@[unknown]:[61743]: LOG:
> incomplete startup packet
> 2013-01-12 01:04:04 UTC:[local]:postgres@postgres:[61744]: FATAL:  the
> database system is starting up
> unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
> 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  entering standby mode
> 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4A00000039" from archive
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  checkpoint record is at
> 8E4A/39CD4BA0
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  redo record is at
> 8E4A/19F0D210; shutdown FALSE
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next transaction ID:
> 0/552859005; next OID: 24427698
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next MultiXactId: 109221; next
> MultiXactOffset: 2443321
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  oldest unfrozen transaction
> ID: 3104202601, in database 16408
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  transaction ID wrap limit is
> 956718952, limited by database with OID 16408
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  resetting unlogged relations:
> cleanup 1 init 0
> 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  could not open directory
> "pg_snapshots": No such file or directory
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  initializing for hot standby
> 2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4A00000019" from archive
> 2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:04:05 UTC::@:[61742]: LOG:  redo starts at 8E4A/19F0D210
> 2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4A0000001A" from archive
> 2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>
> *SNIP*
>
> 2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F00000077" from archive
> 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F00000078" from archive
> 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F00000079" from archive
> 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F0000007A" from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  consistent recovery state
> reached at 8E4F/7A22BD08
> 2013-01-12 01:09:38 UTC::@:[61729]: LOG:  database system is ready to
> accept read only connections
> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F0000007B" from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F0000007C" from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F0000007D" from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:09:39 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F0000007E" from archive
>
>
> *SNIP*
>
> 2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E53000000D9" from archive
> 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E53000000DA" from archive
> 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E53000000DB" from archive
> 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E53000000DC" from archive
> 2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E53000000DD" from archive
> 2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
> 2013-01-12 01:14:30 UTC::@:[61742]: LOG:  unexpected pageaddr 8E4A/1000000
> in log file 36435, segment 222, offset 0
> unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
> 2013-01-12 01:14:30 UTC::@:[19504]: LOG:  streaming replication
> successfully connected to primary
> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=1
> file=base/16408/22873432 time=2.538 msec
> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=2
> file=base/16408/18967506 time=12.054 msec
> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=3
> file=base/16408/18967506_fsm time=0.095 msec
> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=4
> file=base/16408/22873244 time=0.144 msec
> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=5
> file=base/16408/22892823 time=0.087 msec
>
> 9. Slave DB connected to streaming replication with Master DB and all
> seems fine.
>
>
> Any help would be appreciated. Thanks!

Re: Postgres WAL Recovery Fails... And Then Works...

От

KONDO Mitsumasa

Дата:

15 января 2013 г., 04:05:38

Hi

It may be PG9.2.x on HA cluster(pgsql RA on Pacemaker).

This is known bug.
When HA cluster starting, PG use recovery.conf in crash recovery.
But using recovery.conf in PG9.2, PG need database cluster that does not be crashed.

Please read under following mailing list.
[HACKERS] [BUG?] lag of minRecoveryPont in archive recovery
[BUGS] PITR potentially broken in 9.2

Latest-dev(next release version 9.2.3)is fixed this bug.

(2013/01/15 5:02), Lonni J Friedman wrote:
> Your errors look somewhat similar to a problem I reported last week
> (no replies thus far):
> http://www.postgresql.org/message-id/CAP=oouE5niXgAO_34Q+FGq=tS6EsMHXXRRoByov3zn9pU4249w@mail.gmail.com
>
> Except in my case no number of restarts helped.  You didn't say, were
> you explicitly copying $PGDATA or using some other mechanism to
> migrate the data elsewhere?
>
> Also, which version of postgres are you using?
>
> On Fri, Jan 11, 2013 at 6:32 PM, Phil Monroe <phil@identified.com> wrote:
>>
>> Hi Everyone,
>>
>> So we had to failover and do a full base backup to get our slave database
>> back online and ran into a interesting scenario. After copying the data
>> directory, setting up the recovery.conf, and starting the slave database,
>> the database crashes while replaying xlogs. However, trying to start the
>> database again, the database is able to replay xlogs farther than it
>> initially got, but ultimately ended up failing out again. After starting the
>> DB a third time, PostgreSQL replays even further and catches up to the
>> master to start streaming replication. Is this common and or acceptable?
>>
>> The error messages that cause PostgreSQL to stop replaying xlogs:
>>
>> ERROR #1:
>> 2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation
>> base/16408/18967399 does not exist
>> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
>> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
>> 2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to
>> invalid pages
>> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
>> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
>> 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was
>> terminated by signal 6: Aborted
>>
>> ERROR #2:
>> 2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation
>> base/16408/18967399 does not exist
>> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
>> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
>> 2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to
>> invalid pages
>> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
>> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
>> 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was
>> terminated by signal 6: Aborted
>>
>>
>> Fortunately, these errors only pertain to indexes, which can be rebuilt.
>> But is this a sign that the data directory on the slave is corrupt?
>>
>> Here is a full timeline of the logs highlighting the issues. :
>>
>> 1. Data Directory Copy Finishes.
>> 2. Recovery.conf Setup
>>
>> 3. Start #1:
>>
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  database system was shut down in
>> recovery at 2013-01-12 00:14:06 UTC
>> 2013-01-12 00:14:23 UTC:[local]:[unknown]@[unknown]:[61151]: LOG:
>> incomplete startup packet
>> 2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61152]: FATAL:  the
>> database system is starting up
>> unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  entering standby mode
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
>> "0000000900008E45000000B8" from archive
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
>> "0000000900008E450000008B" from archive
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  could not open directory
>> "pg_snapshots": No such file or directory
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  redo starts at 8E45/8B173180
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  consistent recovery state
>> reached at 8E45/8B174840
>> 2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61175]: FATAL:  the
>> database system is starting up
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
>> "0000000900008E450000008C" from archive
>> 2013-01-12 00:14:24 UTC::@:[61150]: LOG:  restored log file
>> "0000000900008E450000008D" from archive
>>
>> *SNIP*
>>
>> 2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file
>> "0000000900008E4800000066" from archive
>> 2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file
>> "0000000900008E4800000067" from archive
>> 2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation
>> base/16408/18967399 does not exist
>> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
>> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
>> 2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to
>> invalid pages
>> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
>> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
>> 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was
>> terminated by signal 6: Aborted
>> 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  terminating any other active
>> server processes
>>
>>
>> 4. PostgreSQL shuts down...
>> 5. Debugging logs enabled in postgresql.conf.
>>
>>
>> 6. Start #2:
>>
>> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  database system was interrupted
>> while in recovery at log time 2013-01-11 18:05:31 UTC
>> 2013-01-12 00:51:48 UTC::@:[36831]: HINT:  If this has occurred more than
>> once some data might be corrupted and you might need to choose an earlier
>> recovery target.
>> 2013-01-12 00:51:48 UTC:[local]:[unknown]@[unknown]:[36832]: LOG:
>> incomplete startup packet
>> 2013-01-12 00:51:48 UTC:[local]:postgres@postgres:[36833]: FATAL:  the
>> database system is starting up
>> unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
>> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  entering standby mode
>> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E45000000B8" from archive
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  checkpoint record is at
>> 8E45/B80AF650
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  redo record is at
>> 8E45/8B173180; shutdown FALSE
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next transaction ID:
>> 0/552803703; next OID: 24427698
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next MultiXactId: 109027; next
>> MultiXactOffset: 2442921
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  oldest unfrozen transaction
>> ID: 3104202601, in database 16408
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  transaction ID wrap limit is
>> 956718952, limited by database with OID 16408
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  resetting unlogged relations:
>> cleanup 1 init 0
>> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  could not open directory
>> "pg_snapshots": No such file or directory
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  initializing for hot standby
>> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E450000008B" from archive
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  redo starts at 8E45/8B173180
>> 2013-01-12 00:51:49 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E450000008C" from archive
>>
>> *SNIP*
>>
>> 2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E4800000062" from archive
>> 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E4800000063" from archive
>> 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E4800000064" from archive
>> 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E4800000065" from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E4800000066" from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E4800000067" from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  consistent recovery state
>> reached at 8E48/67AC4E28
>> 2013-01-12 00:54:46 UTC::@:[36796]: LOG:  database system is ready to
>> accept read only connections
>> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E4800000068" from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E4800000069" from archive
>> 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E480000006A" from archive
>> 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E480000006B" from archive
>> 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E480000006C" from archive
>>
>> *SNIP*
>>
>> 2013-01-12 01:02:16 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E4F00000079" from archive
>> 2013-01-12 01:02:16 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:02:17 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E4F0000007A" from archive
>> 2013-01-12 01:02:17 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation
>> base/16408/18967399 does not exist
>> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
>> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
>> 2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to
>> invalid pages
>> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
>> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
>> 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was
>> terminated by signal 6: Aborted
>> 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  terminating any other active
>> server processes
>>
>>
>> 7. PostgreSQL shuts down...
>>
>>
>> 8. Start #3:
>>
>> 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  database system was interrupted
>> while in recovery at log time 2013-01-11 19:50:31 UTC
>> 2013-01-12 01:04:04 UTC::@:[61742]: HINT:  If this has occurred more than
>> once some data might be corrupted and you might need to choose an earlier
>> recovery target.
>> 2013-01-12 01:04:04 UTC:[local]:[unknown]@[unknown]:[61743]: LOG:
>> incomplete startup packet
>> 2013-01-12 01:04:04 UTC:[local]:postgres@postgres:[61744]: FATAL:  the
>> database system is starting up
>> unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
>> 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  entering standby mode
>> 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4A00000039" from archive
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  checkpoint record is at
>> 8E4A/39CD4BA0
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  redo record is at
>> 8E4A/19F0D210; shutdown FALSE
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next transaction ID:
>> 0/552859005; next OID: 24427698
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next MultiXactId: 109221; next
>> MultiXactOffset: 2443321
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  oldest unfrozen transaction
>> ID: 3104202601, in database 16408
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  transaction ID wrap limit is
>> 956718952, limited by database with OID 16408
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  resetting unlogged relations:
>> cleanup 1 init 0
>> 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  could not open directory
>> "pg_snapshots": No such file or directory
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  initializing for hot standby
>> 2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4A00000019" from archive
>> 2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:04:05 UTC::@:[61742]: LOG:  redo starts at 8E4A/19F0D210
>> 2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4A0000001A" from archive
>> 2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>>
>> *SNIP*
>>
>> 2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4F00000077" from archive
>> 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4F00000078" from archive
>> 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4F00000079" from archive
>> 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4F0000007A" from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  consistent recovery state
>> reached at 8E4F/7A22BD08
>> 2013-01-12 01:09:38 UTC::@:[61729]: LOG:  database system is ready to
>> accept read only connections
>> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4F0000007B" from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4F0000007C" from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4F0000007D" from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:09:39 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4F0000007E" from archive
>>
>>
>> *SNIP*
>>
>> 2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E53000000D9" from archive
>> 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E53000000DA" from archive
>> 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E53000000DB" from archive
>> 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E53000000DC" from archive
>> 2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E53000000DD" from archive
>> 2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
>> 2013-01-12 01:14:30 UTC::@:[61742]: LOG:  unexpected pageaddr 8E4A/1000000
>> in log file 36435, segment 222, offset 0
>> unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
>> 2013-01-12 01:14:30 UTC::@:[19504]: LOG:  streaming replication
>> successfully connected to primary
>> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=1
>> file=base/16408/22873432 time=2.538 msec
>> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=2
>> file=base/16408/18967506 time=12.054 msec
>> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=3
>> file=base/16408/18967506_fsm time=0.095 msec
>> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=4
>> file=base/16408/22873244 time=0.144 msec
>> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=5
>> file=base/16408/22892823 time=0.087 msec
>>
>> 9. Slave DB connected to streaming replication with Master DB and all
>> seems fine.
>>
>>
>> Any help would be appreciated. Thanks!
>
>
--
NTT OSS Center
Mitsumasa KONDO

Re: Postgres WAL Recovery Fails... And Then Works...

От

Albe Laurenz

Дата:

15 января 2013 г., 11:18:59

Phil Monroe wrote:
> So we had to failover and do a full base backup to get our slave database back online and ran into a
> interesting scenario. After copying the data directory, setting up the recovery.conf, and starting the
> slave database, the database crashes while replaying xlogs. However, trying to start the database
> again, the database is able to replay xlogs farther than it initially got, but ultimately ended up
> failing out again. After starting the DB a third time, PostgreSQL replays even further and catches up
> to the master to start streaming replication. Is this common and or acceptable?

Certainly not acceptable.

Just checking: When you were "copying the data directory",
did you have online backup mode enabled?

Yours,
Laurenz Albe

Re: Postgres WAL Recovery Fails... And Then Works...

От

Heikki Linnakangas

Дата:

15 января 2013 г., 14:02:24

On 12.01.2013 04:32, Phil Monroe wrote:
> Hi Everyone,
>
> So we had to failover and do a full base backup to get our slave database back
> online and ran into a interesting scenario. After copying the data directory,
> setting up the recovery.conf, and starting the slave database, the database
> crashes while replaying xlogs. However, trying to start the database again, the
> database is able to replay xlogs farther than it initially got, but ultimately
> ended up failing out again. After starting the DB a third time, PostgreSQL
> replays even further and catches up to the master to start streaming
> replication. Is this common and or acceptable?

How did you perform the base backup? Did you use pg_basebackup? Or if
you did a filesystem-level copy, did you use pg_start/stop_backup
correctly? Did you take the base backup from the master server, or from
another slave?

This looks similar to the bug discussed here:
http://www.postgresql.org/message-id/CAMkU=1wpvYJVEDo6Qvq4QbosZ+AV6BMVCf+XVCG=mJqFRjQ8Pg@mail.gmail.com.
That was fixed in 9.2.2, so if you're using 9.2.1 or 9.2.0, try upgrading.

- Heikki

Re: Postgres WAL Recovery Fails... And Then Works...

От

Phil Monroe

Дата:

15 января 2013 г., 22:54:55

Sorry, Initial response got blocked since I replied with the logs quoted again.

 Also, which version of postgres are you using?

 PostgreSQL 9.2.1 on Ubuntu 12.04

 Except in my case no number of restarts helped.  You didn't say, were
 you explicitly copying $PGDATA or using some other mechanism to
 migrate the data elsewhere?

 So we have a very large database (~5TB), so we use a script to do parallel rsyncs to copy the data directory (https://gist.github.com/4477190/#file-pmrcp-rb). The whole copy process ended up taking ~3.5 hours. So we did a physical copy of $PGDATA (which is located at /var/lib/postgresql/9.2/main/ on both machines.). We followed the following process to do this:

 1. Master archives WAL files to Backup Host.
 2. Execute on Master: psql -c "select pg_start_backup('DATE-slave-restore')"
 3. Execute on Master: RCP='rsync -cav --inplace -e rsh' EXCLUDE='pg_xlog' pmrcp /var/lib/postgresql/9.2/main/ prd-db-01:/var/lib/postgresql/9.2/main/ > /tmp/backup.log
 4. Execute on Master: psql -c "select pg_stop_backup()"
 5. On Slave, setup recovery.conf to read WAL archive on Backup Host
 6. Execute on Slave:  pg_ctlcluster 9.2 main start (as described in initial email)

 Best,
 Phil

 Heikki Linnakangas wrote:

 Sorry, Initial response got blocked since I replied with the logs 
 quoted again.

 Also, which version of postgres are you using?

 PostgreSQL 9.2.1 on Ubuntu 12.04

 Except in my case no number of restarts helped.  You didn't say, were
 you explicitly copying $PGDATA or using some other mechanism to
 migrate the data elsewhere?

 So we have a very large database (~5TB), so we use a script to do 
 parallel rsyncs to copy the data directory 
 (https://gist.github.com/4477190/#file-pmrcp-rb). The whole copy 
 process ended up taking ~3.5 hours. So we did a physical copy of 
 $PGDATA (which is located at /var/lib/postgresql/9.2/main/ on both 
 machines.). We followed the following process to do this:

 1. Master archives WAL files to Backup Host.
 2. Execute on Master: psql -c "select 
 pg_start_backup('DATE-slave-restore')"
 3. Execute on Master: RCP='rsync -cav --inplace -e rsh' 
 EXCLUDE='pg_xlog' pmrcp /var/lib/postgresql/9.2/main/ 
 prd-db-01:/var/lib/postgresql/9.2/main/ > /tmp/backup.log
 4. Execute on Master: psql -c "select pg_stop_backup()"
 5. On Slave, setup recovery.conf to read WAL archive on Backup Host
 6. Execute on Slave:  pg_ctlcluster 9.2 main start (as described in 
 initial email)

 Best,
 Phil

Re: Postgres WAL Recovery Fails... And Then Works...

От

Phil Monroe

Дата:

17 января 2013 г., 23:58:00

 Also, which version of postgres are you using?

 PostgreSQL 9.2.1 on Ubuntu 12.04


 Except in my case no number of restarts helped.  You didn't say, were
 you explicitly copying $PGDATA or using some other mechanism to
 migrate the data elsewhere?


 So we have a very large database (~5TB), so we use a script to do parallel rsyncs to copy the data directory (https://gist.github.com/4477190/#file-pmrcp-rb). The whole copy process ended up taking ~3.5 hours. So we did a physical copy of $PGDATA (which is located at /var/lib/postgresql/9.2/main/ on both machines.). We followed the following process to do this:

 1. Master archives WAL files to Backup Host.
 2. Execute on Master: psql -c "select pg_start_backup('DATE-slave-restore')"
 3. Execute on Master: RCP='rsync -cav --inplace -e rsh' EXCLUDE='pg_xlog' pmrcp /var/lib/postgresql/9.2/main/ prd-db-01:/var/lib/postgresql/9.2/main/ > /tmp/backup.log 
 4. Execute on Master: psql -c "select pg_stop_backup()"
 5. On Slave, setup recovery.conf to read WAL archive on Backup Host
 6. Execute on Slave:  pg_ctlcluster 9.2 main start (as described in initial email)


 Best,
 Phil



 Lonni J Friedman wrote:

 Your errors look somewhat similar to a problem I reported last week
 (no replies thus far):
 http://www.postgresql.org/message-id/CAP=oouE5niXgAO_34Q+FGq=tS6EsMHXXRRoByov3zn9pU4249w@mail.gmail.com

 Except in my case no number of restarts helped.  You didn't say, were
 you explicitly copying $PGDATA or using some other mechanism to
 migrate the data elsewhere?

 Also, which version of postgres are you using?

 On Fri, Jan 11, 2013 at 6:32 PM, Phil Monroe<phil@identified.com>  wrote:

 Hi Everyone,

 So we had to failover and do a full base backup to get our slave database
 back online and ran into a interesting scenario. After copying the data
 directory, setting up the recovery.conf, and starting the slave database,
 the database crashes while replaying xlogs. However, trying to start the
 database again, the database is able to replay xlogs farther than it
 initially got, but ultimately ended up failing out again. After starting the
 DB a third time, PostgreSQL replays even further and catches up to the
 master to start streaming replication. Is this common and or acceptable?

 The error messages that cause PostgreSQL to stop replaying xlogs:

 ERROR #1:
 2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation
 base/16408/18967399 does not exist
 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
 2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to
 invalid pages
 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was
 terminated by signal 6: Aborted

 ERROR #2:
 2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation
 base/16408/18967399 does not exist
 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
 2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to
 invalid pages
 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was
 terminated by signal 6: Aborted


 Fortunately, these errors only pertain to indexes, which can be rebuilt.
 But is this a sign that the data directory on the slave is corrupt?

 Here is a full timeline of the logs highlighting the issues. :

 1. Data Directory Copy Finishes.
 2. Recovery.conf Setup

 3. Start #1:

 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  database system was shut down in
 recovery at 2013-01-12 00:14:06 UTC
 2013-01-12 00:14:23 UTC:[local]:[unknown]@[unknown]:[61151]: LOG:
 incomplete startup packet
 2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61152]: FATAL:  the
 database system is starting up
 unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  entering standby mode
 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
 "0000000900008E45000000B8" from archive
 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
 "0000000900008E450000008B" from archive
 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  could not open directory
 "pg_snapshots": No such file or directory
 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  redo starts at 8E45/8B173180
 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  consistent recovery state
 reached at 8E45/8B174840
 2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61175]: FATAL:  the
 database system is starting up
 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
 "0000000900008E450000008C" from archive
 2013-01-12 00:14:24 UTC::@:[61150]: LOG:  restored log file
 "0000000900008E450000008D" from archive

 *SNIP*

 2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file
 "0000000900008E4800000066" from archive
 2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file
 "0000000900008E4800000067" from archive
 2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation
 base/16408/18967399 does not exist
 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
 2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to
 invalid pages
 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was
 terminated by signal 6: Aborted
 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  terminating any other active
 server processes


 4. PostgreSQL shuts down...
 5. Debugging logs enabled in postgresql.conf.


 6. Start #2:

 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  database system was interrupted
 while in recovery at log time 2013-01-11 18:05:31 UTC
 2013-01-12 00:51:48 UTC::@:[36831]: HINT:  If this has occurred more than
 once some data might be corrupted and you might need to choose an earlier
 recovery target.
 2013-01-12 00:51:48 UTC:[local]:[unknown]@[unknown]:[36832]: LOG:
 incomplete startup packet
 2013-01-12 00:51:48 UTC:[local]:postgres@postgres:[36833]: FATAL:  the
 database system is starting up
 unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  entering standby mode
 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E45000000B8" from archive
 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  checkpoint record is at
 8E45/B80AF650
 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  redo record is at
 8E45/8B173180; shutdown FALSE
 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next transaction ID:
 0/552803703; next OID: 24427698
 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next MultiXactId: 109027; next
 MultiXactOffset: 2442921
 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  oldest unfrozen transaction
 ID: 3104202601, in database 16408
 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  transaction ID wrap limit is
 956718952, limited by database with OID 16408
 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  resetting unlogged relations:
 cleanup 1 init 0
 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  could not open directory
 "pg_snapshots": No such file or directory
 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  initializing for hot standby
 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E450000008B" from archive
 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  redo starts at 8E45/8B173180
 2013-01-12 00:51:49 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E450000008C" from archive

 *SNIP*

 2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E4800000062" from archive
 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E4800000063" from archive
 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E4800000064" from archive
 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E4800000065" from archive
 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E4800000066" from archive
 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E4800000067" from archive
 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  consistent recovery state
 reached at 8E48/67AC4E28
 2013-01-12 00:54:46 UTC::@:[36796]: LOG:  database system is ready to
 accept read only connections
 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E4800000068" from archive
 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E4800000069" from archive
 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E480000006A" from archive
 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E480000006B" from archive
 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E480000006C" from archive

 *SNIP*

 2013-01-12 01:02:16 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E4F00000079" from archive
 2013-01-12 01:02:16 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 01:02:17 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E4F0000007A" from archive
 2013-01-12 01:02:17 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation
 base/16408/18967399 does not exist
 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
 2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to
 invalid pages
 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was
 terminated by signal 6: Aborted
 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  terminating any other active
 server processes


 7. PostgreSQL shuts down...


 8. Start #3:

 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  database system was interrupted
 while in recovery at log time 2013-01-11 19:50:31 UTC
 2013-01-12 01:04:04 UTC::@:[61742]: HINT:  If this has occurred more than
 once some data might be corrupted and you might need to choose an earlier
 recovery target.
 2013-01-12 01:04:04 UTC:[local]:[unknown]@[unknown]:[61743]: LOG:
 incomplete startup packet
 2013-01-12 01:04:04 UTC:[local]:postgres@postgres:[61744]: FATAL:  the
 database system is starting up
 unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  entering standby mode
 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4A00000039" from archive
 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  checkpoint record is at
 8E4A/39CD4BA0
 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  redo record is at
 8E4A/19F0D210; shutdown FALSE
 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next transaction ID:
 0/552859005; next OID: 24427698
 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next MultiXactId: 109221; next
 MultiXactOffset: 2443321
 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  oldest unfrozen transaction
 ID: 3104202601, in database 16408
 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  transaction ID wrap limit is
 956718952, limited by database with OID 16408
 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  resetting unlogged relations:
 cleanup 1 init 0
 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  could not open directory
 "pg_snapshots": No such file or directory
 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  initializing for hot standby
 2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4A00000019" from archive
 2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:04:05 UTC::@:[61742]: LOG:  redo starts at 8E4A/19F0D210
 2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4A0000001A" from archive
 2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive

 *SNIP*

 2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4F00000077" from archive
 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4F00000078" from archive
 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4F00000079" from archive
 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4F0000007A" from archive
 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  consistent recovery state
 reached at 8E4F/7A22BD08
 2013-01-12 01:09:38 UTC::@:[61729]: LOG:  database system is ready to
 accept read only connections
 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4F0000007B" from archive
 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4F0000007C" from archive
 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4F0000007D" from archive
 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:09:39 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4F0000007E" from archive


 *SNIP*

 2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E53000000D9" from archive
 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E53000000DA" from archive
 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E53000000DB" from archive
 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E53000000DC" from archive
 2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E53000000DD" from archive
 2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
 2013-01-12 01:14:30 UTC::@:[61742]: LOG:  unexpected pageaddr 8E4A/1000000
 in log file 36435, segment 222, offset 0
 unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
 2013-01-12 01:14:30 UTC::@:[19504]: LOG:  streaming replication
 successfully connected to primary
 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=1
 file=base/16408/22873432 time=2.538 msec
 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=2
 file=base/16408/18967506 time=12.054 msec
 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=3
 file=base/16408/18967506_fsm time=0.095 msec
 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=4
 file=base/16408/22873244 time=0.144 msec
 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=5
 file=base/16408/22892823 time=0.087 msec

 9. Slave DB connected to streaming replication with Master DB and all
 seems fine.


 Any help would be appreciated. Thanks!

Re: Postgres WAL Recovery Fails... And Then Works...

От

Phil Monroe

Дата:

17 января 2013 г., 23:58:40

Sorry, Initial response got blocked since I replied with the logs quoted again.

 Also, which version of postgres are you using?

 PostgreSQL 9.2.1 on Ubuntu 12.04

 Except in my case no number of restarts helped.  You didn't say, were
 you explicitly copying $PGDATA or using some other mechanism to
 migrate the data elsewhere?

 So we have a very large database (~5TB), so we use a script to do parallel rsyncs to copy the data directory (https://gist.github.com/4477190/#file-pmrcp-rb). The whole copy process ended up taking ~3.5 hours. So we did a physical copy of $PGDATA (which is located at /var/lib/postgresql/9.2/main/ on both machines.). We followed the following process to do this:

 1. Master archives WAL files to Backup Host.
 2. Execute on Master: psql -c "select pg_start_backup('DATE-slave-restore')"
 3. Execute on Master: RCP='rsync -cav --inplace -e rsh' EXCLUDE='pg_xlog' pmrcp /var/lib/postgresql/9.2/main/ prd-db-01:/var/lib/postgresql/9.2/main/ > /tmp/backup.log
 4. Execute on Master: psql -c "select pg_stop_backup()"
 5. On Slave, setup recovery.conf to read WAL archive on Backup Host
 6. Execute on Slave:  pg_ctlcluster 9.2 main start (as described in initial email)

 Best,
 Phil

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Postgres WAL Recovery Fails... And Then Works...

Вложения