Обсуждение: Postgres WAL Recovery Fails... And Then Works...

Поиск
Список
Период
Сортировка

Postgres WAL Recovery Fails... And Then Works...

От
Phil Monroe
Дата:
Hi Everyone,

So we had to failover and do a full base backup to get our slave database back online and ran into a interesting scenario. After copying the data directory, setting up the recovery.conf, and starting the slave database, the database crashes while replaying xlogs. However, trying to start the database again, the database is able to replay xlogs farther than it initially got, but ultimately ended up failing out again. After starting the DB a third time, PostgreSQL replays even further and catches up to the master to start streaming replication. Is this common and or acceptable?

The error messages that cause PostgreSQL to stop replaying xlogs:

ERROR #1:
2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation base/16408/18967399 does not exist
2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to invalid pages
2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was terminated by signal 6: Aborted


ERROR #2:
2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation base/16408/18967399 does not exist
2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to invalid pages
2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was terminated by signal 6: Aborted



Fortunately, these errors only pertain to indexes, which can be rebuilt. But is this a sign that the data directory on the slave is corrupt?

Here is a full timeline of the logs
highlighting the issues. :

1. Data Directory Copy Finishes.
2. Recovery.conf Setup

3. Start #1:

2013-01-12 00:14:23 UTC::@:[61150]: LOG:  database system was shut down in recovery at 2013-01-12 00:14:06 UTC
2013-01-12 00:14:23 UTC:[local]:[unknown]@[unknown]:[61151]: LOG:  incomplete startup packet
2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61152]: FATAL:  the database system is starting up
unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  entering standby mode
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file "0000000900008E45000000B8" from archive
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file "0000000900008E450000008B" from archive
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  could not open directory "pg_snapshots": No such file or directory
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  redo starts at 8E45/8B173180
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  consistent recovery state reached at 8E45/8B174840
2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61175]: FATAL:  the database system is starting up
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file "0000000900008E450000008C" from archive
2013-01-12 00:14:24 UTC::@:[61150]: LOG:  restored log file "0000000900008E450000008D" from archive

*SNIP*

2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file "0000000900008E4800000066" from archive
2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file "0000000900008E4800000067" from archive
2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation base/16408/18967399 does not exist
2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to invalid pages
2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was terminated by signal 6: Aborted
2013-01-12 00:17:18 UTC::@:[61121]: LOG:  terminating any other active server processes


4. PostgreSQL shuts down...
5. Debugging logs enabled in postgresql.conf.


6. Start #2:

2013-01-12 00:51:48 UTC::@:[36831]: LOG:  database system was interrupted while in recovery at log time 2013-01-11 18:05:31 UTC
2013-01-12 00:51:48 UTC::@:[36831]: HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2013-01-12 00:51:48 UTC:[local]:[unknown]@[unknown]:[36832]: LOG:  incomplete startup packet
2013-01-12 00:51:48 UTC:[local]:postgres@postgres:[36833]: FATAL:  the database system is starting up
unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  entering standby mode
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file "0000000900008E45000000B8" from archive
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  checkpoint record is at 8E45/B80AF650
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  redo record is at 8E45/8B173180; shutdown FALSE
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next transaction ID: 0/552803703; next OID: 24427698
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next MultiXactId: 109027; next MultiXactOffset: 2442921
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  oldest unfrozen transaction ID: 3104202601, in database 16408
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  transaction ID wrap limit is 956718952, limited by database with OID 16408
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  resetting unlogged relations: cleanup 1 init 0
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  could not open directory "pg_snapshots": No such file or directory
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  initializing for hot standby
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file "0000000900008E450000008B" from archive
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  redo starts at 8E45/8B173180
2013-01-12 00:51:49 UTC::@:[36831]: LOG:  restored log file "0000000900008E450000008C" from archive

*SNIP*

2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000062" from archive
2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000063" from archive
2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000064" from archive
2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000065" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000066" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000067" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  consistent recovery state reached at 8E48/67AC4E28
2013-01-12 00:54:46 UTC::@:[36796]: LOG:  database system is ready to accept read only connections
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000068" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file "0000000900008E4800000069" from archive
2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file "0000000900008E480000006A" from archive
2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file "0000000900008E480000006B" from archive
2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file "0000000900008E480000006C" from archive

*SNIP*

2013-01-12 01:02:16 UTC::@:[36831]: LOG:  restored log file "0000000900008E4F00000079" from archive
2013-01-12 01:02:16 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 01:02:17 UTC::@:[36831]: LOG:  restored log file "0000000900008E4F0000007A" from archive
2013-01-12 01:02:17 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation base/16408/18967399 does not exist
2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to invalid pages
2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was terminated by signal 6: Aborted
2013-01-12 01:02:20 UTC::@:[36796]: LOG:  terminating any other active server processes


7. PostgreSQL shuts down... 


8. Start #3:

2013-01-12 01:04:04 UTC::@:[61742]: LOG:  database system was interrupted while in recovery at log time 2013-01-11 19:50:31 UTC
2013-01-12 01:04:04 UTC::@:[61742]: HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2013-01-12 01:04:04 UTC:[local]:[unknown]@[unknown]:[61743]: LOG:  incomplete startup packet
2013-01-12 01:04:04 UTC:[local]:postgres@postgres:[61744]: FATAL:  the database system is starting up
unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
2013-01-12 01:04:04 UTC::@:[61742]: LOG:  entering standby mode
2013-01-12 01:04:04 UTC::@:[61742]: LOG:  restored log file "0000000900008E4A00000039" from archive
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  checkpoint record is at 8E4A/39CD4BA0
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  redo record is at 8E4A/19F0D210; shutdown FALSE
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next transaction ID: 0/552859005; next OID: 24427698
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next MultiXactId: 109221; next MultiXactOffset: 2443321
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  oldest unfrozen transaction ID: 3104202601, in database 16408
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  transaction ID wrap limit is 956718952, limited by database with OID 16408
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  resetting unlogged relations: cleanup 1 init 0
2013-01-12 01:04:04 UTC::@:[61742]: LOG:  could not open directory "pg_snapshots": No such file or directory
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  initializing for hot standby
2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file "0000000900008E4A00000019" from archive
2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:04:05 UTC::@:[61742]: LOG:  redo starts at 8E4A/19F0D210
2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file "0000000900008E4A0000001A" from archive
2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive

*SNIP*

2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F00000077" from archive
2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F00000078" from archive
2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F00000079" from archive
2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F0000007A" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  consistent recovery state reached at 8E4F/7A22BD08
2013-01-12 01:09:38 UTC::@:[61729]: LOG:  database system is ready to accept read only connections
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F0000007B" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F0000007C" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F0000007D" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:39 UTC::@:[61742]: LOG:  restored log file "0000000900008E4F0000007E" from archive


*SNIP*

2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file "0000000900008E53000000D9" from archive
2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file "0000000900008E53000000DA" from archive
2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file "0000000900008E53000000DB" from archive
2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file "0000000900008E53000000DC" from archive
2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file "0000000900008E53000000DD" from archive
2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
2013-01-12 01:14:30 UTC::@:[61742]: LOG:  unexpected pageaddr 8E4A/1000000 in log file 36435, segment 222, offset 0
unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
2013-01-12 01:14:30 UTC::@:[19504]: LOG:  streaming replication successfully connected to primary
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=1 file=base/16408/22873432 time=2.538 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=2 file=base/16408/18967506 time=12.054 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=3 file=base/16408/18967506_fsm time=0.095 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=4 file=base/16408/22873244 time=0.144 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=5 file=base/16408/22892823 time=0.087 msec

9. Slave DB connected to streaming replication with Master DB and all seems fine.


Any help would be appreciated. Thanks!


--

Phil Monroe

Stuff  |  Identified

185 Berry Street Suite 6100, San Francisco, CA 94107 



 
Вложения

Re: Postgres WAL Recovery Fails... And Then Works...

От
Lonni J Friedman
Дата:
Your errors look somewhat similar to a problem I reported last week
(no replies thus far):
http://www.postgresql.org/message-id/CAP=oouE5niXgAO_34Q+FGq=tS6EsMHXXRRoByov3zn9pU4249w@mail.gmail.com

Except in my case no number of restarts helped.  You didn't say, were
you explicitly copying $PGDATA or using some other mechanism to
migrate the data elsewhere?

Also, which version of postgres are you using?

On Fri, Jan 11, 2013 at 6:32 PM, Phil Monroe <phil@identified.com> wrote:
>
> Hi Everyone,
>
> So we had to failover and do a full base backup to get our slave database
> back online and ran into a interesting scenario. After copying the data
> directory, setting up the recovery.conf, and starting the slave database,
> the database crashes while replaying xlogs. However, trying to start the
> database again, the database is able to replay xlogs farther than it
> initially got, but ultimately ended up failing out again. After starting the
> DB a third time, PostgreSQL replays even further and catches up to the
> master to start streaming replication. Is this common and or acceptable?
>
> The error messages that cause PostgreSQL to stop replaying xlogs:
>
> ERROR #1:
> 2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation
> base/16408/18967399 does not exist
> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
> 2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to
> invalid pages
> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
> 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was
> terminated by signal 6: Aborted
>
> ERROR #2:
> 2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation
> base/16408/18967399 does not exist
> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
> 2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to
> invalid pages
> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
> 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was
> terminated by signal 6: Aborted
>
>
> Fortunately, these errors only pertain to indexes, which can be rebuilt.
> But is this a sign that the data directory on the slave is corrupt?
>
> Here is a full timeline of the logs highlighting the issues. :
>
> 1. Data Directory Copy Finishes.
> 2. Recovery.conf Setup
>
> 3. Start #1:
>
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  database system was shut down in
> recovery at 2013-01-12 00:14:06 UTC
> 2013-01-12 00:14:23 UTC:[local]:[unknown]@[unknown]:[61151]: LOG:
> incomplete startup packet
> 2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61152]: FATAL:  the
> database system is starting up
> unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  entering standby mode
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
> "0000000900008E45000000B8" from archive
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
> "0000000900008E450000008B" from archive
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  could not open directory
> "pg_snapshots": No such file or directory
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  redo starts at 8E45/8B173180
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  consistent recovery state
> reached at 8E45/8B174840
> 2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61175]: FATAL:  the
> database system is starting up
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
> "0000000900008E450000008C" from archive
> 2013-01-12 00:14:24 UTC::@:[61150]: LOG:  restored log file
> "0000000900008E450000008D" from archive
>
> *SNIP*
>
> 2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file
> "0000000900008E4800000066" from archive
> 2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file
> "0000000900008E4800000067" from archive
> 2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation
> base/16408/18967399 does not exist
> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
> 2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to
> invalid pages
> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
> 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was
> terminated by signal 6: Aborted
> 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  terminating any other active
> server processes
>
>
> 4. PostgreSQL shuts down...
> 5. Debugging logs enabled in postgresql.conf.
>
>
> 6. Start #2:
>
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  database system was interrupted
> while in recovery at log time 2013-01-11 18:05:31 UTC
> 2013-01-12 00:51:48 UTC::@:[36831]: HINT:  If this has occurred more than
> once some data might be corrupted and you might need to choose an earlier
> recovery target.
> 2013-01-12 00:51:48 UTC:[local]:[unknown]@[unknown]:[36832]: LOG:
> incomplete startup packet
> 2013-01-12 00:51:48 UTC:[local]:postgres@postgres:[36833]: FATAL:  the
> database system is starting up
> unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  entering standby mode
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E45000000B8" from archive
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  checkpoint record is at
> 8E45/B80AF650
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  redo record is at
> 8E45/8B173180; shutdown FALSE
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next transaction ID:
> 0/552803703; next OID: 24427698
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next MultiXactId: 109027; next
> MultiXactOffset: 2442921
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  oldest unfrozen transaction
> ID: 3104202601, in database 16408
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  transaction ID wrap limit is
> 956718952, limited by database with OID 16408
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  resetting unlogged relations:
> cleanup 1 init 0
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  could not open directory
> "pg_snapshots": No such file or directory
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  initializing for hot standby
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E450000008B" from archive
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  redo starts at 8E45/8B173180
> 2013-01-12 00:51:49 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E450000008C" from archive
>
> *SNIP*
>
> 2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000062" from archive
> 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000063" from archive
> 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000064" from archive
> 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000065" from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000066" from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000067" from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  consistent recovery state
> reached at 8E48/67AC4E28
> 2013-01-12 00:54:46 UTC::@:[36796]: LOG:  database system is ready to
> accept read only connections
> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000068" from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4800000069" from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E480000006A" from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E480000006B" from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E480000006C" from archive
>
> *SNIP*
>
> 2013-01-12 01:02:16 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4F00000079" from archive
> 2013-01-12 01:02:16 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:02:17 UTC::@:[36831]: LOG:  restored log file
> "0000000900008E4F0000007A" from archive
> 2013-01-12 01:02:17 UTC::@:[36831]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation
> base/16408/18967399 does not exist
> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
> 2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to
> invalid pages
> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
> 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was
> terminated by signal 6: Aborted
> 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  terminating any other active
> server processes
>
>
> 7. PostgreSQL shuts down...
>
>
> 8. Start #3:
>
> 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  database system was interrupted
> while in recovery at log time 2013-01-11 19:50:31 UTC
> 2013-01-12 01:04:04 UTC::@:[61742]: HINT:  If this has occurred more than
> once some data might be corrupted and you might need to choose an earlier
> recovery target.
> 2013-01-12 01:04:04 UTC:[local]:[unknown]@[unknown]:[61743]: LOG:
> incomplete startup packet
> 2013-01-12 01:04:04 UTC:[local]:postgres@postgres:[61744]: FATAL:  the
> database system is starting up
> unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
> 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  entering standby mode
> 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4A00000039" from archive
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  checkpoint record is at
> 8E4A/39CD4BA0
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  redo record is at
> 8E4A/19F0D210; shutdown FALSE
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next transaction ID:
> 0/552859005; next OID: 24427698
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next MultiXactId: 109221; next
> MultiXactOffset: 2443321
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  oldest unfrozen transaction
> ID: 3104202601, in database 16408
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  transaction ID wrap limit is
> 956718952, limited by database with OID 16408
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  resetting unlogged relations:
> cleanup 1 init 0
> 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  could not open directory
> "pg_snapshots": No such file or directory
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  initializing for hot standby
> 2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4A00000019" from archive
> 2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:04:05 UTC::@:[61742]: LOG:  redo starts at 8E4A/19F0D210
> 2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4A0000001A" from archive
> 2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>
> *SNIP*
>
> 2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F00000077" from archive
> 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F00000078" from archive
> 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F00000079" from archive
> 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F0000007A" from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  consistent recovery state
> reached at 8E4F/7A22BD08
> 2013-01-12 01:09:38 UTC::@:[61729]: LOG:  database system is ready to
> accept read only connections
> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F0000007B" from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F0000007C" from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F0000007D" from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:09:39 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E4F0000007E" from archive
>
>
> *SNIP*
>
> 2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E53000000D9" from archive
> 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E53000000DA" from archive
> 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E53000000DB" from archive
> 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E53000000DC" from archive
> 2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> 2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file
> "0000000900008E53000000DD" from archive
> 2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
> unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
> 2013-01-12 01:14:30 UTC::@:[61742]: LOG:  unexpected pageaddr 8E4A/1000000
> in log file 36435, segment 222, offset 0
> unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
> 2013-01-12 01:14:30 UTC::@:[19504]: LOG:  streaming replication
> successfully connected to primary
> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=1
> file=base/16408/22873432 time=2.538 msec
> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=2
> file=base/16408/18967506 time=12.054 msec
> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=3
> file=base/16408/18967506_fsm time=0.095 msec
> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=4
> file=base/16408/22873244 time=0.144 msec
> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=5
> file=base/16408/22892823 time=0.087 msec
>
> 9. Slave DB connected to streaming replication with Master DB and all
> seems fine.
>
>
> Any help would be appreciated. Thanks!


Re: Postgres WAL Recovery Fails... And Then Works...

От
KONDO Mitsumasa
Дата:
Hi

It may be PG9.2.x on HA cluster(pgsql RA on Pacemaker).

This is known bug.
When HA cluster starting, PG use recovery.conf in crash recovery.
But using recovery.conf in PG9.2, PG need database cluster that does not be crashed.

Please read under following mailing list.
[HACKERS] [BUG?] lag of minRecoveryPont in archive recovery
[BUGS] PITR potentially broken in 9.2

Latest-dev(next release version 9.2.3)is fixed this bug.

(2013/01/15 5:02), Lonni J Friedman wrote:
> Your errors look somewhat similar to a problem I reported last week
> (no replies thus far):
> http://www.postgresql.org/message-id/CAP=oouE5niXgAO_34Q+FGq=tS6EsMHXXRRoByov3zn9pU4249w@mail.gmail.com
>
> Except in my case no number of restarts helped.  You didn't say, were
> you explicitly copying $PGDATA or using some other mechanism to
> migrate the data elsewhere?
>
> Also, which version of postgres are you using?
>
> On Fri, Jan 11, 2013 at 6:32 PM, Phil Monroe <phil@identified.com> wrote:
>>
>> Hi Everyone,
>>
>> So we had to failover and do a full base backup to get our slave database
>> back online and ran into a interesting scenario. After copying the data
>> directory, setting up the recovery.conf, and starting the slave database,
>> the database crashes while replaying xlogs. However, trying to start the
>> database again, the database is able to replay xlogs farther than it
>> initially got, but ultimately ended up failing out again. After starting the
>> DB a third time, PostgreSQL replays even further and catches up to the
>> master to start streaming replication. Is this common and or acceptable?
>>
>> The error messages that cause PostgreSQL to stop replaying xlogs:
>>
>> ERROR #1:
>> 2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation
>> base/16408/18967399 does not exist
>> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
>> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
>> 2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to
>> invalid pages
>> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
>> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
>> 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was
>> terminated by signal 6: Aborted
>>
>> ERROR #2:
>> 2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation
>> base/16408/18967399 does not exist
>> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
>> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
>> 2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to
>> invalid pages
>> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
>> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
>> 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was
>> terminated by signal 6: Aborted
>>
>>
>> Fortunately, these errors only pertain to indexes, which can be rebuilt.
>> But is this a sign that the data directory on the slave is corrupt?
>>
>> Here is a full timeline of the logs highlighting the issues. :
>>
>> 1. Data Directory Copy Finishes.
>> 2. Recovery.conf Setup
>>
>> 3. Start #1:
>>
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  database system was shut down in
>> recovery at 2013-01-12 00:14:06 UTC
>> 2013-01-12 00:14:23 UTC:[local]:[unknown]@[unknown]:[61151]: LOG:
>> incomplete startup packet
>> 2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61152]: FATAL:  the
>> database system is starting up
>> unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  entering standby mode
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
>> "0000000900008E45000000B8" from archive
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
>> "0000000900008E450000008B" from archive
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  could not open directory
>> "pg_snapshots": No such file or directory
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  redo starts at 8E45/8B173180
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  consistent recovery state
>> reached at 8E45/8B174840
>> 2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61175]: FATAL:  the
>> database system is starting up
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
>> "0000000900008E450000008C" from archive
>> 2013-01-12 00:14:24 UTC::@:[61150]: LOG:  restored log file
>> "0000000900008E450000008D" from archive
>>
>> *SNIP*
>>
>> 2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file
>> "0000000900008E4800000066" from archive
>> 2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file
>> "0000000900008E4800000067" from archive
>> 2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation
>> base/16408/18967399 does not exist
>> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
>> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
>> 2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to
>> invalid pages
>> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
>> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
>> 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was
>> terminated by signal 6: Aborted
>> 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  terminating any other active
>> server processes
>>
>>
>> 4. PostgreSQL shuts down...
>> 5. Debugging logs enabled in postgresql.conf.
>>
>>
>> 6. Start #2:
>>
>> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  database system was interrupted
>> while in recovery at log time 2013-01-11 18:05:31 UTC
>> 2013-01-12 00:51:48 UTC::@:[36831]: HINT:  If this has occurred more than
>> once some data might be corrupted and you might need to choose an earlier
>> recovery target.
>> 2013-01-12 00:51:48 UTC:[local]:[unknown]@[unknown]:[36832]: LOG:
>> incomplete startup packet
>> 2013-01-12 00:51:48 UTC:[local]:postgres@postgres:[36833]: FATAL:  the
>> database system is starting up
>> unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
>> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  entering standby mode
>> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E45000000B8" from archive
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  checkpoint record is at
>> 8E45/B80AF650
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  redo record is at
>> 8E45/8B173180; shutdown FALSE
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next transaction ID:
>> 0/552803703; next OID: 24427698
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next MultiXactId: 109027; next
>> MultiXactOffset: 2442921
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  oldest unfrozen transaction
>> ID: 3104202601, in database 16408
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  transaction ID wrap limit is
>> 956718952, limited by database with OID 16408
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  resetting unlogged relations:
>> cleanup 1 init 0
>> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  could not open directory
>> "pg_snapshots": No such file or directory
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  initializing for hot standby
>> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E450000008B" from archive
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  redo starts at 8E45/8B173180
>> 2013-01-12 00:51:49 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E450000008C" from archive
>>
>> *SNIP*
>>
>> 2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E4800000062" from archive
>> 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E4800000063" from archive
>> 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E4800000064" from archive
>> 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E4800000065" from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E4800000066" from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E4800000067" from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  consistent recovery state
>> reached at 8E48/67AC4E28
>> 2013-01-12 00:54:46 UTC::@:[36796]: LOG:  database system is ready to
>> accept read only connections
>> 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E4800000068" from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E4800000069" from archive
>> 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E480000006A" from archive
>> 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E480000006B" from archive
>> 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E480000006C" from archive
>>
>> *SNIP*
>>
>> 2013-01-12 01:02:16 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E4F00000079" from archive
>> 2013-01-12 01:02:16 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:02:17 UTC::@:[36831]: LOG:  restored log file
>> "0000000900008E4F0000007A" from archive
>> 2013-01-12 01:02:17 UTC::@:[36831]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation
>> base/16408/18967399 does not exist
>> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
>> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
>> 2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to
>> invalid pages
>> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
>> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
>> 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was
>> terminated by signal 6: Aborted
>> 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  terminating any other active
>> server processes
>>
>>
>> 7. PostgreSQL shuts down...
>>
>>
>> 8. Start #3:
>>
>> 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  database system was interrupted
>> while in recovery at log time 2013-01-11 19:50:31 UTC
>> 2013-01-12 01:04:04 UTC::@:[61742]: HINT:  If this has occurred more than
>> once some data might be corrupted and you might need to choose an earlier
>> recovery target.
>> 2013-01-12 01:04:04 UTC:[local]:[unknown]@[unknown]:[61743]: LOG:
>> incomplete startup packet
>> 2013-01-12 01:04:04 UTC:[local]:postgres@postgres:[61744]: FATAL:  the
>> database system is starting up
>> unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
>> 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  entering standby mode
>> 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4A00000039" from archive
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  checkpoint record is at
>> 8E4A/39CD4BA0
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  redo record is at
>> 8E4A/19F0D210; shutdown FALSE
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next transaction ID:
>> 0/552859005; next OID: 24427698
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next MultiXactId: 109221; next
>> MultiXactOffset: 2443321
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  oldest unfrozen transaction
>> ID: 3104202601, in database 16408
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  transaction ID wrap limit is
>> 956718952, limited by database with OID 16408
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  resetting unlogged relations:
>> cleanup 1 init 0
>> 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  could not open directory
>> "pg_snapshots": No such file or directory
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  initializing for hot standby
>> 2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4A00000019" from archive
>> 2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:04:05 UTC::@:[61742]: LOG:  redo starts at 8E4A/19F0D210
>> 2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4A0000001A" from archive
>> 2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>>
>> *SNIP*
>>
>> 2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4F00000077" from archive
>> 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4F00000078" from archive
>> 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4F00000079" from archive
>> 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4F0000007A" from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  consistent recovery state
>> reached at 8E4F/7A22BD08
>> 2013-01-12 01:09:38 UTC::@:[61729]: LOG:  database system is ready to
>> accept read only connections
>> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4F0000007B" from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4F0000007C" from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4F0000007D" from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:09:39 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E4F0000007E" from archive
>>
>>
>> *SNIP*
>>
>> 2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E53000000D9" from archive
>> 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E53000000DA" from archive
>> 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E53000000DB" from archive
>> 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E53000000DC" from archive
>> 2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> 2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file
>> "0000000900008E53000000DD" from archive
>> 2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
>> unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
>> 2013-01-12 01:14:30 UTC::@:[61742]: LOG:  unexpected pageaddr 8E4A/1000000
>> in log file 36435, segment 222, offset 0
>> unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
>> 2013-01-12 01:14:30 UTC::@:[19504]: LOG:  streaming replication
>> successfully connected to primary
>> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=1
>> file=base/16408/22873432 time=2.538 msec
>> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=2
>> file=base/16408/18967506 time=12.054 msec
>> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=3
>> file=base/16408/18967506_fsm time=0.095 msec
>> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=4
>> file=base/16408/22873244 time=0.144 msec
>> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=5
>> file=base/16408/22892823 time=0.087 msec
>>
>> 9. Slave DB connected to streaming replication with Master DB and all
>> seems fine.
>>
>>
>> Any help would be appreciated. Thanks!
>
>
--
NTT OSS Center
Mitsumasa KONDO



Re: Postgres WAL Recovery Fails... And Then Works...

От
Albe Laurenz
Дата:
Phil Monroe wrote:
> So we had to failover and do a full base backup to get our slave database back online and ran into a
> interesting scenario. After copying the data directory, setting up the recovery.conf, and starting the
> slave database, the database crashes while replaying xlogs. However, trying to start the database
> again, the database is able to replay xlogs farther than it initially got, but ultimately ended up
> failing out again. After starting the DB a third time, PostgreSQL replays even further and catches up
> to the master to start streaming replication. Is this common and or acceptable?

Certainly not acceptable.

Just checking: When you were "copying the data directory",
did you have online backup mode enabled?

Yours,
Laurenz Albe


Re: Postgres WAL Recovery Fails... And Then Works...

От
Heikki Linnakangas
Дата:
On 12.01.2013 04:32, Phil Monroe wrote:
> Hi Everyone,
>
> So we had to failover and do a full base backup to get our slave database back
> online and ran into a interesting scenario. After copying the data directory,
> setting up the recovery.conf, and starting the slave database, the database
> crashes while replaying xlogs. However, trying to start the database again, the
> database is able to replay xlogs farther than it initially got, but ultimately
> ended up failing out again. After starting the DB a third time, PostgreSQL
> replays even further and catches up to the master to start streaming
> replication. Is this common and or acceptable?

How did you perform the base backup? Did you use pg_basebackup? Or if
you did a filesystem-level copy, did you use pg_start/stop_backup
correctly? Did you take the base backup from the master server, or from
another slave?

This looks similar to the bug discussed here:
http://www.postgresql.org/message-id/CAMkU=1wpvYJVEDo6Qvq4QbosZ+AV6BMVCf+XVCG=mJqFRjQ8Pg@mail.gmail.com.
That was fixed in 9.2.2, so if you're using 9.2.1 or 9.2.0, try upgrading.

- Heikki


Re: Postgres WAL Recovery Fails... And Then Works...

От
Phil Monroe
Дата:
Sorry, Initial response got blocked since I replied with the logs quoted again.



Also, which version of postgres are you using?



PostgreSQL 9.2.1 on Ubuntu 12.04




Except in my case no number of restarts helped.  You didn't say, were
you explicitly copying $PGDATA or using some other mechanism to
migrate the data elsewhere?




So we have a very large database (~5TB), so we use a script to do parallel rsyncs to copy the data directory (https://gist.github.com/4477190/#file-pmrcp-rb). The whole copy process ended up taking ~3.5 hours. So we did a physical copy of $PGDATA (which is located at /var/lib/postgresql/9.2/main/ on both machines.). We followed the following process to do this:

1. Master archives WAL files to Backup Host.
2. Execute on Master: psql -c "select pg_start_backup('DATE-slave-restore')"
3. Execute on Master: RCP='rsync -cav --inplace -e rsh' EXCLUDE='pg_xlog' pmrcp /var/lib/postgresql/9.2/main/ prd-db-01:/var/lib/postgresql/9.2/main/ > /tmp/backup.log
4. Execute on Master: psql -c "select pg_stop_backup()"
5. On Slave, setup recovery.conf to read WAL archive on Backup Host
6. Execute on Slave:  pg_ctlcluster 9.2 main start (as described in initial email)


Best,
Phil

Heikki Linnakangas wrote:

Sorry, Initial response got blocked since I replied with the logs
quoted again.



Also, which version of postgres are you using?



PostgreSQL 9.2.1 on Ubuntu 12.04




Except in my case no number of restarts helped.  You didn't say, were
you explicitly copying $PGDATA or using some other mechanism to
migrate the data elsewhere?




So we have a very large database (~5TB), so we use a script to do
parallel rsyncs to copy the data directory
(https://gist.github.com/4477190/#file-pmrcp-rb). The whole copy
process ended up taking ~3.5 hours. So we did a physical copy of
$PGDATA (which is located at /var/lib/postgresql/9.2/main/ on both
machines.). We followed the following process to do this:

1. Master archives WAL files to Backup Host.
2. Execute on Master: psql -c "select
pg_start_backup('DATE-slave-restore')"
3. Execute on Master: RCP='rsync -cav --inplace -e rsh'
EXCLUDE='pg_xlog' pmrcp /var/lib/postgresql/9.2/main/
prd-db-01:/var/lib/postgresql/9.2/main/ > /tmp/backup.log
4. Execute on Master: psql -c "select pg_stop_backup()"
5. On Slave, setup recovery.conf to read WAL archive on Backup Host
6. Execute on Slave:  pg_ctlcluster 9.2 main start (as described in
initial email)


Best,
Phil

Re: Postgres WAL Recovery Fails... And Then Works...

От
Phil Monroe
Дата:

Also, which version of postgres are you using?

PostgreSQL 9.2.1 on Ubuntu 12.04


Except in my case no number of restarts helped.  You didn't say, were
you explicitly copying $PGDATA or using some other mechanism to
migrate the data elsewhere?


So we have a very large database (~5TB), so we use a script to do parallel rsyncs to copy the data directory (https://gist.github.com/4477190/#file-pmrcp-rb). The whole copy process ended up taking ~3.5 hours. So we did a physical copy of $PGDATA (which is located at /var/lib/postgresql/9.2/main/ on both machines.). We followed the following process to do this:

1. Master archives WAL files to Backup Host.
2. Execute on Master: psql -c "select pg_start_backup('DATE-slave-restore')"
3. Execute on Master: RCP='rsync -cav --inplace -e rsh' EXCLUDE='pg_xlog' pmrcp /var/lib/postgresql/9.2/main/ prd-db-01:/var/lib/postgresql/9.2/main/ > /tmp/backup.log
4. Execute on Master: psql -c "select pg_stop_backup()"
5. On Slave, setup recovery.conf to read WAL archive on Backup Host
6. Execute on Slave:  pg_ctlcluster 9.2 main start (as described in initial email)


Best,
Phil



Lonni J Friedman wrote:

Your errors look somewhat similar to a problem I reported last week
(no replies thus far):
http://www.postgresql.org/message-id/CAP=oouE5niXgAO_34Q+FGq=tS6EsMHXXRRoByov3zn9pU4249w@mail.gmail.com

Except in my case no number of restarts helped.  You didn't say, were
you explicitly copying $PGDATA or using some other mechanism to
migrate the data elsewhere?

Also, which version of postgres are you using?

On Fri, Jan 11, 2013 at 6:32 PM, Phil Monroe<phil@identified.com>  wrote:

Hi Everyone,

So we had to failover and do a full base backup to get our slave database
back online and ran into a interesting scenario. After copying the data
directory, setting up the recovery.conf, and starting the slave database,
the database crashes while replaying xlogs. However, trying to start the
database again, the database is able to replay xlogs farther than it
initially got, but ultimately ended up failing out again. After starting the
DB a third time, PostgreSQL replays even further and catches up to the
master to start streaming replication. Is this common and or acceptable?

The error messages that cause PostgreSQL to stop replaying xlogs:

ERROR #1:
2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation
base/16408/18967399 does not exist
2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to
invalid pages
2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was
terminated by signal 6: Aborted

ERROR #2:
2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation
base/16408/18967399 does not exist
2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to
invalid pages
2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was
terminated by signal 6: Aborted


Fortunately, these errors only pertain to indexes, which can be rebuilt.
But is this a sign that the data directory on the slave is corrupt?

Here is a full timeline of the logs highlighting the issues. :

1. Data Directory Copy Finishes.
2. Recovery.conf Setup

3. Start #1:

2013-01-12 00:14:23 UTC::@:[61150]: LOG:  database system was shut down in
recovery at 2013-01-12 00:14:06 UTC
2013-01-12 00:14:23 UTC:[local]:[unknown]@[unknown]:[61151]: LOG:
incomplete startup packet
2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61152]: FATAL:  the
database system is starting up
unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  entering standby mode
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
"0000000900008E45000000B8" from archive
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
"0000000900008E450000008B" from archive
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  could not open directory
"pg_snapshots": No such file or directory
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  redo starts at 8E45/8B173180
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  consistent recovery state
reached at 8E45/8B174840
2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61175]: FATAL:  the
database system is starting up
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
"0000000900008E450000008C" from archive
2013-01-12 00:14:24 UTC::@:[61150]: LOG:  restored log file
"0000000900008E450000008D" from archive

*SNIP*

2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file
"0000000900008E4800000066" from archive
2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file
"0000000900008E4800000067" from archive
2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation
base/16408/18967399 does not exist
2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to
invalid pages
2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was
terminated by signal 6: Aborted
2013-01-12 00:17:18 UTC::@:[61121]: LOG:  terminating any other active
server processes


4. PostgreSQL shuts down...
5. Debugging logs enabled in postgresql.conf.


6. Start #2:

2013-01-12 00:51:48 UTC::@:[36831]: LOG:  database system was interrupted
while in recovery at log time 2013-01-11 18:05:31 UTC
2013-01-12 00:51:48 UTC::@:[36831]: HINT:  If this has occurred more than
once some data might be corrupted and you might need to choose an earlier
recovery target.
2013-01-12 00:51:48 UTC:[local]:[unknown]@[unknown]:[36832]: LOG:
incomplete startup packet
2013-01-12 00:51:48 UTC:[local]:postgres@postgres:[36833]: FATAL:  the
database system is starting up
unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  entering standby mode
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file
"0000000900008E45000000B8" from archive
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  checkpoint record is at
8E45/B80AF650
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  redo record is at
8E45/8B173180; shutdown FALSE
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next transaction ID:
0/552803703; next OID: 24427698
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next MultiXactId: 109027; next
MultiXactOffset: 2442921
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  oldest unfrozen transaction
ID: 3104202601, in database 16408
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  transaction ID wrap limit is
956718952, limited by database with OID 16408
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  resetting unlogged relations:
cleanup 1 init 0
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  could not open directory
"pg_snapshots": No such file or directory
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  initializing for hot standby
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file
"0000000900008E450000008B" from archive
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  redo starts at 8E45/8B173180
2013-01-12 00:51:49 UTC::@:[36831]: LOG:  restored log file
"0000000900008E450000008C" from archive

*SNIP*

2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
"0000000900008E4800000062" from archive
2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
"0000000900008E4800000063" from archive
2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
"0000000900008E4800000064" from archive
2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
"0000000900008E4800000065" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
"0000000900008E4800000066" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
"0000000900008E4800000067" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  consistent recovery state
reached at 8E48/67AC4E28
2013-01-12 00:54:46 UTC::@:[36796]: LOG:  database system is ready to
accept read only connections
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
"0000000900008E4800000068" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
"0000000900008E4800000069" from archive
2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
"0000000900008E480000006A" from archive
2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
"0000000900008E480000006B" from archive
2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
"0000000900008E480000006C" from archive

*SNIP*

2013-01-12 01:02:16 UTC::@:[36831]: LOG:  restored log file
"0000000900008E4F00000079" from archive
2013-01-12 01:02:16 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 01:02:17 UTC::@:[36831]: LOG:  restored log file
"0000000900008E4F0000007A" from archive
2013-01-12 01:02:17 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation
base/16408/18967399 does not exist
2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to
invalid pages
2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was
terminated by signal 6: Aborted
2013-01-12 01:02:20 UTC::@:[36796]: LOG:  terminating any other active
server processes


7. PostgreSQL shuts down...


8. Start #3:

2013-01-12 01:04:04 UTC::@:[61742]: LOG:  database system was interrupted
while in recovery at log time 2013-01-11 19:50:31 UTC
2013-01-12 01:04:04 UTC::@:[61742]: HINT:  If this has occurred more than
once some data might be corrupted and you might need to choose an earlier
recovery target.
2013-01-12 01:04:04 UTC:[local]:[unknown]@[unknown]:[61743]: LOG:
incomplete startup packet
2013-01-12 01:04:04 UTC:[local]:postgres@postgres:[61744]: FATAL:  the
database system is starting up
unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
2013-01-12 01:04:04 UTC::@:[61742]: LOG:  entering standby mode
2013-01-12 01:04:04 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4A00000039" from archive
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  checkpoint record is at
8E4A/39CD4BA0
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  redo record is at
8E4A/19F0D210; shutdown FALSE
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next transaction ID:
0/552859005; next OID: 24427698
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next MultiXactId: 109221; next
MultiXactOffset: 2443321
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  oldest unfrozen transaction
ID: 3104202601, in database 16408
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  transaction ID wrap limit is
956718952, limited by database with OID 16408
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  resetting unlogged relations:
cleanup 1 init 0
2013-01-12 01:04:04 UTC::@:[61742]: LOG:  could not open directory
"pg_snapshots": No such file or directory
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  initializing for hot standby
2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4A00000019" from archive
2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:04:05 UTC::@:[61742]: LOG:  redo starts at 8E4A/19F0D210
2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4A0000001A" from archive
2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive

*SNIP*

2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4F00000077" from archive
2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4F00000078" from archive
2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4F00000079" from archive
2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4F0000007A" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  consistent recovery state
reached at 8E4F/7A22BD08
2013-01-12 01:09:38 UTC::@:[61729]: LOG:  database system is ready to
accept read only connections
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4F0000007B" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4F0000007C" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4F0000007D" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:39 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4F0000007E" from archive


*SNIP*

2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
"0000000900008E53000000D9" from archive
2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
"0000000900008E53000000DA" from archive
2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
"0000000900008E53000000DB" from archive
2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file
"0000000900008E53000000DC" from archive
2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file
"0000000900008E53000000DD" from archive
2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
2013-01-12 01:14:30 UTC::@:[61742]: LOG:  unexpected pageaddr 8E4A/1000000
in log file 36435, segment 222, offset 0
unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
2013-01-12 01:14:30 UTC::@:[19504]: LOG:  streaming replication
successfully connected to primary
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=1
file=base/16408/22873432 time=2.538 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=2
file=base/16408/18967506 time=12.054 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=3
file=base/16408/18967506_fsm time=0.095 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=4
file=base/16408/22873244 time=0.144 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=5
file=base/16408/22892823 time=0.087 msec

9. Slave DB connected to streaming replication with Master DB and all
seems fine.


Any help would be appreciated. Thanks!

Re: Postgres WAL Recovery Fails... And Then Works...

От
Phil Monroe
Дата:
Sorry, Initial response got blocked since I replied with the logs quoted again.


Also, which version of postgres are you using?


PostgreSQL 9.2.1 on Ubuntu 12.04



Except in my case no number of restarts helped.  You didn't say, were
you explicitly copying $PGDATA or using some other mechanism to
migrate the data elsewhere?



So we have a very large database (~5TB), so we use a script to do parallel rsyncs to copy the data directory (https://gist.github.com/4477190/#file-pmrcp-rb). The whole copy process ended up taking ~3.5 hours. So we did a physical copy of $PGDATA (which is located at /var/lib/postgresql/9.2/main/ on both machines.). We followed the following process to do this:

1. Master archives WAL files to Backup Host.
2. Execute on Master: psql -c "select pg_start_backup('DATE-slave-restore')"
3. Execute on Master: RCP='rsync -cav --inplace -e rsh' EXCLUDE='pg_xlog' pmrcp /var/lib/postgresql/9.2/main/ prd-db-01:/var/lib/postgresql/9.2/main/ > /tmp/backup.log
4. Execute on Master: psql -c "select pg_stop_backup()"
5. On Slave, setup recovery.conf to read WAL archive on Backup Host
6. Execute on Slave:  pg_ctlcluster 9.2 main start (as described in initial email)


Best,
Phil