Re: Postgres WAL Recovery Fails... And Then Works...

Поиск
Список
Период
Сортировка
От Phil Monroe
Тема Re: Postgres WAL Recovery Fails... And Then Works...
Дата
Msg-id 50F49885.8070905@identified.com
обсуждение исходный текст
Ответ на Re: Postgres WAL Recovery Fails... And Then Works...  (Lonni J Friedman <netllama@gmail.com>)
Список pgsql-admin

Also, which version of postgres are you using?

PostgreSQL 9.2.1 on Ubuntu 12.04


Except in my case no number of restarts helped.  You didn't say, were
you explicitly copying $PGDATA or using some other mechanism to
migrate the data elsewhere?


So we have a very large database (~5TB), so we use a script to do parallel rsyncs to copy the data directory (https://gist.github.com/4477190/#file-pmrcp-rb). The whole copy process ended up taking ~3.5 hours. So we did a physical copy of $PGDATA (which is located at /var/lib/postgresql/9.2/main/ on both machines.). We followed the following process to do this:

1. Master archives WAL files to Backup Host.
2. Execute on Master: psql -c "select pg_start_backup('DATE-slave-restore')"
3. Execute on Master: RCP='rsync -cav --inplace -e rsh' EXCLUDE='pg_xlog' pmrcp /var/lib/postgresql/9.2/main/ prd-db-01:/var/lib/postgresql/9.2/main/ > /tmp/backup.log
4. Execute on Master: psql -c "select pg_stop_backup()"
5. On Slave, setup recovery.conf to read WAL archive on Backup Host
6. Execute on Slave:  pg_ctlcluster 9.2 main start (as described in initial email)


Best,
Phil



Lonni J Friedman wrote:

Your errors look somewhat similar to a problem I reported last week
(no replies thus far):
http://www.postgresql.org/message-id/CAP=oouE5niXgAO_34Q+FGq=tS6EsMHXXRRoByov3zn9pU4249w@mail.gmail.com

Except in my case no number of restarts helped.  You didn't say, were
you explicitly copying $PGDATA or using some other mechanism to
migrate the data elsewhere?

Also, which version of postgres are you using?

On Fri, Jan 11, 2013 at 6:32 PM, Phil Monroe<phil@identified.com>  wrote:

Hi Everyone,

So we had to failover and do a full base backup to get our slave database
back online and ran into a interesting scenario. After copying the data
directory, setting up the recovery.conf, and starting the slave database,
the database crashes while replaying xlogs. However, trying to start the
database again, the database is able to replay xlogs farther than it
initially got, but ultimately ended up failing out again. After starting the
DB a third time, PostgreSQL replays even further and catches up to the
master to start streaming replication. Is this common and or acceptable?

The error messages that cause PostgreSQL to stop replaying xlogs:

ERROR #1:
2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation
base/16408/18967399 does not exist
2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to
invalid pages
2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was
terminated by signal 6: Aborted

ERROR #2:
2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation
base/16408/18967399 does not exist
2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to
invalid pages
2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was
terminated by signal 6: Aborted


Fortunately, these errors only pertain to indexes, which can be rebuilt.
But is this a sign that the data directory on the slave is corrupt?

Here is a full timeline of the logs highlighting the issues. :

1. Data Directory Copy Finishes.
2. Recovery.conf Setup

3. Start #1:

2013-01-12 00:14:23 UTC::@:[61150]: LOG:  database system was shut down in
recovery at 2013-01-12 00:14:06 UTC
2013-01-12 00:14:23 UTC:[local]:[unknown]@[unknown]:[61151]: LOG:
incomplete startup packet
2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61152]: FATAL:  the
database system is starting up
unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  entering standby mode
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
"0000000900008E45000000B8" from archive
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
"0000000900008E450000008B" from archive
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  could not open directory
"pg_snapshots": No such file or directory
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  redo starts at 8E45/8B173180
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  consistent recovery state
reached at 8E45/8B174840
2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61175]: FATAL:  the
database system is starting up
2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
"0000000900008E450000008C" from archive
2013-01-12 00:14:24 UTC::@:[61150]: LOG:  restored log file
"0000000900008E450000008D" from archive

*SNIP*

2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file
"0000000900008E4800000066" from archive
2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file
"0000000900008E4800000067" from archive
2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation
base/16408/18967399 does not exist
2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to
invalid pages
2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was
terminated by signal 6: Aborted
2013-01-12 00:17:18 UTC::@:[61121]: LOG:  terminating any other active
server processes


4. PostgreSQL shuts down...
5. Debugging logs enabled in postgresql.conf.


6. Start #2:

2013-01-12 00:51:48 UTC::@:[36831]: LOG:  database system was interrupted
while in recovery at log time 2013-01-11 18:05:31 UTC
2013-01-12 00:51:48 UTC::@:[36831]: HINT:  If this has occurred more than
once some data might be corrupted and you might need to choose an earlier
recovery target.
2013-01-12 00:51:48 UTC:[local]:[unknown]@[unknown]:[36832]: LOG:
incomplete startup packet
2013-01-12 00:51:48 UTC:[local]:postgres@postgres:[36833]: FATAL:  the
database system is starting up
unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  entering standby mode
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file
"0000000900008E45000000B8" from archive
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  checkpoint record is at
8E45/B80AF650
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  redo record is at
8E45/8B173180; shutdown FALSE
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next transaction ID:
0/552803703; next OID: 24427698
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next MultiXactId: 109027; next
MultiXactOffset: 2442921
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  oldest unfrozen transaction
ID: 3104202601, in database 16408
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  transaction ID wrap limit is
956718952, limited by database with OID 16408
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  resetting unlogged relations:
cleanup 1 init 0
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  could not open directory
"pg_snapshots": No such file or directory
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  initializing for hot standby
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file
"0000000900008E450000008B" from archive
2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:51:48 UTC::@:[36831]: LOG:  redo starts at 8E45/8B173180
2013-01-12 00:51:49 UTC::@:[36831]: LOG:  restored log file
"0000000900008E450000008C" from archive

*SNIP*

2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
"0000000900008E4800000062" from archive
2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
"0000000900008E4800000063" from archive
2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
"0000000900008E4800000064" from archive
2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
"0000000900008E4800000065" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
"0000000900008E4800000066" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
"0000000900008E4800000067" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  consistent recovery state
reached at 8E48/67AC4E28
2013-01-12 00:54:46 UTC::@:[36796]: LOG:  database system is ready to
accept read only connections
2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
"0000000900008E4800000068" from archive
2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
"0000000900008E4800000069" from archive
2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
"0000000900008E480000006A" from archive
2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
"0000000900008E480000006B" from archive
2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
"0000000900008E480000006C" from archive

*SNIP*

2013-01-12 01:02:16 UTC::@:[36831]: LOG:  restored log file
"0000000900008E4F00000079" from archive
2013-01-12 01:02:16 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 01:02:17 UTC::@:[36831]: LOG:  restored log file
"0000000900008E4F0000007A" from archive
2013-01-12 01:02:17 UTC::@:[36831]: DEBUG:  got WAL segment from archive
2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation
base/16408/18967399 does not exist
2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to
invalid pages
2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was
terminated by signal 6: Aborted
2013-01-12 01:02:20 UTC::@:[36796]: LOG:  terminating any other active
server processes


7. PostgreSQL shuts down...


8. Start #3:

2013-01-12 01:04:04 UTC::@:[61742]: LOG:  database system was interrupted
while in recovery at log time 2013-01-11 19:50:31 UTC
2013-01-12 01:04:04 UTC::@:[61742]: HINT:  If this has occurred more than
once some data might be corrupted and you might need to choose an earlier
recovery target.
2013-01-12 01:04:04 UTC:[local]:[unknown]@[unknown]:[61743]: LOG:
incomplete startup packet
2013-01-12 01:04:04 UTC:[local]:postgres@postgres:[61744]: FATAL:  the
database system is starting up
unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
2013-01-12 01:04:04 UTC::@:[61742]: LOG:  entering standby mode
2013-01-12 01:04:04 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4A00000039" from archive
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  checkpoint record is at
8E4A/39CD4BA0
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  redo record is at
8E4A/19F0D210; shutdown FALSE
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next transaction ID:
0/552859005; next OID: 24427698
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next MultiXactId: 109221; next
MultiXactOffset: 2443321
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  oldest unfrozen transaction
ID: 3104202601, in database 16408
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  transaction ID wrap limit is
956718952, limited by database with OID 16408
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  resetting unlogged relations:
cleanup 1 init 0
2013-01-12 01:04:04 UTC::@:[61742]: LOG:  could not open directory
"pg_snapshots": No such file or directory
2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  initializing for hot standby
2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4A00000019" from archive
2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:04:05 UTC::@:[61742]: LOG:  redo starts at 8E4A/19F0D210
2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4A0000001A" from archive
2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive

*SNIP*

2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4F00000077" from archive
2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4F00000078" from archive
2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4F00000079" from archive
2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4F0000007A" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  consistent recovery state
reached at 8E4F/7A22BD08
2013-01-12 01:09:38 UTC::@:[61729]: LOG:  database system is ready to
accept read only connections
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4F0000007B" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4F0000007C" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4F0000007D" from archive
2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:09:39 UTC::@:[61742]: LOG:  restored log file
"0000000900008E4F0000007E" from archive


*SNIP*

2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
"0000000900008E53000000D9" from archive
2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
"0000000900008E53000000DA" from archive
2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
"0000000900008E53000000DB" from archive
2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file
"0000000900008E53000000DC" from archive
2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file
"0000000900008E53000000DD" from archive
2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
2013-01-12 01:14:30 UTC::@:[61742]: LOG:  unexpected pageaddr 8E4A/1000000
in log file 36435, segment 222, offset 0
unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
2013-01-12 01:14:30 UTC::@:[19504]: LOG:  streaming replication
successfully connected to primary
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=1
file=base/16408/22873432 time=2.538 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=2
file=base/16408/18967506 time=12.054 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=3
file=base/16408/18967506_fsm time=0.095 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=4
file=base/16408/22873244 time=0.144 msec
2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=5
file=base/16408/22892823 time=0.087 msec

9. Slave DB connected to streaming replication with Master DB and all
seems fine.


Any help would be appreciated. Thanks!

В списке pgsql-admin по дате отправления:

Предыдущее
От: Vinod V
Дата:
Сообщение: Need assistance in incremental backup for my environment
Следующее
От: Phil Monroe
Дата:
Сообщение: Re: Postgres WAL Recovery Fails... And Then Works...