Re: Postgres WAL Recovery Fails... And Then Works...

Поиск
Список
Период
Сортировка
От	Phil Monroe
Тема	Re: Postgres WAL Recovery Fails... And Then Works...
Дата	17 января 2013 г. 23:58:00
Msg-id	50F49885.8070905@identified.com обсуждение исходный текст
Ответ на	Re: Postgres WAL Recovery Fails... And Then Works... (Lonni J Friedman <netllama@gmail.com>)
Список	pgsql-admin
Дерево обсуждения
 Also, which version of postgres are you using?

 PostgreSQL 9.2.1 on Ubuntu 12.04


 Except in my case no number of restarts helped.  You didn't say, were
 you explicitly copying $PGDATA or using some other mechanism to
 migrate the data elsewhere?


 So we have a very large database (~5TB), so we use a script to do parallel rsyncs to copy the data directory (https://gist.github.com/4477190/#file-pmrcp-rb). The whole copy process ended up taking ~3.5 hours. So we did a physical copy of $PGDATA (which is located at /var/lib/postgresql/9.2/main/ on both machines.). We followed the following process to do this:

 1. Master archives WAL files to Backup Host.
 2. Execute on Master: psql -c "select pg_start_backup('DATE-slave-restore')"
 3. Execute on Master: RCP='rsync -cav --inplace -e rsh' EXCLUDE='pg_xlog' pmrcp /var/lib/postgresql/9.2/main/ prd-db-01:/var/lib/postgresql/9.2/main/ > /tmp/backup.log 
 4. Execute on Master: psql -c "select pg_stop_backup()"
 5. On Slave, setup recovery.conf to read WAL archive on Backup Host
 6. Execute on Slave:  pg_ctlcluster 9.2 main start (as described in initial email)


 Best,
 Phil



 Lonni J Friedman wrote:

 Your errors look somewhat similar to a problem I reported last week
 (no replies thus far):
 http://www.postgresql.org/message-id/CAP=oouE5niXgAO_34Q+FGq=tS6EsMHXXRRoByov3zn9pU4249w@mail.gmail.com

 Except in my case no number of restarts helped.  You didn't say, were
 you explicitly copying $PGDATA or using some other mechanism to
 migrate the data elsewhere?

 Also, which version of postgres are you using?

 On Fri, Jan 11, 2013 at 6:32 PM, Phil Monroe<phil@identified.com>  wrote:

 Hi Everyone,

 So we had to failover and do a full base backup to get our slave database
 back online and ran into a interesting scenario. After copying the data
 directory, setting up the recovery.conf, and starting the slave database,
 the database crashes while replaying xlogs. However, trying to start the
 database again, the database is able to replay xlogs farther than it
 initially got, but ultimately ended up failing out again. After starting the
 DB a third time, PostgreSQL replays even further and catches up to the
 master to start streaming replication. Is this common and or acceptable?

 The error messages that cause PostgreSQL to stop replaying xlogs:

 ERROR #1:
 2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation
 base/16408/18967399 does not exist
 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
 2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to
 invalid pages
 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was
 terminated by signal 6: Aborted

 ERROR #2:
 2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation
 base/16408/18967399 does not exist
 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
 2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to
 invalid pages
 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was
 terminated by signal 6: Aborted


 Fortunately, these errors only pertain to indexes, which can be rebuilt.
 But is this a sign that the data directory on the slave is corrupt?

 Here is a full timeline of the logs highlighting the issues. :

 1. Data Directory Copy Finishes.
 2. Recovery.conf Setup

 3. Start #1:

 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  database system was shut down in
 recovery at 2013-01-12 00:14:06 UTC
 2013-01-12 00:14:23 UTC:[local]:[unknown]@[unknown]:[61151]: LOG:
 incomplete startup packet
 2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61152]: FATAL:  the
 database system is starting up
 unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  entering standby mode
 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
 "0000000900008E45000000B8" from archive
 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
 "0000000900008E450000008B" from archive
 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  could not open directory
 "pg_snapshots": No such file or directory
 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  redo starts at 8E45/8B173180
 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  consistent recovery state
 reached at 8E45/8B174840
 2013-01-12 00:14:23 UTC:[local]:postgres@postgres:[61175]: FATAL:  the
 database system is starting up
 2013-01-12 00:14:23 UTC::@:[61150]: LOG:  restored log file
 "0000000900008E450000008C" from archive
 2013-01-12 00:14:24 UTC::@:[61150]: LOG:  restored log file
 "0000000900008E450000008D" from archive

 *SNIP*

 2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file
 "0000000900008E4800000066" from archive
 2013-01-12 00:17:17 UTC::@:[61150]: LOG:  restored log file
 "0000000900008E4800000067" from archive
 2013-01-12 00:17:17 UTC::@:[61150]: WARNING:  page 2387476480 of relation
 base/16408/18967399 does not exist
 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
 2013-01-12 00:17:17 UTC::@:[61150]: PANIC:  WAL contains references to
 invalid pages
 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT:  xlog redo delete: index
 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  startup process (PID 61150) was
 terminated by signal 6: Aborted
 2013-01-12 00:17:18 UTC::@:[61121]: LOG:  terminating any other active
 server processes


 4. PostgreSQL shuts down...
 5. Debugging logs enabled in postgresql.conf.


 6. Start #2:

 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  database system was interrupted
 while in recovery at log time 2013-01-11 18:05:31 UTC
 2013-01-12 00:51:48 UTC::@:[36831]: HINT:  If this has occurred more than
 once some data might be corrupted and you might need to choose an earlier
 recovery target.
 2013-01-12 00:51:48 UTC:[local]:[unknown]@[unknown]:[36832]: LOG:
 incomplete startup packet
 2013-01-12 00:51:48 UTC:[local]:postgres@postgres:[36833]: FATAL:  the
 database system is starting up
 unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  entering standby mode
 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E45000000B8" from archive
 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  checkpoint record is at
 8E45/B80AF650
 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  redo record is at
 8E45/8B173180; shutdown FALSE
 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next transaction ID:
 0/552803703; next OID: 24427698
 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  next MultiXactId: 109027; next
 MultiXactOffset: 2442921
 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  oldest unfrozen transaction
 ID: 3104202601, in database 16408
 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  transaction ID wrap limit is
 956718952, limited by database with OID 16408
 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  resetting unlogged relations:
 cleanup 1 init 0
 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  could not open directory
 "pg_snapshots": No such file or directory
 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  initializing for hot standby
 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E450000008B" from archive
 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:51:48 UTC::@:[36831]: LOG:  redo starts at 8E45/8B173180
 2013-01-12 00:51:49 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E450000008C" from archive

 *SNIP*

 2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E4800000062" from archive
 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E4800000063" from archive
 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:54:45 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E4800000064" from archive
 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E4800000065" from archive
 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E4800000066" from archive
 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E4800000067" from archive
 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  consistent recovery state
 reached at 8E48/67AC4E28
 2013-01-12 00:54:46 UTC::@:[36796]: LOG:  database system is ready to
 accept read only connections
 2013-01-12 00:54:46 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E4800000068" from archive
 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E4800000069" from archive
 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E480000006A" from archive
 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E480000006B" from archive
 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 00:54:47 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E480000006C" from archive

 *SNIP*

 2013-01-12 01:02:16 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E4F00000079" from archive
 2013-01-12 01:02:16 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 01:02:17 UTC::@:[36831]: LOG:  restored log file
 "0000000900008E4F0000007A" from archive
 2013-01-12 01:02:17 UTC::@:[36831]: DEBUG:  got WAL segment from archive
 2013-01-12 01:02:17 UTC::@:[36831]: WARNING:  page 2387607552 of relation
 base/16408/18967399 does not exist
 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
 2013-01-12 01:02:17 UTC::@:[36831]: PANIC:  WAL contains references to
 invalid pages
 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT:  xlog redo delete: index
 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  startup process (PID 36831) was
 terminated by signal 6: Aborted
 2013-01-12 01:02:20 UTC::@:[36796]: LOG:  terminating any other active
 server processes


 7. PostgreSQL shuts down...


 8. Start #3:

 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  database system was interrupted
 while in recovery at log time 2013-01-11 19:50:31 UTC
 2013-01-12 01:04:04 UTC::@:[61742]: HINT:  If this has occurred more than
 once some data might be corrupted and you might need to choose an earlier
 recovery target.
 2013-01-12 01:04:04 UTC:[local]:[unknown]@[unknown]:[61743]: LOG:
 incomplete startup packet
 2013-01-12 01:04:04 UTC:[local]:postgres@postgres:[61744]: FATAL:  the
 database system is starting up
 unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  entering standby mode
 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4A00000039" from archive
 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  checkpoint record is at
 8E4A/39CD4BA0
 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  redo record is at
 8E4A/19F0D210; shutdown FALSE
 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next transaction ID:
 0/552859005; next OID: 24427698
 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  next MultiXactId: 109221; next
 MultiXactOffset: 2443321
 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  oldest unfrozen transaction
 ID: 3104202601, in database 16408
 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  transaction ID wrap limit is
 956718952, limited by database with OID 16408
 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  resetting unlogged relations:
 cleanup 1 init 0
 2013-01-12 01:04:04 UTC::@:[61742]: LOG:  could not open directory
 "pg_snapshots": No such file or directory
 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG:  initializing for hot standby
 2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4A00000019" from archive
 2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:04:05 UTC::@:[61742]: LOG:  redo starts at 8E4A/19F0D210
 2013-01-12 01:04:05 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4A0000001A" from archive
 2013-01-12 01:04:05 UTC::@:[61742]: DEBUG:  got WAL segment from archive

 *SNIP*

 2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4F00000077" from archive
 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4F00000078" from archive
 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:09:37 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4F00000079" from archive
 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4F0000007A" from archive
 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  consistent recovery state
 reached at 8E4F/7A22BD08
 2013-01-12 01:09:38 UTC::@:[61729]: LOG:  database system is ready to
 accept read only connections
 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4F0000007B" from archive
 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4F0000007C" from archive
 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:09:38 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4F0000007D" from archive
 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:09:39 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E4F0000007E" from archive


 *SNIP*

 2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E53000000D9" from archive
 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E53000000DA" from archive
 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:14:29 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E53000000DB" from archive
 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E53000000DC" from archive
 2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 2013-01-12 01:14:30 UTC::@:[61742]: LOG:  restored log file
 "0000000900008E53000000DD" from archive
 2013-01-12 01:14:30 UTC::@:[61742]: DEBUG:  got WAL segment from archive
 unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
 2013-01-12 01:14:30 UTC::@:[61742]: LOG:  unexpected pageaddr 8E4A/1000000
 in log file 36435, segment 222, offset 0
 unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
 2013-01-12 01:14:30 UTC::@:[19504]: LOG:  streaming replication
 successfully connected to primary
 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=1
 file=base/16408/22873432 time=2.538 msec
 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=2
 file=base/16408/18967506 time=12.054 msec
 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=3
 file=base/16408/18967506_fsm time=0.095 msec
 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=4
 file=base/16408/22873244 time=0.144 msec
 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG:  checkpoint sync: number=5
 file=base/16408/22892823 time=0.087 msec

 9. Slave DB connected to streaming replication with Master DB and all
 seems fine.


 Any help would be appreciated. Thanks!
В списке pgsql-admin по дате отправления:
Предыдущее

От: Vinod V
Дата: 17 января 2013 г., 23:57:32
Сообщение: Need assistance in incremental backup for my environment
Следующее

От: Phil Monroe
Дата: 17 января 2013 г., 23:58:40
Сообщение: Re: Postgres WAL Recovery Fails... And Then Works...
Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Postgres WAL Recovery Fails... And Then Works...

Предыдущее

Следующее