Обсуждение: BUG #4854: Problems with replaying WAL files on Warm Standby
The following bug has been logged online: Bug reference: 4854 Logged by: Keith Pierno Email address: kpierno@lulu.com PostgreSQL version: 8.3.6 Operating system: Red Hat Enterprise Linux AS release 4 (Nahant Update 6) PPC64 Description: Problems with replaying WAL files on Warm Standby Details: We experienced a failure of our primary database server, after fixing the hardware issue I attempted to turn the old primary into a warm standby slave. I used a hotbackup from the new primary server and restored the files to the slave. I then removed all references to old WAL files and used rsync to copy the appropriate WAL files from our archive location. During the log playback I received the following errors. 2009-06-15 13:25:11 EDT <::> LOG: restored log file "00000004000007460000007F" from archive 2009-06-15 13:25:11 EDT <::> LOG: restored log file "000000040000074600000080" from archive 2009-06-15 13:25:11 EDT <::> LOG: restored log file "000000040000074600000081" from archive 2009-06-15 13:25:11 EDT <::> LOG: restored log file "000000040000074600000082" from archive 2009-06-15 13:25:12 EDT <::> LOG: restored log file "000000040000074600000083" from archive 2009-06-15 13:25:12 EDT <::> LOG: restored log file "000000040000074600000084" from archive 2009-06-15 13:25:12 EDT <::> PANIC: unexpected timeline ID 6 (after 4) in checkpoint record 2009-06-15 13:25:12 EDT <::> CONTEXT: xlog redo checkpoint: redo 746/84FFF010; tli 6; xid 0/50577196; oid 31395572; multi 6542; offset 13124; shutdown 2009-06-15 13:25:13 EDT <::> LOG: startup process (PID 27836) was terminated by signal 6: Aborted 2009-06-15 13:25:13 EDT <::> LOG: aborting startup due to startup process failure This backup was created at 0205 EDT and the WAL in question was from 0743 EDT.
"Keith Pierno" <kpierno@lulu.com> writes: > 2009-06-15 13:25:12 EDT <::> PANIC: unexpected timeline ID 6 (after 4) in > checkpoint record Hmm. It's complaining because it didn't find timeline 6 mentioned in the timeline history file it read (if any). Maybe you forgot to archive or restore the NNNNNNNN.history file? regards, tom lane
Keith Pierno <kpierno@lulu.com> writes: > Thanks for the prompt response. If such a file is needed for recovery > it was never created by postgres. The current archiving process creates > uses rsync to archive the WAL files to a shared archive area. In the > past and on my other cluster we do not see .history files on the > primary server and have been able to recover without them. The .history file would have been created on the slave at the time it became new master (and started its own timeline). regards, tom lane
Thanks for the prompt response. If such a file is needed for recovery it was never created by postgres. The current archiving process creates uses rsync to archive the WAL files to a shared archive area. In the past and on my other cluster we do not see .history files on the primary server and have been able to recover without them. If it helps I can send a copy of the WAL files and my postgres.conf files. Keith Tom Lane wrote: "Keith Pierno" <kpierno@lulu.com> writes: 2009-06-15 13:25:12 EDT <::> PANIC: unexpected timeline ID 6 (after 4) in checkpoint record Hmm. It's complaining because it didn't find timeline 6 mentioned in the timeline history file it read (if any). Maybe you forgot to archive or restore the NNNNNNNN.history file? regards, tom lane
I was able to find the file, which was from the date of the failure (Tuesday June 9th). My question is why the backup from this morning would all of a sudden require a file from the 9th even though 5 hours of logs were able to be applied? I ended up doing another hotbackup and restoring the database from this afternoon and this seems to be working fine. I am just worried that another WAL file will throw the same error and I will be without a warm standby. Thanks again, Keith Tom Lane wrote: Keith Pierno <kpierno@lulu.com> writes: Thanks for the prompt response. If such a file is needed for recovery it was never created by postgres. The current archiving process creates uses rsync to archive the WAL files to a shared archive area. In the past and on my other cluster we do not see .history files on the primary server and have been able to recover without them. The .history file would have been created on the slave at the time it became new master (and started its own timeline). regards, tom lane
Keith Pierno <kpierno@lulu.com> writes: > I was able to find the file, which was from the date of the failure > (Tuesday June 9th). My question is why the backup from this morning > would all of a sudden require a file from the 9th even though 5 hours > of logs were able to be applied? It was apparently busy applying the logs that were from the old primary. When it got up to the point where the failover occurred, kaboom :-(. It would have had trouble in any case there because the subsequent log files are presumably numbered 00000006... instead of 00000004... Were you getting those shipped over properly from the new primary? regards, tom lane
The backup used was from well after the failover time which is why I was concerned. Interestingly enough the logs are still all prefixed with 00000004... That just makes this problem extremely bizarre. Current log listing: [postgres@db01pri pgwalarchives]# ls -ltr total 82028 -rw------- 1 postgres postgres     250 Jun 15 02:05 0000000400000745000000DB.00F290D8.backup -rw------- 1 postgres postgres     250 Jun 15 13:28 000000040000074700000065.00D48908.backup -rw------- 1 postgres postgres 16777216 Jun 15 15:58 0000000400000747000000E4 -rw------- 1 postgres postgres 16777216 Jun 15 15:59 0000000400000747000000E5 -rw------- 1 postgres postgres 16777216 Jun 15 16:02 0000000400000747000000E6 -rw------- 1 postgres postgres 16777216 Jun 15 16:04 0000000400000747000000E7 -rw------- 1 postgres postgres 16777216 Jun 15 16:07 0000000400000747000000E8 [postgres@db01pri pgwalarchives]# Keith Tom Lane wrote: Keith Pierno <kpierno@lulu.com> writes: I was able to find the file, which was from the date of the failure (Tuesday June 9th). My question is why the backup from this morning would all of a sudden require a file from the 9th even though 5 hours of logs were able to be applied? It was apparently busy applying the logs that were from the old primary. When it got up to the point where the failover occurred, kaboom :-(. It would have had trouble in any case there because the subsequent log files are presumably numbered 00000006... instead of 00000004... Were you getting those shipped over properly from the new primary? regards, tom lane
Keith Pierno <kpierno@lulu.com> writes: > The backup used was from well after the failover time which is why I > was concerned. Interestingly enough the logs are still all prefixed > with 00000004... That just makes this problem extremely bizarre. Hmm, that *is* weird. It seems like the new primary must have reverted its decision to go from timeline 4 to timeline 6. (Which in itself is a bit odd; why not timeline 5?) Can you give us an exact sequence of events on the slave server/new primary around the time of the failover? Also, what was in the .history file when you found it, and are there any other history files? regards, tom lane
The timeline for the events all dates MM/DD/YYYY    06/09/2009 1310 EDT - Hardware fault on primary database server db01pri    06/09/2009 1325 EDT - Failover to warm standby db01sec    06/12/2009 1615 EDT - db01pri server fixed and OS booted    06/15/2009 1115 EDT - started recovery of hotbackup from 06/15/2009 0205 EDT from db01sec onto db01pri    06/15/2009 1320 EDT - Attempted to start postgres on db01pri in warm standby mode    06/15/2009 1325 EDT - Failure to apply WAL log errors with "unexpected timeline ID"    06/15/2009 1340 EDT - Started a new hotbackup on db01sec    06/15/2009 1545 EDT - Started recovery hotbackup from 06/15/2009 1340 to db01pri    06/15/2000 1430 EDT - db01pri recovered and running in warm standby Here is the contents of the pg_xlog directory and the 00000004.history file: [postgres@db01pri ~]$ cat 00000004.history 1   0000000100000736000000A1   before transaction 0 at 1999-12-31 19:00:00-05 [postgres@db01pri ~]$ ls -l total 98468 -rw------- 1 postgres postgres      74 Jul 10 2008 00000002.history -rw------- 1 postgres postgres      74 Jun 9 13:29 00000003.history -rw------- 1 postgres postgres 16777216 Jun 16 08:45 0000000400000749000000C9 -rw------- 1 postgres postgres 16777216 Jun 16 08:46 0000000400000749000000CA -rw------- 1 postgres postgres 16777216 Jun 16 08:47 0000000400000749000000CB -rw------- 1 postgres postgres      74 Jun 9 13:33 00000004.history drwxr-xr-x 2 postgres postgres   32768 Jun 16 08:46 archive_status -rw------- 1 postgres postgres 16777216 Jun 9 13:45 xlogtemp.17243 -rw------- 1 postgres postgres 16777216 Jun 9 13:45 xlogtemp.17244 -rw------- 1 postgres postgres 16777216 Jun 9 13:52 xlogtemp.17397 [postgres@db01pri ~]$ Thanks again, Keith Tom Lane wrote: Keith Pierno <kpierno@lulu.com> writes: The backup used was from well after the failover time which is why I was concerned. Interestingly enough the logs are still all prefixed with 00000004... That just makes this problem extremely bizarre. Hmm, that *is* weird. It seems like the new primary must have reverted its decision to go from timeline 4 to timeline 6. (Which in itself is a bit odd; why not timeline 5?) Can you give us an exact sequence of events on the slave server/new primary around the time of the failover? Also, what was in the .history file when you found it, and are there any other history files? regards, tom lane
Keith Pierno <kpierno@lulu.com> writes: > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> > <html> > <head> BTW, please do not post all-HTML mail to this list. > The timeline for the events all dates MM/DD/YYYY > > 06/09/2009 1310 EDT - Hardware fault on primary database server db01pri > 06/09/2009 1325 EDT - Failover to warm standby db01sec Hmm, how exactly was the failover managed? > [postgres@db01pri ~]$ ls -l > total 98468 > -rw------- 1 postgres postgres 74 Jul 10 2008 00000002.history > -rw------- 1 postgres postgres 74 Jun 9 13:29 00000003.history > -rw------- 1 postgres postgres 16777216 Jun 16 08:45 0000000400000749000000C9 > -rw------- 1 postgres postgres 16777216 Jun 16 08:46 0000000400000749000000CA > -rw------- 1 postgres postgres 16777216 Jun 16 08:47 0000000400000749000000CB > -rw------- 1 postgres postgres 74 Jun 9 13:33 00000004.history > drwxr-xr-x 2 postgres postgres 32768 Jun 16 08:46 archive_status > -rw------- 1 postgres postgres 16777216 Jun 9 13:45 xlogtemp.17243 > -rw------- 1 postgres postgres 16777216 Jun 9 13:45 xlogtemp.17244 > -rw------- 1 postgres postgres 16777216 Jun 9 13:52 xlogtemp.17397 The xlogtemp files are a bit suspicious; they shouldn't be there. The timestamps suggest that something odd was going on right after the failover. Do you have the slave's postmaster log from that time interval, and if so, does it show anything out of the ordinary? regards, tom lane