unable to fail over to warm standby server
От | Mason Hale |
---|---|
Тема | unable to fail over to warm standby server |
Дата | |
Msg-id | 1e85dd391001271523i2348d739y998bdd814b6a50c0@mail.gmail.com обсуждение исходный текст |
Ответы |
Re: unable to fail over to warm standby server
|
Список | pgsql-bugs |
Hello -- We are using PostgreSQL 8.3.8 with a Warm Standy (PITR) setup. Recently we experienced a failure on our primary database server and when we attempted to fail over to the standby server it would not come up. This configuration has been tested previously (we've successfully transferred from primary to secondary at least 3 times in the past). Our system is up and running with the Warm Standby setup reconfigured and running again. I'm reporting this issue as bug to describe the circumstances of this failed recovery in the event it does indicates a bug in this very critical feature of PostgreSQL. Both servers are identically configured as follows: - Dual QuadCore Intel Xeon-Harpertown 5430 @ 2.66GHz - Redhat EL 5.1-64 - 64GB RAM - 250GB - 2-disk RAID 0 for / partition - 1.1TB - 8 disk RAID 10 configuration for /data - 72GB - 2-disk RAID 0 for /pg_xlog It is also probably worth mentioning that Simon Riggs (copied) consulted on the configuration of our warm standby setup, thus we are fairly confident it was setup correctly. ;-) Archive configuration on primary server (prod-db-1) was: archive_mode = on archive_command = 'rsync -az --rsh=ssh %p prod-db-2.vpn:wal_archive/%f </dev/null' archive_timeout = 120 Restore configuration (in recovery.conf) on secondary server (prod-db-2) was: restore_command = 'pg_standby -t /tmp/pgsql.trigger.5432 /home/postgres/wal_archive %f %p %r 2>>pg_standby.log' Timeline of events: 1.) At 20:58 UTC -- received alert that primary server was not responding to pings FYI the last lines in the postgresql.log file on the primary server (before it crashed) were: 2010-01-18 20:40:56 UTC (10.16.74.194)LOG: unexpected EOF on client connection 2010-01-18 20:40:59 UTC ()LOG: checkpoint starting: time 2010-01-18 20:45:56 UTC (10.16.74.200)LOG: unexpected EOF on client connection 2010-01-18 20:50:57 UTC (10.12.58.142)LOG: unexpected EOF on client connection 2010-01-18 20:51:45 UTC ()LOG: checkpoint complete: wrote 45066 buffers (8.6%); 0 transaction log file(s) added, 0 removed, 44 recycled; write=455.622 s, sync=190.549 s, total=646.443 s - sysadmin confirmed primary was offline, and started manual process to transition to standby server, while another sysadmin worked simultaneously on trying to bring the primary server back online. 2.) Running `ps aufx` on the standby (prod-db-2) included this output postgres 16323 0.0 0.1 4578680 98180 ? S 2009 0:00 /usr/local/pgsql-8.3.8/bin/postgres -D /data/pgdata-8.3 postgres 16324 0.0 0.0 71884 716 ? Ss 2009 0:09 \_ postgres: logger process postgres 16325 3.4 6.5 4581480 4284404 ? Ss 2009 3902:25 \_ postgres: startup process waiting for 0000000200003C82000000D8 postgres 9164 0.0 0.0 63772 996 ? S 20:56 0:00 \_ sh -c pg_standby -t /tmp/pgsql.trigger.5432 /home/postgres/wal_archive 0000000200003C82000000D postgres 9165 0.0 0.0 19168 616 ? S 20:56 0:00 \_ pg_standby -t /tmp/pgsql.trigger.5432 /home/postgres/wal_archive 0000000200003C82000000D8 Notice the standby server is waiting on WAL file: 0000000200003C82000000D8 3.) Running `ls -l /home/postgres/wal_archive` showed that the most recent files transferred were -rw------- 1 postgres postgres 16777216 Jan 18 20:55 0000000200003C82000000D5 -rw------- 1 postgres postgres 16777216 Jan 18 20:55 0000000200003C82000000D6 -rw------- 1 postgres postgres 16777216 Jan 18 20:56 0000000200003C82000000D7 4.) The sysadmin ran the following command to make the secondary server exit recovery mode and come online: sudo touch /tmp/pgsql.trigger.5432 NOTE: The use of 'sudo' here is a potential cause/contributor of the failure, because the file pgsql.trigger.5432 will be owned by root, yet the postmaster process starts as user 'postgres'. However a permission issue is inconsistent with what we saw next. 5.) After a few seconds the postgres process stopped but did not come backup. The last few lines of the postgresql.log file were: 2010-01-18 20:54:28 UTC ()LOG: restored log file "0000000200003C82000000D4" from archive 2010-01-18 20:55:03 UTC ()LOG: restored log file "0000000200003C82000000D5" from archive 2010-01-18 20:55:37 UTC ()LOG: restored log file "0000000200003C82000000D6" from archive 2010-01-18 20:56:20 UTC ()LOG: restored log file "0000000200003C82000000D7" from archive 2010-01-18 21:08:31 UTC ()FATAL: could not restore file "0000000200003C82000000D8" from archive: return code 65280 2010-01-18 21:08:31 UTC ()LOG: startup process (PID 16325) exited with exit code 1 2010-01-18 21:08:31 UTC ()LOG: aborting startup due to startup process failure Looking in the /home/postgres/wal_archive directory again confirmed that it did not contain any file named 0000000200003C82000000D8 This was the most mysterious part of the puzzle as the database apparently aborted with a FATAL error when trying to restore a file that didn't exist. We've searched Google and postgresql.org to determine what return code 65280 means, but have come up empty. 6.) Sysadmin attempted to restart the server by hand using `pg_ctl -D /data/pgdata-8.3/ start` -- this command returned a "server starting" message, but the server never appeared in the process list. 7.) The contents of the newly created postgresql.log were: tail postgresql-2010-01-18_210946.log 2010-01-18 21:09:46 UTC ()LOG: database system was interrupted while in recovery at log time 2010-01-18 20:10:59 UTC 2010-01-18 21:09:46 UTC ()HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2010-01-18 21:09:46 UTC ()LOG: starting archive recovery 2010-01-18 21:09:46 UTC ()LOG: restore_command = 'pg_standby -t /tmp/pgsql.trigger.5432 /home/postgres/wal_archive %f %p %r 2>>pg_standby.log' 2010-01-18 21:09:46 UTC ()FATAL: could not restore file "00000002.history" from archive: return code 65280 2010-01-18 21:09:46 UTC ()LOG: startup process (PID 9284) exited with exit code 1 2010-01-18 21:09:46 UTC ()LOG: aborting startup due to startup process failure 8.) Sysadmin noticed the 'recovery.conf' file was still in the db root directory (it had not been renamed recovery.done). The recovery.conf file was renamed to recovery.conf.old and running `pg_ctl -D /data/pgdata-8.3/ start` attempted again 9.) The server did not come up (again). This time the contents of the new postgresql.log file were: [postgres@prod-db-2 pg_log]$ tail -n 100 postgresql-2010-01-18_211132.log 2010-01-18 21:11:32 UTC ()LOG: database system was interrupted while in recovery at log time 2010-01-18 20:10:59 UTC 2010-01-18 21:11:32 UTC ()HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2010-01-18 21:11:32 UTC ()LOG: could not open file "pg_xlog/0000000200003C82000000A3" (log file 15490, segment 163): No such file or directory 2010-01-18 21:11:32 UTC ()LOG: invalid primary checkpoint record 2010-01-18 21:11:32 UTC ()LOG: could not open file "pg_xlog/0000000200003C8200000049" (log file 15490, segment 73): No such file or directory 2010-01-18 21:11:32 UTC ()LOG: invalid secondary checkpoint record 2010-01-18 21:11:32 UTC ()PANIC: could not locate a valid checkpoint record 2010-01-18 21:11:32 UTC ()LOG: startup process (PID 9328) was terminated by signal 6: Aborted 2010-01-18 21:11:32 UTC ()LOG: aborting startup due to startup process failure 10.) At this point a lot of time was spent on the #postgresql IRC channel looking for help diagnosing this problem. Before the issue was resolved, the primary server was brought back online. Soon after we decided to create a new base backup on the standby server and aborted trying to recover. FYI: The last few lines of pg_standby.log were: [postgres@prod-db-2 pgdata-8.3]$ tail -n 300 pg_standby.log trigger file found cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such file or directory cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such file or directory cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such file or directory cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such file or directory cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such file or directory cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such file or directory cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such file or directory trigger file found ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not permittedtrigger file found ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not permitted This file was not looked until after the attempt to recover was aborted. Clearly the permissions on /tmp/pgsql.trigger.5432 were a problem, but we don't see how that would explain the error messages, which seem to indicate that data on the standby server was corrupted. If you need additional information we have saved console logs and logs files from both the primary and secondary servers. Thanks for reading this far and any help in tracking down the cause of this unexpected failure. cheers, - Mason
В списке pgsql-bugs по дате отправления: