Обсуждение: WAL Log Shipping - Warm Standby not working under 8.3.7
We had WAL Log shipping (warm standby) working fine under 8.1.4 but under 8.3.7 we can’t get the slave to come up properly. Nothing has changed in our process with regard to start_backup, rsync, stop_backup, bring up the warm standby server in continuous recovery mode, but the failover DB won’t start with the following error:
<2010-01-07 10:54:23 MST>LOG: received immediate shutdown request
/mxl/var/pgsql/data/ha_copy.sh: line 103: 13976 Quit sleep 5
File /mxl/var/pgsql/data/stopslave found. Aborting Process.
<2010-01-07 10:54:28 MST>LOG: could not open file "pg_xlog/00000001000000F6000000E9" (log file 246, segment 233): No such file or directory
<2010-01-07 10:54:28 MST>LOG: redo done at F6/E8FFE378
<2010-01-07 10:54:28 MST>LOG: last completed transaction was at log time 2009-12-28 10:18:04.893307-07
Waiting for log: 00000001000000F6000000E8
<2010-01-07 11:24:49 MST>FATAL: could not restore file "00000001000000F6000000E8" from archive: return code 15
The log file in reference is very old and is not on the Master PG server in pg_xlogs and the “last completed transaction” can’t be right either. Is this a bug or it is something we are doing wrong?
Thanks,
Keaton
psql (PostgreSQL) 8.3.7
contains support for command-line editing
RHEL 5 64 Bit
Linux ourservername 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
<2010-01-07 10:54:23 MST>LOG: received immediate shutdown request
/mxl/var/pgsql/data/ha_copy.sh: line 103: 13976 Quit sleep 5
File /mxl/var/pgsql/data/stopslave found. Aborting Process.
<2010-01-07 10:54:28 MST>LOG: could not open file "pg_xlog/00000001000000F6000000E9" (log file 246, segment 233): No such file or directory
<2010-01-07 10:54:28 MST>LOG: redo done at F6/E8FFE378
<2010-01-07 10:54:28 MST>LOG: last completed transaction was at log time 2009-12-28 10:18:04.893307-07
Waiting for log: 00000001000000F6000000E8
<2010-01-07 11:24:49 MST>FATAL: could not restore file "00000001000000F6000000E8" from archive: return code 15
The log file in reference is very old and is not on the Master PG server in pg_xlogs and the “last completed transaction” can’t be right either. Is this a bug or it is something we are doing wrong?
Thanks,
Keaton
psql (PostgreSQL) 8.3.7
contains support for command-line editing
RHEL 5 64 Bit
Linux ourservername 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
I did find some references to a fix of last-completed transaction time and I looked in the postgresql-bugs archive, but I’m not having any luck confirming that this is a problem in 8.3.7 and an upgrade to 8.3.9 would fix the issue.
postgresql 8.3.7 .... Fix incorrect logging of last-completed-transaction time during PITR ..... Last transaction end time is now logged at end of recovery and at each logged restart point (Simon) ...
On 1/7/10 12:53 PM, "Keaton Adams" <kadams@mxlogic.com> wrote:
postgresql 8.3.7 .... Fix incorrect logging of last-completed-transaction time during PITR ..... Last transaction end time is now logged at end of recovery and at each logged restart point (Simon) ...
On 1/7/10 12:53 PM, "Keaton Adams" <kadams@mxlogic.com> wrote:
We had WAL Log shipping (warm standby) working fine under 8.1.4 but under 8.3.7 we can’t get the slave to come up properly. Nothing has changed in our process with regard to start_backup, rsync, stop_backup, bring up the warm standby server in continuous recovery mode, but the failover DB won’t start with the following error:
<2010-01-07 10:54:23 MST>LOG: received immediate shutdown request
/mxl/var/pgsql/data/ha_copy.sh: line 103: 13976 Quit sleep 5
File /mxl/var/pgsql/data/stopslave found. Aborting Process.
<2010-01-07 10:54:28 MST>LOG: could not open file "pg_xlog/00000001000000F6000000E9" (log file 246, segment 233): No such file or directory
<2010-01-07 10:54:28 MST>LOG: redo done at F6/E8FFE378
<2010-01-07 10:54:28 MST>LOG: last completed transaction was at log time 2009-12-28 10:18:04.893307-07
Waiting for log: 00000001000000F6000000E8
<2010-01-07 11:24:49 MST>FATAL: could not restore file "00000001000000F6000000E8" from archive: return code 15
The log file in reference is very old and is not on the Master PG server in pg_xlogs and the “last completed transaction” can’t be right either. Is this a bug or it is something we are doing wrong?
Thanks,
Keaton
psql (PostgreSQL) 8.3.7
contains support for command-line editing
RHEL 5 64 Bit
Linux ourservername 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
OK,
So what am I doing wrong here?
Installed PG 8.3.7 on Slave machine
Restored from last evening's backup from the master DB to make the rsync across the network finish sooner.
Shut down the PG instance on the slave machine
Ran a script that does the following:
select pg_start_backup('Master_Backup');
rsync -rvlpogtz ${masterdb}/* ${slave_dbuser}@${slave_host}:${slavedb}
select pg_stop_backup();
ssh ${slave_dbuser}@${slave_host} rm ${slavedb}/postmaster.pid 2>/dev/null
ssh ${slave_dbuser}@${slave_host} rm ${slave_backup_path}/0* 2>/dev/null
ssh ${slave_dbuser}@${slave_host} ${PSQL_BIN}/pg_ctl -D ${slavedb} -l logfile start
When the slave PG database attempts to come up in recovery mode, it aborts because it is looking for a log file that is extremely old and does not exist on the master DB server. I believe the Master PG instance was restarted on 12/28/09 and has been running ever since. Is there a way to reset the “last completed transaction” on a DB? Why is PG looking so far back for a WAL log to begin recovery when so much has been done since the 28th including daily backups?
<2010-01-07 10:54:23 MST>LOG: received immediate shutdown request
/mxl/var/pgsql/data/ha_copy.sh: line 103: 13976 Quit sleep 5
File /mxl/var/pgsql/data/stopslave found. Aborting Process.
<2010-01-07 10:54:28 MST>LOG: could not open file "pg_xlog/00000001000000F6000000E9" (log file 246, segment 233): N
o such file or directory
<2010-01-07 10:54:28 MST>LOG: redo done at F6/E8FFE378
<2010-01-07 10:54:28 MST>LOG: last completed transaction was at log time 2009-12-28 10:18:04.893307-07
Waiting for log: 00000001000000F6000000E8
<2010-01-07 11:24:49 MST>FATAL: could not restore file "00000001000000F6000000E8" from archive: return code 15
Again, nothing was changed with the scripts or the replication process and this worked just fine under 8.1.4.
Thanks!
On 1/8/10 8:10 AM, "Keaton Adams" <kadams@mxlogic.com> wrote:
So what am I doing wrong here?
Installed PG 8.3.7 on Slave machine
Restored from last evening's backup from the master DB to make the rsync across the network finish sooner.
Shut down the PG instance on the slave machine
Ran a script that does the following:
select pg_start_backup('Master_Backup');
rsync -rvlpogtz ${masterdb}/* ${slave_dbuser}@${slave_host}:${slavedb}
select pg_stop_backup();
ssh ${slave_dbuser}@${slave_host} rm ${slavedb}/postmaster.pid 2>/dev/null
ssh ${slave_dbuser}@${slave_host} rm ${slave_backup_path}/0* 2>/dev/null
ssh ${slave_dbuser}@${slave_host} ${PSQL_BIN}/pg_ctl -D ${slavedb} -l logfile start
When the slave PG database attempts to come up in recovery mode, it aborts because it is looking for a log file that is extremely old and does not exist on the master DB server. I believe the Master PG instance was restarted on 12/28/09 and has been running ever since. Is there a way to reset the “last completed transaction” on a DB? Why is PG looking so far back for a WAL log to begin recovery when so much has been done since the 28th including daily backups?
<2010-01-07 10:54:23 MST>LOG: received immediate shutdown request
/mxl/var/pgsql/data/ha_copy.sh: line 103: 13976 Quit sleep 5
File /mxl/var/pgsql/data/stopslave found. Aborting Process.
<2010-01-07 10:54:28 MST>LOG: could not open file "pg_xlog/00000001000000F6000000E9" (log file 246, segment 233): N
o such file or directory
<2010-01-07 10:54:28 MST>LOG: redo done at F6/E8FFE378
<2010-01-07 10:54:28 MST>LOG: last completed transaction was at log time 2009-12-28 10:18:04.893307-07
Waiting for log: 00000001000000F6000000E8
<2010-01-07 11:24:49 MST>FATAL: could not restore file "00000001000000F6000000E8" from archive: return code 15
Again, nothing was changed with the scripts or the replication process and this worked just fine under 8.1.4.
Thanks!
On 1/8/10 8:10 AM, "Keaton Adams" <kadams@mxlogic.com> wrote:
I did find some references to a fix of last-completed transaction time and I looked in the postgresql-bugs archive, but I’m not having any luck confirming that this is a problem in 8.3.7 and an upgrade to 8.3.9 would fix the issue.
postgresql 8.3.7 .... Fix incorrect logging of last-completed-transaction time during PITR ..... Last transaction end time is now logged at end of recovery and at each logged restart point (Simon) ...
On 1/7/10 12:53 PM, "Keaton Adams" <kadams@mxlogic.com> wrote:We had WAL Log shipping (warm standby) working fine under 8.1.4 but under 8.3.7 we can’t get the slave to come up properly. Nothing has changed in our process with regard to start_backup, rsync, stop_backup, bring up the warm standby server in continuous recovery mode, but the failover DB won’t start with the following error:
<2010-01-07 10:54:23 MST>LOG: received immediate shutdown request
/mxl/var/pgsql/data/ha_copy.sh: line 103: 13976 Quit sleep 5
File /mxl/var/pgsql/data/stopslave found. Aborting Process.
<2010-01-07 10:54:28 MST>LOG: could not open file "pg_xlog/00000001000000F6000000E9" (log file 246, segment 233): No such file or directory
<2010-01-07 10:54:28 MST>LOG: redo done at F6/E8FFE378
<2010-01-07 10:54:28 MST>LOG: last completed transaction was at log time 2009-12-28 10:18:04.893307-07
Waiting for log: 00000001000000F6000000E8
<2010-01-07 11:24:49 MST>FATAL: could not restore file "00000001000000F6000000E8" from archive: return code 15
The log file in reference is very old and is not on the Master PG server in pg_xlogs and the “last completed transaction” can’t be right either. Is this a bug or it is something we are doing wrong?
Thanks,
Keaton
psql (PostgreSQL) 8.3.7
contains support for command-line editing
RHEL 5 64 Bit
Linux ourservername 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
On Jan 8, 2010, at 9:50 AM, Keaton Adams wrote: > OK, > > So what am I doing wrong here? > > Installed PG 8.3.7 on Slave machine > > Restored from last evening's backup from the master DB to make the rsync across the network finish sooner. > > Shut down the PG instance on the slave machine > > Ran a script that does the following: > > select pg_start_backup('Master_Backup'); > rsync -rvlpogtz ${masterdb}/* ${slave_dbuser}@${slave_host}:${slavedb} > select pg_stop_backup(); > ssh ${slave_dbuser}@${slave_host} rm ${slavedb}/postmaster.pid 2>/dev/null > ssh ${slave_dbuser}@${slave_host} rm ${slave_backup_path}/0* 2>/dev/null > ssh ${slave_dbuser}@${slave_host} ${PSQL_BIN}/pg_ctl -D ${slavedb} -l logfile start Is ${slave_backup_path} your archive directory? Why are you deleting all of you archives there? Erik Jones, Database Administrator Engine Yard Support, Scalability, Reliability 866.518.9273 x 260 Location: US/Pacific IRC: mage2k
On Jan 8, 2010, at 4:50 PM, Erik Jones wrote: > > On Jan 8, 2010, at 9:50 AM, Keaton Adams wrote: > >> OK, >> >> So what am I doing wrong here? >> >> Installed PG 8.3.7 on Slave machine >> >> Restored from last evening's backup from the master DB to make the rsync across the network finish sooner. >> >> Shut down the PG instance on the slave machine >> >> Ran a script that does the following: >> >> select pg_start_backup('Master_Backup'); >> rsync -rvlpogtz ${masterdb}/* ${slave_dbuser}@${slave_host}:${slavedb} >> select pg_stop_backup(); >> ssh ${slave_dbuser}@${slave_host} rm ${slavedb}/postmaster.pid 2>/dev/null >> ssh ${slave_dbuser}@${slave_host} rm ${slave_backup_path}/0* 2>/dev/null >> ssh ${slave_dbuser}@${slave_host} ${PSQL_BIN}/pg_ctl -D ${slavedb} -l logfile start > > Is ${slave_backup_path} your archive directory? Why are you deleting all of you archives there? Also, what are the contents of your recovery.conf file? Are you using pg_standby? The typical setup is to clear <datadir>/pg_xlogon your standby and use pg_standby to recovery files directly from your archive directory? Erik Jones, Database Administrator Engine Yard Support, Scalability, Reliability 866.518.9273 x 260 Location: US/Pacific IRC: mage2k
We are still on PG 8.3.7 and I don’t see pg_standby. It looks like it was added with 8.3.8. I will put in a request to update to 8.3.9 as a part of our next software upgrade (internal product we write) but for now I need to try to figure out why this isn’t working under 8.3.7. I did manage to bring up replication under 8.3.7 in the test environment just fine, so the question is now why the standby PG instance won’t come up in recovery mode and why it is looking so far back for the next WAL log file.
-K
On 1/8/10 5:55 PM, "Erik Jones" <ejones@engineyard.com> wrote:
-K
On 1/8/10 5:55 PM, "Erik Jones" <ejones@engineyard.com> wrote:
On Jan 8, 2010, at 4:50 PM, Erik Jones wrote:
>
> On Jan 8, 2010, at 9:50 AM, Keaton Adams wrote:
>
>> OK,
>>
>> So what am I doing wrong here?
>>
>> Installed PG 8.3.7 on Slave machine
>>
>> Restored from last evening's backup from the master DB to make the rsync across the network finish sooner.
>>
>> Shut down the PG instance on the slave machine
>>
>> Ran a script that does the following:
>>
>> select pg_start_backup('Master_Backup');
>> rsync -rvlpogtz ${masterdb}/* ${slave_dbuser}@${slave_host}:${slavedb}
>> select pg_stop_backup();
>> ssh ${slave_dbuser}@${slave_host} rm ${slavedb}/postmaster.pid 2>/dev/null
>> ssh ${slave_dbuser}@${slave_host} rm ${slave_backup_path}/0* 2>/dev/null
>> ssh ${slave_dbuser}@${slave_host} ${PSQL_BIN}/pg_ctl -D ${slavedb} -l logfile start
>
> Is ${slave_backup_path} your archive directory? Why are you deleting all of you archives there?
Also, what are the contents of your recovery.conf file? Are you using pg_standby? The typical setup is to clear <datadir>/pg_xlog on your standby and use pg_standby to recovery files directly from your archive directory?
Erik Jones, Database Administrator
Engine Yard
Support, Scalability, Reliability
866.518.9273 x 260
Location: US/Pacific
IRC: mage2k
On Jan 9, 2010, at 7:24 AM, Keaton Adams wrote: > Restored from last evening's backup from the master DB to make the rsync across the network finish sooner. > >> > >> Shut down the PG instance on the slave machine > >> > >> Ran a script that does the following: > >> > >> select pg_start_backup('Master_Backup'); > >> rsync -rvlpogtz ${masterdb}/* ${slave_dbuser}@${slave_host}:${slavedb} > >> select pg_stop_backup(); > >> ssh ${slave_dbuser}@${slave_host} rm ${slavedb}/postmaster.pid 2>/dev/null > >> ssh ${slave_dbuser}@${slave_host} rm ${slave_backup_path}/0* 2>/dev/null > >> ssh ${slave_dbuser}@${slave_host} ${PSQL_BIN}/pg_ctl -D ${slavedb} -l logfile start > > > > Is ${slave_backup_path} your archive directory? Why are you deleting all of you archives there? > > Also, what are the contents of your recovery.conf file? Are you using pg_standby? The typical setup is to clear <datadir>/pg_xlogon your standby and use pg_standby to recovery files directly from your archive directory? > We are still on PG 8.3.7 and I don’t see pg_standby. It looks like it was added with 8.3.8. I will put in a request toupdate to 8.3.9 as a part of our next software upgrade (internal product we write) but for now I need to try to figureout why this isn’t working under 8.3.7. I did manage to bring up replication under 8.3.7 in the test environment justfine, so the question is now why the standby PG instance won’t come up in recovery mode and why it is looking so farback for the next WAL log file. > > -K Keaton, Please, don't top post (I've fixed it in this reply), it makes it difficult for others to join the thread and know what'scurrently going on. Anyways, with regards to pg_standby, it should be included in all versions >= 8.3.0. Regardless, it will work with anythingversion since at least 8.2 so you can just grab it from a source install and build it against your built source treeor use your package manager of choice to get it. Regardless: > >> ssh ${slave_dbuser}@${slave_host} rm ${slave_backup_path}/0* 2>/dev/null You are doing that *after* you run pg_start_backup() so you're deleting WAL files that are needed for the standby -- it needsall WAL files generated from the start_backup() call on. Erik Jones, Database Administrator Engine Yard Support, Scalability, Reliability 866.518.9273 x 260 Location: US/Pacific IRC: mage2k