Обсуждение: After upgrade to 9.3, streaming replication fails to start

Поиск
Список
Период
Сортировка

After upgrade to 9.3, streaming replication fails to start

От
Jeff Ross
Дата:
Hi all,

I recently installed 9.3 into my work servers.

With 9.2 I had streaming replication working.  Primary server is on
nirvana, standby server is on dukkha.

Upgrade on nirvana went just fine--no problems there.  Initial
installation on dukkha went fine as well, but streaming replication
fails to start with the error "PANIC:  could not locate a valid
checkpoint record".

Here's what I'm doing to start streaming replication:

On the standby:

_postgresql@dukkha:/var/postgresql $ initdb -D /var/postgresql/data.93.5432
The files belonging to this database system will be owned by user
"_postgresql".
This user must also own the server process.

The database cluster will be initialized with locale "C".
The default database encoding has accordingly been set to "SQL_ASCII".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory /var/postgresql/data.93.5432 ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
creating configuration files ... ok
creating template1 database in /var/postgresql/data.93.5432/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ... not supported on this platform
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok
syncing data to disk ... ok

WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

     postgres -D /var/postgresql/data.93.5432
or
     pg_ctl -D /var/postgresql/data.93.5432 -l logfile start

_postgresql@dukkha:/var/postgresql $ cp pg_hba.conf data.93.5432/
_postgresql@dukkha:/var/postgresql $ echo "include
'/var/postgresql/my_postgresql.conf'" >> data.93.5432/postgresql.conf
_postgresql@dukkha:/var/postgresql $ tail -1 data.93.5432/postgresql.conf
include '/var/postgresql/my_postgresql.conf'
_postgresql@dukkha:/var/postgresql $ cat my_postgresql.conf
#my settings
listen_addresses = 'localhost,10.31.16.3'               # what IP
address(es) to listen on;
unix_socket_directories = '/var/postgresql/'            # (change
requires restart)
log_line_prefix = '<%u%%%d> '                   # special values:
log_timezone = 'America/Denver'
timezone = 'America/Denver'
log_statement = 'all'                   # none, ddl, mod, all
archive_mode = on
wal_level = hot_standby                 # minimal, archive, or hot_standby
archive_command = 'cd .'
default_statistics_target = 50 # pgtune wizard 2009-12-19
maintenance_work_mem = 120MB # pgtune wizard 2009-12-19
constraint_exclusion = on # pgtune wizard 2009-12-19
checkpoint_completion_target = 0.9 # pgtune wizard 2009-12-19
effective_cache_size = 1408MB # pgtune wizard 2009-12-19
work_mem = 12MB # pgtune wizard 2009-12-19
wal_buffers = 8MB # pgtune wizard 2009-12-19
checkpoint_segments = 16 # pgtune wizard 2009-12-19
shared_buffers = 480MB # pgtune wizard 2009-12-19
max_connections = 100 # pgtune wizard 2009-12-19
hot_standby = on

_postgresql@dukkha:/var/postgresql $ cat recovery.conf
standby_mode = 'on'
primary_conninfo = 'host=nirvana.internal port=5432 user=wykids_replication'
restore_command = 'cp /wal/%f %p'
archive_cleanup_command = 'pg_archivecleanup /wal %r'

Switch to the primary:

_postgresql@nirvana:/var/postgresql $ cat start_hot_standby.sh
#!/bin/sh
backup_label=wykids_`date +%Y-%m-%d`
#remove any existing wal files on the standby
ssh dukkha.internal rm -rf /wal/*
#stop the standby server if it is running
ssh dukkha.internal sudo /usr/local/bin/svc -d /service/postgresql.5432
psql -c "select pg_start_backup('$backup_label');" template1
rsync \
         --copy-links \
         --delete \
         --exclude=backup_label \
         --exclude=postgresql.conf \
         --exclude=recovery.done \
         -e ssh -avz /var/postgresql/data.93.5432/ \
         dukkha.internal:/var/postgresql/data.93.5432/
ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_xlog/*
ssh dukkha.internal rm -f
/var/postgresql/data.93.5432/pg_xlog/archive_status/*
ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_log/*
ssh dukkha.internal rm -f /var/postgresql/data.93.5432/postmaster.pid
ssh dukkha.internal ln -s /var/postgresql/recovery.conf
/var/postgresql/data.93.5432/recovery.conf
psql -c "select pg_stop_backup();" template1
ssh dukkha.internal sudo /usr/local/bin/svc -u /service/postgresql.5432


_postgresql@nirvana:/var/postgresql $ sh -x start_hot_standby.sh
+ date +%Y-%m-%d
+ backup_label=wykids_2013-11-06
+ ssh dukkha.internal rm -rf /wal/*
+ ssh dukkha.internal sudo /usr/local/bin/svc -d /service/postgresql.5432
+ rsync -e ssh /wal/ dukkha.internal:/wal/
skipping directory .
+ psql -c select pg_start_backup('wykids_2013-11-06'); template1
  pg_start_backup
-----------------
  0/3D000024
(1 row)

+ rsync --copy-links --delete --exclude=backup_label
--exclude=postgresql.conf --exclude=recovery.done -e ssh -avz
/var/postgresql/data.93.5432/ dukkha.internal:/var/postgresql/data.93.5432/
sending incremental file list
./
(snip rsync output)

sent 239,677,033 bytes  received 225,961 bytes  1,139,681.68 bytes/sec
total size is 1,426,541,526  speedup is 5.95
+ ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_xlog/*
rm: /var/postgresql/data.93.5432/pg_xlog/archive_status: is a directory
+ ssh dukkha.internal rm -f
/var/postgresql/data.93.5432/pg_xlog/archive_status/*
+ ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_log/*
+ ssh dukkha.internal rm -f /var/postgresql/data.93.5432/postmaster.pid
+ ssh dukkha.internal ln -s /var/postgresql/recovery.conf
/var/postgresql/data.93.5432/recovery.conf
+ psql -c select pg_stop_backup(); template1
NOTICE:  pg_stop_backup complete, all required WAL segments have been
archived
  pg_stop_backup
----------------
  0/3D0000DC
(1 row)

+ ssh dukkha.internal sudo /usr/local/bin/svc -u /service/postgresql.5432

The standby server fails to start with the following:

2013-11-06 07:50:23.436056500 <%> LOG:  database system was interrupted;
last known up at 2013-11-06 07:47:07 MST
2013-11-06 07:50:23.495849500 <%> LOG:  entering standby mode
2013-11-06 07:50:23.495854500 <%> LOG:  invalid primary checkpoint record
2013-11-06 07:50:23.495858500 <%> LOG:  invalid secondary checkpoint record
2013-11-06 07:50:23.495862500 <%> PANIC:  could not locate a valid
checkpoint record

I'm sure I'm missing something simple here, but I'm not seeing what it is.

Thanks!

Jeff Ross


Re: After upgrade to 9.3, streaming replication fails to start

От
Jeff Janes
Дата:
On Wed, Nov 6, 2013 at 9:40 AM, Jeff Ross <jross@wykids.org> wrote:


_postgresql@nirvana:/var/postgresql $ cat start_hot_standby.sh
#!/bin/sh
backup_label=wykids_`date +%Y-%m-%d`
#remove any existing wal files on the standby
ssh dukkha.internal rm -rf /wal/*
#stop the standby server if it is running
ssh dukkha.internal sudo /usr/local/bin/svc -d /service/postgresql.5432
psql -c "select pg_start_backup('$backup_label');" template1
rsync \
        --copy-links \
        --delete \
        --exclude=backup_label \


Excluding backup_label is exactly the wrong thing to do.  The only reason backup_label is created in the first place is so that it can be copied to the replica, where it is needed.  It's existence on the master is a nuisance.

 
        --exclude=postgresql.conf \
        --exclude=recovery.done \
        -e ssh -avz /var/postgresql/data.93.5432/ \
        dukkha.internal:/var/postgresql/data.93.5432/
ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_xlog/*
ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_xlog/archive_status/*
ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_log/*
ssh dukkha.internal rm -f /var/postgresql/data.93.5432/postmaster.pid
ssh dukkha.internal ln -s /var/postgresql/recovery.conf /var/postgresql/data.93.5432/recovery.conf
psql -c "select pg_stop_backup();" template1
ssh dukkha.internal sudo /usr/local/bin/svc -u /service/postgresql.5432


_postgresql@nirvana:/var/postgresql $ sh -x start_hot_standby.sh
+ date +%Y-%m-%d
+ backup_label=wykids_2013-11-06
+ ssh dukkha.internal rm -rf /wal/*
+ ssh dukkha.internal sudo /usr/local/bin/svc -d /service/postgresql.5432
+ rsync -e ssh /wal/ dukkha.internal:/wal/
skipping directory .


Where is the above rsync coming from?  It doesn't seem to be in the shell script you showed.

Anyway, I think you need to copy the wal over after you call pg_stop_backup, not before you call pg_start_backup.

Cheers,

Jeff

Re: After upgrade to 9.3, streaming replication fails to start

От
Jeff Ross
Дата:

On 11/6/13, 11:32 AM, Jeff Janes wrote:
On Wed, Nov 6, 2013 at 9:40 AM, Jeff Ross <jross@wykids.org> wrote:


_postgresql@nirvana:/var/postgresql $ cat start_hot_standby.sh
#!/bin/sh
backup_label=wykids_`date +%Y-%m-%d`
#remove any existing wal files on the standby
ssh dukkha.internal rm -rf /wal/*
#stop the standby server if it is running
ssh dukkha.internal sudo /usr/local/bin/svc -d /service/postgresql.5432
psql -c "select pg_start_backup('$backup_label');" template1
rsync \
        --copy-links \
        --delete \
        --exclude=backup_label \


Excluding backup_label is exactly the wrong thing to do.  The only reason backup_label is created in the first place is so that it can be copied to the replica, where it is needed.  It's existence on the master is a nuisance.

 
        --exclude=postgresql.conf \
        --exclude=recovery.done \
        -e ssh -avz /var/postgresql/data.93.5432/ \
        dukkha.internal:/var/postgresql/data.93.5432/
ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_xlog/*
ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_xlog/archive_status/*
ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_log/*
ssh dukkha.internal rm -f /var/postgresql/data.93.5432/postmaster.pid
ssh dukkha.internal ln -s /var/postgresql/recovery.conf /var/postgresql/data.93.5432/recovery.conf
psql -c "select pg_stop_backup();" template1
ssh dukkha.internal sudo /usr/local/bin/svc -u /service/postgresql.5432


_postgresql@nirvana:/var/postgresql $ sh -x start_hot_standby.sh
+ date +%Y-%m-%d
+ backup_label=wykids_2013-11-06
+ ssh dukkha.internal rm -rf /wal/*
+ ssh dukkha.internal sudo /usr/local/bin/svc -d /service/postgresql.5432
+ rsync -e ssh /wal/ dukkha.internal:/wal/
skipping directory .


Where is the above rsync coming from?  It doesn't seem to be in the shell script you showed.

Anyway, I think you need to copy the wal over after you call pg_stop_backup, not before you call pg_start_backup.

Cheers,

Jeff

Hi Jeff,

Thanks for the reply.  Oops, I copied one of the many changes to the script, but not the one with the rsync to copy /wal from the primary to the standby.

I should have mentioned that wal archiving is setup and working from the primary to the standby.  It saves wal both on the locally on the primary and remotesly on the standby.

I moved the rsync line to copy wal from primary to secondary after pg_stop_backup but I'm still getting the same panic on the standby.

Here's the real, honest version of the script I use to start the hot standby:

_postgresql@nirvana:/var/postgresql $ cat start_hot_standby.sh
#!/bin/sh
backup_label=wykids_`date +%Y-%m-%d`
#remove any existing wal files on the secondary
ssh dukkha.internal "rm -rf /wal/*"
ssh dukkha.internal sudo /usr/local/bin/svc -d /service/postgresql.5432
psql -c "select pg_start_backup('$backup_label');" template1
rsync \
        --copy-links \
        --delete \
        --exclude=backup_label \
        --exclude=postgresql.conf \
        --exclude=recovery.done \
        -e ssh -avz /var/postgresql/data.93.5432/ \
        dukkha.internal:/var/postgresql/data.93.5432/
ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/pg_xlog/*"
ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/pg_xlog/archive_status/*"
ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/pg_log/*"
ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/postmaster.pid"
ssh dukkha.internal "ln -s /var/postgresql/recovery.conf /var/postgresql/data.93.5432/recovery.conf"
psql -c "select pg_stop_backup();" template1
rsync -e ssh -avz /wal/ dukkha.internal:/wal/
ssh dukkha.internal sudo /usr/local/bin/svc -u /service/postgresql.5432

Here are the logs on the standby after running the above:

2013-11-06 11:56:30.792461500 <%> LOG:  database system was interrupted; last known up at 2013-11-06 11:52:22 MST
2013-11-06 11:56:30.800685500 <%> LOG:  entering standby mode
2013-11-06 11:56:30.800891500 <%> LOG:  invalid primary checkpoint record
2013-11-06 11:56:30.800930500 <%> LOG:  invalid secondary checkpoint record
2013-11-06 11:56:30.801004500 <%> PANIC:  could not locate a valid checkpoint record

Jeff

Re: After upgrade to 9.3, streaming replication fails to start--SOLVED

От
Jeff Ross
Дата:
On 11/6/13, 12:26 PM, Jeff Ross wrote:

On 11/6/13, 11:32 AM, Jeff Janes wrote:
On Wed, Nov 6, 2013 at 9:40 AM, Jeff Ross <jross@wykids.org> wrote:


_postgresql@nirvana:/var/postgresql $ cat start_hot_standby.sh
#!/bin/sh
backup_label=wykids_`date +%Y-%m-%d`
#remove any existing wal files on the standby
ssh dukkha.internal rm -rf /wal/*
#stop the standby server if it is running
ssh dukkha.internal sudo /usr/local/bin/svc -d /service/postgresql.5432
psql -c "select pg_start_backup('$backup_label');" template1
rsync \
        --copy-links \
        --delete \
        --exclude=backup_label \


Excluding backup_label is exactly the wrong thing to do.  The only reason backup_label is created in the first place is so that it can be copied to the replica, where it is needed.  It's existence on the master is a nuisance.

 
        --exclude=postgresql.conf \
        --exclude=recovery.done \
        -e ssh -avz /var/postgresql/data.93.5432/ \
        dukkha.internal:/var/postgresql/data.93.5432/
ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_xlog/*
ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_xlog/archive_status/*
ssh dukkha.internal rm -f /var/postgresql/data.93.5432/pg_log/*
ssh dukkha.internal rm -f /var/postgresql/data.93.5432/postmaster.pid
ssh dukkha.internal ln -s /var/postgresql/recovery.conf /var/postgresql/data.93.5432/recovery.conf
psql -c "select pg_stop_backup();" template1
ssh dukkha.internal sudo /usr/local/bin/svc -u /service/postgresql.5432


_postgresql@nirvana:/var/postgresql $ sh -x start_hot_standby.sh
+ date +%Y-%m-%d
+ backup_label=wykids_2013-11-06
+ ssh dukkha.internal rm -rf /wal/*
+ ssh dukkha.internal sudo /usr/local/bin/svc -d /service/postgresql.5432
+ rsync -e ssh /wal/ dukkha.internal:/wal/
skipping directory .


Where is the above rsync coming from?  It doesn't seem to be in the shell script you showed.

Anyway, I think you need to copy the wal over after you call pg_stop_backup, not before you call pg_start_backup.

Cheers,

Jeff

Hi Jeff,

Thanks for the reply.  Oops, I copied one of the many changes to the script, but not the one with the rsync to copy /wal from the primary to the standby.

I should have mentioned that wal archiving is setup and working from the primary to the standby.  It saves wal both on the locally on the primary and remotesly on the standby.

I moved the rsync line to copy wal from primary to secondary after pg_stop_backup but I'm still getting the same panic on the standby.

Here's the real, honest version of the script I use to start the hot standby:

_postgresql@nirvana:/var/postgresql $ cat start_hot_standby.sh
#!/bin/sh
backup_label=wykids_`date +%Y-%m-%d`
#remove any existing wal files on the secondary
ssh dukkha.internal "rm -rf /wal/*"
ssh dukkha.internal sudo /usr/local/bin/svc -d /service/postgresql.5432
psql -c "select pg_start_backup('$backup_label');" template1
rsync \
        --copy-links \
        --delete \
        --exclude=backup_label \
        --exclude=postgresql.conf \
        --exclude=recovery.done \
        -e ssh -avz /var/postgresql/data.93.5432/ \
        dukkha.internal:/var/postgresql/data.93.5432/
ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/pg_xlog/*"
ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/pg_xlog/archive_status/*"
ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/pg_log/*"
ssh dukkha.internal "rm -f /var/postgresql/data.93.5432/postmaster.pid"
ssh dukkha.internal "ln -s /var/postgresql/recovery.conf /var/postgresql/data.93.5432/recovery.conf"
psql -c "select pg_stop_backup();" template1
rsync -e ssh -avz /wal/ dukkha.internal:/wal/
ssh dukkha.internal sudo /usr/local/bin/svc -u /service/postgresql.5432

Here are the logs on the standby after running the above:

2013-11-06 11:56:30.792461500 <%> LOG:  database system was interrupted; last known up at 2013-11-06 11:52:22 MST
2013-11-06 11:56:30.800685500 <%> LOG:  entering standby mode
2013-11-06 11:56:30.800891500 <%> LOG:  invalid primary checkpoint record
2013-11-06 11:56:30.800930500 <%> LOG:  invalid secondary checkpoint record
2013-11-06 11:56:30.801004500 <%> PANIC:  could not locate a valid checkpoint record

Jeff
My apologies to Jeff--I'd missed his in-line comment above that I should *not* exclude the backup label from the rsync of the primary to the standby.  As soon as I removed that exclusion and with his other suggested change that I should copy the /wal from the primary to the standby after pg_stop_backup, streaming replication started on the standby exactly as it should.

Logs from the standby:

2013-11-07 09:21:15.273712500 <%> LOG:  database system was interrupted; last known up at 2013-11-07 09:16:05 MST
2013-11-07 09:21:15.286834500 <%> LOG:  entering standby mode
2013-11-07 09:21:16.873654500 <%> LOG:  restored log file "000000010000000000000050" from archive
2013-11-07 09:21:16.936355500 <%> LOG:  redo starts at 0/50000024
2013-11-07 09:21:17.129718500 <%> LOG:  consistent recovery state reached at 0/50036D6C
2013-11-07 09:21:17.131933500 <%> LOG:  database system is ready to accept read only connections
2013-11-07 09:21:17.136856500 cp: /wal/000000010000000000000051: No such file or directory
2013-11-07 09:21:17.194811500 <%> LOG:  started streaming WAL from primary at 0/51000000 on timeline 1

Jeff Ross