Re: Replication failure, slave requesting old segments
От | Adrian Klaver |
---|---|
Тема | Re: Replication failure, slave requesting old segments |
Дата | |
Msg-id | d5e0ad50-801d-0cf8-e9d6-7e0f7e5f6d68@aklaver.com обсуждение исходный текст |
Ответ на | Replication failure, slave requesting old segments ("Phil Endecott" <spam_from_pgsql_lists@chezphil.org>) |
Ответы |
Re: Replication failure, slave requesting old segments
("Phil Endecott" <spam_from_pgsql_lists@chezphil.org>)
|
Список | pgsql-general |
On 08/11/2018 08:18 AM, Phil Endecott wrote: > Dear Experts, > > I recently set up replication for the first time. It seemed to be > working OK in my initial tests, but didn't cope when the slave was > down for a longer period. This is all with the Debian stable > packages of PostgreSQL 9.6. My replication setup involves a third > server, "backup", to and from which WAL files are copied using scp > (and essentially never deleted), plus streaming replication in > "hot standby" mode. > > On the master, I have: > > wal_level = replica > archive_mode = on > archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f && > scp %p backup:backup/postgresql/archivedir/%f' > > On the slave I have: > > standby_mode = 'on' > primary_conninfo = 'user=postgres host=master port=5432' > restore_command = 'scp backup:backup/postgresql/archivedir/%f %p' > > hot_standby = on > > > This all seemed to work OK until the slave was shut down for a few days. > (During this time the master was also down for a shorter period.) > When the slave came back up it successfully copied several WAL files > from the backup server until it reached one that hasn't been created yet: > > 2018-08-10 22:37:38.322 UTC [615] LOG: restored log file "0000000100000007000000CC" from archive > 2018-08-10 22:38:02.011 UTC [615] LOG: restored log file "0000000100000007000000CD" from archive > 2018-08-10 22:38:29.802 UTC [615] LOG: restored log file "0000000100000007000000CE" from archive > 2018-08-10 22:38:55.973 UTC [615] LOG: restored log file "0000000100000007000000CF" from archive > 2018-08-10 22:39:24.964 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory > > At this point there was a temporary problem due to a firewall rule that > blocked the replication connection to the master: > > 2018-08-10 22:41:35.865 UTC [1289] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly > > The slave then entered a loop, retrying to scp the next file from the backup > archive (and failing) and connecting to the master for streaming replication > (and also failing). That is as expected, except for the "invalid magic number": > > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory > 2018-08-10 22:41:37.018 UTC [615] LOG: invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0 > 2018-08-10 22:43:46.936 UTC [1445] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly > > Note that it's complaining about an invalid magic number in log segment D1, which is > the one that has just failed to scp. That looked suspicious to me and I remembered > notes in the docs about ensuring that the archive and/or restore commands return an > error exit status when they fail; I checked that and yes scp does exit(1) when the > requested file doesn't exist: > > $ scp backup:/egheriugherg /tmp/a > scp: /egheriugherg: No such file or directory > $ echo $? > 1 > > Anyway, the slave continued in its retrying loop as expected, except that each time > it re-fetched the previous segment, D0; this seems a bit peculiar: > > 2018-08-10 22:44:17.796 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory > 2018-08-10 22:46:28.727 UTC [1628] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly > > (Note the "invalid magic number" was only reported the first time.) > > Eventually I fixed the networking issue that had prevented the replication connection > to master and it could start streaming: > > 2018-08-11 00:05:50.364 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory > 2018-08-11 00:05:51.325 UTC [7208] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1 > 2018-08-11 00:05:51.325 UTC [7208] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0has already been removed > > So here is the real problem which worries me: it seems to ask the master for > segment D0, which is one that it did successfully restore from the backup earlier > on, and the master - not unexpectedly - reports that it has already been removed. > > I tried restarting the slave: > > 2018-08-11 00:12:15.536 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive > 2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0 > 2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088 > 2018-08-11 00:12:16.068 UTC [7953] LOG: database system is ready to accept read only connections > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory > 2018-08-11 00:12:17.057 UTC [7954] LOG: invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0 > 2018-08-11 00:12:17.242 UTC [8034] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1 > 2018-08-11 00:12:17.242 UTC [8034] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0has already been removed > > Note that again it failed to get D1 from the backup, as expected, and then > reports "invalid magic number" in that file, and again tries to start > streaming from a segment that the master doesn't have. > > Eventually I decided to try and cause the D1 file to exist in the backup: > > db=> create table waste_space as select * from large_table; > > That caused four new WAL segments to be copied to the backup, and restored > by the slave: > > 2018-08-11 00:55:49.741 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive > 2018-08-11 00:56:12.304 UTC [7954] LOG: restored log file "0000000100000007000000D1" from archive > 2018-08-11 00:56:35.481 UTC [7954] LOG: restored log file "0000000100000007000000D2" from archive > 2018-08-11 00:56:57.443 UTC [7954] LOG: restored log file "0000000100000007000000D3" from archive > 2018-08-11 00:57:21.723 UTC [7954] LOG: restored log file "0000000100000007000000D4" from archive > scp: backup/postgresql/archivedir/0000000100000007000000D5: No such file or directory > 2018-08-11 00:57:22.915 UTC [7954] LOG: unexpected pageaddr 7/C7000000 in log segment 00000001000000070000 > 00D5, offset 0 > 2018-08-11 00:57:23.114 UTC [12348] LOG: started streaming WAL from primary at 7/D5000000 on timeline 1 > > Note the "unexpected pageaddr" message. Despite that, the streaming started > and everything was working correctly. Hurray! > > I remain concerned, however, that something bad happened and I'd like to > understand it. Can anyone help? Looks like the master recycled the WAL's while the slave could not connect. Might want to take a look at replication slots to keep WAL's from being recycled: https://www.postgresql.org/docs/9.6/static/warm-standby.html#STREAMING-REPLICATION-SLOTS with the heads up that slots do not limit the space taken up by WAL's. The "unexpected pageaddr" probably is for same reason, per comments in xlogreader.c: " /* * Check that the address on the page agrees with what we expected. * This check typically fails when an old WAL segment is recycled, * and hasn't yet been overwritten with new data yet. */ ... report_invalid_record(state, "unexpected pageaddr %X/%X in log segment %s, offset %u", (uint32) (hdr->xlp_pageaddr >> 32), (uint32) hdr->xlp_pageaddr, fname, offset); " > > > Thanks, > > Phil. > > > > -- Adrian Klaver adrian.klaver@aklaver.com
В списке pgsql-general по дате отправления:
Следующее
От: "Phil Endecott"Дата:
Сообщение: Re: Replication failure, slave requesting old segments