Re: Replication failure, slave requesting old segments

Поиск
Список
Период
Сортировка
От Adrian Klaver
Тема Re: Replication failure, slave requesting old segments
Дата
Msg-id d5e0ad50-801d-0cf8-e9d6-7e0f7e5f6d68@aklaver.com
обсуждение исходный текст
Ответ на Replication failure, slave requesting old segments  ("Phil Endecott" <spam_from_pgsql_lists@chezphil.org>)
Ответы Re: Replication failure, slave requesting old segments  ("Phil Endecott" <spam_from_pgsql_lists@chezphil.org>)
Список pgsql-general
On 08/11/2018 08:18 AM, Phil Endecott wrote:
> Dear Experts,
> 
> I recently set up replication for the first time.  It seemed to be
> working OK in my initial tests, but didn't cope when the slave was
> down for a longer period.  This is all with the Debian stable
> packages of PostgreSQL 9.6.  My replication setup involves a third
> server, "backup", to and from which WAL files are copied using scp
> (and essentially never deleted), plus streaming replication in
> "hot standby" mode.
> 
> On the master, I have:
> 
> wal_level = replica
> archive_mode = on
> archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
>                     scp %p backup:backup/postgresql/archivedir/%f'
> 
> On the slave I have:
> 
> standby_mode = 'on'
> primary_conninfo = 'user=postgres host=master port=5432'
> restore_command = 'scp backup:backup/postgresql/archivedir/%f %p'
> 
> hot_standby = on
> 
> 
> This all seemed to work OK until the slave was shut down for a few days.
> (During this time the master was also down for a shorter period.)
> When the slave came back up it successfully copied several WAL files
> from the backup server until it reached one that hasn't been created yet:
> 
> 2018-08-10 22:37:38.322 UTC [615] LOG:  restored log file "0000000100000007000000CC" from archive
> 2018-08-10 22:38:02.011 UTC [615] LOG:  restored log file "0000000100000007000000CD" from archive
> 2018-08-10 22:38:29.802 UTC [615] LOG:  restored log file "0000000100000007000000CE" from archive
> 2018-08-10 22:38:55.973 UTC [615] LOG:  restored log file "0000000100000007000000CF" from archive
> 2018-08-10 22:39:24.964 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 
> At this point there was a temporary problem due to a firewall rule that
> blocked the replication connection to the master:
> 
> 2018-08-10 22:41:35.865 UTC [1289] FATAL:  could not connect to the primary server: SSL connection has been closed
unexpectedly
> 
> The slave then entered a loop, retrying to scp the next file from the backup
> archive (and failing) and connecting to the master for streaming replication
> (and also failing).  That is as expected, except for the "invalid magic number":
> 
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 2018-08-10 22:41:37.018 UTC [615] LOG:  invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
> 2018-08-10 22:43:46.936 UTC [1445] FATAL:  could not connect to the primary server: SSL connection has been closed
unexpectedly
> 
> Note that it's complaining about an invalid magic number in log segment D1, which is
> the one that has just failed to scp.  That looked suspicious to me and I remembered
> notes in the docs about ensuring that the archive and/or restore commands return an
> error exit status when they fail; I checked that and yes scp does exit(1) when the
> requested file doesn't exist:
> 
> $ scp backup:/egheriugherg /tmp/a
> scp: /egheriugherg: No such file or directory
> $ echo $?
> 1
> 
> Anyway, the slave continued in its retrying loop as expected, except that each time
> it re-fetched the previous segment, D0; this seems a bit peculiar:
> 
> 2018-08-10 22:44:17.796 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 2018-08-10 22:46:28.727 UTC [1628] FATAL:  could not connect to the primary server: SSL connection has been closed
unexpectedly
> 
> (Note the "invalid magic number" was only reported the first time.)
> 
> Eventually I fixed the networking issue that had prevented the replication connection
> to master and it could start streaming:
> 
> 2018-08-11 00:05:50.364 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 2018-08-11 00:05:51.325 UTC [7208] LOG:  started streaming WAL from primary at 7/D0000000 on timeline 1
> 2018-08-11 00:05:51.325 UTC [7208] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment
0000000100000007000000D0has already been removed
 
> 
> So here is the real problem which worries me: it seems to ask the master for
> segment D0, which is one that it did successfully restore from the backup earlier
> on, and the master - not unexpectedly - reports that it has already been removed.
> 
> I tried restarting the slave:
> 
> 2018-08-11 00:12:15.536 UTC [7954] LOG:  restored log file "0000000100000007000000D0" from archive
> 2018-08-11 00:12:15.797 UTC [7954] LOG:  redo starts at 7/D0F956C0
> 2018-08-11 00:12:16.068 UTC [7954] LOG:  consistent recovery state reached at 7/D0FFF088
> 2018-08-11 00:12:16.068 UTC [7953] LOG:  database system is ready to accept read only connections
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 2018-08-11 00:12:17.057 UTC [7954] LOG:  invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
> 2018-08-11 00:12:17.242 UTC [8034] LOG:  started streaming WAL from primary at 7/D0000000 on timeline 1
> 2018-08-11 00:12:17.242 UTC [8034] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment
0000000100000007000000D0has already been removed
 
> 
> Note that again it failed to get D1 from the backup, as expected, and then
> reports "invalid magic number" in that file, and again tries to start
> streaming from a segment that the master doesn't have.
> 
> Eventually I decided to try and cause the D1 file to exist in the backup:
> 
> db=> create table waste_space as select * from large_table;
> 
> That caused four new WAL segments to be copied to the backup, and restored
> by the slave:
> 
> 2018-08-11 00:55:49.741 UTC [7954] LOG:  restored log file "0000000100000007000000D0" from archive
> 2018-08-11 00:56:12.304 UTC [7954] LOG:  restored log file "0000000100000007000000D1" from archive
> 2018-08-11 00:56:35.481 UTC [7954] LOG:  restored log file "0000000100000007000000D2" from archive
> 2018-08-11 00:56:57.443 UTC [7954] LOG:  restored log file "0000000100000007000000D3" from archive
> 2018-08-11 00:57:21.723 UTC [7954] LOG:  restored log file "0000000100000007000000D4" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D5: No such file or directory
> 2018-08-11 00:57:22.915 UTC [7954] LOG:  unexpected pageaddr 7/C7000000 in log segment 00000001000000070000
> 00D5, offset 0
> 2018-08-11 00:57:23.114 UTC [12348] LOG:  started streaming WAL from primary at 7/D5000000 on timeline 1
> 
> Note the "unexpected pageaddr" message.  Despite that, the streaming started
> and everything was working correctly.  Hurray!
> 
> I remain concerned, however, that something bad happened and I'd like to
> understand it.  Can anyone help?

Looks like the master recycled the WAL's while the slave could not connect.

Might want to take a look at replication slots to keep WAL's from being 
recycled:

https://www.postgresql.org/docs/9.6/static/warm-standby.html#STREAMING-REPLICATION-SLOTS

with the heads up that slots do not limit the space taken up by WAL's.


The "unexpected pageaddr" probably is for same reason, per comments in 
xlogreader.c:

"
/*
  * Check that the address on the page agrees with what we expected.
  * This check typically fails when an old WAL segment is recycled,
  * and hasn't yet been overwritten with new data yet.
  */

...

report_invalid_record(state,
                       "unexpected pageaddr %X/%X in log segment %s, 
offset %u",
(uint32) (hdr->xlp_pageaddr >> 32), (uint32) hdr->xlp_pageaddr,
                                              fname,
                                              offset);
"

> 
> 
> Thanks,
> 
> Phil.
> 
> 
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


В списке pgsql-general по дате отправления:

Предыдущее
От: "Phil Endecott"
Дата:
Сообщение: Replication failure, slave requesting old segments
Следующее
От: "Phil Endecott"
Дата:
Сообщение: Re: Replication failure, slave requesting old segments