Обсуждение: BUG #15151: Error with wal replay after planned manual switchover.

Поиск
Список
Период
Сортировка

BUG #15151: Error with wal replay after planned manual switchover.

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      15151
Logged by:          Maxim Boguk
Email address:      mb@dataegret.com
PostgreSQL version: 9.6.8
Operating system:   Linux
Description:

Hi,

Today after planned manual switch-over (via remove all load from master,
checkpoint master, wait replica in sync, stop master, stop replica, switch
master and replica roles without promote), I got very strange error on new
replica (log_min_messages=debug5):

2018-04-12 04:23:28 MSK 32059 @ from  [vxid: txid:0] [] LOG:  database
system was shut down in recovery at 2018-04-12 04:23:26 MSK
2018-04-12 04:23:28 MSK 32059 @ from  [vxid: txid:0] [] LOG:  entering
standby mode
2018-04-12 04:23:28 MSK 32059 @ from  [vxid: txid:0] [] LOG:  invalid
resource manager ID 211 at C7BB/EDC42C68
2018-04-12 04:23:28 MSK 32060 rohh.app@hh from 192.168.1.2 [vxid: txid:0] []
FATAL:  the database system is starting up
2018-04-12 04:23:28 MSK 32061 rohh.app@hh from [local] [vxid: txid:0] []
FATAL:  the database system is starting up
2018-04-12 04:23:28 MSK 32063 [unknown]@[unknown] from [local] [vxid:
txid:0] [] LOG:  incomplete startup packet
2018-04-12 04:23:28 MSK 32064 postgres@postgres from [local] [vxid: txid:0]
[] FATAL:  the database system is starting up
2018-04-12 04:23:28 MSK 32062 @ from  [vxid: txid:0] [] LOG:  started
streaming WAL from primary at C7BB/ED000000 on timeline 1
2018-04-12 04:23:28 MSK 32059 @ from  [vxid: txid:0] [] LOG:  invalid
resource manager ID 211 at C7BB/EDC42C68
2018-04-12 04:23:30 MSK 32062 @ from  [vxid: txid:0] [] FATAL:  terminating
walreceiver process due to administrator command
2018-04-12 04:23:30 MSK 32059 @ from  [vxid: txid:0] [] LOG:  invalid
resource manager ID 211 at C7BB/EDC42C68
2018-04-12 04:23:30 MSK 32059 @ from  [vxid: txid:0] [] LOG:  invalid
resource manager ID 211 at C7BB/EDC42C68

and so on.
Start with log_min_messages=debug5 - doesn't add any useful information.

Restart of new slave doesn't help.
Stop new slave, manual clean xlog directory on slave and start slave -
doesn't help (e.g. it doesn't look like the single wal file was damaged
during network transfer).

Server offline at this moment and open for experiments for few days.

pg_xlogdump show no errors in wal files:
postgres@db-hh3:~$ /usr/lib/postgresql/9.6/bin/pg_xlogdump
--start=C7BB/EDC42C68
/var/lib/postgresql/9.6/main/pg_xlog/000000010000C7BB000000ED | less
first record is after C7BB/EDC42C68, at C7BB/EDC433B8, skipping over 1872
bytes
rmgr: XLOG        len (rec/tot):     51/  2556, tx:          0, lsn:
C7BB/EDC433B8, prev C7BB/EDC42928, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326257 FPW
rmgr: XLOG        len (rec/tot):     51/  3140, tx:          0, lsn:
C7BB/EDC43DB8, prev C7BB/EDC433B8, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/150046445 blk 32094 FPW
rmgr: XLOG        len (rec/tot):     51/  2595, tx:          0, lsn:
C7BB/EDC44A18, prev C7BB/EDC43DB8, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326250 FPW
rmgr: XLOG        len (rec/tot):     51/  3130, tx:          0, lsn:
C7BB/EDC45440, prev C7BB/EDC44A18, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/150046445 blk 15989 FPW
rmgr: XLOG        len (rec/tot):     51/  2590, tx:          0, lsn:
C7BB/EDC46098, prev C7BB/EDC45440, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326298 FPW


Few records before C7BB/EDC42C68:


rmgr: XLOG        len (rec/tot):     51/  2520, tx:          0, lsn:
C7BB/EDC41F38, prev C7BB/EDC41348, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326155 FPW
rmgr: XLOG        len (rec/tot):     51/  2703, tx:          0, lsn:
C7BB/EDC42928, prev C7BB/EDC41F38, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326262 FPW

rmgr: XLOG        len (rec/tot):     51/  2556, tx:          0, lsn:
C7BB/EDC433B8, prev C7BB/EDC42928, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326257 FPW
rmgr: XLOG        len (rec/tot):     51/  3140, tx:          0, lsn:
C7BB/EDC43DB8, prev C7BB/EDC433B8, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/150046445 blk 32094 FPW
rmgr: XLOG        len (rec/tot):     51/  2595, tx:          0, lsn:
C7BB/EDC44A18, prev C7BB/EDC43DB8, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326250 FPW

pg_xlogdump show no visible to me anomalies in 000000010000C7BB000000ED or
000000010000C7BB000000EC files.

Any additional information I'll try provide, but my experience with internal
wal structure outside of using pg_xlogdump output very limited.


Re: BUG #15151: Error with wal replay after planned manualswitchover.

От
Michael Paquier
Дата:
On Thu, Apr 12, 2018 at 12:20:09PM +0000, PG Bug reporting form wrote:
> Today after planned manual switch-over (via remove all load from master,
> checkpoint master, wait replica in sync, stop master, stop replica, switch
> master and replica roles without promote), I got very strange error on new
> replica (log_min_messages=debug5):
>
> 2018-04-12 04:23:28 MSK 32059 @ from  [vxid: txid:0] [] LOG:  database
> system was shut down in recovery at 2018-04-12 04:23:26 MSK
> 2018-04-12 04:23:28 MSK 32059 @ from  [vxid: txid:0] [] LOG:  entering
> standby mode
> 2018-04-12 04:23:28 MSK 32059 @ from  [vxid: txid:0] [] LOG:  invalid
> resource manager ID 211 at C7BB/EDC42C68

C7BB/EDC42C68 is not a valid WAL record, still the message can happen if
you face the end of WAL data as the XLOG reader has not received yet all
the data it wants.  The error messages which validate the format of a
record's header can be triggered randomly, as when a WAL segment is
recycled by a checkpoint its data is not zero'ed for performance
reasons so there is random data in the segment range.  All the sanity
checks are here to be careful about that, as well as per-record CRCs.

After seeing the end of the WAL stream, as this is a standby this
fetches into a mode where it waits for WAL data using different
sources: the local pg_xlog, a full segment using restore_command or WAL
streaming.  And...

> 2018-04-12 04:23:28 MSK 32060 rohh.app@hh from 192.168.1.2 [vxid: txid:0] []
> FATAL:  the database system is starting up
> 2018-04-12 04:23:28 MSK 32061 rohh.app@hh from [local] [vxid: txid:0] []
> FATAL:  the database system is starting up
> 2018-04-12 04:23:28 MSK 32063 [unknown]@[unknown] from [local] [vxid:
> txid:0] [] LOG:  incomplete startup packet
> 2018-04-12 04:23:28 MSK 32064 postgres@postgres from [local] [vxid: txid:0]
> [] FATAL:  the database system is starting up
> 2018-04-12 04:23:28 MSK 32062 @ from  [vxid: txid:0] [] LOG:  started
> streaming WAL from primary at C7BB/ED000000 on timeline 1

When streaming begins, it fetches data from the beginning of the
previous segment, and would try to fetch its data completely, so once
the next record is fully present, then replay would be allowed to
resume.

> 2018-04-12 04:23:28 MSK 32059 @ from  [vxid: txid:0] [] LOG:  invalid
> resource manager ID 211 at C7BB/EDC42C68
> 2018-04-12 04:23:30 MSK 32062 @ from  [vxid: txid:0] [] FATAL:  terminating
> walreceiver process due to administrator command

What's that actually?

> Restart of new slave doesn't help.
> Stop new slave, manual clean xlog directory on slave and start slave -
> doesn't help (e.g. it doesn't look like the single wal file was damaged
> during network transfer).

Hm.  What happens if you copy 000000010000C7BB000000ED to the standby's
pg_xlog.  Have you checked if the segment used on the standby is able to
move after record C7BB/EDC42C68?  Perhaps it is not able to move on
because you are pointing to an incorrect primary so the stream cannot
resume?

> pg_xlogdump show no errors in wal files:
>
> <snip>
>
> rmgr: XLOG        len (rec/tot):     51/  2520, tx:          0, lsn:
> C7BB/EDC41F38, prev C7BB/EDC41348, desc: FPI_FOR_HINT , blkref #0: rel
> 1663/16400/18105 blk 326155 FPW
> rmgr: XLOG        len (rec/tot):     51/  2703, tx:          0, lsn:
> C7BB/EDC42928, prev C7BB/EDC41F38, desc: FPI_FOR_HINT , blkref #0: rel
> 1663/16400/18105 blk 326262 FPW
>
> rmgr: XLOG        len (rec/tot):     51/  2556, tx:          0, lsn:
> C7BB/EDC433B8, prev C7BB/EDC42928, desc: FPI_FOR_HINT , blkref #0: rel
> 1663/16400/18105 blk 326257 FPW
> rmgr: XLOG        len (rec/tot):     51/  3140, tx:          0, lsn:
> C7BB/EDC43DB8, prev C7BB/EDC433B8, desc: FPI_FOR_HINT , blkref #0: rel
> 1663/16400/150046445 blk 32094 FPW
> rmgr: XLOG        len (rec/tot):     51/  2595, tx:          0, lsn:
> C7BB/EDC44A18, prev C7BB/EDC43DB8, desc: FPI_FOR_HINT , blkref #0: rel
> 1663/16400/18105 blk 326250 FPW
>
> pg_xlogdump show no visible to me anomalies in 000000010000C7BB000000ED or
> 000000010000C7BB000000EC files.

Indeed that looks clean.

> Any additional information I'll try provide, but my experience with internal
> wal structure outside of using pg_xlogdump output very limited.

It looks like the stream is not able to resume because the replay cannot
move on.  Perhaps you should check if the standby is fetching its data
via streaming from the correct source.
--
Michael

Вложения