BUG #15151: Error with wal replay after planned manual switchover.

Поиск
Список
Период
Сортировка
От PG Bug reporting form
Тема BUG #15151: Error with wal replay after planned manual switchover.
Дата
Msg-id 152353560977.31233.11406243320876395727@wrigleys.postgresql.org
обсуждение исходный текст
Ответы Re: BUG #15151: Error with wal replay after planned manualswitchover.  (Michael Paquier <michael@paquier.xyz>)
Список pgsql-bugs
The following bug has been logged on the website:

Bug reference:      15151
Logged by:          Maxim Boguk
Email address:      mb@dataegret.com
PostgreSQL version: 9.6.8
Operating system:   Linux
Description:

Hi,

Today after planned manual switch-over (via remove all load from master,
checkpoint master, wait replica in sync, stop master, stop replica, switch
master and replica roles without promote), I got very strange error on new
replica (log_min_messages=debug5):

2018-04-12 04:23:28 MSK 32059 @ from  [vxid: txid:0] [] LOG:  database
system was shut down in recovery at 2018-04-12 04:23:26 MSK
2018-04-12 04:23:28 MSK 32059 @ from  [vxid: txid:0] [] LOG:  entering
standby mode
2018-04-12 04:23:28 MSK 32059 @ from  [vxid: txid:0] [] LOG:  invalid
resource manager ID 211 at C7BB/EDC42C68
2018-04-12 04:23:28 MSK 32060 rohh.app@hh from 192.168.1.2 [vxid: txid:0] []
FATAL:  the database system is starting up
2018-04-12 04:23:28 MSK 32061 rohh.app@hh from [local] [vxid: txid:0] []
FATAL:  the database system is starting up
2018-04-12 04:23:28 MSK 32063 [unknown]@[unknown] from [local] [vxid:
txid:0] [] LOG:  incomplete startup packet
2018-04-12 04:23:28 MSK 32064 postgres@postgres from [local] [vxid: txid:0]
[] FATAL:  the database system is starting up
2018-04-12 04:23:28 MSK 32062 @ from  [vxid: txid:0] [] LOG:  started
streaming WAL from primary at C7BB/ED000000 on timeline 1
2018-04-12 04:23:28 MSK 32059 @ from  [vxid: txid:0] [] LOG:  invalid
resource manager ID 211 at C7BB/EDC42C68
2018-04-12 04:23:30 MSK 32062 @ from  [vxid: txid:0] [] FATAL:  terminating
walreceiver process due to administrator command
2018-04-12 04:23:30 MSK 32059 @ from  [vxid: txid:0] [] LOG:  invalid
resource manager ID 211 at C7BB/EDC42C68
2018-04-12 04:23:30 MSK 32059 @ from  [vxid: txid:0] [] LOG:  invalid
resource manager ID 211 at C7BB/EDC42C68

and so on.
Start with log_min_messages=debug5 - doesn't add any useful information.

Restart of new slave doesn't help.
Stop new slave, manual clean xlog directory on slave and start slave -
doesn't help (e.g. it doesn't look like the single wal file was damaged
during network transfer).

Server offline at this moment and open for experiments for few days.

pg_xlogdump show no errors in wal files:
postgres@db-hh3:~$ /usr/lib/postgresql/9.6/bin/pg_xlogdump
--start=C7BB/EDC42C68
/var/lib/postgresql/9.6/main/pg_xlog/000000010000C7BB000000ED | less
first record is after C7BB/EDC42C68, at C7BB/EDC433B8, skipping over 1872
bytes
rmgr: XLOG        len (rec/tot):     51/  2556, tx:          0, lsn:
C7BB/EDC433B8, prev C7BB/EDC42928, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326257 FPW
rmgr: XLOG        len (rec/tot):     51/  3140, tx:          0, lsn:
C7BB/EDC43DB8, prev C7BB/EDC433B8, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/150046445 blk 32094 FPW
rmgr: XLOG        len (rec/tot):     51/  2595, tx:          0, lsn:
C7BB/EDC44A18, prev C7BB/EDC43DB8, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326250 FPW
rmgr: XLOG        len (rec/tot):     51/  3130, tx:          0, lsn:
C7BB/EDC45440, prev C7BB/EDC44A18, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/150046445 blk 15989 FPW
rmgr: XLOG        len (rec/tot):     51/  2590, tx:          0, lsn:
C7BB/EDC46098, prev C7BB/EDC45440, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326298 FPW


Few records before C7BB/EDC42C68:


rmgr: XLOG        len (rec/tot):     51/  2520, tx:          0, lsn:
C7BB/EDC41F38, prev C7BB/EDC41348, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326155 FPW
rmgr: XLOG        len (rec/tot):     51/  2703, tx:          0, lsn:
C7BB/EDC42928, prev C7BB/EDC41F38, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326262 FPW

rmgr: XLOG        len (rec/tot):     51/  2556, tx:          0, lsn:
C7BB/EDC433B8, prev C7BB/EDC42928, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326257 FPW
rmgr: XLOG        len (rec/tot):     51/  3140, tx:          0, lsn:
C7BB/EDC43DB8, prev C7BB/EDC433B8, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/150046445 blk 32094 FPW
rmgr: XLOG        len (rec/tot):     51/  2595, tx:          0, lsn:
C7BB/EDC44A18, prev C7BB/EDC43DB8, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326250 FPW

pg_xlogdump show no visible to me anomalies in 000000010000C7BB000000ED or
000000010000C7BB000000EC files.

Any additional information I'll try provide, but my experience with internal
wal structure outside of using pg_xlogdump output very limited.


В списке pgsql-bugs по дате отправления:

Предыдущее
От: Arthur Zakirov
Дата:
Сообщение: Re: BUG #15150: Reading uninitialised value in NISortAffixes(tsearch/spell.c)
Следующее
От: Tom Lane
Дата:
Сообщение: Re: BUG #15150: Reading uninitialised value in NISortAffixes (tsearch/spell.c)