Re: Standby corruption after master is restarted

Поиск
Список
Период
Сортировка
От Emre Hasegeli
Тема Re: Standby corruption after master is restarted
Дата
Msg-id CAE2gYzxnF1wwUg_nysXLKg9CMjz7SHCDdv=7xOShaA-_=uLT+A@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Standby corruption after master is restarted  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: Standby corruption after master is restarted  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-bugs
> In the initial report (from August 2018) you shared pg_xlogdump output,
> showing that the corrupted WAL record is an FPI_FOR_HINT right after
> CHECKPOINT_SHUTDOWN. Was it the same case this time?

No, this time the xlogdump on master looks like this:

> rmgr: Heap2       len (rec/tot):    110/   110, tx:          0, lsn: 1DFB/D6BDDF20, prev 1DFB/D6BDDEE0, desc: CLEAN
remxid0, blkref #0: rel 1663/16385/141744 blk 331540 
> rmgr: Heap2       len (rec/tot):    100/   100, tx:          0, lsn: 1DFB/D6BDDF90, prev 1DFB/D6BDDF20, desc: CLEAN
remxid0, blkref #0: rel 1663/16385/141744 blk 331541 
> rmgr: XLOG        len (rec/tot):    106/   106, tx:          0, lsn: 1DFB/D6BDDFF8, prev 1DFB/D6BDDF90, desc:
CHECKPOINT_SHUTDOWNredo 1DFB/D6BDDFF8; tli 5; prev tli 5; fpw true; xid 2/3522301709; oid 141900; multi 3434; offset
7985;oldest xid 3329390243 in DB 16385; oldest multi 1 in DB 16385; oldest/newest commit timestamp xid: 0/0; oldest
runningxid 0; shutdown 
> rmgr: Heap        len (rec/tot):     56/  2511, tx: 3522301709, lsn: 1DFB/D6BDE080, prev 1DFB/D6BDDFF8, desc: INSERT
off93, blkref #0: rel 1663/40796/40826 blk 0 FPW 
> rmgr: Transaction len (rec/tot):     34/    34, tx: 3522301709, lsn: 1DFB/D6BDEA50, prev 1DFB/D6BDE080, desc: COMMIT
2018-03-2814:01:36.800811 UTC 
> rmgr: Heap        len (rec/tot):     56/  1257, tx: 3522301710, lsn: 1DFB/D6BDEA78, prev 1DFB/D6BDEA50, desc: INSERT
off41, blkref #0: rel 1663/40796/40820 blk 345119 FPW 
> rmgr: Btree       len (rec/tot):     55/  1857, tx: 3522301710, lsn: 1DFB/D6BDEF68, prev 1DFB/D6BDEA78, desc:
INSERT_LEAFoff 125, blkref #0: rel 1663/40796/40839 blk 107903 FPW 
> rmgr: Transaction len (rec/tot):     34/    34, tx: 3522301710, lsn: 1DFB/D6BDF6B0, prev 1DFB/D6BDEF68, desc: COMMIT
2018-03-2814:01:37.361010 UTC 

> BTW which versions are we talking about? I see the initial report
> mentioned catversion 201608131, this one mentions 201510051, so I'm
> guessing 9.6 and 9.5. Which minor versions?

The first incident was on 9.6.3, the second 9.5.7.

> Is the master under load (accepting writes) before shutdown?

Yes.  Especially on the second incident it was being written heavily.
The master gets up to 10k INSERT transactions per second before being
killed.

> How was it restarted, actually? I see you're mentioning OOM killer, so I
> guess "kill -9". What about the first report - was it the same case, or
> was it restarted "nicely" using pg_ctl?

On the first incident the master was being restarted as part of
upgrade from 9.6.3 to 9.6.4.  It should have been nicely done by the
Debian upgrade scripts.

> Could the replica receive the WAL in some other way - say, from a WAL
> archive? What archive/restore commands you use?

We have repmgr + barman setup:

> archive_command = 'test $(psql -Atc "select failed_count from pg_stat_archiver") -gt 10 || rsync --quiet
--contimeout=2--timeout=2 %p BARMAN_HOST::barman/' 
> restore_command = 'barman-wal-restore --gzip BARMAN_HOST BARMAN_CLUSTER %f %p'


В списке pgsql-bugs по дате отправления:

Предыдущее
От: Aman Bhadoriya
Дата:
Сообщение: installation error
Следующее
От: Emre Hasegeli
Дата:
Сообщение: Re: Standby corruption after master is restarted