Re: Standby corruption after master is restarted

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Standby corruption after master is restarted
Дата
Msg-id be30f068-1876-a1dd-287b-7e6c68e7ed0b@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: Standby corruption after master is restarted  (Emre Hasegeli <emre@hasegeli.com>)
Ответы Re: Standby corruption after master is restarted  (Emre Hasegeli <emre@hasegeli.com>)
Список pgsql-bugs

On 04/16/2018 10:55 AM, Emre Hasegeli wrote:
>> In the initial report (from August 2018) you shared pg_xlogdump output,
>> showing that the corrupted WAL record is an FPI_FOR_HINT right after
>> CHECKPOINT_SHUTDOWN. Was it the same case this time?
> 
> No, this time the xlogdump on master looks like this:
> 
>> rmgr: Heap2       len (rec/tot):    110/   110, tx:          0, lsn: 1DFB/D6BDDF20, prev 1DFB/D6BDDEE0, desc: CLEAN
remxid0, blkref #0: rel 1663/16385/141744 blk 331540
 
>> rmgr: Heap2       len (rec/tot):    100/   100, tx:          0, lsn: 1DFB/D6BDDF90, prev 1DFB/D6BDDF20, desc: CLEAN
remxid0, blkref #0: rel 1663/16385/141744 blk 331541
 
>> rmgr: XLOG        len (rec/tot):    106/   106, tx:          0, lsn: 1DFB/D6BDDFF8, prev 1DFB/D6BDDF90, desc:
CHECKPOINT_SHUTDOWNredo 1DFB/D6BDDFF8; tli 5; prev tli 5; fpw true; xid 2/3522301709; oid 141900; multi 3434; offset
7985;oldest xid 3329390243 in DB 16385; oldest multi 1 in DB 16385; oldest/newest commit timestamp xid: 0/0; oldest
runningxid 0; shutdown
 
>> rmgr: Heap        len (rec/tot):     56/  2511, tx: 3522301709, lsn: 1DFB/D6BDE080, prev 1DFB/D6BDDFF8, desc: INSERT
off93, blkref #0: rel 1663/40796/40826 blk 0 FPW
 
>> rmgr: Transaction len (rec/tot):     34/    34, tx: 3522301709, lsn: 1DFB/D6BDEA50, prev 1DFB/D6BDE080, desc: COMMIT
2018-03-2814:01:36.800811 UTC
 
>> rmgr: Heap        len (rec/tot):     56/  1257, tx: 3522301710, lsn: 1DFB/D6BDEA78, prev 1DFB/D6BDEA50, desc: INSERT
off41, blkref #0: rel 1663/40796/40820 blk 345119 FPW
 
>> rmgr: Btree       len (rec/tot):     55/  1857, tx: 3522301710, lsn: 1DFB/D6BDEF68, prev 1DFB/D6BDEA78, desc:
INSERT_LEAFoff 125, blkref #0: rel 1663/40796/40839 blk 107903 FPW
 
>> rmgr: Transaction len (rec/tot):     34/    34, tx: 3522301710, lsn: 1DFB/D6BDF6B0, prev 1DFB/D6BDEF68, desc: COMMIT
2018-03-2814:01:37.361010 UTC
 
> 

OK, and on the standby? I suppose it breaks right after the
CHECKPOINT_SHUTDOWN record, right?

>> BTW which versions are we talking about? I see the initial report
>> mentioned catversion 201608131, this one mentions 201510051, so I'm
>> guessing 9.6 and 9.5. Which minor versions?
> 
> The first incident was on 9.6.3, the second 9.5.7.
> 
>> Is the master under load (accepting writes) before shutdown?
> 
> Yes.  Especially on the second incident it was being written heavily.
> The master gets up to 10k INSERT transactions per second before being
> killed.
> 
>> How was it restarted, actually? I see you're mentioning OOM killer, so I
>> guess "kill -9". What about the first report - was it the same case, or
>> was it restarted "nicely" using pg_ctl?
> 
> On the first incident the master was being restarted as part of
> upgrade from 9.6.3 to 9.6.4.  It should have been nicely done by the
> Debian upgrade scripts.
> 

Hmmm. I'm not sure what exactly debian upgrade script (and IIRC it
depends on packages). I wonder if it does fast/smart/immediate.

>> Could the replica receive the WAL in some other way - say, from a WAL
>> archive? What archive/restore commands you use?
> 
> We have repmgr + barman setup:
> 
>> archive_command = 'test $(psql -Atc "select failed_count from pg_stat_archiver") -gt 10 || rsync --quiet
--contimeout=2--timeout=2 %p BARMAN_HOST::barman/'
 
>> restore_command = 'barman-wal-restore --gzip BARMAN_HOST BARMAN_CLUSTER %f %p'

And can you determine how the broken WAL got to the standby? Was it by
directly fetching from the primary, or from the barman archive? Is it
broken in the archive too?

I think one hypothesis is that the standby starts fetching a WAL segment
from the primary, and while doing that it gets recycled and overwritten
(so the standby gets bogus somewhere half-way through). That should not
be possible, of course, but perhaps there's something broken right after
a restart? A race condition or something like that?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


В списке pgsql-bugs по дате отправления:

Предыдущее
От: Emre Hasegeli
Дата:
Сообщение: Re: Standby corruption after master is restarted
Следующее
От: Emre Hasegeli
Дата:
Сообщение: Re: Standby corruption after master is restarted