Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?

Поиск
Список
Период
Сортировка
От Venkata B Nagothi
Тема Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Дата
Msg-id CAEyp7J-4MmVwGoZSwvaSULZC80JDD_tL-9KsNiqF17+bNqiSBg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Ответы Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
Список pgsql-bugs

On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello. I added pgsql-hackers.

This occurs also on git master and back to 9.4.

At Fri, 13 Jan 2017 08:47:06 -0600, Jonathon Nelson <jdnelson@dyn.com> wrote in <CACJqAM1ydcZcd5DoCp+y5hkWto1ZeGW+Mj8UK7avqctbGJO8Bw@mail.gmail.com>
> On Mon, Nov 28, 2016 at 1:39 PM, Jonathon Nelson <jdnelson@dyn.com> wrote:
> > First, the postgresql configuration differs only minimally from the stock
> > config:
> >
> > Assume wal_keep_segments = 0.
> > Assume the use of physical replication slots.
> > Assume one master, one standby.
> >
> > Lastly, we have observed the behavior "in the wild" at least twice and in
> > the lab a dozen or so times.
> >
> > EXAMPLE #1 (logs are from the replica):
> >
> > user=,db=,app=,client= DEBUG:  creating and filling new WAL file
> > user=,db=,app=,client= DEBUG:  done creating and filling new WAL file
> > user=,db=,app=,client= DEBUG:  sending write 6/8B000000 flush 6/8A000000
> > apply 5/748425A0
> > user=,db=,app=,client= DEBUG:  sending write 6/8B000000 flush 6/8B000000
> > apply 5/74843020
> > <control-c here>
> > user=,db=,app=,client= DEBUG:  postmaster received signal 2
> > user=,db=,app=,client= LOG:  received fast shutdown request
> > user=,db=,app=,client= LOG:  aborting any active transactions
> >
> > And, upon restart:
> >
> > user=,db=,app=,client= LOG:  restartpoint starting: xlog
> > user=,db=,app=,client= DEBUG:  updated min recovery point to 6/67002390 on
> > timeline 1
> > user=,db=,app=,client= DEBUG:  performing replication slot checkpoint
> > user=,db=,app=,client= DEBUG:  updated min recovery point to 6/671768C0 on
> > timeline 1
> > user=,db=,app=,client= CONTEXT:  writing block 589 of relation
> > base/13294/16501
> > user=,db=,app=,client= LOG:  invalid magic number 0000 in log segment
> > 00000001000000060000008B, offset 0
> > user=,db=,app=,client= DEBUG:  switched WAL source from archive to stream
> > after failure
> > user=,db=,app=,client= LOG:  started streaming WAL from primary at
> > 6/8A000000 on timeline 1
> > user=,db=,app=,client= FATAL:  could not receive data from WAL stream:
> > ERROR:  requested WAL segment 00000001000000060000008A has already been
> > removed

I managed to reproduce this. A little tweak as the first patch
lets the standby to suicide as soon as walreceiver sees a
contrecord at the beginning of a segment.

- M(aster): createdb as a master with wal_keep_segments = 0
            (default), min_log_messages = debug2
- M: Create a physical repslot.
- S(tandby): Setup a standby database.
- S: Edit recovery.conf to use the replication slot above then
     start it.
- S: touch /tmp/hoge
- M: Run pgbench ...
- S: After a while, the standby stops.
  > LOG:  #################### STOP THE SERVER

- M: Stop pgbench.
- M: Do 'checkpoint;' twice.
- S: rm /tmp/hoge
- S: Fails to catch up with the following error.

  > FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000000000002B has already been removed


I have been testing / reviewing the latest patch "0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i might need some more clarification on this.

Before applying the patch, I tried re-producing the above error -

- I had master->standby in streaming replication
- Took the backup of master
   - with a low max_wal_size and wal_keep_segments = 0
- Configured standby with recovery.conf
- Created replication slot on master
- Configured the replication slot on standby and started the standby
- I got the below error
  
   >> 2017-03-10 11:58:15.704 AEDT [478] LOG:  invalid record length at 0/F2000140: wanted 24, got 0
   >> 2017-03-10 11:58:15.706 AEDT [481] LOG:  started streaming WAL from primary at 0/F2000000 on timeline 1
   >> 2017-03-10 11:58:15.706 AEDT [481] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000100000000000000F2 has already been removed

and i could notice that the file "0000000100000000000000F2" was removed from the master. This can be easily re-produced and this occurs irrespective of configuring replication slots.

As long as the file "0000000100000000000000F2" is available on the master, standby continues to stream WALs without any issues.

some more details -

Contents of the file  "0000000100000000000000F2" on standby before pg_stop_backup()

rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn: 0/F2000028, prev 0/F1000098, desc: RUNNING_XACTS nextXid 638 latestCompletedXid 637 oldestRunningXid 638
rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn: 0/F2000060, prev 0/F2000028, desc: RUNNING_XACTS nextXid 638 latestCompletedXid 637 oldestRunningXid 638
rmgr: XLOG        len (rec/tot):     80/   106, tx:          0, lsn: 0/F2000098, prev 0/F2000060, desc: CHECKPOINT_ONLINE redo 0/F2000060; tli 1; prev tli 1; fpw true; xid 0:638; oid 16487; multi 1; offset 0; oldest xid 544 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 638; online
rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn: 0/F2000108, prev 0/F2000098, desc: RUNNING_XACTS nextXid 638 latestCompletedXid 637 oldestRunningXid 638
pg_waldump: FATAL:  error in WAL record at 0/F2000108: invalid record length at 0/F2000140: wanted 24, got 0

Contents of the file on master after pg_stop_backup()

rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn: 0/F2000028, prev 0/F1000098, desc: RUNNING_XACTS nextXid 638 latestCompletedXid 637 oldestRunningXid 638
rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn: 0/F2000060, prev 0/F2000028, desc: RUNNING_XACTS nextXid 638 latestCompletedXid 637 oldestRunningXid 638
rmgr: XLOG        len (rec/tot):     80/   106, tx:          0, lsn: 0/F2000098, prev 0/F2000060, desc: CHECKPOINT_ONLINE redo 0/F2000060; tli 1; prev tli 1; fpw true; xid 0:638; oid 16487; multi 1; offset 0; oldest xid 544 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 638; online
rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn: 0/F2000108, prev 0/F2000098, desc: RUNNING_XACTS nextXid 638 latestCompletedXid 637 oldestRunningXid 638
rmgr: Heap2       len (rec/tot):      8/  7735, tx:          0, lsn: 0/F2000140, prev 0/F2000108, desc: CLEAN remxid 620, blkref #0: rel 1663/13179/2619 blk 2 FPW
rmgr: Heap2       len (rec/tot):      8/  6863, tx:          0, lsn: 0/F2001F78, prev 0/F2000140, desc: CLEAN remxid 620, blkref #0: rel 1663/13179/2840 blk 0 FPW
rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn: 0/F2003A60, prev 0/F2001F78, desc: RUNNING_XACTS nextXid 638 latestCompletedXid 637 oldestRunningXid 638
rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn: 0/F2003A98, prev 0/F2003A60, desc: RUNNING_XACTS nextXid 638 latestCompletedXid 637 oldestRunningXid 638
rmgr: XLOG        len (rec/tot):     80/   106, tx:          0, lsn: 0/F2003AD0, prev 0/F2003A98, desc: CHECKPOINT_ONLINE redo 0/F2003A98; tli 1; prev tli 1; fpw true; xid 0:638; oid 16487; multi 1; offset 0; oldest xid 544 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 638; online
rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn: 0/F2003B40, prev 0/F2003AD0, desc: RUNNING_XACTS nextXid 638 latestCompletedXid 637 oldestRunningXid 638
rmgr: XLOG        len (rec/tot):      8/    34, tx:          0, lsn: 0/F2003B78, prev 0/F2003B40, desc: BACKUP_END 0/F2000060
rmgr: XLOG        len (rec/tot):      0/    24, tx:          0, lsn: 0/F2003BA0, prev 0/F2003B78, desc: SWITCH

If the scenario i created to reproduce the error is correct, then, applying the patch is not making a difference.  

I think, i need help in building a specific test case which will re-produce the specific BUG related to physical replication slots as reported ?

Will continue to review the patch, once i have any comments on this.

Regards,
Venkata B N

Database Consultant


В списке pgsql-bugs по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: [BUGS] BUG #14588: Postgres process blocked on semop
Следующее
От: Neo Liu
Дата:
Сообщение: Re: [BUGS] BUG #14588: Postgres process blocked on semop