Bug in Physical Replication Slots (at least 9.5)?
От | Jonathon Nelson |
---|---|
Тема | Bug in Physical Replication Slots (at least 9.5)? |
Дата | |
Msg-id | CACJqAM3xVz0JY1XFDKPP+JoJAjoGx=GNuOAshEDWCext7BFvCQ@mail.gmail.com обсуждение исходный текст |
Ответы |
Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?
(Jonathon Nelson <jdnelson@dyn.com>)
|
Список | pgsql-bugs |
We think we have discovered a bug in the physical replication slots functionality in PostgreSQL 9.5. We've seen the behavior across Operating Systems (CentOS-7 and openSUSE LEAP 42.1), filesystems (ext4 and xfs), and versions (9.5.3 and 9.5.4). All were on x86_64. We notice that if we stop and then re-start the *standby*, upon restart it will - sometimes - request a WAL file that the master no longer has. First, the postgresql configuration differs only minimally from the stock config: Assume wal_keep_segments =3D 0. Assume the use of physical replication slots. Assume one master, one standby. Lastly, we have observed the behavior "in the wild" at least twice and in the lab a dozen or so times. EXAMPLE #1 (logs are from the replica): user=3D,db=3D,app=3D,client=3D DEBUG: creating and filling new WAL file user=3D,db=3D,app=3D,client=3D DEBUG: done creating and filling new WAL fi= le user=3D,db=3D,app=3D,client=3D DEBUG: sending write 6/8B000000 flush 6/8A0= 00000 apply 5/748425A0 user=3D,db=3D,app=3D,client=3D DEBUG: sending write 6/8B000000 flush 6/8B0= 00000 apply 5/74843020 <control-c here> user=3D,db=3D,app=3D,client=3D DEBUG: postmaster received signal 2 user=3D,db=3D,app=3D,client=3D LOG: received fast shutdown request user=3D,db=3D,app=3D,client=3D LOG: aborting any active transactions And, upon restart: user=3D,db=3D,app=3D,client=3D LOG: restartpoint starting: xlog user=3D,db=3D,app=3D,client=3D DEBUG: updated min recovery point to 6/6700= 2390 on timeline 1 user=3D,db=3D,app=3D,client=3D DEBUG: performing replication slot checkpoi= nt user=3D,db=3D,app=3D,client=3D DEBUG: updated min recovery point to 6/6717= 68C0 on timeline 1 user=3D,db=3D,app=3D,client=3D CONTEXT: writing block 589 of relation base/13294/16501 user=3D,db=3D,app=3D,client=3D LOG: invalid magic number 0000 in log segme= nt 00000001000000060000008B, offset 0 user=3D,db=3D,app=3D,client=3D DEBUG: switched WAL source from archive to = stream after failure user=3D,db=3D,app=3D,client=3D LOG: started streaming WAL from primary at 6/8A000000 on timeline 1 user=3D,db=3D,app=3D,client=3D FATAL: could not receive data from WAL stre= am: ERROR: requested WAL segment 00000001000000060000008A has already been removed A physical analysis shows that the WAL file 00000001000000060000008B is 100% zeroes (ASCII NUL). The results of querying pg_replication_slots shows a restart_lsn that matches =E2=80=A6.6/8B. Pg_controldata shows values like: Minimum recovery ending location: 6/8Axxxxxx How can the master show a position that is greater than the minimum recovery ending location? EXAMPLE #2: Minimum recovery ending location: 19DD/73FFFFE0 Log segment 00000001000019DD00000073 was not available. The restart LSN was 19DD/74000000. The last few lines from pg_xlogdump 00000001000019DD00000073: rmgr: Btree len (rec/tot): 2/ 64, tx: 77257, lsn: 19DD/73FFFF60, prev 19DD/73FFFF20, desc: INSERT_LEAF off 132, blkref #0: rel 1663/16403/150017028 blk 1832 rmgr: Btree len (rec/tot): 2/ 64, tx: 77257, lsn: 19DD/73FFFFA0, prev 19DD/73FFFF60, desc: INSERT_LEAF off 206, blkref #0: rel 1663/16403/150017028 blk 11709 If I'm understanding this properly, (0x73FFFFA0 - 0x73000000) is the first byte of the last record in this file, and the record length is 64 bytes which places the first byte of the next record at: 16777184 (0xffffe0) (logical position 0x73ffffe0: this jives with pg_controldata). However, there are only 32 bytes of file left: 0x73FFFFA0 - 0x73000000 + 64 -=3D> 16777184 16777216 - 16777184 -=3D> 32 Which means that the next record is in the WAL file 00000001000019DD00000074. A possible theory: Let us assume PG has applied 100% of the data in a given WAL file, and let=E2=80=99s assume (as in this case) that the WAL file is 00000001000019DD00000073. When it starts up again, it uses the control data to start and say =E2=80=9CThe next record is at 19DD/0x73ffffe0" whic= h it truncates to 0x73000000. However, PG has *also* already told the master that is has fully received, written, and flushed all of the data for that WAL file, so the master has 0x74000000 as the start position (and has consequently removed the WAL file for 0x73). The relationship between pg_controldata and pg_replication_slots.restart_lsn seem to be very slightly (but importantly) at odds. Could it be this part of the code? From src/backend/replication/walreceiverfuncs.c in RequestXLogStreaming (as of a0aa358ca603d8189fe4be72f614cf7cf363d81a): 235 /* 236 * We always start at the beginning of the segment. That prevents a broken 237 * segment (i.e., with no records in the first half of a segment) from 238 * being created by XLOG streaming, which might cause trouble later on if 239 * the segment is e.g archived. 240 */ 241 if (recptr % XLogSegSize !=3D 0) 242 recptr -=3D recptr % XLogSegSize; 243 We start up with 19DD/0x73ffffe0 (but there would not be enough room in that segment for any more records, so logically we'd have to go to 19DD/0x74000000). When we start WAL receiving, we truncate 0x73ffffe0 to 0x73000000, which the master has already removed (and - technically - we don't actually need?). -- Jon Nelson Dyn / Principal Software Engineer
В списке pgsql-bugs по дате отправления:
Предыдущее
От: Tom LaneДата:
Сообщение: Re: BUG #14437: BTREE Index is not used for queries on citext columns
Следующее
От: nikolay.nikitin@infowatch.comДата:
Сообщение: BUG #14438: Wrong row count in the join plan with unique index scan