Physical slot restart_lsn advances incorrectly requiring restore from archive

Поиск
Список
Период
Сортировка
От Hsu, John
Тема Physical slot restart_lsn advances incorrectly requiring restore from archive
Дата
Msg-id 77734732-44A4-4209-8C2F-3AF36C9D4D18@amazon.com
обсуждение исходный текст
Ответы Re: Physical slot restart_lsn advances incorrectly requiring restore from archive  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Список pgsql-hackers
Hi hackers,
 
We believe we’re seeing a problem with how physical slot’s restart_lsn is advanced leading to the replicas needing to
restorefrom archive in order for replication to resume. 
 
The logs below are from reproductions against 10.13. I’m still working on reproducing it for 12.3.
 
WAL write spans two WAL segments . 
Write to first WAL segment is complete but not to the second segment. 
Write to first WAL segment is acknowledged as flushed from the Postgres replica.
Primary restarts before the write to second segment completes. It also means the complete WAL was never written. 
Crash recovery finishes at a record before the incomplete WAL write. 
Though now replica start the slot at the next WAL segment, since the previous WAL was already acknowledged as flushed.
 
Primary crashes because it ran out of space:
2020-07-10 01:23:31.399 UTC:10.15.4.83(56430):replication_user@[unknown]:[4554]:DEBUG: write 0/2C000000 flush
0/2BFD2000apply 0/216ACCD0
 
2020-07-10 01:23:31.401 UTC:10.15.4.83(56430):replication_user@[unknown]:[4554]:DEBUG: write 0/2C000000 flush
0/2C000000apply 0/216AE728
 
2020-07-10 01:23:31.504 UTC::@:[4548]:DEBUG: creating and filling new WAL file
2020-07-10 01:23:31.511 UTC::@:[4548]:PANIC: could not write to file "pg_wal/xlogtemp.4548": No space left on device
2020-07-10 01:23:31.518 UTC::@:[4543]:DEBUG: reaping dead processes
 
Crash recovery beings:
2020-07-10 01:23:36.074 UTC::@:[8677]:DEBUG: checkpoint record is at 0/2B51B030
2020-07-10 01:23:36.074 UTC::@:[8677]:DEBUG: redo record is at 0/2A65AE08; shutdown FALSE
..
2020-07-10 01:23:36.076 UTC::@:[8677]:DEBUG: starting up replication slots
2020-07-10 01:23:36.076 UTC::@:[8677]:DEBUG: restoring replication slot from "pg_replslot/physical_slot_1/state"
2020-07-10 01:23:36.078 UTC::@:[8677]:LOG: restart_lsn for cp slot physical_slot_1: 0/2BF12000 (extra debug logs I
added)
2020-07-10 01:23:36.081 UTC::@:[8677]:LOG: redo starts at 0/2A65AE08
...
2020-07-10 01:23:36.325 UTC::@:[8677]:LOG: redo done at 0/2BFFFFB0
...
2020-07-10 01:23:36.330 UTC::@:[8677]:LOG: checkpoint starting: end-of-recovery immediate
2020-07-10 01:23:36.332 UTC::@:[8677]:DEBUG: performing replication slot checkpoint
...
2020-07-10 01:23:36.380 UTC::@:[8677]:DEBUG: checkpoint sync: number=13 file=base/13934/2662 time=0.001 msec
2020-07-10 01:23:36.380 UTC::@:[8677]:DEBUG: checkpoint sync: number=14 file=base/13934/2663 time=0.001 msec
2020-07-10 01:23:36.380 UTC::@:[8677]:DEBUG: checkpoint sync: number=15 file=base/13934/24586 time=0.001 msec
2020-07-10 01:23:36.385 UTC::@:[8677]:LOG: could not signal for checkpoint: checkpointer is not running
2020-07-10 01:23:36.385 UTC::@:[8677]:DEBUG: creating and filling new WAL file
2020-07-10 01:23:36.397 UTC::@:[8677]:PANIC: could not write to file "pg_wal/xlogtemp.8677": No space left on device
 
Primary runs out of space during crash recovery. Space is freed up afterwards and crash recovery beings again.
 
2020-07-10 01:32:45.804 UTC::@:[16329]:DEBUG: checkpoint record is at 0/2B51B030
2020-07-10 01:32:45.805 UTC::@:[16329]:DEBUG: redo record is at 0/2A65AE08; shutdown FALSE
...
2020-07-10 01:32:45.805 UTC::@:[16329]:DEBUG: starting up replication slots
2020-07-10 01:32:45.805 UTC::@:[16329]:DEBUG: restoring replication slot from "pg_replslot/physical_slot_1/state"
2020-07-10 01:32:45.806 UTC::@:[16329]:LOG: restart_lsn for cp slot physical_slot_1: 0/2BF12000
...
2020-07-10 01:32:45.809 UTC::@:[16329]:LOG: redo starts at 0/2A65AE08
2020-07-10 01:32:46.043 UTC::@:[16329]:DEBUG: could not open file "pg_wal/00000001000000000000002C": No such file or
directory
2020-07-10 01:32:46.043 UTC::@:[16329]:LOG: redo done at 0/2BFFFFB0
 
Redo finishes at 0/2BFFFFB0 even though the flush we received from the replica is already at 0/2C000000.
 
This is problematic because the replica reconnects to the slot telling it to start past the new redo point.
 
2020-07-10 01:32:50.641 UTC:10.15.4.83(56698):replication_user@[unknown]:[16572]:DEBUG: received replication command:
START_REPLICATIONSLOT "physical_slot_1" 0/2C000000 TIMELINE 1
 
2020-07-10 01:32:50.641 UTC:10.15.4.83(56698):replication_user@[unknown]:[16572]:DEBUG: "walreceiver" has now caught up
withupstream server
 
2020-07-10 01:32:50.774 UTC:10.15.4.83(56698):replication_user@[unknown]:[16572]:DEBUG: write 0/2C000B80 flush
0/2BFFFFF0apply 0/2BFFFFF0
 
2020-07-10 01:32:50.775 UTC:10.15.4.83(56698):replication_user@[unknown]:[16572]:DEBUG: write 0/2C000B80 flush
0/2C000B80apply 0/2BFFFFF0
 
 
This leads to a mismatch between at the end of 0/2B and what was streamed to the replica.
 
Replica logs:
2020-07-10 01:32:50.671 UTC::@:[24899]:LOG: started streaming WAL from primary at 0/2C000000 on timeline 1
...
2020-07-10 01:39:32.251 UTC::@:[11703]:DEBUG: could not restore file "00000001000000000000002C" from archive: child
processexited with exit code 1
 
2020-07-10 01:39:32.251 UTC::@:[11703]:DEBUG: invalid contrecord length 90 at 0/2BFFFFF0
2020-07-10 01:39:32.251 UTC::@:[11703]:DEBUG: switched WAL source from archive to stream after failure
 
Now the physical slot has advanced past the 0/2B which is what the replica actually needs.
 
postgres=> select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn
 

---------------------------------------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
physical_slot_1 | | physical | | | f | f | | | | 0/2C000B80 |
(1 row)
 
 
0/2B can now rotate out on the primary and requires restoring from archive in order for replication to resume.
 
The attached patch (against 10) attempts to address this by keeping track of the first flushLsn in the current segNo,
andwait until we receive one after that before updating. This prevents the WAL from rotating out of the primary and a
rebootfrom the replica will fix it instead of needing to restore from archive.
 
 
With the patch:
 
Primary goes into crash recovery and we avoid updating the restart_lsn of the slot:
 
2020-07-10 18:50:12.686 UTC::@:[6160]:LOG: redo starts at 0/2D417108
2020-07-10 18:50:12.965 UTC::@:[6160]:DEBUG: could not open file "pg_wal/00000001000000000000002F": No such file or
directory
2020-07-10 18:50:12.965 UTC::@:[6160]:LOG: redo done at 0/2EFFFF90
...
2020-07-10 18:59:32.987 UTC:10.15.0.240(9056):replication_user@[unknown]:[19623]:DEBUG: received replication command:
START_REPLICATIONSLOT "physical_slot_2" 0/2F000000 TIMELINE 1
 
2020-07-10 18:59:33.937 UTC:10.15.0.240(9056):replication_user@[unknown]:[19623]:DEBUG: write 0/2F020000 flush
0/2EFFFFD0apply 0/2EFFFFD0
 
2020-07-10 18:59:33.937 UTC:10.15.0.240(9056):replication_user@[unknown]:[19623]:LOG: lsn is not in restartSegNo,
updateto match
 
2020-07-10 18:59:33.938 UTC:10.15.0.240(9056):replication_user@[unknown]:[19623]:DEBUG: write 0/2F020000 flush
0/2F020000apply 0/2EFFFFD0
 
2020-07-10 18:59:33.938 UTC:10.15.0.240(9056):replication_user@[unknown]:[19623]:LOG: lsn is not in restartSegNo,
updateto match
 
 
Replica logs:
2020-07-10 18:59:54.040 UTC::@:[12873]:DEBUG: could not restore file "00000001000000000000002F" from archive: child
processexited with exit code 1
 
2020-07-10 18:59:54.040 UTC::@:[12873]:DEBUG: invalid contrecord length 58 at 0/2EFFFFD0
 
 
Since the flushLSN hasn't advanced past the first one in the restartSegNo, it doesn't get updated in future
checkpoints.
 
postgres=> select * from pg_replication_slots;
slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin |
restart_lsn| confirmed_flush_lsn
 

---------------------------------------------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
physical_slot_2 | | physical | | | f | f | | | | 0/2DFF8000 |
(1 row)
 
 
Rebooting the replica allows replication to resume from the slot and restart_lsn advances normally.

 Thanks,

 John H


Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: Wait profiling
Следующее
От: Tom Lane
Дата:
Сообщение: Re: "tuple concurrently updated" in pg_restore --jobs