Re: "invalid contrecord" error on replica
От | Adrien Nayrat |
---|---|
Тема | Re: "invalid contrecord" error on replica |
Дата | |
Msg-id | 1f1da31b-6b39-55e5-a9b0-487068d5b225@anayrat.info обсуждение исходный текст |
Ответ на | "invalid contrecord" error on replica (JOULAUD François <Francois.JOULAUD@radiofrance.com>) |
Ответы |
Re: "invalid contrecord" error on replica
(Laurenz Albe <laurenz.albe@cybertec.at>)
Re: "invalid contrecord" error on replica (Adrien Nayrat <adrien.nayrat@anayrat.info>) Re: "invalid contrecord" error on replica (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
Список | pgsql-general |
Hello, I encountered a similar issue with pg 13. TL;DR: The secondary did not received a wal record (CHECKPOINT_SHUTDOWN) which corrupted the wal and he failed when he tried to replay it. For a personal project I have a primary and a secondary with streaming replication and replication_slot. I updated pgbackrest only on primary side, which broke archiving due to a different pgbackrest version on primary and secondary. It filled the filesystem until the primary crashed. So, I updated pgbackrest on the secondary, archiving caught up and I did not look at secondary. I thought it will caught up too. Unfortunately not: FATAL: could not connect to the primary server: FATAL: the database system is in recovery mode LOG: started streaming WAL from primary at AA/A1000000 on timeline 1 FATAL: could not receive data from WAL stream : ERROR: requested starting point AA/A1000000 is ahead of the WAL flush position of this server AA/A0FFFBE8 LOG: started streaming WAL from primary at AA/A1000000 on timeline 1 FATAL: could not receive data from WAL stream : ERROR: requested starting point AA/A1000000 is ahead of the WAL flush position of this server AA/A0FFFBE8 LOG: started streaming WAL from primary at AA/A1000000 on timeline 1 FATAL: could not receive data from WAL stream : ERROR: requested starting point AA/A1000000 is ahead of the WAL flush position of this server AA/A0FFFBE8 LOG: started streaming WAL from primary at AA/A1000000 on timeline 1 FATAL: could not receive data from WAL stream : ERROR: requested starting point AA/A1000000 is ahead of the WAL flush position of this server AA/A0FFFC20 [...] LOG: started streaming WAL from primary at AA/A1000000 on timeline 1 FATAL: could not receive data from WAL stream : ERROR: requested starting point AA/A1000000 is ahead of the WAL flush position of this server AA/A0FFFC20 LOG: started streaming WAL from primary at AA/A1000000 on timeline 1 FATAL: could not receive data from WAL stream : ERROR: requested starting point AA/A1000000 is ahead of the WAL flush position of this server AA/A0FFFC20 LOG: started streaming WAL from primary at AA/A1000000 on timeline 1 LOG: invalid contrecord length 1523 at AA/A0FFFB70 FATAL: terminating walreceiver process due to administrator command LOG: invalid contrecord length 1523 at AA/A0FFFB70 LOG: invalid contrecord length 1523 at AA/A0FFFB70 [ previous line keep repeating ] And several hours later: FATAL: terminating walreceiver process due to administrator command No other messages arround, just the same "invalid contrecord length 1523 at AA/A0FFFB70" and from time to time "terminating walreceiver process due to administrator command". During all this time, the primary kept WAL files because replication slot was inactive, until FS was full again. Then, I decided to restart both (I just removed some useless files to have more place). Hoping the secondary be able to start streaming replication. Unfortunately, the secondary asked for an old WAL file which where already recycled on primary: LOG: received fast shutdown request LOG: aborting any active transactions LOG: shutting down LOG: database system is shut down LOG: starting PostgreSQL 13.2 (Debian 13.2-1.pgdg90+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit LOG: listening on IPv4 address "127.0.0.1", port 5433 LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5433" LOG: database system was shut down in recovery at 2021-05-02 20:23:49 CEST LOG: entering standby mode LOG: redo starts at AA/9F29F7C8 LOG: consistent recovery state reached at AA/A0FFFB70 LOG: invalid contrecord length 1523 at AA/A0FFFB70 LOG: database system is ready to accept read only connections LOG: started streaming WAL from primary at AA/A0000000 on timeline 1 FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000001000000AA000000A0 has already been removed (Notice, he started replication at AA/A0000000 whereas, before the restart, he started at AA/A1000000) I dig a little bit, on the primary, the restart_lsn of the replication slot was: AA/A1004B78 pg_controldata on secondary told: Latest checkpoint location: AA/9FC68B48 Latest checkpoint's REDO location: AA/9F29F7C8 Latest checkpoint's REDO WAL file: 00000001000000AA0000009F I understood why he asked for a WAL before its restart_lsn: He reached AA/A1004B78, after pg_ctl restart by me, he start replaying from "REDO location" at AA/9F29F7C8 until he reached AA/A0FFFB70 then asked for the WAL A0 and boom. Why he stopped at AA/A0FFFB70 whereas the restart_lsn is AA/A1004B78 ? I looked at 00000001000000AA000000A0 on the secondary, it end by: rmgr: Transaction len (rec/tot): 34/ 34, tx: 55621934, lsn: AA/A0FFFAF8, prev AA/A0FFE7F8, desc: COMMIT 2021-05-01 17:25:51.949124 CEST rmgr: Heap len (rec/tot): 76/ 76, tx: 55621935, lsn: AA/A0FFFB20, prev AA/A0FFFAF8, desc: HOT_UPDATE off 83 xmax 55621935 flags 0x20 ; new off 19 xmax 0, blkref #0: rel 1663/167087/167363 blk 2662 Fortunately I archive wal, and on the primary, the end of this wal is: rmgr: Transaction len (rec/tot): 34/ 34, tx: 55621934, lsn: AA/A0FFFAF8, prev AA/A0FFE7F8, desc: COMMIT 2021-05-01 17:25:51.949124 CEST rmgr: Heap len (rec/tot): 76/ 76, tx: 55621935, lsn: AA/A0FFFB20, prev AA/A0FFFAF8, desc: HOT_UPDATE off 83 xmax 55621935 flags 0x20 ; new off 19 xmax 0, blkref #0: rel 1663/167087/167363 blk 2662 rmgr: XLOG len (rec/tot): 114/ 114, tx: 0, lsn: AA/A0FFFB70, prev AA/A0FFFB20, desc: CHECKPOINT_SHUTDOWN redo AA/A0FFFB70; tli 1; prev tli 1; fpw true; xid 0:55621936; oid 1401408; multi 401; offset 802; oldest xid 478 in DB 13397; oldest multi 1 in DB 13397; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: AA/A0FFFBE8, prev AA/A0FFFB70, desc: RUNNING_XACTS nextXid 55621936 latestCompletedXid 55621935 oldestRunningXid 55621936 I also dumped 00000001000000AA000000A1 on the secondary and it contains all the records until AA/A1004018. It is really weird, I don't understand how the secondary can miss the last 2 records of A0? It seems he did not received the CHECKPOINT_SHUTDOWN record? Any idea? Thanks! -- Adrien NAYRAT
В списке pgsql-general по дате отправления:
Следующее
От: Wolfgang RißlerДата:
Сообщение: Re: Access a newer Version of PGDB (v13) with an older libpq (v10 x86)