BUG: Former primary node might stuck when started as a standby
От | Alexander Lakhin |
---|---|
Тема | BUG: Former primary node might stuck when started as a standby |
Дата | |
Msg-id | b0102688-6d6c-c86a-db79-e0e91d245b1a@gmail.com обсуждение исходный текст |
Ответы |
Re: BUG: Former primary node might stuck when started as a standby
(Aleksander Alekseev <aleksander@timescale.com>)
|
Список | pgsql-hackers |
Hello hackers, [ reporting this bug here due to limitations of the bug reporting form ] When a node, that acted as a primary, becomes a standby as in the following script: [ ... some WAL-logged activity ... ] $primary->teardown_node; $standby->promote; $primary->enable_streaming($standby); $primary->start; it might not go online, due to the error: new timeline N forked off current database system timeline M before current recovery point X/X A complete TAP test is attached. I put it in src/test/recovery/t, run as follows: for i in `seq 100`; do echo "iteration $i"; timeout 60 make -s check -C src/test/recovery PROVE_TESTS="t/099*" || break; done and get: ... iteration 7 # +++ tap check in src/test/recovery +++ t/099_change_roles.pl .. ok All tests successful. Files=1, Tests=20, 14 wallclock secs ( 0.01 usr 0.00 sys + 4.20 cusr 4.75 csys = 8.96 CPU) Result: PASS iteration 8 # +++ tap check in src/test/recovery +++ t/099_change_roles.pl .. 9/? make: *** [Makefile:23: check] Terminated With wal_debug enabled (and log_min_messages=DEBUG2, log_statement=all), I see the following in the _node1.log: 2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl LOG: INSERT @ 0/304DBF0: - Transaction/COMMIT: 2024-01-18 15:21:02.258739+00 2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (10, 'inserted on node1'); 2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl LOG: xlog flush request 0/304DBF0; write 0/0; flush 0/0 2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (10, 'inserted on node1'); 2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG: write 0/304DBF0 flush 0/304DB78 apply 0/304DB78 reply_time 2024-01-18 15:21:02.2588+00 2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG: write 0/304DBF0 flush 0/304DBF0 apply 0/304DB78 reply_time 2024-01-18 15:21:02.258809+00 2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG: write 0/304DBF0 flush 0/304DBF0 apply 0/304DBF0 reply_time 2024-01-18 15:21:02.258864+00 2024-01-18 15:21:02.259 UTC [663563] DEBUG: server process (PID 663701) exited with exit code 0 2024-01-18 15:21:02.260 UTC [663563] DEBUG: forked new backend, pid=663704 socket=8 2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: statement: INSERT INTO t VALUES (1000 * 1 + 608, 'background activity'); 2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: INSERT @ 0/304DC40: - Heap/INSERT: off: 12, flags: 0x00 2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background activity'); 2024-01-18 15:21:02.261 UTC [663563] DEBUG: postmaster received shutdown request signal 2024-01-18 15:21:02.261 UTC [663563] LOG: received immediate shutdown request 2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: INSERT @ 0/304DC68: - Transaction/COMMIT: 2024-01-18 15:21:02.261828+00 2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background activity'); 2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: xlog flush request 0/304DC68; write 0/0; flush 0/0 2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background activity'); ... 2024-01-18 15:21:02.262 UTC [663563] LOG: database system is shut down ... 2024-01-18 15:21:02.474 UTC [663810] LOG: starting PostgreSQL 16.1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit ... 2024-01-18 15:21:02.478 UTC [663816] LOG: REDO @ 0/304DBC8; LSN 0/304DBF0: prev 0/304DB78; xid 898; len 8 - Transaction/COMMIT: 2024-01-18 15:21:02.258739+00 2024-01-18 15:21:02.478 UTC [663816] LOG: REDO @ 0/304DBF0; LSN 0/304DC40: prev 0/304DBC8; xid 899; len 3; blkref #0: rel 1663/5/16384, blk 1 - Heap/INSERT: off: 12, flags: 0x00 2024-01-18 15:21:02.478 UTC [663816] LOG: REDO @ 0/304DC40; LSN 0/304DC68: prev 0/304DBF0; xid 899; len 8 - Transaction/COMMIT: 2024-01-18 15:21:02.261828+00 ... 2024-01-18 15:21:02.481 UTC [663819] LOG: fetching timeline history file for timeline 20 from primary server 2024-01-18 15:21:02.481 UTC [663819] LOG: started streaming WAL from primary at 0/3000000 on timeline 19 ... 2024-01-18 15:21:02.481 UTC [663819] DETAIL: End of WAL reached on timeline 19 at 0/304DBF0. ... 2024-01-18 15:21:02.481 UTC [663816] LOG: new timeline 20 forked off current database system timeline 19 before current recovery point 0/304DC68 In this case, node1 wrote to it's WAL record 0/304DC68, but sent to node2 only record 0/304DBF0, then node2, being promoted to primary, forked a next timeline from it, but when node1 was started as a standby, it first replayed 0/304DC68 from WAL, and then could not switch to the new timeline starting from the previous position. Reproduced on REL_12_STABLE .. master. Best regards, Alexander
Вложения
В списке pgsql-hackers по дате отправления: