Обсуждение: BUG: Former primary node might stuck when started as a standby
Hello hackers, [ reporting this bug here due to limitations of the bug reporting form ] When a node, that acted as a primary, becomes a standby as in the following script: [ ... some WAL-logged activity ... ] $primary->teardown_node; $standby->promote; $primary->enable_streaming($standby); $primary->start; it might not go online, due to the error: new timeline N forked off current database system timeline M before current recovery point X/X A complete TAP test is attached. I put it in src/test/recovery/t, run as follows: for i in `seq 100`; do echo "iteration $i"; timeout 60 make -s check -C src/test/recovery PROVE_TESTS="t/099*" || break; done and get: ... iteration 7 # +++ tap check in src/test/recovery +++ t/099_change_roles.pl .. ok All tests successful. Files=1, Tests=20, 14 wallclock secs ( 0.01 usr 0.00 sys + 4.20 cusr 4.75 csys = 8.96 CPU) Result: PASS iteration 8 # +++ tap check in src/test/recovery +++ t/099_change_roles.pl .. 9/? make: *** [Makefile:23: check] Terminated With wal_debug enabled (and log_min_messages=DEBUG2, log_statement=all), I see the following in the _node1.log: 2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl LOG: INSERT @ 0/304DBF0: - Transaction/COMMIT: 2024-01-18 15:21:02.258739+00 2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (10, 'inserted on node1'); 2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl LOG: xlog flush request 0/304DBF0; write 0/0; flush 0/0 2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (10, 'inserted on node1'); 2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG: write 0/304DBF0 flush 0/304DB78 apply 0/304DB78 reply_time 2024-01-18 15:21:02.2588+00 2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG: write 0/304DBF0 flush 0/304DBF0 apply 0/304DB78 reply_time 2024-01-18 15:21:02.258809+00 2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG: write 0/304DBF0 flush 0/304DBF0 apply 0/304DBF0 reply_time 2024-01-18 15:21:02.258864+00 2024-01-18 15:21:02.259 UTC [663563] DEBUG: server process (PID 663701) exited with exit code 0 2024-01-18 15:21:02.260 UTC [663563] DEBUG: forked new backend, pid=663704 socket=8 2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: statement: INSERT INTO t VALUES (1000 * 1 + 608, 'background activity'); 2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: INSERT @ 0/304DC40: - Heap/INSERT: off: 12, flags: 0x00 2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background activity'); 2024-01-18 15:21:02.261 UTC [663563] DEBUG: postmaster received shutdown request signal 2024-01-18 15:21:02.261 UTC [663563] LOG: received immediate shutdown request 2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: INSERT @ 0/304DC68: - Transaction/COMMIT: 2024-01-18 15:21:02.261828+00 2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background activity'); 2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: xlog flush request 0/304DC68; write 0/0; flush 0/0 2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background activity'); ... 2024-01-18 15:21:02.262 UTC [663563] LOG: database system is shut down ... 2024-01-18 15:21:02.474 UTC [663810] LOG: starting PostgreSQL 16.1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit ... 2024-01-18 15:21:02.478 UTC [663816] LOG: REDO @ 0/304DBC8; LSN 0/304DBF0: prev 0/304DB78; xid 898; len 8 - Transaction/COMMIT: 2024-01-18 15:21:02.258739+00 2024-01-18 15:21:02.478 UTC [663816] LOG: REDO @ 0/304DBF0; LSN 0/304DC40: prev 0/304DBC8; xid 899; len 3; blkref #0: rel 1663/5/16384, blk 1 - Heap/INSERT: off: 12, flags: 0x00 2024-01-18 15:21:02.478 UTC [663816] LOG: REDO @ 0/304DC40; LSN 0/304DC68: prev 0/304DBF0; xid 899; len 8 - Transaction/COMMIT: 2024-01-18 15:21:02.261828+00 ... 2024-01-18 15:21:02.481 UTC [663819] LOG: fetching timeline history file for timeline 20 from primary server 2024-01-18 15:21:02.481 UTC [663819] LOG: started streaming WAL from primary at 0/3000000 on timeline 19 ... 2024-01-18 15:21:02.481 UTC [663819] DETAIL: End of WAL reached on timeline 19 at 0/304DBF0. ... 2024-01-18 15:21:02.481 UTC [663816] LOG: new timeline 20 forked off current database system timeline 19 before current recovery point 0/304DC68 In this case, node1 wrote to it's WAL record 0/304DC68, but sent to node2 only record 0/304DBF0, then node2, being promoted to primary, forked a next timeline from it, but when node1 was started as a standby, it first replayed 0/304DC68 from WAL, and then could not switch to the new timeline starting from the previous position. Reproduced on REL_12_STABLE .. master. Best regards, Alexander
Вложения
Hi, > it might not go online, due to the error: > new timeline N forked off current database system timeline M before current recovery point X/X > [...] > In this case, node1 wrote to it's WAL record 0/304DC68, but sent to node2 > only record 0/304DBF0, then node2, being promoted to primary, forked a next > timeline from it, but when node1 was started as a standby, it first > replayed 0/304DC68 from WAL, and then could not switch to the new timeline > starting from the previous position. Unless I'm missing something, this is just the right behavior of the system. node1 has no way of knowing the history of node1/node2/nodeN promotion. It sees that it has more data and/or inconsistent timeline with another node and refuses to process further until DBA will intervene. What else can node1 do, drop the data? That's not how things are done in Postgres :) What if this is a very important data and node2 was promoted mistakenly, either manually or by a buggy script. It's been a while since I seriously played with replication, but if memory serves, a proper way to switch node1 to a replica mode would be to use pg_rewind on it first. -- Best regards, Aleksander Alekseev
Hi Aleksander, 19.01.2024 14:45, Aleksander Alekseev wrote: > >> it might not go online, due to the error: >> new timeline N forked off current database system timeline M before current recovery point X/X >> [...] >> In this case, node1 wrote to it's WAL record 0/304DC68, but sent to node2 >> only record 0/304DBF0, then node2, being promoted to primary, forked a next >> timeline from it, but when node1 was started as a standby, it first >> replayed 0/304DC68 from WAL, and then could not switch to the new timeline >> starting from the previous position. > Unless I'm missing something, this is just the right behavior of the system. Thank you for the answer! > node1 has no way of knowing the history of node1/node2/nodeN > promotion. It sees that it has more data and/or inconsistent timeline > with another node and refuses to process further until DBA will > intervene. But node1 knows that it's a standby now and it's expected to get all the WAL records from the primary, doesn't it? Maybe it could REDO from it's own WAL as little records as possible, before requesting records from the authoritative source... Is it supposed that it's more performance-efficient (not on the first restart, but on later ones)? > What else can node1 do, drop the data? That's not how > things are done in Postgres :) In case no other options exist (this behavior is really correct and the only possible), maybe the server should just stop? Can DBA intervene somehow to make the server proceed without stopping it? > It's been a while since I seriously played with replication, but if > memory serves, a proper way to switch node1 to a replica mode would be > to use pg_rewind on it first. Perhaps that's true generally, but as we can see, without the extra records replayed, this scenario works just fine. Moreover, existing tests rely on it, e.g., 009_twophase.pl or 012_subtransactions.pl (in fact, my research of the issue was initiated per a test failure). Best regards, Alexander
Hi, > But node1 knows that it's a standby now and it's expected to get all the > WAL records from the primary, doesn't it? Yes, but node1 doesn't know if it always was a standby or not. What if node1 was always a standby, node2 was a primary, then node2 died and node3 is a new primary. If node1 sees inconsistency in the WAL records, it should report it and stop doing anything, since it doesn't has all the information needed to resolve the inconsistencies in all the possible cases. Only DBA has this information. > > It's been a while since I seriously played with replication, but if > > memory serves, a proper way to switch node1 to a replica mode would be > > to use pg_rewind on it first. > > Perhaps that's true generally, but as we can see, without the extra > records replayed, this scenario works just fine. Moreover, existing tests > rely on it, e.g., 009_twophase.pl or 012_subtransactions.pl (in fact, my > research of the issue was initiated per a test failure). I suggest focusing on particular flaky tests then and how to fix them. -- Best regards, Aleksander Alekseev
Hi Aleksander, [ I'm writing this off-list to minimize noise, but we can continue the discussion in -hackers, if you wish ] 22.01.2024 14:00, Aleksander Alekseev wrote: > Hi, > >> But node1 knows that it's a standby now and it's expected to get all the >> WAL records from the primary, doesn't it? > Yes, but node1 doesn't know if it always was a standby or not. What if > node1 was always a standby, node2 was a primary, then node2 died and > node3 is a new primary. Excuse me, but I still can't understand what could go wrong in this case. Let's suppose, node1 has WAL with the following contents before start: CPLOC | TL1R1 | TL1R2 | TL1R3 | while node2's WAL contains: TL1R1 | TL2R1 | TL2R2 | ... where CPLOC -- a checkpoint location, TLxRy -- a record y on a timeline x. I assume that requesting all WAL records from node2 without redoing local records should be the right thing. And even in the situation you propose: CPLOC | TL2R5 | TL2R6 | TL2R7 | while node3's WAL contains: TL2R5 | TL3R1 | TL3R2 | ... I see no issue with applying records from node3... > If node1 sees inconsistency in the WAL > records, it should report it and stop doing anything, since it doesn't > has all the information needed to resolve the inconsistencies in all > the possible cases. Only DBA has this information. I still wonder, what can be considered an inconsistency in this situation. Doesn't the exactly redo of all the local WAL records create the inconsistency here? For me, it's the question of an authoritative source, and if we had such a source, we should trust it's records only. Or in the other words, what if the record TL1R3, which node1 wrote to it's WAL, but didn't send to node2, happened to have an incorrect checksum (due to partial write, for example)? If I understand correctly, node1 will just stop redoing WAL at that position to receive all the following records from node2 and move forward without reporting the inconsistency (an extra WAL record). Best regards, Alexander
Hello,
22.01.2024 13:00, Aleksander Alekseev wrote:
22.01.2024 13:00, Aleksander Alekseev wrote:
Perhaps that's true generally, but as we can see, without the extra records replayed, this scenario works just fine. Moreover, existing tests rely on it, e.g., 009_twophase.pl or 012_subtransactions.pl (in fact, my research of the issue was initiated per a test failure).I suggest focusing on particular flaky tests then and how to fix them.
For the record: skink produced the same failure with 009_twophase.pl
yesterday [1]:
366/366 postgresql:recovery / recovery/009_twophase TIMEOUT 15000.31s killed by signal 15 SIGTERM
[15:11:30.591](0.807s) ok 11 - Restore of prepared transaction on promoted standby
### Enabling streaming replication for node "london"
### Starting node "london"
# Running: pg_ctl --wait --pgdata /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/009_twophase/data/t_009_twophase_london_data/pgdata --log /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/009_twophase/log/009_twophase_london.log --options --cluster-name=london start
waiting for server to start....... done
server started
# Postmaster PID for node "london" is 4181556
death by signal at /home/bf/bf-build/skink-master/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 181.
# Postmaster PID for node "london" is 4181556
### Stopping node "london" using mode immediate
# Running: pg_ctl --pgdata /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/009_twophase/data/t_009_twophase_london_data/pgdata --mode immediate stop
waiting for server to shut down.... done
server stopped
# No postmaster PID for node "london"
# Postmaster PID for node "paris" is 4147391
### Stopping node "paris" using mode immediate
# Running: pg_ctl --pgdata /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/009_twophase/data/t_009_twophase_paris_data/pgdata --mode immediate stop
waiting for server to shut down.... done
server stopped
# No postmaster PID for node "paris"
[19:19:46.999](14896.408s) # Tests were run but no plan was declared and done_testing() was not seen.
[19:19:46.999](0.000s) # Looks like your test exited with 4 just after 11.
pgsql.build/testrun/recovery/009_twophase/log/009_twophase_london.log
2026-02-04 15:11:35.675 CET [4181556][postmaster][:0] LOG: starting PostgreSQL 19devel on x86_64-linux, compiled by gcc-15.2.0, 64-bit
2026-02-04 15:11:35.692 CET [4181556][postmaster][:0] LOG: listening on Unix socket "/tmp/IzuUID03TM/.s.PGSQL.22911"
2026-02-04 15:11:35.904 CET [4183900][startup][:0] LOG: database system was interrupted; last known up at 2026-02-04 15:11:14 CET
2026-02-04 15:11:36.017 CET [4183900][startup][:0] LOG: entering standby mode
2026-02-04 15:11:36.019 CET [4183900][startup][:0] LOG: database system was not properly shut down; automatic recovery in progress
2026-02-04 15:11:36.104 CET [4183900][startup][38/0:0] LOG: redo starts at 0/03021600
2026-02-04 15:11:36.206 CET [4183900][startup][38/0:0] LOG: invalid record length at 0/030220B8: expected at least 24, got 0
2026-02-04 15:11:36.209 CET [4183900][startup][38/0:0] LOG: consistent recovery state reached at 0/030220B8
2026-02-04 15:11:36.210 CET [4181556][postmaster][:0] LOG: database system is ready to accept read-only connections
2026-02-04 15:11:36.919 CET [4184142][walreceiver][:0] LOG: fetching timeline history file for timeline 2 from primary server
2026-02-04 15:11:36.997 CET [4184142][walreceiver][:0] LOG: started streaming WAL from primary at 0/03000000 on timeline 1
2026-02-04 15:11:37.076 CET [4184142][walreceiver][:0] LOG: replication terminated by primary server
2026-02-04 15:11:37.076 CET [4184142][walreceiver][:0] DETAIL: End of WAL reached on timeline 1 at 0/03022070.
2026-02-04 15:11:37.098 CET [4183900][startup][38/0:0] LOG: new timeline 2 forked off current database system timeline 1 before current recovery point 0/030220B8
2026-02-04 15:11:37.103 CET [4184142][walreceiver][:0] LOG: restarted WAL streaming at 0/03000000 on timeline 1
2026-02-04 15:11:37.107 CET [4184142][walreceiver][:0] LOG: replication terminated by primary server
...
[1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2026-02-04%2013%3A36%3A40&stg=recovery-check
Best regards,
Alexander
RE: BUG: Former primary node might stuck when started as a standby
От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Alexander, Aleksander
Is it enough that we can wait till all WALs are replayed before the promotion?
Like attached.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Вложения
Dear Kuroda-san,
09.02.2026 04:38, Hayato Kuroda (Fujitsu) wrote:
09.02.2026 04:38, Hayato Kuroda (Fujitsu) wrote:
@font-face {font-family:"MS ゴシック"; panose-1:2 11 6 9 7 2 5 8 2 4;}@font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;}@font-face {font-family:游ゴシック; panose-1:2 11 4 0 0 0 0 0 0 0;}@font-face {font-family:"MS Pゴシック"; panose-1:2 11 6 0 7 2 5 8 2 4;}@font-face {font-family:"\@MS Pゴシック";}@font-face {font-family:"\@游ゴシック"; panose-1:2 11 4 0 0 0 0 0 0 0;}@font-face {font-family:"\@MS ゴシック"; panose-1:2 11 6 9 7 2 5 8 2 4;}p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0mm; font-size:12.0pt; font-family:"MS Pゴシック";}.MsoChpDefault {mso-style-type:export-only; font-size:10.0pt; mso-ligatures:none;}div.WordSection1 {page:WordSection1;} Dear Alexander, Aleksander
Is it enough that we can wait till all WALs are replayed before the promotion?
Like attached.
Thank you for paying attention to this!
I tried to add
$primary->wait_for_replay_catchup($standby);
in my test (see attached), but that doesn't help -- it still fails as below:
iteration 61
# +++ tap check in src/test/recovery +++
t/099_change_roles.pl .. 1/? make: *** [Makefile:28: check] Terminated
src/test/recovery/tmp_check/log/099_change_roles_node1.log
2026-02-12 20:30:42.953 EET [2716149] LOG: new timeline 4 forked off current database system timeline 3 before current recovery point 0/3025B48
Best regards,
Alexander
Вложения
RE: BUG: Former primary node might stuck when started as a standby
От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Alexander,
I checked your test and reproduced the issue with it.
Was it possible that INSERT happened in-between wait_for_replay_catchup and
teardown_node? In this case we may not ensure WAL records generated in the time
window were reached, right?
Similar stuff won7t happen in 009_twophase.pl because it does not have the bg activities.
Obtained log around here is shown in [1].
[1]
```
2026-02-13 10:34:56.751 JST client backend[521764] 099_change_roles.pl LOG: statement: INSERT INTO t VALUES (1000 * 1
+2812, 'background activity');
2026-02-13 10:34:56.753 JST client backend[521762] 099_change_roles.pl LOG: statement: SELECT '0/030ED4F8' <=
replay_lsnAND state = 'streaming' * from wait_for_replay_catchup
FROM pg_catalog.pg_stat_replication
WHERE application_name IN ('node2', 'walreceiver')
2026-02-13 10:34:56.762 JST client backend[521767] 099_change_roles.pl LOG: statement: INSERT INTO t VALUES (1000 * 1
+2813, 'background activity'); * do INSERT from background activities
2026-02-13 10:34:56.764 JST client backend[521768] 099_change_roles.pl LOG: statement: INSERT INTO t VALUES (1000 * 2
+2629, 'background activity');
2026-02-13 10:34:56.765 JST postmaster[521622] LOG: received immediate shutdown request
* from teardown_node
2026-02-13 10:34:56.767 JST postmaster[521622] LOG: database system is shut down
2026-02-13 10:34:56.996 JST postmaster[521844] LOG: starting PostgreSQL 19devel on x86_64-linux, compiled by
gcc-11.5.0,64-bit
....
2026-02-13 10:34:57.124 JST walreceiver[521864] LOG: replication terminated by primary server
2026-02-13 10:34:57.124 JST walreceiver[521864] DETAIL: End of WAL reached on timeline 153 at 0/030ED6D8.
2026-02-13 10:34:57.124 JST startup[521855] LOG: new timeline 154 forked off current database system timeline 153
beforecurrent recovery point 0/030ED750
2026-02-13 10:34:57.124 JST walreceiver[521864] LOG: restarted WAL streaming at 0/03000000 on timeline 153
```
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Dear Kuroda-san,
13.02.2026 04:03, Hayato Kuroda (Fujitsu) wrote:
13.02.2026 04:03, Hayato Kuroda (Fujitsu) wrote:
Dear Alexander, I checked your test and reproduced the issue with it. Was it possible that INSERT happened in-between wait_for_replay_catchup and teardown_node? In this case we may not ensure WAL records generated in the time window were reached, right? Similar stuff won7t happen in 009_twophase.pl because it does not have the bg activities.
From my old records, 009_twophase.pl failed exactly due to background (
namely, bgwriter's) activity.
I modified bgwriter.c to reproduce the failure easier:
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -67,7 +67,7 @@ int BgWriterDelay = 200;
* Interval in which standby snapshots are logged into the WAL stream, in
* milliseconds.
*/
-#define LOG_SNAPSHOT_INTERVAL_MS 15000
+#define LOG_SNAPSHOT_INTERVAL_MS 1
/*
* LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid
@@ -306,7 +306,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
*/
rc = WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
- BgWriterDelay /* ms */ , WAIT_EVENT_BGWRITER_MAIN);
+ 1 /* ms */ , WAIT_EVENT_BGWRITER_MAIN);
/*
* If no latch event and BgBufferSync says nothing's happening, extend
@@ -339,6 +339,5 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
StrategyNotifyBgWriter(-1);
}
- prev_hibernate = can_hibernate;
}
}
multiplied the test to increase probability of the failure:
for i in {1..20}; do cp -r src/test/recovery/ src/test/recovery_$i/; sed "s|src/test/recovery|src/test/recovery_$i|" -i src/test/recovery_$i/Makefile; done
and executed it in a loop:
for i in {1..100}; do echo "ITERATION $i"; parallel --halt now,fail=1 -j20 --linebuffer --tag PROVE_TESTS="t/009*" NO_TEMP_INSTALL=1 timeout 60 make check -s -C src/test/recovery_{} ::: `seq 20` || break; done
It failed for me on iterations 27, 4, 22 as below:
ITERATION 22
...
18 t/009_twophase.pl .. ok
18 All tests successful.
18 Files=1, Tests=30, 12 wallclock secs ( 0.01 usr 0.01 sys + 0.26 cusr 0.58 csys = 0.86 CPU)
18 Result: PASS
5 make: *** wait: No child processes. Stop.
5 make: *** Waiting for unfinished jobs....
5 make: *** wait: No child processes. Stop.
parallel: This job failed:
PROVE_TESTS=t/009* NO_TEMP_INSTALL=1 timeout 60 make check -s -C src/test/recovery_5
src/test/recovery_5/tmp_check/log/009_twophase_london.log contains:
2026-02-13 21:03:28.248 EET [3987222] LOG: new timeline 2 forked off current database system timeline 1 before current recovery point 0/3029190
...
(Without "timeout 60", the test just hangs — we can see the same in [1],
the test was killed with SIGTERM after 15000 seconds...)
[1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2026-02-04%2013%3A36%3A40&stg=recovery-check
Best regards,
Alexander
RE: BUG: Former primary node might stuck when started as a standby
От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Alexander, > From my old records, 009_twophase.pl failed exactly due to background ( > namely, bgwriter's) activity. Okay, so I think there are two reasons why the test could fail. 1) old primary shut down before all changes are replicated. This can avoid by adding wait_for_replay_catchup() before the tearing down. 2) bgwriter on old primary generated the RUNNING_XACTS record and the node shut dwon before sending it. .. and you mentioned for the case 2), right? I recalled that an injection point "skip-log-running-xacts" can be used to supress generating the WAL record, see 035_standby_logical_decoding.pl. My idea is to attach the injeciton point before the switchover and avoid adding the record. Attached patch implements the idea. How do you feel? Best regards, Hayato Kuroda FUJITSU LIMITED
Вложения
Dear Kuroda-san,
16.02.2026 05:10, Hayato Kuroda (Fujitsu) wrote:
16.02.2026 05:10, Hayato Kuroda (Fujitsu) wrote:
From my old records, 009_twophase.pl failed exactly due to background ( namely, bgwriter's) activity.Okay, so I think there are two reasons why the test could fail. 1) old primary shut down before all changes are replicated. This can avoid by adding wait_for_replay_catchup() before the tearing down. 2) bgwriter on old primary generated the RUNNING_XACTS record and the node shut dwon before sending it. .. and you mentioned for the case 2), right? I recalled that an injection point "skip-log-running-xacts" can be used to supress generating the WAL record, see 035_standby_logical_decoding.pl. My idea is to attach the injeciton point before the switchover and avoid adding the record. Attached patch implements the idea. How do you feel?
Unfortunately, the testing procedure I shared above still produces failures
with the patched 009_twophase.pl. With
wal_debug = on
in /tmp/temp.config, I can see:
...
ITERATION 3
...
11 t/009_twophase.pl .. ok
11 All tests successful.
11 Files=1, Tests=30, 10 wallclock secs ( 0.02 usr 0.00 sys + 0.27 cusr 0.63 csys = 0.92 CPU)
11 Result: PASS
20 make: *** [Makefile:28: check] Terminated
parallel: This job failed:
TEMP_CONFIG=/tmp/temp.config PROVE_TESTS=t/009* NO_TEMP_INSTALL=1 timeout 60 make check -s -C src/test/recovery_20
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl LOG: statement: PREPARE TRANSACTION 'xact_009_10';
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl LOG: INSERT @ 0/030227F8: - Transaction/PREPARE: gid xact_009_10: 2026-02-17 07:06:44.313023+02; subxacts: 790
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl STATEMENT: PREPARE TRANSACTION 'xact_009_10';
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl LOG: xlog flush request 0/030227F8; write 0/00000000; flush 0/00000000
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl STATEMENT: PREPARE TRANSACTION 'xact_009_10';
2026-02-17 07:06:44.313 EET background writer[754333] LOG: INSERT @ 0/03022838: - Standby/RUNNING_XACTS: nextXid 791 latestCompletedXid 788 oldestRunningXid 789; 1 xacts: 789; 1 subxacts: 790
2026-02-17 07:06:44.322 EET client backend[754949] 009_twophase.pl LOG: statement: SELECT pg_current_wal_flush_lsn()
2026-02-17 07:06:44.330 EET client backend[754974] 009_twophase.pl LOG: statement: SELECT '0/030227F8' <= replay_lsn AND state = 'streaming'
FROM pg_catalog.pg_stat_replication
WHERE application_name IN ('paris', 'walreceiver')
2026-02-17 07:06:44.337 EET postmaster[754306] LOG: received immediate shutdown request
2026-02-17 07:06:44.339 EET postmaster[754306] LOG: database system is shut down
...
2026-02-17 07:06:44.566 EET startup[755326] LOG: REDO @ 0/03022618; LSN 0/03022650: prev 0/030225E0; xid 0; len 24 - Standby/RUNNING_XACTS: nextXid 789 latestCompletedXid 788 oldestRunningXid 789
2026-02-17 07:06:44.566 EET startup[755326] LOG: REDO @ 0/03022650; LSN 0/030226A0: prev 0/03022618; xid 789; len 3; blkref #0: rel 1663/5/16384, blk 0 - Heap/INSERT: off: 21, flags: 0x00
2026-02-17 07:06:44.566 EET startup[755326] LOG: REDO @ 0/030226A0; LSN 0/030226F0: prev 0/03022650; xid 790; len 3; blkref #0: rel 1663/5/16384, blk 0 - Heap/INSERT: off: 22, flags: 0x00
2026-02-17 07:06:44.566 EET startup[755326] LOG: REDO @ 0/030226F0; LSN 0/030227F8: prev 0/030226A0; xid 789; len 232 - Transaction/PREPARE: gid xact_009_10: 2026-02-17 07:06:44.313023+02; subxacts: 790
2026-02-17 07:06:44.566 EET startup[755326] LOG: REDO @ 0/030227F8; LSN 0/03022838: prev 0/030226F0; xid 0; len 32 - Standby/RUNNING_XACTS: nextXid 791 latestCompletedXid 788 oldestRunningXid 789; 1 xacts: 789; 1 subxacts: 790
2026-02-17 07:06:44.566 EET startup[755326] LOG: invalid record length at 0/03022838: expected at least 24, got 0
2026-02-17 07:06:44.566 EET startup[755326] LOG: consistent recovery state reached at 0/03022838
2026-02-17 07:06:44.566 EET postmaster[755283] LOG: database system is ready to accept read-only connections
2026-02-17 07:06:44.569 EET walreceiver[755338] LOG: fetching timeline history file for timeline 2 from primary server
2026-02-17 07:06:44.569 EET walreceiver[755338] LOG: started streaming WAL from primary at 0/03000000 on timeline 1
2026-02-17 07:06:44.569 EET walreceiver[755338] LOG: replication terminated by primary server
2026-02-17 07:06:44.569 EET walreceiver[755338] DETAIL: End of WAL reached on timeline 1 at 0/030227F8.
2026-02-17 07:06:44.570 EET startup[755326] LOG: new timeline 2 forked off current database system timeline 1 before current recovery point 0/03022838
...
Regarding the approach with "skip-log-running-xacts", I wonder if there is
a strong guarantee that no other WAL write can happen in the same window
(e.g., from autovacuum)?
Best regards,
Alexander
RE: BUG: Former primary node might stuck when started as a standby
От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Alexander,
> Unfortunately, the testing procedure I shared above still produces failures
> with the patched 009_twophase.pl.
Hmm, I ran the test for hours, but I could nor reproduce the failure. But let me analyze
based on your log.
> With wal_debug = on
> in /tmp/temp.config, I can see:
> 2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl LOG: statement: PREPARE TRANSACTION
'xact_009_10';
> 2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl LOG: INSERT @ 0/030227F8: - Transaction/PREPARE:
gidxact_009_10: 2026-02-17 07:06:44.313023+02; subxacts: 790
> 2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl STATEMENT: PREPARE TRANSACTION 'xact_009_10';
> 2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl LOG: xlog flush request 0/030227F8; write
0/00000000;flush 0/00000000
> 2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl STATEMENT: PREPARE TRANSACTION 'xact_009_10';
> 2026-02-17 07:06:44.313 EET background writer[754333] LOG: INSERT @ 0/03022838: - Standby/RUNNING_XACTS: nextXid
791latestCompletedXid 788 oldestRunningXid 789; 1 xacts: 789; 1 subxacts: 790
> 2026-02-17 07:06:44.322 EET client backend[754949] 009_twophase.pl LOG: statement: SELECT
pg_current_wal_flush_lsn()
> 2026-02-17 07:06:44.330 EET client backend[754974] 009_twophase.pl LOG: statement: SELECT '0/030227F8' <= replay_lsn
ANDstate = 'streaming'
> FROM pg_catalog.pg_stat_replication
> WHERE application_name IN ('paris', 'walreceiver')
> 2026-02-17 07:06:44.337 EET postmaster[754306] LOG: received immediate shutdown request
> 2026-02-17 07:06:44.339 EET postmaster[754306] LOG: database system is shut down
I have few experience to see the wal_debug output, but background writer seems to
generate the RUNNING_XACTS record. It's different from my expectation. To confirm,
did you really enable the injection point? For now 009_twophase can work without
the `-Dinjection_points=true` but it should be set to avoid random failures.
Another possibility is that bgwriter has already been passed `IS_INJECTION_POINT_ATTACHED("skip-log-running-xacts")`
before attaching the injection point. In this case background writer would
generate the RUNNING_XACTS records at the bad timing. One possible workaround
is to attach the injeciton_point bit earlier, it does not solve the root cause
though. Attached patch applies the band-aid.
> Regarding the approach with "skip-log-running-xacts", I wonder if there is
> a strong guarantee that no other WAL write can happen in the same window
> (e.g., from autovacuum)?
Good point, autovacuum and checkpoint should be also avoided. IIUC it can be done
by setting GUCs.
Best regards,
Hayato Kuroda
FUJITSU LIMITED
Вложения
Dear Kuroda-san,
19.02.2026 05:50, Hayato Kuroda (Fujitsu) wrote:
> Dear Alexander,
>
>> Unfortunately, the testing procedure I shared above still produces failures
>> with the patched 009_twophase.pl.
> Hmm, I ran the test for hours, but I could nor reproduce the failure. But let me analyze
> based on your log.
Please look at the attached self-contained script. It works for me (failed
on iterations 6, 12, 2 right now, on my workstation with Ryzen 7900X) --
probably you could adjust number of parallel jobs to reproduce it on your
hardware.
> I have few experience to see the wal_debug output, but background writer seems to
> generate the RUNNING_XACTS record. It's different from my expectation. To confirm,
> did you really enable the injection point? For now 009_twophase can work without
> the `-Dinjection_points=true` but it should be set to avoid random failures.
I think it failed before the injection was set. My log contains:
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl STATEMENT: PREPARE TRANSACTION 'xact_009_10';
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl LOG: xlog flush request 0/030227F8; write
0/00000000; flush 0/00000000
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl STATEMENT: PREPARE TRANSACTION 'xact_009_10';
2026-02-17 07:06:44.313 EET background writer[754333] LOG: INSERT @ 0/03022838: - Standby/RUNNING_XACTS: nextXid 791
latestCompletedXid 788 oldestRunningXid 789; 1 xacts: 789; 1 subxacts: 790
As far as I can see, it corresponds to this place in the test:
SAVEPOINT s1;
INSERT INTO t_009_tbl VALUES (22, 'issued to ${cur_primary_name}');
PREPARE TRANSACTION 'xact_009_10';");
+$cur_primary->wait_for_replay_catchup($cur_standby);
$cur_primary->teardown_node;
$cur_standby->promote;
And as we found out before, wait_for_replay_catchup() before teardown
doesn't help... I can't say for sure, but from my experiments, the test
didn't fail with $cur_primary->stop instead of $cur_primary->teardown_node.
Best regards,
Alexander