Обсуждение: BUG: Former primary node might stuck when started as a standby

Поиск
Список
Период
Сортировка

BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Hello hackers,

[ reporting this bug here due to limitations of the bug reporting form ]

When a node, that acted as a primary, becomes a standby as in the
following script:
[ ... some WAL-logged activity ... ]
$primary->teardown_node;
$standby->promote;

$primary->enable_streaming($standby);
$primary->start;

it might not go online, due to the error:
new timeline N forked off current database system timeline M before current recovery point X/X

A complete TAP test is attached.
I put it in src/test/recovery/t, run as follows:
for i in `seq 100`; do echo "iteration $i"; timeout 60 make -s check -C src/test/recovery PROVE_TESTS="t/099*" ||
break;
 
done
and get:
...
iteration 7
# +++ tap check in src/test/recovery +++
t/099_change_roles.pl .. ok
All tests successful.
Files=1, Tests=20, 14 wallclock secs ( 0.01 usr  0.00 sys +  4.20 cusr  4.75 csys =  8.96 CPU)
Result: PASS
iteration 8
# +++ tap check in src/test/recovery +++
t/099_change_roles.pl .. 9/? make: *** [Makefile:23: check] Terminated

With wal_debug enabled (and log_min_messages=DEBUG2, log_statement=all),
I see the following in the _node1.log:
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl LOG: INSERT @ 0/304DBF0:  - Transaction/COMMIT: 2024-01-18 
15:21:02.258739+00
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (10, 'inserted on node1');
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl LOG:  xlog flush request 0/304DBF0; write 0/0; flush 0/0
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (10, 'inserted on node1');
2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG:  write 0/304DBF0 flush 0/304DB78 apply 0/304DB78 reply_time
2024-01-18
 
15:21:02.2588+00
2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG:  write 0/304DBF0 flush 0/304DBF0 apply 0/304DB78 reply_time
2024-01-18
 
15:21:02.258809+00
2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG:  write 0/304DBF0 flush 0/304DBF0 apply 0/304DBF0 reply_time
2024-01-18
 
15:21:02.258864+00
2024-01-18 15:21:02.259 UTC [663563] DEBUG:  server process (PID 663701) exited with exit code 0
2024-01-18 15:21:02.260 UTC [663563] DEBUG:  forked new backend, pid=663704 socket=8
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: statement: INSERT INTO t VALUES (1000 * 1 + 608, 
'background activity');
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: INSERT @ 0/304DC40:  - Heap/INSERT: off: 12, flags: 0x00
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background 
activity');
2024-01-18 15:21:02.261 UTC [663563] DEBUG:  postmaster received shutdown request signal
2024-01-18 15:21:02.261 UTC [663563] LOG:  received immediate shutdown request
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: INSERT @ 0/304DC68:  - Transaction/COMMIT: 2024-01-18 
15:21:02.261828+00
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background 
activity');
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG:  xlog flush request 0/304DC68; write 0/0; flush 0/0
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background 
activity');
...
2024-01-18 15:21:02.262 UTC [663563] LOG:  database system is shut down
...
2024-01-18 15:21:02.474 UTC [663810] LOG:  starting PostgreSQL 16.1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
...
2024-01-18 15:21:02.478 UTC [663816] LOG:  REDO @ 0/304DBC8; LSN 0/304DBF0: prev 0/304DB78; xid 898; len 8 - 
Transaction/COMMIT: 2024-01-18 15:21:02.258739+00
2024-01-18 15:21:02.478 UTC [663816] LOG:  REDO @ 0/304DBF0; LSN 0/304DC40: prev 0/304DBC8; xid 899; len 3; blkref #0:

rel 1663/5/16384, blk 1 - Heap/INSERT: off: 12, flags: 0x00
2024-01-18 15:21:02.478 UTC [663816] LOG:  REDO @ 0/304DC40; LSN 0/304DC68: prev 0/304DBF0; xid 899; len 8 - 
Transaction/COMMIT: 2024-01-18 15:21:02.261828+00
...
2024-01-18 15:21:02.481 UTC [663819] LOG:  fetching timeline history file for timeline 20 from primary server
2024-01-18 15:21:02.481 UTC [663819] LOG:  started streaming WAL from primary at 0/3000000 on timeline 19
...
2024-01-18 15:21:02.481 UTC [663819] DETAIL:  End of WAL reached on timeline 19 at 0/304DBF0.
...
2024-01-18 15:21:02.481 UTC [663816] LOG:  new timeline 20 forked off current database system timeline 19 before
current
 
recovery point 0/304DC68

In this case, node1 wrote to it's WAL record 0/304DC68, but sent to node2
only record 0/304DBF0, then node2, being promoted to primary, forked a next
timeline from it, but when node1 was started as a standby, it first
replayed 0/304DC68 from WAL, and then could not switch to the new timeline
starting from the previous position.

Reproduced on REL_12_STABLE .. master.

Best regards,
Alexander
Вложения

Re: BUG: Former primary node might stuck when started as a standby

От
Aleksander Alekseev
Дата:
Hi,

> it might not go online, due to the error:
> new timeline N forked off current database system timeline M before current recovery point X/X
> [...]
> In this case, node1 wrote to it's WAL record 0/304DC68, but sent to node2
> only record 0/304DBF0, then node2, being promoted to primary, forked a next
> timeline from it, but when node1 was started as a standby, it first
> replayed 0/304DC68 from WAL, and then could not switch to the new timeline
> starting from the previous position.

Unless I'm missing something, this is just the right behavior of the system.

node1 has no way of knowing the history of node1/node2/nodeN
promotion. It sees that it has more data and/or inconsistent timeline
with another node and refuses to process further until DBA will
intervene. What else can node1 do, drop the data? That's not how
things are done in Postgres :) What if this is a very important data
and node2 was promoted mistakenly, either manually or by a buggy
script.

It's been a while since I seriously played with replication, but if
memory serves, a proper way to switch node1 to a replica mode would be
to use pg_rewind on it first.

-- 
Best regards,
Aleksander Alekseev



Re: BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Hi Aleksander,

19.01.2024 14:45, Aleksander Alekseev wrote:
>
>> it might not go online, due to the error:
>> new timeline N forked off current database system timeline M before current recovery point X/X
>> [...]
>> In this case, node1 wrote to it's WAL record 0/304DC68, but sent to node2
>> only record 0/304DBF0, then node2, being promoted to primary, forked a next
>> timeline from it, but when node1 was started as a standby, it first
>> replayed 0/304DC68 from WAL, and then could not switch to the new timeline
>> starting from the previous position.
> Unless I'm missing something, this is just the right behavior of the system.

Thank you for the answer!

> node1 has no way of knowing the history of node1/node2/nodeN
> promotion. It sees that it has more data and/or inconsistent timeline
> with another node and refuses to process further until DBA will
> intervene.

But node1 knows that it's a standby now and it's expected to get all the
WAL records from the primary, doesn't it?
Maybe it could REDO from it's own WAL as little records as possible,
before requesting records from the authoritative source...
Is it supposed that it's more performance-efficient (not on the first
restart, but on later ones)?

>   What else can node1 do, drop the data? That's not how
> things are done in Postgres :)

In case no other options exist (this behavior is really correct and the
only possible), maybe the server should just stop?
Can DBA intervene somehow to make the server proceed without stopping it?

> It's been a while since I seriously played with replication, but if
> memory serves, a proper way to switch node1 to a replica mode would be
> to use pg_rewind on it first.

Perhaps that's true generally, but as we can see, without the extra
records replayed, this scenario works just fine. Moreover, existing tests
rely on it, e.g., 009_twophase.pl or 012_subtransactions.pl (in fact, my
research of the issue was initiated per a test failure).

Best regards,
Alexander



Re: BUG: Former primary node might stuck when started as a standby

От
Aleksander Alekseev
Дата:
Hi,

> But node1 knows that it's a standby now and it's expected to get all the
> WAL records from the primary, doesn't it?

Yes, but node1 doesn't know if it always was a standby or not. What if
node1 was always a standby, node2 was a primary, then node2 died and
node3 is a new primary. If node1 sees inconsistency  in the WAL
records, it should report it and stop doing anything, since it doesn't
has all the information needed to resolve the inconsistencies in all
the possible cases. Only DBA has this information.

> > It's been a while since I seriously played with replication, but if
> > memory serves, a proper way to switch node1 to a replica mode would be
> > to use pg_rewind on it first.
>
> Perhaps that's true generally, but as we can see, without the extra
> records replayed, this scenario works just fine. Moreover, existing tests
> rely on it, e.g., 009_twophase.pl or 012_subtransactions.pl (in fact, my
> research of the issue was initiated per a test failure).

I suggest focusing on particular flaky tests then and how to fix them.

-- 
Best regards,
Aleksander Alekseev



Re: BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Hi Aleksander,

[ I'm writing this off-list to minimize noise, but we can continue the discussion in -hackers, if you wish ]

22.01.2024 14:00, Aleksander Alekseev wrote:

> Hi,
>
>> But node1 knows that it's a standby now and it's expected to get all the
>> WAL records from the primary, doesn't it?
> Yes, but node1 doesn't know if it always was a standby or not. What if
> node1 was always a standby, node2 was a primary, then node2 died and
> node3 is a new primary.

Excuse me, but I still can't understand what could go wrong in this case.
Let's suppose, node1 has WAL with the following contents before start:
CPLOC | TL1R1 | TL1R2 | TL1R3 |

while node2's WAL contains:
TL1R1 | TL2R1 | TL2R2 | ...

where CPLOC -- a checkpoint location, TLxRy -- a record y on a timeline x.

I assume that requesting all WAL records from node2 without redoing local
records should be the right thing.

And even in the situation you propose:
CPLOC | TL2R5 | TL2R6 | TL2R7 |

while node3's WAL contains:
TL2R5 | TL3R1 | TL3R2 | ...

I see no issue with applying records from node3...

>   If node1 sees inconsistency  in the WAL
> records, it should report it and stop doing anything, since it doesn't
> has all the information needed to resolve the inconsistencies in all
> the possible cases. Only DBA has this information.

I still wonder, what can be considered an inconsistency in this situation.
Doesn't the exactly redo of all the local WAL records create the
inconsistency here?
For me, it's the question of an authoritative source, and if we had such a
source, we should trust it's records only.

Or in the other words, what if the record TL1R3, which node1 wrote to it's
WAL, but didn't send to node2, happened to have an incorrect checksum (due
to partial write, for example)?
If I understand correctly, node1 will just stop redoing WAL at that
position to receive all the following records from node2 and move forward
without reporting the inconsistency (an extra WAL record).

Best regards,
Alexander



Re: BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Hello,

22.01.2024 13:00, Aleksander Alekseev wrote:
Perhaps that's true generally, but as we can see, without the extra
records replayed, this scenario works just fine. Moreover, existing tests
rely on it, e.g., 009_twophase.pl or 012_subtransactions.pl (in fact, my
research of the issue was initiated per a test failure).
I suggest focusing on particular flaky tests then and how to fix them.

For the record: skink produced the same failure with 009_twophase.pl
yesterday [1]:

366/366 postgresql:recovery / recovery/009_twophase                                     TIMEOUT        15000.31s   killed by signal 15 SIGTERM

[15:11:30.591](0.807s) ok 11 - Restore of prepared transaction on promoted standby
### Enabling streaming replication for node "london"
### Starting node "london"
# Running: pg_ctl --wait --pgdata /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/009_twophase/data/t_009_twophase_london_data/pgdata --log /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/009_twophase/log/009_twophase_london.log --options --cluster-name=london start
waiting for server to start....... done
server started
# Postmaster PID for node "london" is 4181556
death by signal at /home/bf/bf-build/skink-master/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 181.
# Postmaster PID for node "london" is 4181556
### Stopping node "london" using mode immediate
# Running: pg_ctl --pgdata /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/009_twophase/data/t_009_twophase_london_data/pgdata --mode immediate stop
waiting for server to shut down.... done
server stopped
# No postmaster PID for node "london"
# Postmaster PID for node "paris" is 4147391
### Stopping node "paris" using mode immediate
# Running: pg_ctl --pgdata /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/009_twophase/data/t_009_twophase_paris_data/pgdata --mode immediate stop
waiting for server to shut down.... done
server stopped
# No postmaster PID for node "paris"
[19:19:46.999](14896.408s) # Tests were run but no plan was declared and done_testing() was not seen.
[19:19:46.999](0.000s) # Looks like your test exited with 4 just after 11.

pgsql.build/testrun/recovery/009_twophase/log/009_twophase_london.log
2026-02-04 15:11:35.675 CET [4181556][postmaster][:0] LOG:  starting PostgreSQL 19devel on x86_64-linux, compiled by gcc-15.2.0, 64-bit
2026-02-04 15:11:35.692 CET [4181556][postmaster][:0] LOG:  listening on Unix socket "/tmp/IzuUID03TM/.s.PGSQL.22911"
2026-02-04 15:11:35.904 CET [4183900][startup][:0] LOG:  database system was interrupted; last known up at 2026-02-04 15:11:14 CET
2026-02-04 15:11:36.017 CET [4183900][startup][:0] LOG:  entering standby mode
2026-02-04 15:11:36.019 CET [4183900][startup][:0] LOG:  database system was not properly shut down; automatic recovery in progress
2026-02-04 15:11:36.104 CET [4183900][startup][38/0:0] LOG:  redo starts at 0/03021600
2026-02-04 15:11:36.206 CET [4183900][startup][38/0:0] LOG:  invalid record length at 0/030220B8: expected at least 24, got 0
2026-02-04 15:11:36.209 CET [4183900][startup][38/0:0] LOG:  consistent recovery state reached at 0/030220B8
2026-02-04 15:11:36.210 CET [4181556][postmaster][:0] LOG:  database system is ready to accept read-only connections
2026-02-04 15:11:36.919 CET [4184142][walreceiver][:0] LOG:  fetching timeline history file for timeline 2 from primary server
2026-02-04 15:11:36.997 CET [4184142][walreceiver][:0] LOG:  started streaming WAL from primary at 0/03000000 on timeline 1
2026-02-04 15:11:37.076 CET [4184142][walreceiver][:0] LOG:  replication terminated by primary server
2026-02-04 15:11:37.076 CET [4184142][walreceiver][:0] DETAIL:  End of WAL reached on timeline 1 at 0/03022070.
2026-02-04 15:11:37.098 CET [4183900][startup][38/0:0] LOG:  new timeline 2 forked off current database system timeline 1 before current recovery point 0/030220B8
2026-02-04 15:11:37.103 CET [4184142][walreceiver][:0] LOG:  restarted WAL streaming at 0/03000000 on timeline 1
2026-02-04 15:11:37.107 CET [4184142][walreceiver][:0] LOG:  replication terminated by primary server
...

[1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2026-02-04%2013%3A36%3A40&stg=recovery-check

Best regards,
Alexander

RE: BUG: Former primary node might stuck when started as a standby

От
"Hayato Kuroda (Fujitsu)"
Дата:

Dear Alexander, Aleksander

 

Is it enough that we can wait till all WALs are replayed before the promotion?

Like attached.

 

Best regards,

Hayato Kuroda

FUJITSU LIMITED

Вложения

Re: BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Dear Kuroda-san,

09.02.2026 04:38, Hayato Kuroda (Fujitsu) wrote:
@font-face {font-family:"MS ゴシック"; panose-1:2 11 6 9 7 2 5 8 2 4;}@font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;}@font-face {font-family:游ゴシック; panose-1:2 11 4 0 0 0 0 0 0 0;}@font-face {font-family:"MS Pゴシック"; panose-1:2 11 6 0 7 2 5 8 2 4;}@font-face {font-family:"\@MS Pゴシック";}@font-face {font-family:"\@游ゴシック"; panose-1:2 11 4 0 0 0 0 0 0 0;}@font-face {font-family:"\@MS ゴシック"; panose-1:2 11 6 9 7 2 5 8 2 4;}p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0mm; font-size:12.0pt; font-family:"MS Pゴシック";}.MsoChpDefault {mso-style-type:export-only; font-size:10.0pt; mso-ligatures:none;}div.WordSection1 {page:WordSection1;}

Dear Alexander, Aleksander

 

Is it enough that we can wait till all WALs are replayed before the promotion?

Like attached.



Thank you for paying attention to this!

I tried to add
$primary->wait_for_replay_catchup($standby);
in my test (see attached), but that doesn't help -- it still fails as below:
iteration 61
# +++ tap check in src/test/recovery +++
t/099_change_roles.pl .. 1/? make: *** [Makefile:28: check] Terminated

src/test/recovery/tmp_check/log/099_change_roles_node1.log
2026-02-12 20:30:42.953 EET [2716149] LOG:  new timeline 4 forked off current database system timeline 3 before current recovery point 0/3025B48

Best regards,
Alexander
Вложения

RE: BUG: Former primary node might stuck when started as a standby

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Alexander,

I checked your test and reproduced the issue with it.
Was it possible that INSERT happened in-between wait_for_replay_catchup and
teardown_node? In this case we may not ensure WAL records generated in the time
window were reached, right?
Similar stuff won7t happen in 009_twophase.pl because it does not have the bg activities.

Obtained log around here is shown in [1].

[1]
```
2026-02-13 10:34:56.751 JST client backend[521764] 099_change_roles.pl LOG:  statement: INSERT INTO t VALUES (1000 * 1
+2812, 'background activity');
 
2026-02-13 10:34:56.753 JST client backend[521762] 099_change_roles.pl LOG:  statement: SELECT '0/030ED4F8' <=
replay_lsnAND state = 'streaming'        * from wait_for_replay_catchup
 
                 FROM pg_catalog.pg_stat_replication
                 WHERE application_name IN ('node2', 'walreceiver')
2026-02-13 10:34:56.762 JST client backend[521767] 099_change_roles.pl LOG:  statement: INSERT INTO t VALUES (1000 * 1
+2813, 'background activity');        * do INSERT from background activities
 
2026-02-13 10:34:56.764 JST client backend[521768] 099_change_roles.pl LOG:  statement: INSERT INTO t VALUES (1000 * 2
+2629, 'background activity');
 
2026-02-13 10:34:56.765 JST postmaster[521622] LOG:  received immediate shutdown request
   * from teardown_node
 
2026-02-13 10:34:56.767 JST postmaster[521622] LOG:  database system is shut down
2026-02-13 10:34:56.996 JST postmaster[521844] LOG:  starting PostgreSQL 19devel on x86_64-linux, compiled by
gcc-11.5.0,64-bit
 
....
2026-02-13 10:34:57.124 JST walreceiver[521864] LOG:  replication terminated by primary server
2026-02-13 10:34:57.124 JST walreceiver[521864] DETAIL:  End of WAL reached on timeline 153 at 0/030ED6D8.
2026-02-13 10:34:57.124 JST startup[521855] LOG:  new timeline 154 forked off current database system timeline 153
beforecurrent recovery point 0/030ED750
 
2026-02-13 10:34:57.124 JST walreceiver[521864] LOG:  restarted WAL streaming at 0/03000000 on timeline 153
```

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Dear Kuroda-san,

13.02.2026 04:03, Hayato Kuroda (Fujitsu) wrote:
Dear Alexander,

I checked your test and reproduced the issue with it.
Was it possible that INSERT happened in-between wait_for_replay_catchup and
teardown_node? In this case we may not ensure WAL records generated in the time
window were reached, right?
Similar stuff won7t happen in 009_twophase.pl because it does not have the bg activities.

From my old records, 009_twophase.pl failed exactly due to background (
namely, bgwriter's) activity.

I modified bgwriter.c to reproduce the failure easier:
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -67,7 +67,7 @@ int            BgWriterDelay = 200;
  * Interval in which standby snapshots are logged into the WAL stream, in
  * milliseconds.
  */
-#define LOG_SNAPSHOT_INTERVAL_MS 15000
+#define LOG_SNAPSHOT_INTERVAL_MS 1
 
 /*
  * LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid
@@ -306,7 +306,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
          */
         rc = WaitLatch(MyLatch,
                        WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-                       BgWriterDelay /* ms */ , WAIT_EVENT_BGWRITER_MAIN);
+                       1 /* ms */ , WAIT_EVENT_BGWRITER_MAIN);
 
         /*
          * If no latch event and BgBufferSync says nothing's happening, extend
@@ -339,6 +339,5 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
             StrategyNotifyBgWriter(-1);
         }
 
-        prev_hibernate = can_hibernate;
     }
 }

multiplied the test to increase probability of the failure:
for i in {1..20}; do cp -r src/test/recovery/ src/test/recovery_$i/; sed "s|src/test/recovery|src/test/recovery_$i|" -i src/test/recovery_$i/Makefile; done

and executed it in a loop:
for i in {1..100}; do echo "ITERATION $i"; parallel --halt now,fail=1 -j20 --linebuffer --tag PROVE_TESTS="t/009*" NO_TEMP_INSTALL=1 timeout 60 make check -s -C src/test/recovery_{} ::: `seq 20` || break; done

It failed for me on iterations 27, 4, 22 as below:
ITERATION 22
...
18      t/009_twophase.pl .. ok
18      All tests successful.
18      Files=1, Tests=30, 12 wallclock secs ( 0.01 usr  0.01 sys +  0.26 cusr  0.58 csys =  0.86 CPU)
18      Result: PASS
5       make: *** wait: No child processes.  Stop.
5       make: *** Waiting for unfinished jobs....
5       make: *** wait: No child processes.  Stop.
parallel: This job failed:
PROVE_TESTS=t/009* NO_TEMP_INSTALL=1 timeout 60 make check -s -C src/test/recovery_5

src/test/recovery_5/tmp_check/log/009_twophase_london.log contains:
2026-02-13 21:03:28.248 EET [3987222] LOG:  new timeline 2 forked off current database system timeline 1 before current recovery point 0/3029190
...

(Without "timeout 60", the test just hangs — we can see the same in [1],
the test was killed with SIGTERM after 15000 seconds...)

[1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2026-02-04%2013%3A36%3A40&stg=recovery-check

Best regards,
Alexander

RE: BUG: Former primary node might stuck when started as a standby

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Alexander,

> From my old records, 009_twophase.pl failed exactly due to background (
> namely, bgwriter's) activity.

Okay, so I think there are two reasons why the test could fail.

1) old primary shut down before all changes are replicated. This can avoid by
adding wait_for_replay_catchup() before the tearing down.
2) bgwriter on old primary generated the RUNNING_XACTS record and the node shut
dwon before sending it.

.. and you mentioned for the case 2), right? I recalled that an injection point
"skip-log-running-xacts" can be used to supress generating the WAL record, see
035_standby_logical_decoding.pl. My idea is to attach the injeciton point before
the switchover and avoid adding the record.
Attached patch implements the idea.

How do you feel?

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Вложения

Re: BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Dear Kuroda-san,

16.02.2026 05:10, Hayato Kuroda (Fujitsu) wrote:
From my old records, 009_twophase.pl failed exactly due to background (
namely, bgwriter's) activity.
Okay, so I think there are two reasons why the test could fail.

1) old primary shut down before all changes are replicated. This can avoid by
adding wait_for_replay_catchup() before the tearing down.
2) bgwriter on old primary generated the RUNNING_XACTS record and the node shut
dwon before sending it.

.. and you mentioned for the case 2), right? I recalled that an injection point
"skip-log-running-xacts" can be used to supress generating the WAL record, see
035_standby_logical_decoding.pl. My idea is to attach the injeciton point before
the switchover and avoid adding the record.
Attached patch implements the idea.

How do you feel?

Unfortunately, the testing procedure I shared above still produces failures
with the patched 009_twophase.pl. With
wal_debug = on
in /tmp/temp.config, I can see:
...
ITERATION 3
...
11      t/009_twophase.pl .. ok
11      All tests successful.
11      Files=1, Tests=30, 10 wallclock secs ( 0.02 usr  0.00 sys +  0.27 cusr  0.63 csys =  0.92 CPU)
11      Result: PASS
20      make: *** [Makefile:28: check] Terminated
parallel: This job failed:
TEMP_CONFIG=/tmp/temp.config PROVE_TESTS=t/009* NO_TEMP_INSTALL=1 timeout 60 make check -s -C src/test/recovery_20

2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl LOG:  statement: PREPARE TRANSACTION 'xact_009_10';
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl LOG:  INSERT @ 0/030227F8:  - Transaction/PREPARE: gid xact_009_10: 2026-02-17 07:06:44.313023+02; subxacts: 790
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl STATEMENT:  PREPARE TRANSACTION 'xact_009_10';
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl LOG:  xlog flush request 0/030227F8; write 0/00000000; flush 0/00000000
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl STATEMENT:  PREPARE TRANSACTION 'xact_009_10';
2026-02-17 07:06:44.313 EET background writer[754333] LOG:  INSERT @ 0/03022838:  - Standby/RUNNING_XACTS: nextXid 791 latestCompletedXid 788 oldestRunningXid 789; 1 xacts: 789; 1 subxacts: 790
2026-02-17 07:06:44.322 EET client backend[754949] 009_twophase.pl LOG:  statement: SELECT pg_current_wal_flush_lsn()
2026-02-17 07:06:44.330 EET client backend[754974] 009_twophase.pl LOG:  statement: SELECT '0/030227F8' <= replay_lsn AND state = 'streaming'
             FROM pg_catalog.pg_stat_replication
             WHERE application_name IN ('paris', 'walreceiver')
2026-02-17 07:06:44.337 EET postmaster[754306] LOG:  received immediate shutdown request
2026-02-17 07:06:44.339 EET postmaster[754306] LOG:  database system is shut down
...
2026-02-17 07:06:44.566 EET startup[755326] LOG:  REDO @ 0/03022618; LSN 0/03022650: prev 0/030225E0; xid 0; len 24 - Standby/RUNNING_XACTS: nextXid 789 latestCompletedXid 788 oldestRunningXid 789
2026-02-17 07:06:44.566 EET startup[755326] LOG:  REDO @ 0/03022650; LSN 0/030226A0: prev 0/03022618; xid 789; len 3; blkref #0: rel 1663/5/16384, blk 0 - Heap/INSERT: off: 21, flags: 0x00
2026-02-17 07:06:44.566 EET startup[755326] LOG:  REDO @ 0/030226A0; LSN 0/030226F0: prev 0/03022650; xid 790; len 3; blkref #0: rel 1663/5/16384, blk 0 - Heap/INSERT: off: 22, flags: 0x00
2026-02-17 07:06:44.566 EET startup[755326] LOG:  REDO @ 0/030226F0; LSN 0/030227F8: prev 0/030226A0; xid 789; len 232 - Transaction/PREPARE: gid xact_009_10: 2026-02-17 07:06:44.313023+02; subxacts: 790
2026-02-17 07:06:44.566 EET startup[755326] LOG:  REDO @ 0/030227F8; LSN 0/03022838: prev 0/030226F0; xid 0; len 32 - Standby/RUNNING_XACTS: nextXid 791 latestCompletedXid 788 oldestRunningXid 789; 1 xacts: 789; 1 subxacts: 790
2026-02-17 07:06:44.566 EET startup[755326] LOG:  invalid record length at 0/03022838: expected at least 24, got 0
2026-02-17 07:06:44.566 EET startup[755326] LOG:  consistent recovery state reached at 0/03022838
2026-02-17 07:06:44.566 EET postmaster[755283] LOG:  database system is ready to accept read-only connections
2026-02-17 07:06:44.569 EET walreceiver[755338] LOG:  fetching timeline history file for timeline 2 from primary server
2026-02-17 07:06:44.569 EET walreceiver[755338] LOG:  started streaming WAL from primary at 0/03000000 on timeline 1
2026-02-17 07:06:44.569 EET walreceiver[755338] LOG:  replication terminated by primary server
2026-02-17 07:06:44.569 EET walreceiver[755338] DETAIL:  End of WAL reached on timeline 1 at 0/030227F8.
2026-02-17 07:06:44.570 EET startup[755326] LOG:  new timeline 2 forked off current database system timeline 1 before current recovery point 0/03022838
...

Regarding the approach with "skip-log-running-xacts", I wonder if there is
a strong guarantee that no other WAL write can happen in the same window
(e.g., from autovacuum)?

Best regards,
Alexander

RE: BUG: Former primary node might stuck when started as a standby

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Alexander,

> Unfortunately, the testing procedure I shared above still produces failures
> with the patched 009_twophase.pl.

Hmm, I ran the test for hours, but I could nor reproduce the failure. But let me analyze
based on your log.

> With wal_debug = on
> in /tmp/temp.config, I can see:

> 2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl LOG:  statement: PREPARE TRANSACTION
'xact_009_10';
> 2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl LOG:  INSERT @ 0/030227F8:  - Transaction/PREPARE:
gidxact_009_10: 2026-02-17 07:06:44.313023+02; subxacts: 790
 
> 2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl STATEMENT:  PREPARE TRANSACTION 'xact_009_10';
> 2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl LOG:  xlog flush request 0/030227F8; write
0/00000000;flush 0/00000000
 
> 2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl STATEMENT:  PREPARE TRANSACTION 'xact_009_10';
> 2026-02-17 07:06:44.313 EET background writer[754333] LOG:  INSERT @ 0/03022838:  - Standby/RUNNING_XACTS: nextXid
791latestCompletedXid 788 oldestRunningXid 789; 1 xacts: 789; 1 subxacts: 790
 
> 2026-02-17 07:06:44.322 EET client backend[754949] 009_twophase.pl LOG:  statement: SELECT
pg_current_wal_flush_lsn()
> 2026-02-17 07:06:44.330 EET client backend[754974] 009_twophase.pl LOG:  statement: SELECT '0/030227F8' <= replay_lsn
ANDstate = 'streaming'
 
>             FROM pg_catalog.pg_stat_replication
>             WHERE application_name IN ('paris', 'walreceiver')
> 2026-02-17 07:06:44.337 EET postmaster[754306] LOG:  received immediate shutdown request 
> 2026-02-17 07:06:44.339 EET postmaster[754306] LOG:  database system is shut down

I have few experience to see the wal_debug output, but background writer seems to
generate the RUNNING_XACTS record. It's different from my expectation. To confirm,
did you really enable the injection point? For now 009_twophase can work without
the `-Dinjection_points=true` but it should be set to avoid random failures.

Another possibility is that bgwriter has already been passed `IS_INJECTION_POINT_ATTACHED("skip-log-running-xacts")`
before attaching the injection point. In this case background writer would
generate the RUNNING_XACTS records at the bad timing. One possible workaround
is to attach the injeciton_point bit earlier, it does not solve the root cause
though. Attached patch applies the band-aid.

> Regarding the approach with "skip-log-running-xacts", I wonder if there is
> a strong guarantee that no other WAL write can happen in the same window
> (e.g., from autovacuum)?

Good point, autovacuum and checkpoint should be also avoided. IIUC it can be done
by setting GUCs.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Вложения

Re: BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Dear Kuroda-san,

19.02.2026 05:50, Hayato Kuroda (Fujitsu) wrote:
> Dear Alexander,
>
>> Unfortunately, the testing procedure I shared above still produces failures
>> with the patched 009_twophase.pl.
> Hmm, I ran the test for hours, but I could nor reproduce the failure. But let me analyze
> based on your log.

Please look at the attached self-contained script. It works for me (failed
on iterations 6, 12, 2 right now, on my workstation with Ryzen 7900X) --
probably you could adjust number of parallel jobs to reproduce it on your
hardware.

> I have few experience to see the wal_debug output, but background writer seems to
> generate the RUNNING_XACTS record. It's different from my expectation. To confirm,
> did you really enable the injection point? For now 009_twophase can work without
> the `-Dinjection_points=true` but it should be set to avoid random failures.

I think it failed before the injection was set. My log contains:
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl STATEMENT:  PREPARE TRANSACTION 'xact_009_10';
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl LOG:  xlog flush request 0/030227F8; write 
0/00000000; flush 0/00000000
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl STATEMENT:  PREPARE TRANSACTION 'xact_009_10';
2026-02-17 07:06:44.313 EET background writer[754333] LOG:  INSERT @ 0/03022838:  - Standby/RUNNING_XACTS: nextXid 791

latestCompletedXid 788 oldestRunningXid 789; 1 xacts: 789; 1 subxacts: 790

As far as I can see, it corresponds to this place in the test:
      SAVEPOINT s1;
      INSERT INTO t_009_tbl VALUES (22, 'issued to ${cur_primary_name}');
      PREPARE TRANSACTION 'xact_009_10';");
+$cur_primary->wait_for_replay_catchup($cur_standby);
  $cur_primary->teardown_node;
  $cur_standby->promote;

And as we found out before, wait_for_replay_catchup() before teardown
doesn't help... I can't say for sure, but from my experiments, the test
didn't fail with $cur_primary->stop instead of $cur_primary->teardown_node.

Best regards,
Alexander
Вложения

Re: BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Dear Kuroda-san,

[ back to -hackers after short off-list discussion ]

20.02.2026 04:00, Alexander Lakhin wrote:
>
> Please look at the attached self-contained script. It works for me (failed
> on iterations 6, 12, 2 right now, on my workstation with Ryzen 7900X) --
> probably you could adjust number of parallel jobs to reproduce it on your
> hardware.
>

As it turned out, v2 patch works as expected, but the test may still fail
when the build is configured without injection points. Maybe this test (
and similar one(s)) should be skipped in this case, not sure...

Best regards,
Alexander



Re: BUG: Former primary node might stuck when started as a standby

От
Michael Paquier
Дата:
On Mon, Mar 02, 2026 at 09:00:00AM +0200, Alexander Lakhin wrote:
> As it turned out, v2 patch works as expected, but the test may still fail
> when the build is configured without injection points. Maybe this test (
> and similar one(s)) should be skipped in this case, not sure...

Exactly, I don't see what else we can do here except skip the
sequences of the test that we know may fail if injection points are
not enabled in the build.

Looking at v2, the patch ought to comment the reason *why* these tests
are skipped and *why* an injection point is used.  The reader should
have more details than just a small hint about a set of "random
failures".  I'd suggest to document that once with the first injection
point attached, and have the other blocks hold comments telling to
look at the first block.
--
Michael

Вложения

RE: BUG: Former primary node might stuck when started as a standby

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Alexander, Michael,

> On Mon, Mar 02, 2026 at 09:00:00AM +0200, Alexander Lakhin wrote:
> > As it turned out, v2 patch works as expected, but the test may still fail
> > when the build is configured without injection points. Maybe this test (
> > and similar one(s)) should be skipped in this case, not sure...
>
> Exactly, I don't see what else we can do here except skip the
> sequences of the test that we know may fail if injection points are
> not enabled in the build.

I had a concern that some BF animals have not enable the injection point yet
thus coverage might be decreased for them. But it's OK for me to fix it.

> Looking at v2, the patch ought to comment the reason *why* these tests
> are skipped and *why* an injection point is used.  The reader should
> have more details than just a small hint about a set of "random
> failures".  I'd suggest to document that once with the first injection
> point attached, and have the other blocks hold comments telling to
> look at the first block.

I preferred to add descriptions at the place checking enable_injection_points.
See the updated version.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Вложения

Re: BUG: Former primary node might stuck when started as a standby

От
Michael Paquier
Дата:
On Tue, Mar 03, 2026 at 04:02:53AM +0000, Hayato Kuroda (Fujitsu) wrote:
> I had a concern that some BF animals have not enable the injection point yet
> thus coverage might be decreased for them. But it's OK for me to fix
> it.

Requiring injection points to be enabled so as we have a strict
control over the standby snapshot records does not strike me as a bad
requirement in itself.  Most of the animals use the switch these days.
It's a bit sad if this is not entirely stable in pre-v16 branches, but
a stable post-v17 behavior would always be better than an unstable
behavior everywhere.

> I preferred to add descriptions at the place checking enable_injection_points.
> See the updated version.

+    autovacuum = off
+    checkpoint_timeout = 1h

Why do we need these?  An explanation seems in order in the shape of a
commit, or these should be removed.

Is there a different trick than the one posted at [1] to check the
stability of the proposal?  I am wondering if I am missing something,
or if that's all.  Alexander?

[1]: https://www.postgresql.org/message-id/e1cf52d2-c344-4dfd-a918-e5f20ff04fa2@gmail.com
--
Michael

Вложения

Re: BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Hello Michael,

03.03.2026 08:13, Michael Paquier wrote:
+	autovacuum = off
+	checkpoint_timeout = 1h

Why do we need these?  An explanation seems in order in the shape of a
commit, or these should be removed.

Is there a different trick than the one posted at [1] to check the
stability of the proposal?  I am wondering if I am missing something,
or if that's all.  Alexander?

[1]: https://www.postgresql.org/message-id/e1cf52d2-c344-4dfd-a918-e5f20ff04fa2@gmail.com

There was also a separate test (independent of bgwriter, autovacuum,
checkpointer,...) at the top of the thread: [1].

FWIW, 004_timeline_switch also failed the past month in the same way [2].
I've just tested 012_subtransactions.pl with the script I proposed at [3]
and it passed 100 iterations (because it performs $node_primary->stop;
before $node_standby->promote;, I suppose).

[1] https://www.postgresql.org/message-id/b0102688-6d6c-c86a-db79-e0e91d245b1a%40gmail.com
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2026-02-07%2006%3A33%3A20
[3] https://www.postgresql.org/message-id/045cab6f-4738-417e-b551-01adba44d6c3%40gmail.com

Best regards,
Alexander

RE: BUG: Former primary node might stuck when started as a standby

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Michael,

> Requiring injection points to be enabled so as we have a strict
> control over the standby snapshot records does not strike me as a bad
> requirement in itself.  Most of the animals use the switch these days.
> It's a bit sad if this is not entirely stable in pre-v16 branches, but
> a stable post-v17 behavior would always be better than an unstable
> behavior everywhere.

OK, thanks for the clarification. I will create patches for back branches
after it is stabilized on HEAD.
(Current one might be able to be applied, not sure)

> +    autovacuum = off
> +    checkpoint_timeout = 1h
>
> Why do we need these?  An explanation seems in order in the shape of a
> commit, or these should be removed.

I think it is needed to stabilize the test. If checkpointer and/or may generate
the WAL between the switchover, the test can fail with the same reason as bgwriter's
activity. I added comments for it.

> Is there a different trick than the one posted at [1] to check the
> stability of the proposal?  I am wondering if I am missing something,
> or if that's all.  Alexander?
>
> [1]:
> https://www.postgresql.org/message-id/e1cf52d2-c344-4dfd-a918-e5f20ff04fa
> 2@gmail.com

To clarify, I have no knowledge for it.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Вложения

RE: BUG: Former primary node might stuck when started as a standby

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Alexander,


> FWIW, 004_timeline_switch also failed the past month in the same way [2].

Thanks for the info. So I can provide the patch after the issue for 009_twophase.pl
is fixed. For better understanding we may be able to fork new thread.

> I've just tested 012_subtransactions.pl with the script I proposed at [3]
> and it passed 100 iterations (because it performs $node_primary->stop;
> before $node_standby->promote;, I suppose).

I guess so. cluster::stop does the `pg_ctl stop -m fast` command. In this case
the walsender waits till there are nothing to be sent, see WalSndLoop().
Do let me know if you have observed the similar failure here.

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: BUG: Former primary node might stuck when started as a standby

От
Michael Paquier
Дата:
On Tue, Mar 03, 2026 at 09:17:16AM +0000, Hayato Kuroda (Fujitsu) wrote:
> Thanks for the info. So I can provide the patch after the issue for 009_twophase.pl
> is fixed. For better understanding we may be able to fork new
> thread.

Regarding your posted v4, I am actually not convinced that there is a
need for injection points and disabling standby snapshots, for the
three sequences of tests proposed.

While the first wait_for_replay_catchup() can be useful before the
teardown_node() of the primary in the "Check that prepared
transactions can be committed on promoted standby" sequence, it still
has a limited impact.  It looks like we could have other parasite
records as well, depending on how slowly the primary is stopped?  I
think that we should switch to a plain stop() of the primary, the test
wants to check that prepared transactions can be committed on a
standby.  Stopping the primary abruptly does not matter for this
sequence.

For the second wait_for_replay_catchup(), after the PREPARE of
xact_009_11.  I may be missing something but in how does it change
things?  A plain stop() of the primary means that it would have
received all the WAL records from the primary on disk in its pg_wal,
no?  Upon restart, it should replay everything it finds in pg_wal/.  I
don't see a change required here.

For the third wait_for_replay_catchup(), after the PREPARE of
xact_009_12, same dance.  The primary is cleanly stopped first.  All
the WAL records of the primary should have been flushed to the
standby.

As a whole, it looks like we should just switch the teardown() call to
a stop() call in the first test with xact_009_10, backpatch it, and
call it a day.  No need for injection points and no need for GUC
tweaks.  I have not looked at 004_timeline_switch yet.

> I guess so. cluster::stop does the `pg_ctl stop -m fast` command. In this case
> the walsender waits till there are nothing to be sent, see WalSndLoop().
> Do let me know if you have observed the similar failure here.

Exactly.  Doing a clean stop of the primary offers a strong guarantee
here.  We are sure that the standby will have received all the records
from the primary.  Timeline forking is an impossible thing in
012_subtransactions.pl based on how the switchover from the primary to
the standby happens.  I don't see a need for tweaking this test at
all.  Or perhaps you did see a failure of some kind in this test,
Alexander?
--
Michael

Вложения

Re: BUG: Former primary node might stuck when started as a standby

От
Michael Paquier
Дата:
On Wed, Mar 04, 2026 at 02:31:29PM +0900, Michael Paquier wrote:
> As a whole, it looks like we should just switch the teardown() call to
> a stop() call in the first test with xact_009_10, backpatch it, and
> call it a day.  No need for injection points and no need for GUC
> tweaks.

With a little bit more patience, I have reproduced the same failure as
Alexander using the bgwriter trick, -DWAL_DEBUG and his reproducer
script with parallel runs of the 009 recovery test.  The attached
patch is also proving to work.  The failure happens at the 2nd~3rd
iteration without the fix, and the tests are able to last more than 50
iterations with the fix.

As far as I can see by scanning the history of the test, this is a
copy-pasto coming from 30820982b295 where the tests were initially
introduced, where teardown_node() was copied across the test
sequences.  As we want to check that a promoted standby is able to
commit the 2PC transactions issued on the primary, a plain stop() will
equally work.

I'll push this fix shortly, taking care of one instability.  Nice
investigation on this one, Alexander, by the way.
--
Michael

Вложения

Re: BUG: Former primary node might stuck when started as a standby

От
Michael Paquier
Дата:
On Wed, Mar 04, 2026 at 03:29:35PM +0900, Michael Paquier wrote:
> I'll push this fix shortly, taking care of one instability.  Nice
> investigation on this one, Alexander, by the way.

With the patch attached, for reference..
--
Michael

Вложения

RE: BUG: Former primary node might stuck when started as a standby

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Michael,

Thanks for the smarter solutions. Actually I have misunderstood that teardown_node
is used everywhere and it's necessary. Your analysis looks completely correct.

Best regards,
Hayato Kuroda
FUJITSU LIMITED




Re: BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Hello Michael,

04.03.2026 07:31, Michael Paquier wrote
I guess so. cluster::stop does the `pg_ctl stop -m fast` command. In this case
the walsender waits till there are nothing to be sent, see WalSndLoop().
Do let me know if you have observed the similar failure here.
Exactly.  Doing a clean stop of the primary offers a strong guarantee
here.  We are sure that the standby will have received all the records
from the primary.  Timeline forking is an impossible thing in
012_subtransactions.pl based on how the switchover from the primary to
the standby happens.  I don't see a need for tweaking this test at
all.  Or perhaps you did see a failure of some kind in this test,
Alexander?


Yes, 012_subtransactions doesn't fail with aggressive bgwriter, as I noted
before. I mentioned it exactly to show that stop does matter here. But if
we recognize teardown_node in this context as risky, maybe it would make
sense to review also other tests in recovery/. I already wrote about
004_timeline_switch, but probably there are more. E.g., 028_pitr_timelines
(I haven't tested it intensively yet) does:
$node_primary->stop('immediate');

# Promote the standby, and switch WAL so that it archives a WAL segment
# that contains all the INSERTs, on a new timeline.
$node_standby->promote;

Best regards,
Alexander

Re: BUG: Former primary node might stuck when started as a standby

От
Michael Paquier
Дата:
On Wed, Mar 04, 2026 at 10:00:00AM +0200, Alexander Lakhin wrote:
> Yes, 012_subtransactions doesn't fail with aggressive bgwriter, as I noted
> before. I mentioned it exactly to show that stop does matter here. But if
> we recognize teardown_node in this context as risky, maybe it would make
> sense to review also other tests in recovery/. I already wrote about
> 004_timeline_switch, but probably there are more. E.g., 028_pitr_timelines
> (I haven't tested it intensively yet) does:
> $node_primary->stop('immediate');
>
> # Promote the standby, and switch WAL so that it archives a WAL segment
> # that contains all the INSERTs, on a new timeline.
> $node_standby->promote;

I think that your take about 004 is actually right, looking at it more
closely.  By tearing down the primary, it could be possible that
standby_2 receives more records than standby_1.  Then, when we try to
reconnect standby_2 to the promoted standby_1, the TLI could fork, in
theory.  The fix would be the same: by switching to stop(), we'd make
sure that both standby_1 and standby_2 have received all the records
from the primary.  We can also remove the wait_for_catchup() before
the primary is stopped, this offers no protection for standby_2
receiving more records from the primary than standby_1.

It is not surprising that this failure with a three-node scenario is
much harder to reproduce.  I have run the same loop as 009 but things
are super stable even after 50-ish iteractions.  By reading the code,
I agree that the failure is possible to reach in theory, though.  Some
hardcoded sleeps would do the trick (make bgwriter aggressive, patch
the checkpointer so as we do not send the last standby snapshot
records to standby_2, only to standby_1, etc.).

Did you find any buildfarm failures involving 028?  I cannot get
excited in changing tests where nothing has happened, and this test
looks OK as we don't do a switchover.  For 004, we have at least one
failure recorded based on what you said.  That's a fact sufficient for
me to fix things, for 004.
--
Michael

Вложения

Re: BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
04.03.2026 10:23, Michael Paquier wrote:
> Did you find any buildfarm failures involving 028?  I cannot get
> excited in changing tests where nothing has happened, and this test
> looks OK as we don't do a switchover.  For 004, we have at least one
> failure recorded based on what you said.  That's a fact sufficient for
> me to fix things, for 004.

No, I didn't. I agree that eliminating already known failures makes more
sense. Anyway, we'll have the thread which explain why such failures can
happen.

Best regards,
Alexander



Re: BUG: Former primary node might stuck when started as a standby

От
Michael Paquier
Дата:
On Wed, Mar 04, 2026 at 11:00:00AM +0200, Alexander Lakhin wrote:
> No, I didn't. I agree that eliminating already known failures makes more
> sense. Anyway, we'll have the thread which explain why such failures can
> happen.

Fixed this second instability now with the same trick as 009, down to
v14, as there is no strong requirement for an immediate stop of the
primary to check that a standby is able to to a TLI jump when
reconnecting to the promoted standby.
--
Michael

Вложения

RE: BUG: Former primary node might stuck when started as a standby

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Michael, Alexander,

Thanks for pushing fixes.
I checked buildfarm animals and I could not find failures related with it
(configure failure might mask them though). We can keep on eye and come again if
there are not still robust.
Thanks!

Best regards,
Hayato Kuroda
FUJITSU LIMITED