Обсуждение: BUG: Former primary node might stuck when started as a standby

Поиск
Список
Период
Сортировка

BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Hello hackers,

[ reporting this bug here due to limitations of the bug reporting form ]

When a node, that acted as a primary, becomes a standby as in the
following script:
[ ... some WAL-logged activity ... ]
$primary->teardown_node;
$standby->promote;

$primary->enable_streaming($standby);
$primary->start;

it might not go online, due to the error:
new timeline N forked off current database system timeline M before current recovery point X/X

A complete TAP test is attached.
I put it in src/test/recovery/t, run as follows:
for i in `seq 100`; do echo "iteration $i"; timeout 60 make -s check -C src/test/recovery PROVE_TESTS="t/099*" ||
break;
 
done
and get:
...
iteration 7
# +++ tap check in src/test/recovery +++
t/099_change_roles.pl .. ok
All tests successful.
Files=1, Tests=20, 14 wallclock secs ( 0.01 usr  0.00 sys +  4.20 cusr  4.75 csys =  8.96 CPU)
Result: PASS
iteration 8
# +++ tap check in src/test/recovery +++
t/099_change_roles.pl .. 9/? make: *** [Makefile:23: check] Terminated

With wal_debug enabled (and log_min_messages=DEBUG2, log_statement=all),
I see the following in the _node1.log:
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl LOG: INSERT @ 0/304DBF0:  - Transaction/COMMIT: 2024-01-18 
15:21:02.258739+00
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (10, 'inserted on node1');
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl LOG:  xlog flush request 0/304DBF0; write 0/0; flush 0/0
2024-01-18 15:21:02.258 UTC [663701] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (10, 'inserted on node1');
2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG:  write 0/304DBF0 flush 0/304DB78 apply 0/304DB78 reply_time
2024-01-18
 
15:21:02.2588+00
2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG:  write 0/304DBF0 flush 0/304DBF0 apply 0/304DB78 reply_time
2024-01-18
 
15:21:02.258809+00
2024-01-18 15:21:02.258 UTC [663671] node2 DEBUG:  write 0/304DBF0 flush 0/304DBF0 apply 0/304DBF0 reply_time
2024-01-18
 
15:21:02.258864+00
2024-01-18 15:21:02.259 UTC [663563] DEBUG:  server process (PID 663701) exited with exit code 0
2024-01-18 15:21:02.260 UTC [663563] DEBUG:  forked new backend, pid=663704 socket=8
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: statement: INSERT INTO t VALUES (1000 * 1 + 608, 
'background activity');
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: INSERT @ 0/304DC40:  - Heap/INSERT: off: 12, flags: 0x00
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background 
activity');
2024-01-18 15:21:02.261 UTC [663563] DEBUG:  postmaster received shutdown request signal
2024-01-18 15:21:02.261 UTC [663563] LOG:  received immediate shutdown request
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG: INSERT @ 0/304DC68:  - Transaction/COMMIT: 2024-01-18 
15:21:02.261828+00
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background 
activity');
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl LOG:  xlog flush request 0/304DC68; write 0/0; flush 0/0
2024-01-18 15:21:02.261 UTC [663704] 099_change_roles.pl STATEMENT: INSERT INTO t VALUES (1000 * 1 + 608, 'background 
activity');
...
2024-01-18 15:21:02.262 UTC [663563] LOG:  database system is shut down
...
2024-01-18 15:21:02.474 UTC [663810] LOG:  starting PostgreSQL 16.1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
...
2024-01-18 15:21:02.478 UTC [663816] LOG:  REDO @ 0/304DBC8; LSN 0/304DBF0: prev 0/304DB78; xid 898; len 8 - 
Transaction/COMMIT: 2024-01-18 15:21:02.258739+00
2024-01-18 15:21:02.478 UTC [663816] LOG:  REDO @ 0/304DBF0; LSN 0/304DC40: prev 0/304DBC8; xid 899; len 3; blkref #0:

rel 1663/5/16384, blk 1 - Heap/INSERT: off: 12, flags: 0x00
2024-01-18 15:21:02.478 UTC [663816] LOG:  REDO @ 0/304DC40; LSN 0/304DC68: prev 0/304DBF0; xid 899; len 8 - 
Transaction/COMMIT: 2024-01-18 15:21:02.261828+00
...
2024-01-18 15:21:02.481 UTC [663819] LOG:  fetching timeline history file for timeline 20 from primary server
2024-01-18 15:21:02.481 UTC [663819] LOG:  started streaming WAL from primary at 0/3000000 on timeline 19
...
2024-01-18 15:21:02.481 UTC [663819] DETAIL:  End of WAL reached on timeline 19 at 0/304DBF0.
...
2024-01-18 15:21:02.481 UTC [663816] LOG:  new timeline 20 forked off current database system timeline 19 before
current
 
recovery point 0/304DC68

In this case, node1 wrote to it's WAL record 0/304DC68, but sent to node2
only record 0/304DBF0, then node2, being promoted to primary, forked a next
timeline from it, but when node1 was started as a standby, it first
replayed 0/304DC68 from WAL, and then could not switch to the new timeline
starting from the previous position.

Reproduced on REL_12_STABLE .. master.

Best regards,
Alexander
Вложения

Re: BUG: Former primary node might stuck when started as a standby

От
Aleksander Alekseev
Дата:
Hi,

> it might not go online, due to the error:
> new timeline N forked off current database system timeline M before current recovery point X/X
> [...]
> In this case, node1 wrote to it's WAL record 0/304DC68, but sent to node2
> only record 0/304DBF0, then node2, being promoted to primary, forked a next
> timeline from it, but when node1 was started as a standby, it first
> replayed 0/304DC68 from WAL, and then could not switch to the new timeline
> starting from the previous position.

Unless I'm missing something, this is just the right behavior of the system.

node1 has no way of knowing the history of node1/node2/nodeN
promotion. It sees that it has more data and/or inconsistent timeline
with another node and refuses to process further until DBA will
intervene. What else can node1 do, drop the data? That's not how
things are done in Postgres :) What if this is a very important data
and node2 was promoted mistakenly, either manually or by a buggy
script.

It's been a while since I seriously played with replication, but if
memory serves, a proper way to switch node1 to a replica mode would be
to use pg_rewind on it first.

-- 
Best regards,
Aleksander Alekseev



Re: BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Hi Aleksander,

19.01.2024 14:45, Aleksander Alekseev wrote:
>
>> it might not go online, due to the error:
>> new timeline N forked off current database system timeline M before current recovery point X/X
>> [...]
>> In this case, node1 wrote to it's WAL record 0/304DC68, but sent to node2
>> only record 0/304DBF0, then node2, being promoted to primary, forked a next
>> timeline from it, but when node1 was started as a standby, it first
>> replayed 0/304DC68 from WAL, and then could not switch to the new timeline
>> starting from the previous position.
> Unless I'm missing something, this is just the right behavior of the system.

Thank you for the answer!

> node1 has no way of knowing the history of node1/node2/nodeN
> promotion. It sees that it has more data and/or inconsistent timeline
> with another node and refuses to process further until DBA will
> intervene.

But node1 knows that it's a standby now and it's expected to get all the
WAL records from the primary, doesn't it?
Maybe it could REDO from it's own WAL as little records as possible,
before requesting records from the authoritative source...
Is it supposed that it's more performance-efficient (not on the first
restart, but on later ones)?

>   What else can node1 do, drop the data? That's not how
> things are done in Postgres :)

In case no other options exist (this behavior is really correct and the
only possible), maybe the server should just stop?
Can DBA intervene somehow to make the server proceed without stopping it?

> It's been a while since I seriously played with replication, but if
> memory serves, a proper way to switch node1 to a replica mode would be
> to use pg_rewind on it first.

Perhaps that's true generally, but as we can see, without the extra
records replayed, this scenario works just fine. Moreover, existing tests
rely on it, e.g., 009_twophase.pl or 012_subtransactions.pl (in fact, my
research of the issue was initiated per a test failure).

Best regards,
Alexander



Re: BUG: Former primary node might stuck when started as a standby

От
Aleksander Alekseev
Дата:
Hi,

> But node1 knows that it's a standby now and it's expected to get all the
> WAL records from the primary, doesn't it?

Yes, but node1 doesn't know if it always was a standby or not. What if
node1 was always a standby, node2 was a primary, then node2 died and
node3 is a new primary. If node1 sees inconsistency  in the WAL
records, it should report it and stop doing anything, since it doesn't
has all the information needed to resolve the inconsistencies in all
the possible cases. Only DBA has this information.

> > It's been a while since I seriously played with replication, but if
> > memory serves, a proper way to switch node1 to a replica mode would be
> > to use pg_rewind on it first.
>
> Perhaps that's true generally, but as we can see, without the extra
> records replayed, this scenario works just fine. Moreover, existing tests
> rely on it, e.g., 009_twophase.pl or 012_subtransactions.pl (in fact, my
> research of the issue was initiated per a test failure).

I suggest focusing on particular flaky tests then and how to fix them.

-- 
Best regards,
Aleksander Alekseev



Re: BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Hi Aleksander,

[ I'm writing this off-list to minimize noise, but we can continue the discussion in -hackers, if you wish ]

22.01.2024 14:00, Aleksander Alekseev wrote:

> Hi,
>
>> But node1 knows that it's a standby now and it's expected to get all the
>> WAL records from the primary, doesn't it?
> Yes, but node1 doesn't know if it always was a standby or not. What if
> node1 was always a standby, node2 was a primary, then node2 died and
> node3 is a new primary.

Excuse me, but I still can't understand what could go wrong in this case.
Let's suppose, node1 has WAL with the following contents before start:
CPLOC | TL1R1 | TL1R2 | TL1R3 |

while node2's WAL contains:
TL1R1 | TL2R1 | TL2R2 | ...

where CPLOC -- a checkpoint location, TLxRy -- a record y on a timeline x.

I assume that requesting all WAL records from node2 without redoing local
records should be the right thing.

And even in the situation you propose:
CPLOC | TL2R5 | TL2R6 | TL2R7 |

while node3's WAL contains:
TL2R5 | TL3R1 | TL3R2 | ...

I see no issue with applying records from node3...

>   If node1 sees inconsistency  in the WAL
> records, it should report it and stop doing anything, since it doesn't
> has all the information needed to resolve the inconsistencies in all
> the possible cases. Only DBA has this information.

I still wonder, what can be considered an inconsistency in this situation.
Doesn't the exactly redo of all the local WAL records create the
inconsistency here?
For me, it's the question of an authoritative source, and if we had such a
source, we should trust it's records only.

Or in the other words, what if the record TL1R3, which node1 wrote to it's
WAL, but didn't send to node2, happened to have an incorrect checksum (due
to partial write, for example)?
If I understand correctly, node1 will just stop redoing WAL at that
position to receive all the following records from node2 and move forward
without reporting the inconsistency (an extra WAL record).

Best regards,
Alexander



Re: BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Hello,

22.01.2024 13:00, Aleksander Alekseev wrote:
Perhaps that's true generally, but as we can see, without the extra
records replayed, this scenario works just fine. Moreover, existing tests
rely on it, e.g., 009_twophase.pl or 012_subtransactions.pl (in fact, my
research of the issue was initiated per a test failure).
I suggest focusing on particular flaky tests then and how to fix them.

For the record: skink produced the same failure with 009_twophase.pl
yesterday [1]:

366/366 postgresql:recovery / recovery/009_twophase                                     TIMEOUT        15000.31s   killed by signal 15 SIGTERM

[15:11:30.591](0.807s) ok 11 - Restore of prepared transaction on promoted standby
### Enabling streaming replication for node "london"
### Starting node "london"
# Running: pg_ctl --wait --pgdata /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/009_twophase/data/t_009_twophase_london_data/pgdata --log /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/009_twophase/log/009_twophase_london.log --options --cluster-name=london start
waiting for server to start....... done
server started
# Postmaster PID for node "london" is 4181556
death by signal at /home/bf/bf-build/skink-master/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm line 181.
# Postmaster PID for node "london" is 4181556
### Stopping node "london" using mode immediate
# Running: pg_ctl --pgdata /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/009_twophase/data/t_009_twophase_london_data/pgdata --mode immediate stop
waiting for server to shut down.... done
server stopped
# No postmaster PID for node "london"
# Postmaster PID for node "paris" is 4147391
### Stopping node "paris" using mode immediate
# Running: pg_ctl --pgdata /home/bf/bf-build/skink-master/HEAD/pgsql.build/testrun/recovery/009_twophase/data/t_009_twophase_paris_data/pgdata --mode immediate stop
waiting for server to shut down.... done
server stopped
# No postmaster PID for node "paris"
[19:19:46.999](14896.408s) # Tests were run but no plan was declared and done_testing() was not seen.
[19:19:46.999](0.000s) # Looks like your test exited with 4 just after 11.

pgsql.build/testrun/recovery/009_twophase/log/009_twophase_london.log
2026-02-04 15:11:35.675 CET [4181556][postmaster][:0] LOG:  starting PostgreSQL 19devel on x86_64-linux, compiled by gcc-15.2.0, 64-bit
2026-02-04 15:11:35.692 CET [4181556][postmaster][:0] LOG:  listening on Unix socket "/tmp/IzuUID03TM/.s.PGSQL.22911"
2026-02-04 15:11:35.904 CET [4183900][startup][:0] LOG:  database system was interrupted; last known up at 2026-02-04 15:11:14 CET
2026-02-04 15:11:36.017 CET [4183900][startup][:0] LOG:  entering standby mode
2026-02-04 15:11:36.019 CET [4183900][startup][:0] LOG:  database system was not properly shut down; automatic recovery in progress
2026-02-04 15:11:36.104 CET [4183900][startup][38/0:0] LOG:  redo starts at 0/03021600
2026-02-04 15:11:36.206 CET [4183900][startup][38/0:0] LOG:  invalid record length at 0/030220B8: expected at least 24, got 0
2026-02-04 15:11:36.209 CET [4183900][startup][38/0:0] LOG:  consistent recovery state reached at 0/030220B8
2026-02-04 15:11:36.210 CET [4181556][postmaster][:0] LOG:  database system is ready to accept read-only connections
2026-02-04 15:11:36.919 CET [4184142][walreceiver][:0] LOG:  fetching timeline history file for timeline 2 from primary server
2026-02-04 15:11:36.997 CET [4184142][walreceiver][:0] LOG:  started streaming WAL from primary at 0/03000000 on timeline 1
2026-02-04 15:11:37.076 CET [4184142][walreceiver][:0] LOG:  replication terminated by primary server
2026-02-04 15:11:37.076 CET [4184142][walreceiver][:0] DETAIL:  End of WAL reached on timeline 1 at 0/03022070.
2026-02-04 15:11:37.098 CET [4183900][startup][38/0:0] LOG:  new timeline 2 forked off current database system timeline 1 before current recovery point 0/030220B8
2026-02-04 15:11:37.103 CET [4184142][walreceiver][:0] LOG:  restarted WAL streaming at 0/03000000 on timeline 1
2026-02-04 15:11:37.107 CET [4184142][walreceiver][:0] LOG:  replication terminated by primary server
...

[1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2026-02-04%2013%3A36%3A40&stg=recovery-check

Best regards,
Alexander

RE: BUG: Former primary node might stuck when started as a standby

От
"Hayato Kuroda (Fujitsu)"
Дата:

Dear Alexander, Aleksander

 

Is it enough that we can wait till all WALs are replayed before the promotion?

Like attached.

 

Best regards,

Hayato Kuroda

FUJITSU LIMITED

Вложения

Re: BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Dear Kuroda-san,

09.02.2026 04:38, Hayato Kuroda (Fujitsu) wrote:
@font-face {font-family:"MS ゴシック"; panose-1:2 11 6 9 7 2 5 8 2 4;}@font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;}@font-face {font-family:游ゴシック; panose-1:2 11 4 0 0 0 0 0 0 0;}@font-face {font-family:"MS Pゴシック"; panose-1:2 11 6 0 7 2 5 8 2 4;}@font-face {font-family:"\@MS Pゴシック";}@font-face {font-family:"\@游ゴシック"; panose-1:2 11 4 0 0 0 0 0 0 0;}@font-face {font-family:"\@MS ゴシック"; panose-1:2 11 6 9 7 2 5 8 2 4;}p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0mm; font-size:12.0pt; font-family:"MS Pゴシック";}.MsoChpDefault {mso-style-type:export-only; font-size:10.0pt; mso-ligatures:none;}div.WordSection1 {page:WordSection1;}

Dear Alexander, Aleksander

 

Is it enough that we can wait till all WALs are replayed before the promotion?

Like attached.



Thank you for paying attention to this!

I tried to add
$primary->wait_for_replay_catchup($standby);
in my test (see attached), but that doesn't help -- it still fails as below:
iteration 61
# +++ tap check in src/test/recovery +++
t/099_change_roles.pl .. 1/? make: *** [Makefile:28: check] Terminated

src/test/recovery/tmp_check/log/099_change_roles_node1.log
2026-02-12 20:30:42.953 EET [2716149] LOG:  new timeline 4 forked off current database system timeline 3 before current recovery point 0/3025B48

Best regards,
Alexander
Вложения

RE: BUG: Former primary node might stuck when started as a standby

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Alexander,

I checked your test and reproduced the issue with it.
Was it possible that INSERT happened in-between wait_for_replay_catchup and
teardown_node? In this case we may not ensure WAL records generated in the time
window were reached, right?
Similar stuff won7t happen in 009_twophase.pl because it does not have the bg activities.

Obtained log around here is shown in [1].

[1]
```
2026-02-13 10:34:56.751 JST client backend[521764] 099_change_roles.pl LOG:  statement: INSERT INTO t VALUES (1000 * 1
+2812, 'background activity');
 
2026-02-13 10:34:56.753 JST client backend[521762] 099_change_roles.pl LOG:  statement: SELECT '0/030ED4F8' <=
replay_lsnAND state = 'streaming'        * from wait_for_replay_catchup
 
                 FROM pg_catalog.pg_stat_replication
                 WHERE application_name IN ('node2', 'walreceiver')
2026-02-13 10:34:56.762 JST client backend[521767] 099_change_roles.pl LOG:  statement: INSERT INTO t VALUES (1000 * 1
+2813, 'background activity');        * do INSERT from background activities
 
2026-02-13 10:34:56.764 JST client backend[521768] 099_change_roles.pl LOG:  statement: INSERT INTO t VALUES (1000 * 2
+2629, 'background activity');
 
2026-02-13 10:34:56.765 JST postmaster[521622] LOG:  received immediate shutdown request
   * from teardown_node
 
2026-02-13 10:34:56.767 JST postmaster[521622] LOG:  database system is shut down
2026-02-13 10:34:56.996 JST postmaster[521844] LOG:  starting PostgreSQL 19devel on x86_64-linux, compiled by
gcc-11.5.0,64-bit
 
....
2026-02-13 10:34:57.124 JST walreceiver[521864] LOG:  replication terminated by primary server
2026-02-13 10:34:57.124 JST walreceiver[521864] DETAIL:  End of WAL reached on timeline 153 at 0/030ED6D8.
2026-02-13 10:34:57.124 JST startup[521855] LOG:  new timeline 154 forked off current database system timeline 153
beforecurrent recovery point 0/030ED750
 
2026-02-13 10:34:57.124 JST walreceiver[521864] LOG:  restarted WAL streaming at 0/03000000 on timeline 153
```

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Re: BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Dear Kuroda-san,

13.02.2026 04:03, Hayato Kuroda (Fujitsu) wrote:
Dear Alexander,

I checked your test and reproduced the issue with it.
Was it possible that INSERT happened in-between wait_for_replay_catchup and
teardown_node? In this case we may not ensure WAL records generated in the time
window were reached, right?
Similar stuff won7t happen in 009_twophase.pl because it does not have the bg activities.

From my old records, 009_twophase.pl failed exactly due to background (
namely, bgwriter's) activity.

I modified bgwriter.c to reproduce the failure easier:
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -67,7 +67,7 @@ int            BgWriterDelay = 200;
  * Interval in which standby snapshots are logged into the WAL stream, in
  * milliseconds.
  */
-#define LOG_SNAPSHOT_INTERVAL_MS 15000
+#define LOG_SNAPSHOT_INTERVAL_MS 1
 
 /*
  * LSN and timestamp at which we last issued a LogStandbySnapshot(), to avoid
@@ -306,7 +306,7 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
          */
         rc = WaitLatch(MyLatch,
                        WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
-                       BgWriterDelay /* ms */ , WAIT_EVENT_BGWRITER_MAIN);
+                       1 /* ms */ , WAIT_EVENT_BGWRITER_MAIN);
 
         /*
          * If no latch event and BgBufferSync says nothing's happening, extend
@@ -339,6 +339,5 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len)
             StrategyNotifyBgWriter(-1);
         }
 
-        prev_hibernate = can_hibernate;
     }
 }

multiplied the test to increase probability of the failure:
for i in {1..20}; do cp -r src/test/recovery/ src/test/recovery_$i/; sed "s|src/test/recovery|src/test/recovery_$i|" -i src/test/recovery_$i/Makefile; done

and executed it in a loop:
for i in {1..100}; do echo "ITERATION $i"; parallel --halt now,fail=1 -j20 --linebuffer --tag PROVE_TESTS="t/009*" NO_TEMP_INSTALL=1 timeout 60 make check -s -C src/test/recovery_{} ::: `seq 20` || break; done

It failed for me on iterations 27, 4, 22 as below:
ITERATION 22
...
18      t/009_twophase.pl .. ok
18      All tests successful.
18      Files=1, Tests=30, 12 wallclock secs ( 0.01 usr  0.01 sys +  0.26 cusr  0.58 csys =  0.86 CPU)
18      Result: PASS
5       make: *** wait: No child processes.  Stop.
5       make: *** Waiting for unfinished jobs....
5       make: *** wait: No child processes.  Stop.
parallel: This job failed:
PROVE_TESTS=t/009* NO_TEMP_INSTALL=1 timeout 60 make check -s -C src/test/recovery_5

src/test/recovery_5/tmp_check/log/009_twophase_london.log contains:
2026-02-13 21:03:28.248 EET [3987222] LOG:  new timeline 2 forked off current database system timeline 1 before current recovery point 0/3029190
...

(Without "timeout 60", the test just hangs — we can see the same in [1],
the test was killed with SIGTERM after 15000 seconds...)

[1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=skink&dt=2026-02-04%2013%3A36%3A40&stg=recovery-check

Best regards,
Alexander

RE: BUG: Former primary node might stuck when started as a standby

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Alexander,

> From my old records, 009_twophase.pl failed exactly due to background (
> namely, bgwriter's) activity.

Okay, so I think there are two reasons why the test could fail.

1) old primary shut down before all changes are replicated. This can avoid by
adding wait_for_replay_catchup() before the tearing down.
2) bgwriter on old primary generated the RUNNING_XACTS record and the node shut
dwon before sending it.

.. and you mentioned for the case 2), right? I recalled that an injection point
"skip-log-running-xacts" can be used to supress generating the WAL record, see
035_standby_logical_decoding.pl. My idea is to attach the injeciton point before
the switchover and avoid adding the record.
Attached patch implements the idea.

How do you feel?

Best regards,
Hayato Kuroda
FUJITSU LIMITED


Вложения

Re: BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Dear Kuroda-san,

16.02.2026 05:10, Hayato Kuroda (Fujitsu) wrote:
From my old records, 009_twophase.pl failed exactly due to background (
namely, bgwriter's) activity.
Okay, so I think there are two reasons why the test could fail.

1) old primary shut down before all changes are replicated. This can avoid by
adding wait_for_replay_catchup() before the tearing down.
2) bgwriter on old primary generated the RUNNING_XACTS record and the node shut
dwon before sending it.

.. and you mentioned for the case 2), right? I recalled that an injection point
"skip-log-running-xacts" can be used to supress generating the WAL record, see
035_standby_logical_decoding.pl. My idea is to attach the injeciton point before
the switchover and avoid adding the record.
Attached patch implements the idea.

How do you feel?

Unfortunately, the testing procedure I shared above still produces failures
with the patched 009_twophase.pl. With
wal_debug = on
in /tmp/temp.config, I can see:
...
ITERATION 3
...
11      t/009_twophase.pl .. ok
11      All tests successful.
11      Files=1, Tests=30, 10 wallclock secs ( 0.02 usr  0.00 sys +  0.27 cusr  0.63 csys =  0.92 CPU)
11      Result: PASS
20      make: *** [Makefile:28: check] Terminated
parallel: This job failed:
TEMP_CONFIG=/tmp/temp.config PROVE_TESTS=t/009* NO_TEMP_INSTALL=1 timeout 60 make check -s -C src/test/recovery_20

2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl LOG:  statement: PREPARE TRANSACTION 'xact_009_10';
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl LOG:  INSERT @ 0/030227F8:  - Transaction/PREPARE: gid xact_009_10: 2026-02-17 07:06:44.313023+02; subxacts: 790
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl STATEMENT:  PREPARE TRANSACTION 'xact_009_10';
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl LOG:  xlog flush request 0/030227F8; write 0/00000000; flush 0/00000000
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl STATEMENT:  PREPARE TRANSACTION 'xact_009_10';
2026-02-17 07:06:44.313 EET background writer[754333] LOG:  INSERT @ 0/03022838:  - Standby/RUNNING_XACTS: nextXid 791 latestCompletedXid 788 oldestRunningXid 789; 1 xacts: 789; 1 subxacts: 790
2026-02-17 07:06:44.322 EET client backend[754949] 009_twophase.pl LOG:  statement: SELECT pg_current_wal_flush_lsn()
2026-02-17 07:06:44.330 EET client backend[754974] 009_twophase.pl LOG:  statement: SELECT '0/030227F8' <= replay_lsn AND state = 'streaming'
             FROM pg_catalog.pg_stat_replication
             WHERE application_name IN ('paris', 'walreceiver')
2026-02-17 07:06:44.337 EET postmaster[754306] LOG:  received immediate shutdown request
2026-02-17 07:06:44.339 EET postmaster[754306] LOG:  database system is shut down
...
2026-02-17 07:06:44.566 EET startup[755326] LOG:  REDO @ 0/03022618; LSN 0/03022650: prev 0/030225E0; xid 0; len 24 - Standby/RUNNING_XACTS: nextXid 789 latestCompletedXid 788 oldestRunningXid 789
2026-02-17 07:06:44.566 EET startup[755326] LOG:  REDO @ 0/03022650; LSN 0/030226A0: prev 0/03022618; xid 789; len 3; blkref #0: rel 1663/5/16384, blk 0 - Heap/INSERT: off: 21, flags: 0x00
2026-02-17 07:06:44.566 EET startup[755326] LOG:  REDO @ 0/030226A0; LSN 0/030226F0: prev 0/03022650; xid 790; len 3; blkref #0: rel 1663/5/16384, blk 0 - Heap/INSERT: off: 22, flags: 0x00
2026-02-17 07:06:44.566 EET startup[755326] LOG:  REDO @ 0/030226F0; LSN 0/030227F8: prev 0/030226A0; xid 789; len 232 - Transaction/PREPARE: gid xact_009_10: 2026-02-17 07:06:44.313023+02; subxacts: 790
2026-02-17 07:06:44.566 EET startup[755326] LOG:  REDO @ 0/030227F8; LSN 0/03022838: prev 0/030226F0; xid 0; len 32 - Standby/RUNNING_XACTS: nextXid 791 latestCompletedXid 788 oldestRunningXid 789; 1 xacts: 789; 1 subxacts: 790
2026-02-17 07:06:44.566 EET startup[755326] LOG:  invalid record length at 0/03022838: expected at least 24, got 0
2026-02-17 07:06:44.566 EET startup[755326] LOG:  consistent recovery state reached at 0/03022838
2026-02-17 07:06:44.566 EET postmaster[755283] LOG:  database system is ready to accept read-only connections
2026-02-17 07:06:44.569 EET walreceiver[755338] LOG:  fetching timeline history file for timeline 2 from primary server
2026-02-17 07:06:44.569 EET walreceiver[755338] LOG:  started streaming WAL from primary at 0/03000000 on timeline 1
2026-02-17 07:06:44.569 EET walreceiver[755338] LOG:  replication terminated by primary server
2026-02-17 07:06:44.569 EET walreceiver[755338] DETAIL:  End of WAL reached on timeline 1 at 0/030227F8.
2026-02-17 07:06:44.570 EET startup[755326] LOG:  new timeline 2 forked off current database system timeline 1 before current recovery point 0/03022838
...

Regarding the approach with "skip-log-running-xacts", I wonder if there is
a strong guarantee that no other WAL write can happen in the same window
(e.g., from autovacuum)?

Best regards,
Alexander

RE: BUG: Former primary node might stuck when started as a standby

От
"Hayato Kuroda (Fujitsu)"
Дата:
Dear Alexander,

> Unfortunately, the testing procedure I shared above still produces failures
> with the patched 009_twophase.pl.

Hmm, I ran the test for hours, but I could nor reproduce the failure. But let me analyze
based on your log.

> With wal_debug = on
> in /tmp/temp.config, I can see:

> 2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl LOG:  statement: PREPARE TRANSACTION
'xact_009_10';
> 2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl LOG:  INSERT @ 0/030227F8:  - Transaction/PREPARE:
gidxact_009_10: 2026-02-17 07:06:44.313023+02; subxacts: 790
 
> 2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl STATEMENT:  PREPARE TRANSACTION 'xact_009_10';
> 2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl LOG:  xlog flush request 0/030227F8; write
0/00000000;flush 0/00000000
 
> 2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl STATEMENT:  PREPARE TRANSACTION 'xact_009_10';
> 2026-02-17 07:06:44.313 EET background writer[754333] LOG:  INSERT @ 0/03022838:  - Standby/RUNNING_XACTS: nextXid
791latestCompletedXid 788 oldestRunningXid 789; 1 xacts: 789; 1 subxacts: 790
 
> 2026-02-17 07:06:44.322 EET client backend[754949] 009_twophase.pl LOG:  statement: SELECT
pg_current_wal_flush_lsn()
> 2026-02-17 07:06:44.330 EET client backend[754974] 009_twophase.pl LOG:  statement: SELECT '0/030227F8' <= replay_lsn
ANDstate = 'streaming'
 
>             FROM pg_catalog.pg_stat_replication
>             WHERE application_name IN ('paris', 'walreceiver')
> 2026-02-17 07:06:44.337 EET postmaster[754306] LOG:  received immediate shutdown request 
> 2026-02-17 07:06:44.339 EET postmaster[754306] LOG:  database system is shut down

I have few experience to see the wal_debug output, but background writer seems to
generate the RUNNING_XACTS record. It's different from my expectation. To confirm,
did you really enable the injection point? For now 009_twophase can work without
the `-Dinjection_points=true` but it should be set to avoid random failures.

Another possibility is that bgwriter has already been passed `IS_INJECTION_POINT_ATTACHED("skip-log-running-xacts")`
before attaching the injection point. In this case background writer would
generate the RUNNING_XACTS records at the bad timing. One possible workaround
is to attach the injeciton_point bit earlier, it does not solve the root cause
though. Attached patch applies the band-aid.

> Regarding the approach with "skip-log-running-xacts", I wonder if there is
> a strong guarantee that no other WAL write can happen in the same window
> (e.g., from autovacuum)?

Good point, autovacuum and checkpoint should be also avoided. IIUC it can be done
by setting GUCs.

Best regards,
Hayato Kuroda
FUJITSU LIMITED

Вложения

Re: BUG: Former primary node might stuck when started as a standby

От
Alexander Lakhin
Дата:
Dear Kuroda-san,

19.02.2026 05:50, Hayato Kuroda (Fujitsu) wrote:
> Dear Alexander,
>
>> Unfortunately, the testing procedure I shared above still produces failures
>> with the patched 009_twophase.pl.
> Hmm, I ran the test for hours, but I could nor reproduce the failure. But let me analyze
> based on your log.

Please look at the attached self-contained script. It works for me (failed
on iterations 6, 12, 2 right now, on my workstation with Ryzen 7900X) --
probably you could adjust number of parallel jobs to reproduce it on your
hardware.

> I have few experience to see the wal_debug output, but background writer seems to
> generate the RUNNING_XACTS record. It's different from my expectation. To confirm,
> did you really enable the injection point? For now 009_twophase can work without
> the `-Dinjection_points=true` but it should be set to avoid random failures.

I think it failed before the injection was set. My log contains:
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl STATEMENT:  PREPARE TRANSACTION 'xact_009_10';
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl LOG:  xlog flush request 0/030227F8; write 
0/00000000; flush 0/00000000
2026-02-17 07:06:44.313 EET client backend[754908] 009_twophase.pl STATEMENT:  PREPARE TRANSACTION 'xact_009_10';
2026-02-17 07:06:44.313 EET background writer[754333] LOG:  INSERT @ 0/03022838:  - Standby/RUNNING_XACTS: nextXid 791

latestCompletedXid 788 oldestRunningXid 789; 1 xacts: 789; 1 subxacts: 790

As far as I can see, it corresponds to this place in the test:
      SAVEPOINT s1;
      INSERT INTO t_009_tbl VALUES (22, 'issued to ${cur_primary_name}');
      PREPARE TRANSACTION 'xact_009_10';");
+$cur_primary->wait_for_replay_catchup($cur_standby);
  $cur_primary->teardown_node;
  $cur_standby->promote;

And as we found out before, wait_for_replay_catchup() before teardown
doesn't help... I can't say for sure, but from my experiments, the test
didn't fail with $cur_primary->stop instead of $cur_primary->teardown_node.

Best regards,
Alexander
Вложения