Обсуждение: BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18

Поиск
Список
Период
Сортировка

BUG #19093: Behavioral change in walreceiver termination between PostgreSQL 14.17 and 14.18

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      19093
Logged by:          Ryan bird
Email address:      ryanzxg@gmail.com
PostgreSQL version: 14.18
Operating system:   Linux(Cent OS 8)
Description:

Behavioral change in walreceiver termination between PostgreSQL 14.17 and
14.18 affects standby health detection

We observed a behavior change in PostgreSQL 14.18 that affects how the
walreceiver process behaves when a standby reaches the end of WAL on a
diverged timeline.
This change impacts our monitoring service that rely on
pg_stat_wal_receiver.status to detect unhealthy replication.

Test Scenario
PostgreSQL cluster with two instances:
- Postgres_1 (Primary)
- Postgres_2 (Standby)
- Synchronous replication

Test Steps:
1. Restart the device hosting Postgres_1 (this takes time).
2. Failover occurs: Postgres_2 is promoted to primary.
3. Postgres_1(Old primary) comes back online and starts as standby.
4. Postgres_1(Old primary): Streaming replication begins from the new
primary Postgres_2, but fails due to timeline divergence.

We observed a behavioral difference between PostgreSQL 14.17 and 14.18:
- In 14.17, when the standby reaches the end of the WAL on the old timeline
(before the new primary was promoted), the walreceiver process does not
terminate.
- In 14.18, under the same condition, the walreceiver process is terminated.
This leads to different results when querying: SELECT status FROM
pg_stat_wal_receiver;
Observed Behavior:
- PostgreSQL 14.17:
The status is often waiting. After reaching the end of the timeline, there
is a 5-second delay(wal_retrieve_retry_interval) before the next attempt.
During this interval, the status remains waiting.
Our monitoring service detects this as an unhealthy standby and triggers
recovery using pg_rewind or pg_basebackup.
- PostgreSQL 14.18:
The status is often streaming. The walreceiver process is repeatedly
terminated and restarted without the 5-second delay seen in 14.17.
As a result, our monitoring service sees the standby as healthy (streaming),
even though it is unable to stream WAL correctly. This causes the cluster to
become unwritable due to synchronous replication.

Log Comparison:
We’ve attached excerpts from both versions showing the difference in
walreceiver behavior when reaching the end of WAL on a diverged timeline.
PostgreSQL 14.17:
Oct 20 10:35:52 S2 postgres LOG:  replication terminated by primary server
Oct 20 10:35:52 S2 postgres DETAIL:  End of WAL reached on timeline 7 at
1/9A7C1020.
Oct 20 10:35:52 S2 postgres DEBUG:  sending write 1/9A7C1020 flush
1/9A806380 apply 1/9A806380
Oct 20 10:35:52 S2 postgres DEBUG:  walreceiver ended streaming and awaits
new instructions
Oct 20 10:35:52 S2 postgres LOG:  new timeline 9 forked off current database
system timeline 7 before current recovery point 1/9A806380
Oct 20 10:35:57 S2 postgres DEBUG:  skipping restartpoint, already performed
at 0/0
Oct 20 10:35:57 S2 postgres DEBUG:  switched WAL source from stream to
archive after failure
Oct 20 10:35:57 S2 postgres DEBUG:  invalid record length at 1/9A806380:
wanted 24, got 0
Oct 20 10:35:57 S2 postgres DEBUG:  switched WAL source from archive to
stream after failure
Oct 20 10:35:57 S2 postgres LOG:  restarted WAL streaming at 1/9A000000 on
timeline 7

PostgreSQL 14.18:
Oct 21 06:17:08 S2 postgres[90033]: [476-1] :[29287] LOG:  replication
terminated by primary server
Oct 21 06:17:08 S2 postgres[90033]: [476-2] :[29287] DETAIL:  End of WAL
reached on timeline 5 at 0/69970A80.
Oct 21 06:17:08 S2 postgres[90033]: [477-1] :[29287] DEBUG:  sending write
0/69970A80 flush 0/69970AF8 apply 0/69970AF8
Oct 21 06:17:08 S2 postgres[90033]: [478-1] :[29287] DEBUG:  walreceiver
ended streaming and awaits new instructions
Oct 21 06:17:08 S2 postgres[90033]: [479-1] :[29287] FATAL:  terminating
walreceiver process due to administrator command
Oct 21 06:17:08 S2 postgres[90033]: [480-1] :[29287] DEBUG:  shmem_exit(1):
1 before_shmem_exit callbacks to make
Oct 21 06:17:08 S2 postgres[90033]: [481-1] :[29287] DEBUG:  shmem_exit(1):
7 on_shmem_exit callbacks to make
Oct 21 06:17:08 S2 postgres[85600]: [319-1] :[27280] LOG:  new timeline 7
forked off current database system timeline 5 before current recovery point
0/69970AF8
Oct 21 06:17:08 S2 postgres[85600]: [320-1] :[27280] DEBUG:  switched WAL
source from stream to archive after failure
Oct 21 06:17:08 S2 postgres[85600]: [321-1] :[27280] DEBUG:  invalid record
length at 0/69970AF8: wanted 24, got 0
Oct 21 06:17:08 S2 postgres[85600]: [322-1] :[27280] DEBUG:  switched WAL
source from archive to stream after failure
Oct 21 06:17:08 S2 postgres[90033]: [482-1] :[29287] DEBUG:  proc_exit(1): 2
callbacks to make
Oct 21 06:17:08 S2 postgres[90033]: [483-1] :[29287] DEBUG:  exit(1)
Oct 21 06:17:08 S2 postgres[90033]: [484-1] :[29287] DEBUG:  shmem_exit(-1):
0 before_shmem_exit callbacks to make
Oct 21 06:17:08 S2 postgres[90033]: [485-1] :[29287] DEBUG:  shmem_exit(-1):
0 on_shmem_exit callbacks to make
Oct 21 06:17:08 S2 postgres[90033]: [486-1] :[29287] DEBUG:  proc_exit(-1):
0 callbacks to make
Oct 21 06:17:08 S2 postgres[85555]: [322-1] :[27235] DEBUG:  reaping dead
processes
Oct 21 06:17:08 S2 postgres[90837]: [323-1] :[29637] DEBUG:
find_in_dynamic_libpath: trying
"/usr/lib/postgresql14/lib64/libpqwalreceiver"
Oct 21 06:17:08 S2 postgres[90837]: [324-1] :[29637] DEBUG:
find_in_dynamic_libpath: trying
"/usr/lib/postgresql14/lib64/libpqwalreceiver.so"

Root Cause Analysis:
We traced the change to the following commit:
Reset InstallXLogFileSegmentActive after walreceiver self-initiated exit.
https://github.com/postgres/postgres/commit/3635a0a35aafd3bfa80b7a809bc6e91ccd36606a
This commit was introduced to fix a regression reported by Noah Misch
(commit 1f95181), which caused spurious failures in archive recovery.
In PostgreSQL 14.17, the shutdown logic was conditional:
if (WalRcvStreaming())
    XLogShutdownWalRcv();
In PostgreSQL 14.18, the condition was removed:
XLogShutdownWalRcv();
As a result, the walreceiver process is terminated even when it is in a
waiting state. In 14.17, the condition would prevent termination in this
scenario.
The intention was not to terminate the walreceiver process, but rather to
reset the InstallXLogFileSegmentActive flag, which was previously only
cleared inside XLogShutdownWalRcv().
As noted in the commit discussion:
The problem is that whenever we are going for streaming we always set
XLogCtl->InstallXLogFileSegmentActive to true, but while switching
from streaming to archive we do not always reset it so it hits
assertion in some cases. Basically we reset it inside
XLogShutdownWalRcv() but while switching from the streaming mode we
only call it conditionally if WalRcvStreaming(). But it is very much
possible that even before we call WalRcvStreaming() the walreceiver
might have set alrcv->walRcvState to WALRCV_STOPPED. So now
WalRcvStreaming() will return false. So I agree now we do not want to
really shut down the walreceiver but who will reset the flag

This change inadvertently causes walreceiver to terminate in scenarios where
it previously remained alive, altering the observable behavior of streaming
replication.

Questions
1. Is terminating the walreceiver process in this scenario (end of WAL on a
diverged timeline) the expected behavior in 14.18 or later release?
2. Is it expected that pg_stat_wal_receiver.status = 'streaming' may not
accurately reflect streaming health in this case?
3. Would it be safer to decouple flag-reset logic from XLogShutdownWalRcv()
to avoid unintended process termination?

This change impacts monitoring logic and cluster availability under
synchronous replication. We’d appreciate clarification and guidance on best
practices for detecting unhealthy standby nodes in this context.


Hi,

Thanks for reporting this issue!

On Fri, Oct 24, 2025 at 2:43 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      19093
> Logged by:          Ryan bird
> Email address:      ryanzxg@gmail.com
> PostgreSQL version: 14.18
> Operating system:   Linux(Cent OS 8)
> Description:
>
> Behavioral change in walreceiver termination between PostgreSQL 14.17 and
> 14.18 affects standby health detection
>
> We observed a behavior change in PostgreSQL 14.18 that affects how the
> walreceiver process behaves when a standby reaches the end of WAL on a
> diverged timeline.
> This change impacts our monitoring service that rely on
> pg_stat_wal_receiver.status to detect unhealthy replication.
>
> Test Scenario
> PostgreSQL cluster with two instances:
> - Postgres_1 (Primary)
> - Postgres_2 (Standby)
> - Synchronous replication
>
> Test Steps:
> 1. Restart the device hosting Postgres_1 (this takes time).
> 2. Failover occurs: Postgres_2 is promoted to primary.
> 3. Postgres_1(Old primary) comes back online and starts as standby.
> 4. Postgres_1(Old primary): Streaming replication begins from the new
> primary Postgres_2, but fails due to timeline divergence.
>
> We observed a behavioral difference between PostgreSQL 14.17 and 14.18:
> - In 14.17, when the standby reaches the end of the WAL on the old timeline
> (before the new primary was promoted), the walreceiver process does not
> terminate.
> - In 14.18, under the same condition, the walreceiver process is terminated.
> This leads to different results when querying: SELECT status FROM
> pg_stat_wal_receiver;
> Observed Behavior:
> - PostgreSQL 14.17:
> The status is often waiting. After reaching the end of the timeline, there
> is a 5-second delay(wal_retrieve_retry_interval) before the next attempt.
> During this interval, the status remains waiting.
> Our monitoring service detects this as an unhealthy standby and triggers
> recovery using pg_rewind or pg_basebackup.
> - PostgreSQL 14.18:
> The status is often streaming. The walreceiver process is repeatedly
> terminated and restarted without the 5-second delay seen in 14.17.
> As a result, our monitoring service sees the standby as healthy (streaming),
> even though it is unable to stream WAL correctly. This causes the cluster to
> become unwritable due to synchronous replication.
>
> Log Comparison:
> We’ve attached excerpts from both versions showing the difference in
> walreceiver behavior when reaching the end of WAL on a diverged timeline.
> PostgreSQL 14.17:
> Oct 20 10:35:52 S2 postgres LOG:  replication terminated by primary server
> Oct 20 10:35:52 S2 postgres DETAIL:  End of WAL reached on timeline 7 at
> 1/9A7C1020.
> Oct 20 10:35:52 S2 postgres DEBUG:  sending write 1/9A7C1020 flush
> 1/9A806380 apply 1/9A806380
> Oct 20 10:35:52 S2 postgres DEBUG:  walreceiver ended streaming and awaits
> new instructions
> Oct 20 10:35:52 S2 postgres LOG:  new timeline 9 forked off current database
> system timeline 7 before current recovery point 1/9A806380
> Oct 20 10:35:57 S2 postgres DEBUG:  skipping restartpoint, already performed
> at 0/0
> Oct 20 10:35:57 S2 postgres DEBUG:  switched WAL source from stream to
> archive after failure
> Oct 20 10:35:57 S2 postgres DEBUG:  invalid record length at 1/9A806380:
> wanted 24, got 0
> Oct 20 10:35:57 S2 postgres DEBUG:  switched WAL source from archive to
> stream after failure
> Oct 20 10:35:57 S2 postgres LOG:  restarted WAL streaming at 1/9A000000 on
> timeline 7
>
> PostgreSQL 14.18:
> Oct 21 06:17:08 S2 postgres[90033]: [476-1] :[29287] LOG:  replication
> terminated by primary server
> Oct 21 06:17:08 S2 postgres[90033]: [476-2] :[29287] DETAIL:  End of WAL
> reached on timeline 5 at 0/69970A80.
> Oct 21 06:17:08 S2 postgres[90033]: [477-1] :[29287] DEBUG:  sending write
> 0/69970A80 flush 0/69970AF8 apply 0/69970AF8
> Oct 21 06:17:08 S2 postgres[90033]: [478-1] :[29287] DEBUG:  walreceiver
> ended streaming and awaits new instructions
> Oct 21 06:17:08 S2 postgres[90033]: [479-1] :[29287] FATAL:  terminating
> walreceiver process due to administrator command
> Oct 21 06:17:08 S2 postgres[90033]: [480-1] :[29287] DEBUG:  shmem_exit(1):
> 1 before_shmem_exit callbacks to make
> Oct 21 06:17:08 S2 postgres[90033]: [481-1] :[29287] DEBUG:  shmem_exit(1):
> 7 on_shmem_exit callbacks to make
> Oct 21 06:17:08 S2 postgres[85600]: [319-1] :[27280] LOG:  new timeline 7
> forked off current database system timeline 5 before current recovery point
> 0/69970AF8
> Oct 21 06:17:08 S2 postgres[85600]: [320-1] :[27280] DEBUG:  switched WAL
> source from stream to archive after failure
> Oct 21 06:17:08 S2 postgres[85600]: [321-1] :[27280] DEBUG:  invalid record
> length at 0/69970AF8: wanted 24, got 0
> Oct 21 06:17:08 S2 postgres[85600]: [322-1] :[27280] DEBUG:  switched WAL
> source from archive to stream after failure
> Oct 21 06:17:08 S2 postgres[90033]: [482-1] :[29287] DEBUG:  proc_exit(1): 2
> callbacks to make
> Oct 21 06:17:08 S2 postgres[90033]: [483-1] :[29287] DEBUG:  exit(1)
> Oct 21 06:17:08 S2 postgres[90033]: [484-1] :[29287] DEBUG:  shmem_exit(-1):
> 0 before_shmem_exit callbacks to make
> Oct 21 06:17:08 S2 postgres[90033]: [485-1] :[29287] DEBUG:  shmem_exit(-1):
> 0 on_shmem_exit callbacks to make
> Oct 21 06:17:08 S2 postgres[90033]: [486-1] :[29287] DEBUG:  proc_exit(-1):
> 0 callbacks to make
> Oct 21 06:17:08 S2 postgres[85555]: [322-1] :[27235] DEBUG:  reaping dead
> processes
> Oct 21 06:17:08 S2 postgres[90837]: [323-1] :[29637] DEBUG:
> find_in_dynamic_libpath: trying
> "/usr/lib/postgresql14/lib64/libpqwalreceiver"
> Oct 21 06:17:08 S2 postgres[90837]: [324-1] :[29637] DEBUG:
> find_in_dynamic_libpath: trying
> "/usr/lib/postgresql14/lib64/libpqwalreceiver.so"
>
> Root Cause Analysis:
> We traced the change to the following commit:
> Reset InstallXLogFileSegmentActive after walreceiver self-initiated exit.
> https://github.com/postgres/postgres/commit/3635a0a35aafd3bfa80b7a809bc6e91ccd36606a
> This commit was introduced to fix a regression reported by Noah Misch
> (commit 1f95181), which caused spurious failures in archive recovery.
> In PostgreSQL 14.17, the shutdown logic was conditional:
> if (WalRcvStreaming())
>     XLogShutdownWalRcv();
> In PostgreSQL 14.18, the condition was removed:
> XLogShutdownWalRcv();
> As a result, the walreceiver process is terminated even when it is in a
> waiting state. In 14.17, the condition would prevent termination in this
> scenario.
> The intention was not to terminate the walreceiver process, but rather to
> reset the InstallXLogFileSegmentActive flag, which was previously only
> cleared inside XLogShutdownWalRcv().
> As noted in the commit discussion:
> The problem is that whenever we are going for streaming we always set
> XLogCtl->InstallXLogFileSegmentActive to true, but while switching
> from streaming to archive we do not always reset it so it hits
> assertion in some cases. Basically we reset it inside
> XLogShutdownWalRcv() but while switching from the streaming mode we
> only call it conditionally if WalRcvStreaming(). But it is very much
> possible that even before we call WalRcvStreaming() the walreceiver
> might have set alrcv->walRcvState to WALRCV_STOPPED. So now
> WalRcvStreaming() will return false. So I agree now we do not want to
> really shut down the walreceiver but who will reset the flag
>
> This change inadvertently causes walreceiver to terminate in scenarios where
> it previously remained alive, altering the observable behavior of streaming
> replication.

Thanks for your analysis here.

> Questions
> 1. Is terminating the walreceiver process in this scenario (end of WAL on a
> diverged timeline) the expected behavior in 14.18 or later release?

I think this is not the expected behavior. It is likely a bug
introduced in commit 3635a0a.

> 2. Is it expected that pg_stat_wal_receiver.status = 'streaming' may not
> accurately reflect streaming health in this case?

IMO, the expected behavior is for the walreceiver to remain in
WALRCV_WAITING state with status 'waiting', clearly indicating that
replication has stalled due to timeline issues.

> 3. Would it be safer to decouple flag-reset logic from XLogShutdownWalRcv()
> to avoid unintended process termination?

+1 for decoupling here. Please see the attached patch — I’ve tested it
on HEAD, and the walreceiver process is not being terminated.

Best,
Xuneng

Вложения
Hi Xuneng,

Thank you very much for your reply.
I’d like to ask about the fix plan for this issue. Will there be any releases after 14.19, and if so, is this issue expected to be fixed in the 14 branch? Also, will it be addressed in version 16?

Best regards,
Xingguo 
Xuneng Zhou <xunengzhou@gmail.com>于2025年10月24日 周五14:20写道:
Hi,

Thanks for reporting this issue!

On Fri, Oct 24, 2025 at 2:43 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> The following bug has been logged on the website:
>
> Bug reference:      19093
> Logged by:          Ryan bird
> Email address:      ryanzxg@gmail.com
> PostgreSQL version: 14.18
> Operating system:   Linux(Cent OS 8)
> Description:
>
> Behavioral change in walreceiver termination between PostgreSQL 14.17 and
> 14.18 affects standby health detection
>
> We observed a behavior change in PostgreSQL 14.18 that affects how the
> walreceiver process behaves when a standby reaches the end of WAL on a
> diverged timeline.
> This change impacts our monitoring service that rely on
> pg_stat_wal_receiver.status to detect unhealthy replication.
>
> Test Scenario
> PostgreSQL cluster with two instances:
> - Postgres_1 (Primary)
> - Postgres_2 (Standby)
> - Synchronous replication
>
> Test Steps:
> 1. Restart the device hosting Postgres_1 (this takes time).
> 2. Failover occurs: Postgres_2 is promoted to primary.
> 3. Postgres_1(Old primary) comes back online and starts as standby.
> 4. Postgres_1(Old primary): Streaming replication begins from the new
> primary Postgres_2, but fails due to timeline divergence.
>
> We observed a behavioral difference between PostgreSQL 14.17 and 14.18:
> - In 14.17, when the standby reaches the end of the WAL on the old timeline
> (before the new primary was promoted), the walreceiver process does not
> terminate.
> - In 14.18, under the same condition, the walreceiver process is terminated.
> This leads to different results when querying: SELECT status FROM
> pg_stat_wal_receiver;
> Observed Behavior:
> - PostgreSQL 14.17:
> The status is often waiting. After reaching the end of the timeline, there
> is a 5-second delay(wal_retrieve_retry_interval) before the next attempt.
> During this interval, the status remains waiting.
> Our monitoring service detects this as an unhealthy standby and triggers
> recovery using pg_rewind or pg_basebackup.
> - PostgreSQL 14.18:
> The status is often streaming. The walreceiver process is repeatedly
> terminated and restarted without the 5-second delay seen in 14.17.
> As a result, our monitoring service sees the standby as healthy (streaming),
> even though it is unable to stream WAL correctly. This causes the cluster to
> become unwritable due to synchronous replication.
>
> Log Comparison:
> We’ve attached excerpts from both versions showing the difference in
> walreceiver behavior when reaching the end of WAL on a diverged timeline.
> PostgreSQL 14.17:
> Oct 20 10:35:52 S2 postgres LOG:  replication terminated by primary server
> Oct 20 10:35:52 S2 postgres DETAIL:  End of WAL reached on timeline 7 at
> 1/9A7C1020.
> Oct 20 10:35:52 S2 postgres DEBUG:  sending write 1/9A7C1020 flush
> 1/9A806380 apply 1/9A806380
> Oct 20 10:35:52 S2 postgres DEBUG:  walreceiver ended streaming and awaits
> new instructions
> Oct 20 10:35:52 S2 postgres LOG:  new timeline 9 forked off current database
> system timeline 7 before current recovery point 1/9A806380
> Oct 20 10:35:57 S2 postgres DEBUG:  skipping restartpoint, already performed
> at 0/0
> Oct 20 10:35:57 S2 postgres DEBUG:  switched WAL source from stream to
> archive after failure
> Oct 20 10:35:57 S2 postgres DEBUG:  invalid record length at 1/9A806380:
> wanted 24, got 0
> Oct 20 10:35:57 S2 postgres DEBUG:  switched WAL source from archive to
> stream after failure
> Oct 20 10:35:57 S2 postgres LOG:  restarted WAL streaming at 1/9A000000 on
> timeline 7
>
> PostgreSQL 14.18:
> Oct 21 06:17:08 S2 postgres[90033]: [476-1] :[29287] LOG:  replication
> terminated by primary server
> Oct 21 06:17:08 S2 postgres[90033]: [476-2] :[29287] DETAIL:  End of WAL
> reached on timeline 5 at 0/69970A80.
> Oct 21 06:17:08 S2 postgres[90033]: [477-1] :[29287] DEBUG:  sending write
> 0/69970A80 flush 0/69970AF8 apply 0/69970AF8
> Oct 21 06:17:08 S2 postgres[90033]: [478-1] :[29287] DEBUG:  walreceiver
> ended streaming and awaits new instructions
> Oct 21 06:17:08 S2 postgres[90033]: [479-1] :[29287] FATAL:  terminating
> walreceiver process due to administrator command
> Oct 21 06:17:08 S2 postgres[90033]: [480-1] :[29287] DEBUG:  shmem_exit(1):
> 1 before_shmem_exit callbacks to make
> Oct 21 06:17:08 S2 postgres[90033]: [481-1] :[29287] DEBUG:  shmem_exit(1):
> 7 on_shmem_exit callbacks to make
> Oct 21 06:17:08 S2 postgres[85600]: [319-1] :[27280] LOG:  new timeline 7
> forked off current database system timeline 5 before current recovery point
> 0/69970AF8
> Oct 21 06:17:08 S2 postgres[85600]: [320-1] :[27280] DEBUG:  switched WAL
> source from stream to archive after failure
> Oct 21 06:17:08 S2 postgres[85600]: [321-1] :[27280] DEBUG:  invalid record
> length at 0/69970AF8: wanted 24, got 0
> Oct 21 06:17:08 S2 postgres[85600]: [322-1] :[27280] DEBUG:  switched WAL
> source from archive to stream after failure
> Oct 21 06:17:08 S2 postgres[90033]: [482-1] :[29287] DEBUG:  proc_exit(1): 2
> callbacks to make
> Oct 21 06:17:08 S2 postgres[90033]: [483-1] :[29287] DEBUG:  exit(1)
> Oct 21 06:17:08 S2 postgres[90033]: [484-1] :[29287] DEBUG:  shmem_exit(-1):
> 0 before_shmem_exit callbacks to make
> Oct 21 06:17:08 S2 postgres[90033]: [485-1] :[29287] DEBUG:  shmem_exit(-1):
> 0 on_shmem_exit callbacks to make
> Oct 21 06:17:08 S2 postgres[90033]: [486-1] :[29287] DEBUG:  proc_exit(-1):
> 0 callbacks to make
> Oct 21 06:17:08 S2 postgres[85555]: [322-1] :[27235] DEBUG:  reaping dead
> processes
> Oct 21 06:17:08 S2 postgres[90837]: [323-1] :[29637] DEBUG:
> find_in_dynamic_libpath: trying
> "/usr/lib/postgresql14/lib64/libpqwalreceiver"
> Oct 21 06:17:08 S2 postgres[90837]: [324-1] :[29637] DEBUG:
> find_in_dynamic_libpath: trying
> "/usr/lib/postgresql14/lib64/libpqwalreceiver.so"
>
> Root Cause Analysis:
> We traced the change to the following commit:
> Reset InstallXLogFileSegmentActive after walreceiver self-initiated exit.
> https://github.com/postgres/postgres/commit/3635a0a35aafd3bfa80b7a809bc6e91ccd36606a
> This commit was introduced to fix a regression reported by Noah Misch
> (commit 1f95181), which caused spurious failures in archive recovery.
> In PostgreSQL 14.17, the shutdown logic was conditional:
> if (WalRcvStreaming())
>     XLogShutdownWalRcv();
> In PostgreSQL 14.18, the condition was removed:
> XLogShutdownWalRcv();
> As a result, the walreceiver process is terminated even when it is in a
> waiting state. In 14.17, the condition would prevent termination in this
> scenario.
> The intention was not to terminate the walreceiver process, but rather to
> reset the InstallXLogFileSegmentActive flag, which was previously only
> cleared inside XLogShutdownWalRcv().
> As noted in the commit discussion:
> The problem is that whenever we are going for streaming we always set
> XLogCtl->InstallXLogFileSegmentActive to true, but while switching
> from streaming to archive we do not always reset it so it hits
> assertion in some cases. Basically we reset it inside
> XLogShutdownWalRcv() but while switching from the streaming mode we
> only call it conditionally if WalRcvStreaming(). But it is very much
> possible that even before we call WalRcvStreaming() the walreceiver
> might have set alrcv->walRcvState to WALRCV_STOPPED. So now
> WalRcvStreaming() will return false. So I agree now we do not want to
> really shut down the walreceiver but who will reset the flag
>
> This change inadvertently causes walreceiver to terminate in scenarios where
> it previously remained alive, altering the observable behavior of streaming
> replication.

Thanks for your analysis here.

> Questions
> 1. Is terminating the walreceiver process in this scenario (end of WAL on a
> diverged timeline) the expected behavior in 14.18 or later release?

I think this is not the expected behavior. It is likely a bug
introduced in commit 3635a0a.

> 2. Is it expected that pg_stat_wal_receiver.status = 'streaming' may not
> accurately reflect streaming health in this case?

IMO, the expected behavior is for the walreceiver to remain in
WALRCV_WAITING state with status 'waiting', clearly indicating that
replication has stalled due to timeline issues.

> 3. Would it be safer to decouple flag-reset logic from XLogShutdownWalRcv()
> to avoid unintended process termination?

+1 for decoupling here. Please see the attached patch — I’ve tested it
on HEAD, and the walreceiver process is not being terminated.

Best,
Xuneng
Hi xingguo,

On Fri, Oct 24, 2025 at 4:33 PM xingguo <ryanzxg@gmail.com> wrote:
>
> Hi Xuneng,
>
> Thank you very much for your reply.
> I’d like to ask about the fix plan for this issue. Will there be any releases after 14.19, and if so, is this issue
expectedto be fixed in the 14 branch? Also, will it be addressed in version 16? 

It seems we may need to investigate this further. The earlier patch
appears similar to the first version of Dilip’s patch in [1], but the
community ultimately favored the current unconditional shutdown
approach.

If this turns out to be a genuine bug and the proposed fix is correct,
it may need to be backpatched to 14.18. In that case, since Heikki
split xlog.c into xlog.c and xlogrecovery.c in commit 70e8186, we
would likely need to prepare at least two patch versions as well.

[1]
https://www.postgresql.org/message-id/flat/CAFiTN-sE3ry%3DycMPVtC%2BDjw4Fd7gbUGVv_qqw6qfzp%3DJLvqT3g%40mail.gmail.com

> Best regards,
> Xingguo
> Xuneng Zhou <xunengzhou@gmail.com>于2025年10月24日 周五14:20写道:
>>
>> Hi,
>>
>> Thanks for reporting this issue!
>>
>> On Fri, Oct 24, 2025 at 2:43 AM PG Bug reporting form
>> <noreply@postgresql.org> wrote:
>> >
>> > The following bug has been logged on the website:
>> >
>> > Bug reference:      19093
>> > Logged by:          Ryan bird
>> > Email address:      ryanzxg@gmail.com
>> > PostgreSQL version: 14.18
>> > Operating system:   Linux(Cent OS 8)
>> > Description:
>> >
>> > Behavioral change in walreceiver termination between PostgreSQL 14.17 and
>> > 14.18 affects standby health detection
>> >
>> > We observed a behavior change in PostgreSQL 14.18 that affects how the
>> > walreceiver process behaves when a standby reaches the end of WAL on a
>> > diverged timeline.
>> > This change impacts our monitoring service that rely on
>> > pg_stat_wal_receiver.status to detect unhealthy replication.
>> >
>> > Test Scenario
>> > PostgreSQL cluster with two instances:
>> > - Postgres_1 (Primary)
>> > - Postgres_2 (Standby)
>> > - Synchronous replication
>> >
>> > Test Steps:
>> > 1. Restart the device hosting Postgres_1 (this takes time).
>> > 2. Failover occurs: Postgres_2 is promoted to primary.
>> > 3. Postgres_1(Old primary) comes back online and starts as standby.
>> > 4. Postgres_1(Old primary): Streaming replication begins from the new
>> > primary Postgres_2, but fails due to timeline divergence.
>> >
>> > We observed a behavioral difference between PostgreSQL 14.17 and 14.18:
>> > - In 14.17, when the standby reaches the end of the WAL on the old timeline
>> > (before the new primary was promoted), the walreceiver process does not
>> > terminate.
>> > - In 14.18, under the same condition, the walreceiver process is terminated.
>> > This leads to different results when querying: SELECT status FROM
>> > pg_stat_wal_receiver;
>> > Observed Behavior:
>> > - PostgreSQL 14.17:
>> > The status is often waiting. After reaching the end of the timeline, there
>> > is a 5-second delay(wal_retrieve_retry_interval) before the next attempt.
>> > During this interval, the status remains waiting.
>> > Our monitoring service detects this as an unhealthy standby and triggers
>> > recovery using pg_rewind or pg_basebackup.
>> > - PostgreSQL 14.18:
>> > The status is often streaming. The walreceiver process is repeatedly
>> > terminated and restarted without the 5-second delay seen in 14.17.
>> > As a result, our monitoring service sees the standby as healthy (streaming),
>> > even though it is unable to stream WAL correctly. This causes the cluster to
>> > become unwritable due to synchronous replication.
>> >
>> > Log Comparison:
>> > We’ve attached excerpts from both versions showing the difference in
>> > walreceiver behavior when reaching the end of WAL on a diverged timeline.
>> > PostgreSQL 14.17:
>> > Oct 20 10:35:52 S2 postgres LOG:  replication terminated by primary server
>> > Oct 20 10:35:52 S2 postgres DETAIL:  End of WAL reached on timeline 7 at
>> > 1/9A7C1020.
>> > Oct 20 10:35:52 S2 postgres DEBUG:  sending write 1/9A7C1020 flush
>> > 1/9A806380 apply 1/9A806380
>> > Oct 20 10:35:52 S2 postgres DEBUG:  walreceiver ended streaming and awaits
>> > new instructions
>> > Oct 20 10:35:52 S2 postgres LOG:  new timeline 9 forked off current database
>> > system timeline 7 before current recovery point 1/9A806380
>> > Oct 20 10:35:57 S2 postgres DEBUG:  skipping restartpoint, already performed
>> > at 0/0
>> > Oct 20 10:35:57 S2 postgres DEBUG:  switched WAL source from stream to
>> > archive after failure
>> > Oct 20 10:35:57 S2 postgres DEBUG:  invalid record length at 1/9A806380:
>> > wanted 24, got 0
>> > Oct 20 10:35:57 S2 postgres DEBUG:  switched WAL source from archive to
>> > stream after failure
>> > Oct 20 10:35:57 S2 postgres LOG:  restarted WAL streaming at 1/9A000000 on
>> > timeline 7
>> >
>> > PostgreSQL 14.18:
>> > Oct 21 06:17:08 S2 postgres[90033]: [476-1] :[29287] LOG:  replication
>> > terminated by primary server
>> > Oct 21 06:17:08 S2 postgres[90033]: [476-2] :[29287] DETAIL:  End of WAL
>> > reached on timeline 5 at 0/69970A80.
>> > Oct 21 06:17:08 S2 postgres[90033]: [477-1] :[29287] DEBUG:  sending write
>> > 0/69970A80 flush 0/69970AF8 apply 0/69970AF8
>> > Oct 21 06:17:08 S2 postgres[90033]: [478-1] :[29287] DEBUG:  walreceiver
>> > ended streaming and awaits new instructions
>> > Oct 21 06:17:08 S2 postgres[90033]: [479-1] :[29287] FATAL:  terminating
>> > walreceiver process due to administrator command
>> > Oct 21 06:17:08 S2 postgres[90033]: [480-1] :[29287] DEBUG:  shmem_exit(1):
>> > 1 before_shmem_exit callbacks to make
>> > Oct 21 06:17:08 S2 postgres[90033]: [481-1] :[29287] DEBUG:  shmem_exit(1):
>> > 7 on_shmem_exit callbacks to make
>> > Oct 21 06:17:08 S2 postgres[85600]: [319-1] :[27280] LOG:  new timeline 7
>> > forked off current database system timeline 5 before current recovery point
>> > 0/69970AF8
>> > Oct 21 06:17:08 S2 postgres[85600]: [320-1] :[27280] DEBUG:  switched WAL
>> > source from stream to archive after failure
>> > Oct 21 06:17:08 S2 postgres[85600]: [321-1] :[27280] DEBUG:  invalid record
>> > length at 0/69970AF8: wanted 24, got 0
>> > Oct 21 06:17:08 S2 postgres[85600]: [322-1] :[27280] DEBUG:  switched WAL
>> > source from archive to stream after failure
>> > Oct 21 06:17:08 S2 postgres[90033]: [482-1] :[29287] DEBUG:  proc_exit(1): 2
>> > callbacks to make
>> > Oct 21 06:17:08 S2 postgres[90033]: [483-1] :[29287] DEBUG:  exit(1)
>> > Oct 21 06:17:08 S2 postgres[90033]: [484-1] :[29287] DEBUG:  shmem_exit(-1):
>> > 0 before_shmem_exit callbacks to make
>> > Oct 21 06:17:08 S2 postgres[90033]: [485-1] :[29287] DEBUG:  shmem_exit(-1):
>> > 0 on_shmem_exit callbacks to make
>> > Oct 21 06:17:08 S2 postgres[90033]: [486-1] :[29287] DEBUG:  proc_exit(-1):
>> > 0 callbacks to make
>> > Oct 21 06:17:08 S2 postgres[85555]: [322-1] :[27235] DEBUG:  reaping dead
>> > processes
>> > Oct 21 06:17:08 S2 postgres[90837]: [323-1] :[29637] DEBUG:
>> > find_in_dynamic_libpath: trying
>> > "/usr/lib/postgresql14/lib64/libpqwalreceiver"
>> > Oct 21 06:17:08 S2 postgres[90837]: [324-1] :[29637] DEBUG:
>> > find_in_dynamic_libpath: trying
>> > "/usr/lib/postgresql14/lib64/libpqwalreceiver.so"
>> >
>> > Root Cause Analysis:
>> > We traced the change to the following commit:
>> > Reset InstallXLogFileSegmentActive after walreceiver self-initiated exit.
>> > https://github.com/postgres/postgres/commit/3635a0a35aafd3bfa80b7a809bc6e91ccd36606a
>> > This commit was introduced to fix a regression reported by Noah Misch
>> > (commit 1f95181), which caused spurious failures in archive recovery.
>> > In PostgreSQL 14.17, the shutdown logic was conditional:
>> > if (WalRcvStreaming())
>> >     XLogShutdownWalRcv();
>> > In PostgreSQL 14.18, the condition was removed:
>> > XLogShutdownWalRcv();
>> > As a result, the walreceiver process is terminated even when it is in a
>> > waiting state. In 14.17, the condition would prevent termination in this
>> > scenario.
>> > The intention was not to terminate the walreceiver process, but rather to
>> > reset the InstallXLogFileSegmentActive flag, which was previously only
>> > cleared inside XLogShutdownWalRcv().
>> > As noted in the commit discussion:
>> > The problem is that whenever we are going for streaming we always set
>> > XLogCtl->InstallXLogFileSegmentActive to true, but while switching
>> > from streaming to archive we do not always reset it so it hits
>> > assertion in some cases. Basically we reset it inside
>> > XLogShutdownWalRcv() but while switching from the streaming mode we
>> > only call it conditionally if WalRcvStreaming(). But it is very much
>> > possible that even before we call WalRcvStreaming() the walreceiver
>> > might have set alrcv->walRcvState to WALRCV_STOPPED. So now
>> > WalRcvStreaming() will return false. So I agree now we do not want to
>> > really shut down the walreceiver but who will reset the flag
>> >
>> > This change inadvertently causes walreceiver to terminate in scenarios where
>> > it previously remained alive, altering the observable behavior of streaming
>> > replication.
>>
>> Thanks for your analysis here.
>>
>> > Questions
>> > 1. Is terminating the walreceiver process in this scenario (end of WAL on a
>> > diverged timeline) the expected behavior in 14.18 or later release?
>>
>> I think this is not the expected behavior. It is likely a bug
>> introduced in commit 3635a0a.
>>
>> > 2. Is it expected that pg_stat_wal_receiver.status = 'streaming' may not
>> > accurately reflect streaming health in this case?
>>
>> IMO, the expected behavior is for the walreceiver to remain in
>> WALRCV_WAITING state with status 'waiting', clearly indicating that
>> replication has stalled due to timeline issues.
>>
>> > 3. Would it be safer to decouple flag-reset logic from XLogShutdownWalRcv()
>> > to avoid unintended process termination?
>>
>> +1 for decoupling here. Please see the attached patch — I’ve tested it
>> on HEAD, and the walreceiver process is not being terminated.



On Fri, Oct 24, 2025, at 5:33 AM, xingguo wrote:
> I’d like to ask about the fix plan for this issue. Will there be any
> releases after 14.19, and if so, is this issue expected to be fixed in
> the 14 branch? Also, will it be addressed in version 16?
>

I didn't have an opinion about this change behavior. Maybe Noah can comment on
it. Until we have an agreement that this is a regression, there is no fix plan.
Regarding v14, it will be supported for another year [1]. If the regression was
applied to old stable branches, it should be fixed as well. Sometimes, we need
a totally different patch for back branches. There are cases that the
regression is not triggered in old branches and the committer decides it is not
worth to apply it. Sometimes the fix is invasive (for example, require other
commits to be backported) and the bug is rare that we leave it alone.

In summary, stable branches usually don't change the behavior (unless we
consider the behavior totally broken).


[1] https://www.postgresql.org/support/versioning/


--
Euler Taveira
EDB   https://www.enterprisedb.com/



On Fri, Oct 24, 2025 at 02:20:39PM +0800, Xuneng Zhou wrote:
> Thanks for reporting this issue!

Yes, thank you for the detailed report.

> On Fri, Oct 24, 2025 at 2:43 AM PG Bug reporting form <noreply@postgresql.org> wrote:
> > 1. Is terminating the walreceiver process in this scenario (end of WAL on a
> > diverged timeline) the expected behavior in 14.18 or later release?
> 
> I think this is not the expected behavior. It is likely a bug
> introduced in commit 3635a0a.

Right.

> > 2. Is it expected that pg_stat_wal_receiver.status = 'streaming' may not
> > accurately reflect streaming health in this case?

Yes.  Even before the regression from commit 3635a0a, status='streaming' was
not a reliable indicator of health.  walreceiver sets that status during early
startup, before it has attempted a connection.

Perhaps a better health check for your application would be one of these:
- If you have multiple replicas with quorum, track per-replica lag.  If a
  replica isn't advancing, it has a problem regardless of walreceiver status.
- If you have just one replica, issue a commit from the primary and see if it
  hangs.

(Long-term, in master only, perhaps we should introduce another status like
'connecting'.  Perhaps enact the connecting->streaming status transition just
before tendering the first byte of streamed WAL to the startup process.
Alternatively, enact that transition when the startup process accepts the
first streamed byte.  Then your application's health check would get what it
wants.)

> IMO, the expected behavior is for the walreceiver to remain in
> WALRCV_WAITING state with status 'waiting', clearly indicating that
> replication has stalled due to timeline issues.

Right.  While status='streaming' doesn't mean healthy in general, seeing a lot
of 'streaming' is still not expected for this particular scenario.

> --- a/src/backend/access/transam/xlogrecovery.c
> +++ b/src/backend/access/transam/xlogrecovery.c
> @@ -3687,8 +3687,18 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
>                       * Before we leave XLOG_FROM_STREAM state, make sure that
>                       * walreceiver is not active, so that it won't overwrite
>                       * WAL that we restore from archive.
> +                     *
> +                     * If walreceiver is actively streaming (or attempting to
> +                     * connect), we must shut it down. However, if it's already
> +                     * in WAITING state (e.g., due to timeline divergence), we
> +                     * only need to reset the install flag to allow archive
> +                     * restoration, while keeping the process alive for
> +                     * monitoring visibility.
>                       */
> -                    XLogShutdownWalRcv();
> +                    if (WalRcvStreaming())
> +                        XLogShutdownWalRcv();
> +                    else
> +                        ResetInstallXLogFileSegmentActive();

I think this will do the right thing for all reachable cases.  In
WALRCV_WAITING, the walreceiver won't install segments until the startup
process moves the walreceiver to WALRCV_RESTARTING.  There's no reason to take
the further step of terminating the walreceiver.  The discussion before commit
3635a0a wasn't convinced status='waiting' was reachable at this location, but
this thread shows it is reachable.

WalRcvStreaming() return values map to WalRcvState values as follows:

  true: STREAMING STARTING RESTARTING
  false: STOPPED WAITING STOPPING

This change would be wrong if WALRCV_STOPPING were a reachable state here.
That state is the startup process asking walreceiver to stop.  walreceiver may
then still be installing segments, so this location would want to call
XLogShutdownWalRcv() to wait for WALRCV_STOPPED.  That said, WALRCV_STOPPING
is a transient state while the startup process is in ShutdownWalRcv().  Hence,
I expect STOPPING never appears here, and there's no bug.  An assertion may be
in order.

Can you add a TAP test for this case?  Since it was wrong in v15+ for >3y and
wrong in v14 for 5mon before this report, clearly we had a blind spot.

postgr.es/m/YyACvP++zgDphlcm@paquier.xyz discusses a
"standby.signal+primary_conninfo" case.  How will this patch interact with
that case?

Thanks,
nm



On Sun, Oct 26, 2025 at 09:12:41PM -0700, Noah Misch wrote:
> On Fri, Oct 24, 2025 at 02:20:39PM +0800, Xuneng Zhou wrote:
> (Long-term, in master only, perhaps we should introduce another status like
> 'connecting'.  Perhaps enact the connecting->streaming status transition just
> before tendering the first byte of streamed WAL to the startup process.
> Alternatively, enact that transition when the startup process accepts the
> first streamed byte.  Then your application's health check would get what it
> wants.)

Having a "connecting" status here would make sense to me, yep.

> This change would be wrong if WALRCV_STOPPING were a reachable state here.
> That state is the startup process asking walreceiver to stop.  walreceiver may
> then still be installing segments, so this location would want to call
> XLogShutdownWalRcv() to wait for WALRCV_STOPPED.  That said, WALRCV_STOPPING
> is a transient state while the startup process is in ShutdownWalRcv().  Hence,
> I expect STOPPING never appears here, and there's no bug.  An assertion may be
> in order.

A WAL receiver marked as in STOPPING state should never be reached in
this code path as far as I recall the matter.  An assertion sounds
like a cheap insurance anyway.

> Can you add a TAP test for this case?  Since it was wrong in v15+ for >3y and
> wrong in v14 for 5mon before this report, clearly we had a blind spot.

Hmm.  I think that we could tweak the recovery 004_timeline_switch.pl
that has scenarios for TLI switches, which is what this is about.  And
there is no need to rely on a failover of a previous primary to a
promoted standby, we could do the same with a cascading standby.

Based on the logs I can see at 14.18:
2025-10-29 11:09:36.964 JST walreceiver[711] LOG:  replication
terminated by primary server
2025-10-29 11:09:36.964 JST walreceiver[711] DETAIL:  End of WAL
reached on timeline 1 at 0/03024D18.
2025-10-29 11:09:36.964 JST walreceiver[711] FATAL:  terminating
walreceiver process due to administrator command
2025-10-29 11:09:36.964 JST startup[710] LOG:  new target timeline is 2

And on older versions, at 14.17:
2025-10-29 11:14:32.745 JST walreceiver[5857] LOG:  replication
terminated by primary server
2025-10-29 11:14:32.745 JST walreceiver[5857] DETAIL:  End of WAL
reached on timeline 1 at 0/03024D18.
2025-10-29 11:14:32.745 JST startup[5856] LOG:  new target timeline
is 2

I was thinking about two options to provide some coverage:
- Check that the PID of the WAL receiver is still the same before and
after the TLI switch, with a restart or a reload after changing
primary_conninfo.  This cannot be made stable without an injection
point or equivalent to stop the WAL receiver from doing any job before
it does the TLI jump and trigger the code path we are discussing here.
- Check for the logs that we never issue a "terminating walreceiver
process due to administrator command".  This would be simpler in the
long term.

Hence I would suggest just the following addition:
--- a/src/test/recovery/t/004_timeline_switch.pl
+++ b/src/test/recovery/t/004_timeline_switch.pl
@@ -66,6 +66,11 @@ my $result =
   $node_standby_2->safe_psql('postgres', "SELECT count(*) FROM tab_int");
 is($result, qq(2000), 'check content of standby 2');

+# Check the logs, WAL receiver should not have been stopped.  There is no need
+# to rely on a position in the logs: a new log file is used on node restart.
+ok( !$node_standby_2->log_contains(
+  "FATAL: .* terminating walreceiver process due to administrator command"),
+  'WAL receiver should not be stopped across timeline jumps');

 # Ensure that a standby is able to follow a primary on a newer timeline
 # when WAL archiving is enabled.

Note: the issue is also reachable on REL_13_STABLE, through
69a498eb6465.  Are you confident that we would be able to get
something into this branch by the next minor release?  Based on the
state of the problem and the advanced analysis, it sounds to me that
we should be able to conclude by the middle of next week, or something
close to that.

> postgr.es/m/YyACvP++zgDphlcm@paquier.xyz discusses a
> "standby.signal+primary_conninfo" case.  How will this patch interact with
> that case?

This does not sound like an issue to me?  This mentions a case where
we don't have a restore_command and where changes are pushed to the
local pg_wal/ by an external source.  If a WAL receiver is waiting,
the patch means that ResetInstallXLogFileSegmentActive() would be
called and the InstallXLogFileSegmentActive flag reset.  I don't see
why that's not OK.  Or are you foreseeing something I don't?
--
Michael

Вложения
Hi Michael, Noah,

Thanks for looking into this.

On Wed, Oct 29, 2025 at 1:01 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Sun, Oct 26, 2025 at 09:12:41PM -0700, Noah Misch wrote:
> > On Fri, Oct 24, 2025 at 02:20:39PM +0800, Xuneng Zhou wrote:
> > (Long-term, in master only, perhaps we should introduce another status like
> > 'connecting'.  Perhaps enact the connecting->streaming status transition just
> > before tendering the first byte of streamed WAL to the startup process.
> > Alternatively, enact that transition when the startup process accepts the
> > first streamed byte.  Then your application's health check would get what it
> > wants.)
>
> Having a "connecting" status here would make sense to me, yep.

+1, I am also interested in hacking on this.

> > This change would be wrong if WALRCV_STOPPING were a reachable state here.
> > That state is the startup process asking walreceiver to stop.  walreceiver may
> > then still be installing segments, so this location would want to call
> > XLogShutdownWalRcv() to wait for WALRCV_STOPPED.  That said, WALRCV_STOPPING
> > is a transient state while the startup process is in ShutdownWalRcv().  Hence,
> > I expect STOPPING never appears here, and there's no bug.  An assertion may be
> > in order.
>
> A WAL receiver marked as in STOPPING state should never be reached in
> this code path as far as I recall the matter.  An assertion sounds
> like a cheap insurance anyway.

Should we wrap this assertion within #ifdef USE_ASSERT_CHECKING?

#ifdef USE_ASSERT_CHECKING
{
WalRcvData *walrcv = WalRcv;
WalRcvState state;
SpinLockAcquire(&walrcv->mutex);
state = walrcv->walRcvState;
SpinLockRelease(&walrcv->mutex);
Assert(state != WALRCV_STOPPING);
}
#endif

> > Can you add a TAP test for this case?  Since it was wrong in v15+ for >3y and
> > wrong in v14 for 5mon before this report, clearly we had a blind spot.
>
> Hmm.  I think that we could tweak the recovery 004_timeline_switch.pl
> that has scenarios for TLI switches, which is what this is about.  And
> there is no need to rely on a failover of a previous primary to a
> promoted standby, we could do the same with a cascading standby.
>
> Based on the logs I can see at 14.18:
> 2025-10-29 11:09:36.964 JST walreceiver[711] LOG:  replication
> terminated by primary server
> 2025-10-29 11:09:36.964 JST walreceiver[711] DETAIL:  End of WAL
> reached on timeline 1 at 0/03024D18.
> 2025-10-29 11:09:36.964 JST walreceiver[711] FATAL:  terminating
> walreceiver process due to administrator command
> 2025-10-29 11:09:36.964 JST startup[710] LOG:  new target timeline is 2
>
> And on older versions, at 14.17:
> 2025-10-29 11:14:32.745 JST walreceiver[5857] LOG:  replication
> terminated by primary server
> 2025-10-29 11:14:32.745 JST walreceiver[5857] DETAIL:  End of WAL
> reached on timeline 1 at 0/03024D18.
> 2025-10-29 11:14:32.745 JST startup[5856] LOG:  new target timeline
> is 2
>
> I was thinking about two options to provide some coverage:
> - Check that the PID of the WAL receiver is still the same before and
> after the TLI switch, with a restart or a reload after changing
> primary_conninfo.  This cannot be made stable without an injection
> point or equivalent to stop the WAL receiver from doing any job before
> it does the TLI jump and trigger the code path we are discussing here.
> - Check for the logs that we never issue a "terminating walreceiver
> process due to administrator command".  This would be simpler in the
> long term.
>
> Hence I would suggest just the following addition:
> --- a/src/test/recovery/t/004_timeline_switch.pl
> +++ b/src/test/recovery/t/004_timeline_switch.pl
> @@ -66,6 +66,11 @@ my $result =
>    $node_standby_2->safe_psql('postgres', "SELECT count(*) FROM tab_int");
>  is($result, qq(2000), 'check content of standby 2');
>
> +# Check the logs, WAL receiver should not have been stopped.  There is no need
> +# to rely on a position in the logs: a new log file is used on node restart.
> +ok( !$node_standby_2->log_contains(
> +  "FATAL: .* terminating walreceiver process due to administrator command"),
> +  'WAL receiver should not be stopped across timeline jumps');
>
>  # Ensure that a standby is able to follow a primary on a newer timeline
>  # when WAL archiving is enabled.

I attempted to write a TAP test to reproduce this scenario. Since I’m
not very familiar with the recovery-related TAP framework, it was
somewhat difficult to replicate the exact conditions of the issue.
Here's the log of the attached TAP test against both the HEAD and the
patched versions:

Patched
2025-10-29 13:24:18.016 CST walreceiver[4244] LOG:  replication
terminated by primary server
2025-10-29 13:24:18.016 CST walreceiver[4244] DETAIL:  End of WAL
reached on timeline 1 at 0/030000A0.
2025-10-29 13:24:18.016 CST startup[4243] DEBUG:  transaction ID wrap
limit is 2147484395, limited by database with OID 1
2025-10-29 13:24:18.016 CST startup[4243] CONTEXT:  WAL redo at
0/03000028 for XLOG/CHECKPOINT_SHUTDOWN: redo 0/03000028; tli 1; prev
tli 1; fpw true; wal_level replica; xid 0:758; oid 16389; multi 1;
offset 0; oldest xid 748 in DB 1; oldest multi 1 in DB 1;
oldest/newest commit timestamp xid: 0/0; oldest running xid 0;
shutdown
2025-10-29 13:24:18.016 CST walreceiver[4244] DEBUG:  walreceiver
ended streaming and awaits new instructions
2025-10-29 13:24:18.016 CST startup[4243] LOG:  new target timeline is 2

Head
2025-10-29 13:29:03.697 CST walreceiver[7140] LOG:  replication
terminated by primary server
2025-10-29 13:29:03.697 CST walreceiver[7140] DETAIL:  End of WAL
reached on timeline 1 at 0/030000A0.
2025-10-29 13:29:03.697 CST startup[7139] DEBUG:  transaction ID wrap
limit is 2147484395, limited by database with OID 1
2025-10-29 13:29:03.697 CST startup[7139] CONTEXT:  WAL redo at
0/03000028 for XLOG/CHECKPOINT_SHUTDOWN: redo 0/03000028; tli 1; prev
tli 1; fpw true; wal_level replica; xid 0:758; oid 16389; multi 1;
offset 0; oldest xid 748 in DB 1; oldest multi 1 in DB 1;
oldest/newest commit timestamp xid: 0/0; oldest running xid 0;
shutdown
2025-10-29 13:29:03.697 CST walreceiver[7140] DEBUG:  walreceiver
ended streaming and awaits new instructions
2025-10-29 13:29:03.697 CST walreceiver[7140] FATAL:  terminating
walreceiver process due to administrator command
2025-10-29 13:29:03.697 CST startup[7139] LOG:  new target timeline is 2

With the prior patch applied, the walreceiver process remains alive as
expected. However, in the unpatched (HEAD) version, I didn’t observe
the repeated restart behavior shown in the original bug report logs. I
guess that to reproduce it precisely, we need to reach a state similar
to the following log entry:

Oct 20 10:35:52 S2 postgres LOG:  new timeline 9 forked off current
database system timeline 7 before current recovery point 1/9A806380

At that point, the sleep should occur, allowing us to monitor the
walreceiver’s “waiting” state and verify process stability by checking
whether the PID remains consistent.

However, I think Michael’s approach is better — it’s simpler, more
straightforward, and captures the buggy behavior well. I plan to
incorporate it into the next version of the patch.

> Note: the issue is also reachable on REL_13_STABLE, through
> 69a498eb6465.  Are you confident that we would be able to get
> something into this branch by the next minor release?  Based on the
> state of the problem and the advanced analysis, it sounds to me that
> we should be able to conclude by the middle of next week, or something
> close to that.
>
> > postgr.es/m/YyACvP++zgDphlcm@paquier.xyz discusses a
> > "standby.signal+primary_conninfo" case.  How will this patch interact with
> > that case?
>
> This does not sound like an issue to me?  This mentions a case where
> we don't have a restore_command and where changes are pushed to the
> local pg_wal/ by an external source.  If a WAL receiver is waiting,
> the patch means that ResetInstallXLogFileSegmentActive() would be
> called and the InstallXLogFileSegmentActive flag reset.  I don't see
> why that's not OK.  Or are you foreseeing something I don't?

I agree that the patch would handle that case correctly as well.
When the walreceiver stops streaming—regardless of the
reason—ResetInstallXLogFileSegmentActive() is invoked, which clears
the InstallXLogFileSegmentActive flag.
This ensures that the startup process can safely read WAL files from
pg_wal/ without conflicts, allowing the assertion
Assert(!IsInstallXLogFileSegmentActive()) in the archive recovery code
to pass as expected.

Best,
Xuneng

Вложения
On Wed, Oct 29, 2025 at 02:36:10PM +0800, Xuneng Zhou wrote:
> Should we wrap this assertion within #ifdef USE_ASSERT_CHECKING?
>
> #ifdef USE_ASSERT_CHECKING
> {
> WalRcvData *walrcv = WalRcv;
> WalRcvState state;
> SpinLockAcquire(&walrcv->mutex);
> state = walrcv->walRcvState;
> SpinLockRelease(&walrcv->mutex);
> Assert(state != WALRCV_STOPPING);
> }
> #endif

Yep.  Or provide a wrapper routine that's able to retrieve the state
of a WAL receiver, then add it in xlogrecovery.c within an Assert().
It may be better to only do that on HEAD, the back-branches stress me
a bit when we enforce such new assumptions..  Perhaps Noah has a
different opinion.

> However, I think Michael’s approach is better — it’s simpler, more
> straightforward, and captures the buggy behavior well. I plan to
> incorporate it into the next version of the patch.

Thanks!

One weakness of this approach is that this test would not be able to
work if the error message is reworded for some reason.  On stable
branches, I doubt that this error message will ever change, but it
could be updated to something else on HEAD.

Hence, I would suggest one extra improvement for HEAD: let's also add
a check for a valid case where a WAL receiver is stopped.  We have one
in 040_standby_failover_slots_sync.pl, where primary_conninfo is
updated in standby1 and its configuration reloaded, so I'd be tempted
to add a valid check in the server logs of $standby1.  One difference
is that we should grab the offset of the logs before the reload, then
make sure that the log entry gets created with a wait_for_log(), as
there may be a delay between the reload and the moment the log entry
is generated.
--
Michael

Вложения
Hi,

On Thu, Oct 30, 2025 at 7:36 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Wed, Oct 29, 2025 at 02:36:10PM +0800, Xuneng Zhou wrote:
> > Should we wrap this assertion within #ifdef USE_ASSERT_CHECKING?
> >
> > #ifdef USE_ASSERT_CHECKING
> > {
> > WalRcvData *walrcv = WalRcv;
> > WalRcvState state;
> > SpinLockAcquire(&walrcv->mutex);
> > state = walrcv->walRcvState;
> > SpinLockRelease(&walrcv->mutex);
> > Assert(state != WALRCV_STOPPING);
> > }
> > #endif
>
> Yep.  Or provide a wrapper routine that's able to retrieve the state
> of a WAL receiver, then add it in xlogrecovery.c within an Assert().
> It may be better to only do that on HEAD, the back-branches stress me
> a bit when we enforce such new assumptions..  Perhaps Noah has a
> different opinion.

Using a wrapper function seems better here.

> > However, I think Michael’s approach is better — it’s simpler, more
> > straightforward, and captures the buggy behavior well. I plan to
> > incorporate it into the next version of the patch.
>
> Thanks!
>
> One weakness of this approach is that this test would not be able to
> work if the error message is reworded for some reason.  On stable
> branches, I doubt that this error message will ever change, but it
> could be updated to something else on HEAD.
>
> Hence, I would suggest one extra improvement for HEAD: let's also add
> a check for a valid case where a WAL receiver is stopped.  We have one
> in 040_standby_failover_slots_sync.pl, where primary_conninfo is
> updated in standby1 and its configuration reloaded, so I'd be tempted
> to add a valid check in the server logs of $standby1.  One difference
> is that we should grab the offset of the logs before the reload, then
> make sure that the log entry gets created with a wait_for_log(), as
> there may be a delay between the reload and the moment the log entry
> is generated.

Thanks for your suggestion!

I’ve incorporated both tests and the assertion into the patch applied to master.
For earlier stable branches, the test in
040_standby_failover_slots_sync.pl and the assertion for
WAL receiver state checking are not included (if Noah agrees too). Please check.

Best,
Xuneng

Вложения
On Fri, Oct 31, 2025 at 12:07:01AM +0800, Xuneng Zhou wrote:
> Using a wrapper function seems better here.

For a cleaner git history, the wrapper function and the assertion
should be split as its own commit, with the actual fix applied first.
Just an opinion.

> I’ve incorporated both tests and the assertion into the patch applied to master.
> For earlier stable branches, the test in
> 040_standby_failover_slots_sync.pl and the assertion for
> WAL receiver state checking are not included (if Noah agrees too). Please check.

Thanks for doing that.  I have extracted the test to check the "valid"
WAL receiver shutdown case, and applied that separately.

Noah, what do you think?
--
Michael

Вложения
Hi,

On Fri, Oct 31, 2025 at 11:16 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Oct 31, 2025 at 12:07:01AM +0800, Xuneng Zhou wrote:
> > Using a wrapper function seems better here.
>
> For a cleaner git history, the wrapper function and the assertion
> should be split as its own commit, with the actual fix applied first.
> Just an opinion.
>

It makes sense to me. I've split the patch for HEAD into three patch
sets.  Please see the attached files.

> > I’ve incorporated both tests and the assertion into the patch applied to master.
> > For earlier stable branches, the test in
> > 040_standby_failover_slots_sync.pl and the assertion for
> > WAL receiver state checking are not included (if Noah agrees too). Please check.
>
> Thanks for doing that.  I have extracted the test to check the "valid"
> WAL receiver shutdown case, and applied that separately.

Thanks for checking and pushing this.

Best,
Xuneng

Вложения
On Fri, Oct 31, 2025 at 06:50:06PM +0800, Xuneng Zhou wrote:
> On Fri, Oct 31, 2025 at 11:16 AM Michael Paquier <michael@paquier.xyz> wrote:
>>
>> On Fri, Oct 31, 2025 at 12:07:01AM +0800, Xuneng Zhou wrote:
>> > Using a wrapper function seems better here.
>>
>> For a cleaner git history, the wrapper function and the assertion
>> should be split as its own commit, with the actual fix applied first.
>> Just an opinion.
>
> It makes sense to me. I've split the patch for HEAD into three patch
> sets.  Please see the attached files.

Thanks, that looks sensible.  I'll revisit what you have at the
beginning of next week (local Tuesday) with a backpatch down to v13 in
mind.  If others have comments and/or objections, please feel free to
chime in.

Note that the changes in 040_standby_failover_slots_sync.pl are not
needed in v3-0001.  No need to send an updated patch just for that.
--
Michael

Вложения
Hi,

On Sat, Nov 1, 2025 at 8:17 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Oct 31, 2025 at 06:50:06PM +0800, Xuneng Zhou wrote:
> > On Fri, Oct 31, 2025 at 11:16 AM Michael Paquier <michael@paquier.xyz> wrote:
> >>
> >> On Fri, Oct 31, 2025 at 12:07:01AM +0800, Xuneng Zhou wrote:
> >> > Using a wrapper function seems better here.
> >>
> >> For a cleaner git history, the wrapper function and the assertion
> >> should be split as its own commit, with the actual fix applied first.
> >> Just an opinion.
> >
> > It makes sense to me. I've split the patch for HEAD into three patch
> > sets.  Please see the attached files.
>
> Thanks, that looks sensible.  I'll revisit what you have at the
> beginning of next week (local Tuesday) with a backpatch down to v13 in
> mind.  If others have comments and/or objections, please feel free to
> chime in.
>
> Note that the changes in 040_standby_failover_slots_sync.pl are not
> needed in v3-0001.  No need to send an updated patch just for that.

It's an oversight. Thanks for catching it.

Best,
Xuneng



On Sat, Nov 01, 2025 at 08:44:55AM +0800, Xuneng Zhou wrote:
> On Sat, Nov 1, 2025 at 8:17 AM Michael Paquier <michael@paquier.xyz> wrote:
>> Thanks, that looks sensible.  I'll revisit what you have at the
>> beginning of next week (local Tuesday) with a backpatch down to v13 in
>> mind.  If others have comments and/or objections, please feel free to
>> chime in.
>
> It's an oversight. Thanks for catching it.

Phew, done.  While looking at the whole thing, I was wondering if we
should strengthen a little bit what's expected of the context for some
of the callers of the WAL routines, like XLogShutdownWalRcv(), and
finished with the bonus patch attached.  What do you think?
--
Michael

Вложения
Hi, Michael!

On Tue, Nov 4, 2025 at 12:52 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Sat, Nov 01, 2025 at 08:44:55AM +0800, Xuneng Zhou wrote:
> > On Sat, Nov 1, 2025 at 8:17 AM Michael Paquier <michael@paquier.xyz> wrote:
> >> Thanks, that looks sensible.  I'll revisit what you have at the
> >> beginning of next week (local Tuesday) with a backpatch down to v13 in
> >> mind.  If others have comments and/or objections, please feel free to
> >> chime in.
> >
> > It's an oversight. Thanks for catching it.
>
> Phew, done.

Thanks for pushing the patches!

> While looking at the whole thing, I was wondering if we
> should strengthen a little bit what's expected of the context for some
> of the callers of the WAL routines, like XLogShutdownWalRcv(), and
> finished with the bonus patch attached.  What do you think?

The change LGTM — it ensures that the walreceiver’s lifecycle is
orchestrated entirely by the startup process.

Best,
Xuneng



On Tue, Nov 04, 2025 at 05:31:24PM +0800, xingguo wrote:
> I’d like to ask whether this fix will be included in the patch release for
> version 14 - specifically 14.20. I noticed the commit mentions
> Backpatch-through: 13, and I want to confirm whether that means the fix
> will be backpatched to version 13 and all subsequent supported versions.

This has been applied across all the stable branches still supported
by the community, REL_13_STABLE, so this will be included in next
week's set of minor releases.
--
Michael

Вложения