Обсуждение: BUG: Cascading standby fails to reconnect after falling back to archive recovery
BUG: Cascading standby fails to reconnect after falling back to archive recovery
От
Marco Nenciarini
Дата:
Hi hackers,
I've encountered a bug in PostgreSQL's streaming replication where cascading
standbys fail to reconnect after falling back to archive recovery. The issue
occurs when the upstream standby uses archive-only recovery.
The standby requests streaming from the wrong WAL position (next segment boundary
instead of the current position), causing connection failures with this error:
ERROR: requested starting point 0/A000000 is ahead of the WAL flush
position of this server 0/9000000
Attached are two shell scripts that reliably reproduce the issue on PostgreSQL
17.x and 18.x:
1. reproducer_restart_upstream_portable.sh - triggers by restarting upstream
2. reproducer_cascade_restart_portable.sh - triggers by restarting the cascade
The scripts set up this topology:
- Primary with archiving enabled
- Standby using only archive recovery (no streaming from primary)
- Cascading standby streaming from the archive-only standby
When the cascade loses its streaming connection and falls back to archive recovery,
it cannot reconnect. The issue appears to be in xlogrecovery.c around line 3880,
where the position passed to RequestXLogStreaming() determines which segment
boundary is requested.
The cascade restart reproducer shows that even restarting the cascade itself
triggers the bug, which affects routine maintenance operations.
I've encountered a bug in PostgreSQL's streaming replication where cascading
standbys fail to reconnect after falling back to archive recovery. The issue
occurs when the upstream standby uses archive-only recovery.
The standby requests streaming from the wrong WAL position (next segment boundary
instead of the current position), causing connection failures with this error:
ERROR: requested starting point 0/A000000 is ahead of the WAL flush
position of this server 0/9000000
Attached are two shell scripts that reliably reproduce the issue on PostgreSQL
17.x and 18.x:
1. reproducer_restart_upstream_portable.sh - triggers by restarting upstream
2. reproducer_cascade_restart_portable.sh - triggers by restarting the cascade
The scripts set up this topology:
- Primary with archiving enabled
- Standby using only archive recovery (no streaming from primary)
- Cascading standby streaming from the archive-only standby
When the cascade loses its streaming connection and falls back to archive recovery,
it cannot reconnect. The issue appears to be in xlogrecovery.c around line 3880,
where the position passed to RequestXLogStreaming() determines which segment
boundary is requested.
The cascade restart reproducer shows that even restarting the cascade itself
triggers the bug, which affects routine maintenance operations.
Scripts require PostgreSQL binaries in PATH and use ports 15432-15434.
Best regards,
Marco
Вложения
Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery
От
Fujii Masao
Дата:
On Thu, Jan 29, 2026 at 2:03 AM Marco Nenciarini <marco.nenciarini@enterprisedb.com> wrote: > > Hi hackers, > > I've encountered a bug in PostgreSQL's streaming replication where cascading > standbys fail to reconnect after falling back to archive recovery. The issue > occurs when the upstream standby uses archive-only recovery. > > The standby requests streaming from the wrong WAL position (next segment boundary > instead of the current position), causing connection failures with this error: > > ERROR: requested starting point 0/A000000 is ahead of the WAL flush > position of this server 0/9000000 Thanks for the report! I was also able to reproduce this issue on the master branch. Interestingly, I couldn't reproduce it on v11 using the same test case. This makes me wonder whether the issue was introduced in v12 or later. Do you see the same behavior in your environment? Regards, -- Fujii Masao
Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery
От
Xuneng Zhou
Дата:
Hi Marco, On Thu, Jan 29, 2026 at 1:03 AM Marco Nenciarini <marco.nenciarini@enterprisedb.com> wrote: > > Hi hackers, > > I've encountered a bug in PostgreSQL's streaming replication where cascading > standbys fail to reconnect after falling back to archive recovery. The issue > occurs when the upstream standby uses archive-only recovery. > > The standby requests streaming from the wrong WAL position (next segment boundary > instead of the current position), causing connection failures with this error: > > ERROR: requested starting point 0/A000000 is ahead of the WAL flush > position of this server 0/9000000 > > Attached are two shell scripts that reliably reproduce the issue on PostgreSQL > 17.x and 18.x: > > 1. reproducer_restart_upstream_portable.sh - triggers by restarting upstream > 2. reproducer_cascade_restart_portable.sh - triggers by restarting the cascade > > The scripts set up this topology: > - Primary with archiving enabled > - Standby using only archive recovery (no streaming from primary) > - Cascading standby streaming from the archive-only standby > > When the cascade loses its streaming connection and falls back to archive recovery, > it cannot reconnect. The issue appears to be in xlogrecovery.c around line 3880, > where the position passed to RequestXLogStreaming() determines which segment > boundary is requested. > > The cascade restart reproducer shows that even restarting the cascade itself > triggers the bug, which affects routine maintenance operations. > > Scripts require PostgreSQL binaries in PATH and use ports 15432-15434. > > Best regards, > Marco > Thanks for your report. I can reliably reproduce the issue on HEAD using your scripts. I’ve analyzed the problem and am proposing a patch to fix it. --- Analysis When a cascading standby streams from an archive-only upstream: 1. The upstream's GetStandbyFlushRecPtr() returns only replay position (no received-but-not-replayed buffer since there's no walreceiver) 2. When streaming ends and the cascade falls back to archive recovery, it can restore WAL segments from its own archive access 3. The cascade's read position (RecPtr) advances beyond what the upstream has replayed 4. On reconnect, the cascade requests streaming from RecPtr, which the upstream rejects as "ahead of flush position" --- Proposed Fix Track the last confirmed flush position from streaming (lastStreamedFlush) and clamp the streaming start request when it exceeds that position: - Same timeline: clamp to lastStreamedFlush if RecPtr > lastStreamedFlush - Timeline switch: fall back to timeline switchpoint as safe boundary This ensures the cascade requests from a position the upstream definitely has, rather than assuming the upstream can serve whatever the cascade restored locally from archive. I’m not a fan of using sleep in TAP tests, but I haven’t found a better way to reproduce this behavior yet. -- Best, Xuneng
Вложения
Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery
От
Fujii Masao
Дата:
On Thu, Jan 29, 2026 at 9:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > Thanks for your report. I can reliably reproduce the issue on HEAD > using your scripts. I’ve analyzed the problem and am proposing a patch > to fix it. > > --- Analysis > When a cascading standby streams from an archive-only upstream: > > 1. The upstream's GetStandbyFlushRecPtr() returns only replay position > (no received-but-not-replayed buffer since there's no walreceiver) > 2. When streaming ends and the cascade falls back to archive recovery, > it can restore WAL segments from its own archive access > 3. The cascade's read position (RecPtr) advances beyond what the > upstream has replayed > 4. On reconnect, the cascade requests streaming from RecPtr, which the > upstream rejects as "ahead of flush position" > > --- Proposed Fix > > Track the last confirmed flush position from streaming > (lastStreamedFlush) and clamp the streaming start request when it > exceeds that position: I haven't read the patch yet, but doesn't lastStreamedFlush represent the same LSN as tliRecPtr or replayLSN (the arguments to WaitForWALToBecomeAvailable())? If so, we may not need to introduce a new variable to track this LSN. The choice of which LSN is used as the replication start point has varied over time to handle corner cases (for example, commit 06687198018). That makes me wonder whether we should first better understand why WaitForWALToBecomeAvailable() currently uses RecPtr as the starting point. BTW, with v1 patch, I was able to reproduce the issue using the following steps: -------------------------------------------- initdb -D data mkdir arch cat <<EOF >> data/postgresql.conf archive_mode = on archive_command = 'cp %p ../arch/%f' restore_command = 'cp ../arch/%f %p' EOF pg_ctl -D data start pg_basebackup -D sby1 -c fast cp -a sby1 sby2 cat <<EOF >> sby1/postgresql.conf port = 5433 EOF touch sby1/standby.signal pg_ctl -D sby1 start cat <<EOF >> sby2/postgresql.conf port = 5434 primary_conninfo = 'port=5433' EOF touch sby2/standby.signal pg_ctl -D sby2 start pgbench -i -s2 pg_ctl -D sby2 restart -------------------------------------------- In this case, after restarting the standby connecting to another (cascading) standby, I observed the following error. FATAL: could not receive data from WAL stream: ERROR: requested starting point 0/04000000 is ahead of the WAL flush position of this server 0/03FFE8D0 Regards, -- Fujii Masao
Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery
От
Xuneng Zhou
Дата:
Hi Fujii'san,
On Fri, Jan 30, 2026 at 11:12 AM Fujii Masao <masao.fujii@gmail.com> wrote:
>
> On Thu, Jan 29, 2026 at 9:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > Thanks for your report. I can reliably reproduce the issue on HEAD
> > using your scripts. I’ve analyzed the problem and am proposing a patch
> > to fix it.
> >
> > --- Analysis
> > When a cascading standby streams from an archive-only upstream:
> >
> > 1. The upstream's GetStandbyFlushRecPtr() returns only replay position
> > (no received-but-not-replayed buffer since there's no walreceiver)
> > 2. When streaming ends and the cascade falls back to archive recovery,
> > it can restore WAL segments from its own archive access
> > 3. The cascade's read position (RecPtr) advances beyond what the
> > upstream has replayed
> > 4. On reconnect, the cascade requests streaming from RecPtr, which the
> > upstream rejects as "ahead of flush position"
> >
> > --- Proposed Fix
> >
> > Track the last confirmed flush position from streaming
> > (lastStreamedFlush) and clamp the streaming start request when it
> > exceeds that position:
>
> I haven't read the patch yet, but doesn't lastStreamedFlush represent
> the same LSN as tliRecPtr or replayLSN (the arguments to
> WaitForWALToBecomeAvailable())? If so, we may not need to introduce
> a new variable to track this LSN.
> The choice of which LSN is used as the replication start point has varied
> over time to handle corner cases (for example, commit 06687198018).
> That makes me wonder whether we should first better understand
> why WaitForWALToBecomeAvailable() currently uses RecPtr as
> the starting point.
>
> BTW, with v1 patch, I was able to reproduce the issue using the following steps:
>
> --------------------------------------------
> initdb -D data
> mkdir arch
> cat <<EOF >> data/postgresql.conf
> archive_mode = on
> archive_command = 'cp %p ../arch/%f'
> restore_command = 'cp ../arch/%f %p'
> EOF
> pg_ctl -D data start
> pg_basebackup -D sby1 -c fast
> cp -a sby1 sby2
> cat <<EOF >> sby1/postgresql.conf
> port = 5433
> EOF
> touch sby1/standby.signal
> pg_ctl -D sby1 start
> cat <<EOF >> sby2/postgresql.conf
> port = 5434
> primary_conninfo = 'port=5433'
> EOF
> touch sby2/standby.signal
> pg_ctl -D sby2 start
> pgbench -i -s2
> pg_ctl -D sby2 restart
> --------------------------------------------
>
> In this case, after restarting the standby connecting to another
> (cascading) standby, I observed the following error.
>
> FATAL: could not receive data from WAL stream: ERROR: requested
> starting point 0/04000000 is ahead of the WAL flush position of this
> server 0/03FFE8D0
>
> Regards,
>
> --
> Fujii Masao
Thanks for looking into this.
On Fri, Jan 30, 2026 at 11:12 AM Fujii Masao <masao.fujii@gmail.com> wrote:
>
> On Thu, Jan 29, 2026 at 9:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > Thanks for your report. I can reliably reproduce the issue on HEAD
> > using your scripts. I’ve analyzed the problem and am proposing a patch
> > to fix it.
> >
> > --- Analysis
> > When a cascading standby streams from an archive-only upstream:
> >
> > 1. The upstream's GetStandbyFlushRecPtr() returns only replay position
> > (no received-but-not-replayed buffer since there's no walreceiver)
> > 2. When streaming ends and the cascade falls back to archive recovery,
> > it can restore WAL segments from its own archive access
> > 3. The cascade's read position (RecPtr) advances beyond what the
> > upstream has replayed
> > 4. On reconnect, the cascade requests streaming from RecPtr, which the
> > upstream rejects as "ahead of flush position"
> >
> > --- Proposed Fix
> >
> > Track the last confirmed flush position from streaming
> > (lastStreamedFlush) and clamp the streaming start request when it
> > exceeds that position:
>
> I haven't read the patch yet, but doesn't lastStreamedFlush represent
> the same LSN as tliRecPtr or replayLSN (the arguments to
> WaitForWALToBecomeAvailable())? If so, we may not need to introduce
> a new variable to track this LSN.
I think they refer to different types of LSNs. I don’t have access to my computer at the moment, but I’ll look into it and get back to you shortly.
> The choice of which LSN is used as the replication start point has varied
> over time to handle corner cases (for example, commit 06687198018).
> That makes me wonder whether we should first better understand
> why WaitForWALToBecomeAvailable() currently uses RecPtr as
> the starting point.
>
> BTW, with v1 patch, I was able to reproduce the issue using the following steps:
>
> --------------------------------------------
> initdb -D data
> mkdir arch
> cat <<EOF >> data/postgresql.conf
> archive_mode = on
> archive_command = 'cp %p ../arch/%f'
> restore_command = 'cp ../arch/%f %p'
> EOF
> pg_ctl -D data start
> pg_basebackup -D sby1 -c fast
> cp -a sby1 sby2
> cat <<EOF >> sby1/postgresql.conf
> port = 5433
> EOF
> touch sby1/standby.signal
> pg_ctl -D sby1 start
> cat <<EOF >> sby2/postgresql.conf
> port = 5434
> primary_conninfo = 'port=5433'
> EOF
> touch sby2/standby.signal
> pg_ctl -D sby2 start
> pgbench -i -s2
> pg_ctl -D sby2 restart
> --------------------------------------------
>
> In this case, after restarting the standby connecting to another
> (cascading) standby, I observed the following error.
>
> FATAL: could not receive data from WAL stream: ERROR: requested
> starting point 0/04000000 is ahead of the WAL flush position of this
> server 0/03FFE8D0
>
> Regards,
>
> --
> Fujii Masao
Best,
Xuneng
Xuneng
Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery
От
Xuneng Zhou
Дата:
Hi, On Fri, Jan 30, 2026 at 11:12 AM Fujii Masao <masao.fujii@gmail.com> wrote: > > On Thu, Jan 29, 2026 at 9:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Thanks for your report. I can reliably reproduce the issue on HEAD > > using your scripts. I’ve analyzed the problem and am proposing a patch > > to fix it. > > > > --- Analysis > > When a cascading standby streams from an archive-only upstream: > > > > 1. The upstream's GetStandbyFlushRecPtr() returns only replay position > > (no received-but-not-replayed buffer since there's no walreceiver) > > 2. When streaming ends and the cascade falls back to archive recovery, > > it can restore WAL segments from its own archive access > > 3. The cascade's read position (RecPtr) advances beyond what the > > upstream has replayed > > 4. On reconnect, the cascade requests streaming from RecPtr, which the > > upstream rejects as "ahead of flush position" > > > > --- Proposed Fix > > > > Track the last confirmed flush position from streaming > > (lastStreamedFlush) and clamp the streaming start request when it > > exceeds that position: > > I haven't read the patch yet, but doesn't lastStreamedFlush represent > the same LSN as tliRecPtr or replayLSN (the arguments to > WaitForWALToBecomeAvailable())? If so, we may not need to introduce > a new variable to track this LSN. lastStreamedFlush is the upstream’s confirmed flush point from the last streaming session—what the sender guaranteed it had. tliRecPtr is the LSN of the start of the current WAL record which used to determine which timeline that record belongs to (tliOfPointInHistory), and replayLSN is how far we’ve applied locally. After archive fallback, both tliRecPtr and replayLSN can be ahead of what the upstream has, so they can’t safely cap a reconnect. LastStreamedFlush is used as the upstream-capability bound. > The choice of which LSN is used as the replication start point has varied > over time to handle corner cases (for example, commit 06687198018). > That makes me wonder whether we should first better understand > why WaitForWALToBecomeAvailable() currently uses RecPtr as > the starting point. AFAICS, fix 06687198018 addresses a scenario where a standby gets stuck reading a continuation record that spans multiple pages/segments when the pages must come from different sources. The problem: if the first page is read successfully from local pg_wal but the second page contains garbage from a recycled segment, the old code would enter an infinite loop. This happened because: Late failure detection: Page header validation occurred inside XLogReadRecord(), which triggered ReadRecord()'s retry-from-beginning logic—restarting the entire record read from local sources without ever trying streaming. Wrong streaming start position: Even if streaming was eventually attempted, it started from tliRecPtr (record start) rather than RecPtr (current read position), potentially re-requesting segments the primary had already recycled. The fix has two parts: Early page header validation: Validate the page header immediately after reading, before returning to the caller. If garbage is detected (typically via xlp_pageaddr mismatch), jump directly to next_record_is_invalid to try an alternative source (streaming), bypassing ReadRecord()'s retry loop. Correct streaming start position: Change from ptr = tliRecPtr to ptr = RecPtr, so streaming begins at the position where data is actually needed. The record start position (tliRecPtr) is still used for timeline determination, but no longer for the streaming start LSN. Together, these changes ensure the standby escapes the local-read retry loop and fetches the continuation data from the correct position via streaming. > BTW, with v1 patch, I was able to reproduce the issue using the following steps: > > -------------------------------------------- > initdb -D data > mkdir arch > cat <<EOF >> data/postgresql.conf > archive_mode = on > archive_command = 'cp %p ../arch/%f' > restore_command = 'cp ../arch/%f %p' > EOF > pg_ctl -D data start > pg_basebackup -D sby1 -c fast > cp -a sby1 sby2 > cat <<EOF >> sby1/postgresql.conf > port = 5433 > EOF > touch sby1/standby.signal > pg_ctl -D sby1 start > cat <<EOF >> sby2/postgresql.conf > port = 5434 > primary_conninfo = 'port=5433' > EOF > touch sby2/standby.signal > pg_ctl -D sby2 start > pgbench -i -s2 > pg_ctl -D sby2 restart > -------------------------------------------- > > In this case, after restarting the standby connecting to another > (cascading) standby, I observed the following error. > > FATAL: could not receive data from WAL stream: ERROR: requested > starting point 0/04000000 is ahead of the WAL flush position of this > server 0/03FFE8D0 > After sby2 restarts, its WAL read position (RecPtr) is set to the segment boundary 0/04000000, but the upstream sby1 (archive-only standby with no walreceiver) can only serve up to its replay position 0/03FFE8D0. The cascade requests WAL ahead of what the upstream can provide.The issue is that no in-memory state survives the restart to cap the streaming start request. Before restart, the walreceiver knew what the upstream had confirmed; after restart, that information is lost. One potential solution is a "handshake clamp": after connecting, obtain the upstream's current flush LSN from IDENTIFY_SYSTEM and clamp the streaming start position to Min(startpoint, primaryFlush) before sending START_REPLICATION. But I think this is somewhat complicated. -- Best, Xuneng