Обсуждение: BUG: Cascading standby fails to reconnect after falling back to archive recovery

Поиск
Список
Период
Сортировка

BUG: Cascading standby fails to reconnect after falling back to archive recovery

От
Marco Nenciarini
Дата:
Hi hackers,

I've encountered a bug in PostgreSQL's streaming replication where cascading
standbys fail to reconnect after falling back to archive recovery. The issue
occurs when the upstream standby uses archive-only recovery.

The standby requests streaming from the wrong WAL position (next segment boundary
instead of the current position), causing connection failures with this error:

    ERROR: requested starting point 0/A000000 is ahead of the WAL flush
    position of this server 0/9000000

Attached are two shell scripts that reliably reproduce the issue on PostgreSQL
17.x and 18.x:

1. reproducer_restart_upstream_portable.sh - triggers by restarting upstream
2. reproducer_cascade_restart_portable.sh - triggers by restarting the cascade

The scripts set up this topology:
- Primary with archiving enabled
- Standby using only archive recovery (no streaming from primary)
- Cascading standby streaming from the archive-only standby

When the cascade loses its streaming connection and falls back to archive recovery,
it cannot reconnect. The issue appears to be in xlogrecovery.c around line 3880,
where the position passed to RequestXLogStreaming() determines which segment
boundary is requested.

The cascade restart reproducer shows that even restarting the cascade itself
triggers the bug, which affects routine maintenance operations.

Scripts require PostgreSQL binaries in PATH and use ports 15432-15434.

Best regards,
Marco

Вложения
On Thu, Jan 29, 2026 at 2:03 AM Marco Nenciarini
<marco.nenciarini@enterprisedb.com> wrote:
>
> Hi hackers,
>
> I've encountered a bug in PostgreSQL's streaming replication where cascading
> standbys fail to reconnect after falling back to archive recovery. The issue
> occurs when the upstream standby uses archive-only recovery.
>
> The standby requests streaming from the wrong WAL position (next segment boundary
> instead of the current position), causing connection failures with this error:
>
>     ERROR: requested starting point 0/A000000 is ahead of the WAL flush
>     position of this server 0/9000000

Thanks for the report!
I was also able to reproduce this issue on the master branch.

Interestingly, I couldn't reproduce it on v11 using the same test case.
This makes me wonder whether the issue was introduced in v12 or later.

Do you see the same behavior in your environment?

Regards,

--
Fujii Masao



Hi Marco,

On Thu, Jan 29, 2026 at 1:03 AM Marco Nenciarini
<marco.nenciarini@enterprisedb.com> wrote:
>
> Hi hackers,
>
> I've encountered a bug in PostgreSQL's streaming replication where cascading
> standbys fail to reconnect after falling back to archive recovery. The issue
> occurs when the upstream standby uses archive-only recovery.
>
> The standby requests streaming from the wrong WAL position (next segment boundary
> instead of the current position), causing connection failures with this error:
>
>     ERROR: requested starting point 0/A000000 is ahead of the WAL flush
>     position of this server 0/9000000
>
> Attached are two shell scripts that reliably reproduce the issue on PostgreSQL
> 17.x and 18.x:
>
> 1. reproducer_restart_upstream_portable.sh - triggers by restarting upstream
> 2. reproducer_cascade_restart_portable.sh - triggers by restarting the cascade
>
> The scripts set up this topology:
> - Primary with archiving enabled
> - Standby using only archive recovery (no streaming from primary)
> - Cascading standby streaming from the archive-only standby
>
> When the cascade loses its streaming connection and falls back to archive recovery,
> it cannot reconnect. The issue appears to be in xlogrecovery.c around line 3880,
> where the position passed to RequestXLogStreaming() determines which segment
> boundary is requested.
>
> The cascade restart reproducer shows that even restarting the cascade itself
> triggers the bug, which affects routine maintenance operations.
>
> Scripts require PostgreSQL binaries in PATH and use ports 15432-15434.
>
> Best regards,
> Marco
>

Thanks for your report. I can reliably reproduce the issue on HEAD
using your scripts. I’ve analyzed the problem and am proposing a patch
to fix it.

--- Analysis
When a cascading standby streams from an archive-only upstream:

1. The upstream's GetStandbyFlushRecPtr() returns only replay position
(no received-but-not-replayed buffer since there's no walreceiver)
2. When streaming ends and the cascade falls back to archive recovery,
it can restore WAL segments from its own archive access
3. The cascade's read position (RecPtr) advances beyond what the
upstream has replayed
4. On reconnect, the cascade requests streaming from RecPtr, which the
upstream rejects as "ahead of flush position"

--- Proposed Fix

Track the last confirmed flush position from streaming
(lastStreamedFlush) and clamp the streaming start request when it
exceeds that position:

- Same timeline: clamp to lastStreamedFlush if RecPtr > lastStreamedFlush
- Timeline switch: fall back to timeline switchpoint as safe boundary

This ensures the cascade requests from a position the upstream
definitely has, rather than assuming the upstream can serve whatever
the cascade restored locally from archive.

I’m not a fan of using sleep in TAP tests, but I haven’t found a
better way to reproduce this behavior yet.

--
Best,
Xuneng

Вложения
On Thu, Jan 29, 2026 at 9:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> Thanks for your report. I can reliably reproduce the issue on HEAD
> using your scripts. I’ve analyzed the problem and am proposing a patch
> to fix it.
>
> --- Analysis
> When a cascading standby streams from an archive-only upstream:
>
> 1. The upstream's GetStandbyFlushRecPtr() returns only replay position
> (no received-but-not-replayed buffer since there's no walreceiver)
> 2. When streaming ends and the cascade falls back to archive recovery,
> it can restore WAL segments from its own archive access
> 3. The cascade's read position (RecPtr) advances beyond what the
> upstream has replayed
> 4. On reconnect, the cascade requests streaming from RecPtr, which the
> upstream rejects as "ahead of flush position"
>
> --- Proposed Fix
>
> Track the last confirmed flush position from streaming
> (lastStreamedFlush) and clamp the streaming start request when it
> exceeds that position:

I haven't read the patch yet, but doesn't lastStreamedFlush represent
the same LSN as tliRecPtr or replayLSN (the arguments to
WaitForWALToBecomeAvailable())? If so, we may not need to introduce
a new variable to track this LSN.

The choice of which LSN is used as the replication start point has varied
over time to handle corner cases (for example, commit 06687198018).
That makes me wonder whether we should first better understand
why WaitForWALToBecomeAvailable() currently uses RecPtr as
the starting point.

BTW, with v1 patch, I was able to reproduce the issue using the following steps:

--------------------------------------------
initdb -D data
mkdir arch
cat <<EOF >> data/postgresql.conf
archive_mode = on
archive_command = 'cp %p ../arch/%f'
restore_command = 'cp ../arch/%f %p'
EOF
pg_ctl -D data start
pg_basebackup -D sby1 -c fast
cp -a sby1 sby2
cat <<EOF >> sby1/postgresql.conf
port = 5433
EOF
touch sby1/standby.signal
pg_ctl -D sby1 start
cat <<EOF >> sby2/postgresql.conf
port = 5434
primary_conninfo = 'port=5433'
EOF
touch sby2/standby.signal
pg_ctl -D sby2 start
pgbench -i -s2
pg_ctl -D sby2 restart
--------------------------------------------

In this case, after restarting the standby connecting to another
(cascading) standby, I observed the following error.

FATAL:  could not receive data from WAL stream: ERROR:  requested
starting point 0/04000000 is ahead of the WAL flush position of this
server 0/03FFE8D0

Regards,

--
Fujii Masao



Hi Fujii'san,

Thanks for looking into this.

On Fri, Jan 30, 2026 at 11:12 AM Fujii Masao <masao.fujii@gmail.com> wrote:
>
> On Thu, Jan 29, 2026 at 9:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > Thanks for your report. I can reliably reproduce the issue on HEAD
> > using your scripts. I’ve analyzed the problem and am proposing a patch
> > to fix it.
> >
> > --- Analysis
> > When a cascading standby streams from an archive-only upstream:
> >
> > 1. The upstream's GetStandbyFlushRecPtr() returns only replay position
> > (no received-but-not-replayed buffer since there's no walreceiver)
> > 2. When streaming ends and the cascade falls back to archive recovery,
> > it can restore WAL segments from its own archive access
> > 3. The cascade's read position (RecPtr) advances beyond what the
> > upstream has replayed
> > 4. On reconnect, the cascade requests streaming from RecPtr, which the
> > upstream rejects as "ahead of flush position"
> >
> > --- Proposed Fix
> >
> > Track the last confirmed flush position from streaming
> > (lastStreamedFlush) and clamp the streaming start request when it
> > exceeds that position:
>
> I haven't read the patch yet, but doesn't lastStreamedFlush represent
> the same LSN as tliRecPtr or replayLSN (the arguments to
> WaitForWALToBecomeAvailable())? If so, we may not need to introduce
> a new variable to track this LSN.

I think they refer to different types of LSNs. I don’t have access to my computer at the moment, but I’ll look into it and get back to you shortly.

> The choice of which LSN is used as the replication start point has varied
> over time to handle corner cases (for example, commit 06687198018).
> That makes me wonder whether we should first better understand
> why WaitForWALToBecomeAvailable() currently uses RecPtr as
> the starting point.
>
> BTW, with v1 patch, I was able to reproduce the issue using the following steps:
>
> --------------------------------------------
> initdb -D data
> mkdir arch
> cat <<EOF >> data/postgresql.conf
> archive_mode = on
> archive_command = 'cp %p ../arch/%f'
> restore_command = 'cp ../arch/%f %p'
> EOF
> pg_ctl -D data start
> pg_basebackup -D sby1 -c fast
> cp -a sby1 sby2
> cat <<EOF >> sby1/postgresql.conf
> port = 5433
> EOF
> touch sby1/standby.signal
> pg_ctl -D sby1 start
> cat <<EOF >> sby2/postgresql.conf
> port = 5434
> primary_conninfo = 'port=5433'
> EOF
> touch sby2/standby.signal
> pg_ctl -D sby2 start
> pgbench -i -s2
> pg_ctl -D sby2 restart
> --------------------------------------------
>
> In this case, after restarting the standby connecting to another
> (cascading) standby, I observed the following error.
>
> FATAL:  could not receive data from WAL stream: ERROR:  requested
> starting point 0/04000000 is ahead of the WAL flush position of this
> server 0/03FFE8D0
>
> Regards,
>
> --
> Fujii Masao


Best,
Xuneng

Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery

От
Xuneng Zhou
Дата:
Hi,

On Fri, Jan 30, 2026 at 11:12 AM Fujii Masao <masao.fujii@gmail.com> wrote:
>
> On Thu, Jan 29, 2026 at 9:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > Thanks for your report. I can reliably reproduce the issue on HEAD
> > using your scripts. I’ve analyzed the problem and am proposing a patch
> > to fix it.
> >
> > --- Analysis
> > When a cascading standby streams from an archive-only upstream:
> >
> > 1. The upstream's GetStandbyFlushRecPtr() returns only replay position
> > (no received-but-not-replayed buffer since there's no walreceiver)
> > 2. When streaming ends and the cascade falls back to archive recovery,
> > it can restore WAL segments from its own archive access
> > 3. The cascade's read position (RecPtr) advances beyond what the
> > upstream has replayed
> > 4. On reconnect, the cascade requests streaming from RecPtr, which the
> > upstream rejects as "ahead of flush position"
> >
> > --- Proposed Fix
> >
> > Track the last confirmed flush position from streaming
> > (lastStreamedFlush) and clamp the streaming start request when it
> > exceeds that position:
>
> I haven't read the patch yet, but doesn't lastStreamedFlush represent
> the same LSN as tliRecPtr or replayLSN (the arguments to
> WaitForWALToBecomeAvailable())? If so, we may not need to introduce
> a new variable to track this LSN.

lastStreamedFlush is the upstream’s confirmed flush point from the
last streaming session—what the sender guaranteed it had. tliRecPtr is
the LSN of the start of the current WAL record which used to determine
which timeline that record belongs to (tliOfPointInHistory), and
replayLSN is how far we’ve applied locally. After archive fallback,
both tliRecPtr and replayLSN can be ahead of what the upstream has, so
they can’t safely cap a reconnect. LastStreamedFlush is used as the
upstream-capability bound.

> The choice of which LSN is used as the replication start point has varied
> over time to handle corner cases (for example, commit 06687198018).
> That makes me wonder whether we should first better understand
> why WaitForWALToBecomeAvailable() currently uses RecPtr as
> the starting point.

AFAICS, fix 06687198018 addresses a scenario where a standby gets
stuck reading a continuation record that spans multiple pages/segments
when the pages must come from different sources.

The problem: if the first page is read successfully from local pg_wal
but the second page contains garbage from a recycled segment, the old
code would enter an infinite loop. This happened because:

Late failure detection: Page header validation occurred inside
XLogReadRecord(), which triggered ReadRecord()'s retry-from-beginning
logic—restarting the entire record read from local sources without
ever trying streaming.

Wrong streaming start position: Even if streaming was eventually
attempted, it started from tliRecPtr (record start) rather than RecPtr
(current read position), potentially re-requesting segments the
primary had already recycled.

The fix has two parts:

Early page header validation: Validate the page header immediately
after reading, before returning to the caller. If garbage is detected
(typically via xlp_pageaddr mismatch), jump directly to
next_record_is_invalid to try an alternative source (streaming),
bypassing ReadRecord()'s retry loop.

Correct streaming start position: Change from ptr = tliRecPtr to ptr =
RecPtr, so streaming begins at the position where data is actually
needed. The record start position (tliRecPtr) is still used for
timeline determination, but no longer for the streaming start LSN.

Together, these changes ensure the standby escapes the local-read
retry loop and fetches the continuation data from the correct position
via streaming.

> BTW, with v1 patch, I was able to reproduce the issue using the following steps:
>
> --------------------------------------------
> initdb -D data
> mkdir arch
> cat <<EOF >> data/postgresql.conf
> archive_mode = on
> archive_command = 'cp %p ../arch/%f'
> restore_command = 'cp ../arch/%f %p'
> EOF
> pg_ctl -D data start
> pg_basebackup -D sby1 -c fast
> cp -a sby1 sby2
> cat <<EOF >> sby1/postgresql.conf
> port = 5433
> EOF
> touch sby1/standby.signal
> pg_ctl -D sby1 start
> cat <<EOF >> sby2/postgresql.conf
> port = 5434
> primary_conninfo = 'port=5433'
> EOF
> touch sby2/standby.signal
> pg_ctl -D sby2 start
> pgbench -i -s2
> pg_ctl -D sby2 restart
> --------------------------------------------
>
> In this case, after restarting the standby connecting to another
> (cascading) standby, I observed the following error.
>
> FATAL:  could not receive data from WAL stream: ERROR:  requested
> starting point 0/04000000 is ahead of the WAL flush position of this
> server 0/03FFE8D0
>

After sby2 restarts, its WAL read position (RecPtr) is set to the
segment boundary 0/04000000, but the upstream sby1 (archive-only
standby with no walreceiver) can only serve up to its replay position
0/03FFE8D0. The cascade requests WAL ahead of what the upstream can
provide.The issue is that no in-memory state survives the restart to
cap the streaming start request. Before restart, the walreceiver knew
what the upstream had confirmed; after restart, that information is
lost.

One potential solution is a "handshake clamp": after connecting,
obtain the upstream's current flush LSN from IDENTIFY_SYSTEM and clamp
the streaming start position to Min(startpoint, primaryFlush) before
sending START_REPLICATION. But I think this is somewhat complicated.

--
Best,
Xuneng