Обсуждение: [PATCH] Add archive_mode=follow_primary to prevent unarchived WAL on standby promotion

Поиск
Список
Период
Сортировка

[PATCH] Add archive_mode=follow_primary to prevent unarchived WAL on standby promotion

От
Andrey Borodin
Дата:
Hi hackers,

I'd like to propose a new archive_mode setting to address a gap in WAL
archiving for high availability streaming replication configurations.

## Problem

In HA setups using streaming replication, standbys can be
promoted when primary has failed. Some WAL segments might be not yet
archived. This creates gaps in the WAL archive, breaking point-in-time
recovery:

1. Primary generates WAL, streams to standby
2. Standby receives WAL, marks segments as .done immediately
3. Standby deletes WAL during checkpoints
4. Primary hasn't archived yet (archiver lag, network issues, etc.)
5. Primary vanishes
6. Standby gets promoted
7. WAL history lost from archive

This is particularly problematic in synchronous replication where
promotion might happen while the primary is still catching up on archival.

Promoted standby might have some WALs from walreceiver, some from archive. In
this case we need to archive only those WALs which were received, but not
confirmed to be archived by primary.

## Proposed Solution

Add archive_mode=follow_primary, where standbys defer WAL deletion until
the primary confirms archival:

- During recovery: standby creates .ready files for received segments
- Periodically: standby queries primary for archive status via replication protocol
- Primary responds: which segments are archived (no .ready file exists)
- Standby marks those as .done and can safely delete them
- On promotion: standby automatically archives remaining .ready segments

## Implementation

The patch adds two replication protocol messages:
- 'a' (PqReplMsg_ArchiveStatusQuery): standby → primary, sends (timeline, segno) pairs
- 'A' (PqReplMsg_ArchiveStatusResponse): primary → standby, responds with archived pairs

Key changes:
- walreceiver: XLogWalRcvSendArchiveQuery() scans archive_status, sends
queries. I particularily dislike necessity to read whole arcive_status directory,
but found no better way.
- walsender: ProcessStandbyArchiveQueryMessage() checks .ready files, responds.
Fortunately, no potentially FS-heavy operations on Primary.
- archiver: skips archiving during recovery if archive_mode=follow_primary.
I considered creating new kind of status file, but rejected the idea.
- XLogWalRcvClose(): creates .ready files instead of .done in follow_primary mode

Status requests happen at wal_receiver_status_interval (similar to hot_standby_feedback).
Works with cascading replication - each standby queries its immediate upstream.
Primary can be configured with archive_mode=follow_primary too.

## Testing

Included TAP tests cover:
- Basic archive status synchronization
- Standby promotion triggering archival
- Cascading standby configurations
- Multiple standbys from same primary

## Performance Impact

The overhead is minimal:
- Standby: One archive_status directory scan per wal_receiver_status_interval
- Primary: O(n) stat() calls where n = number of .ready files on standby
- Network: Small message (~1KB for 64 segments)
- Some space occupied by unarchived WALs on all standbys

## Open Questions

1. **Naming**: Is "follow_primary" the best name? Alternatives considered:
   - standby
   - synchronized/sync
   - coordinated
   - primary_sync

2. **Query frequency**: Currently tied to wal_receiver_status_interval.
   Should this be a separate GUC?

3. **Message protocol**: Should we batch more segments per message?
   Current limit is 64 per query. Maybe sort rqeuests by LSN to pick 64 oldest segments?

4. **Backwards compatibility**: Primary must understand the protocol.
   Should we version-check or gracefully degrade? I don't think additional check is necessary, but I'm not sure.
   Currently, if a walreceiver with follow_primary connects to an old primary that
   doesn't understand the 'a' message, the primary will log a protocol error
   but replication will continue (the standby just won't get responses).

## Future work

I'd like to extend archiver design to distribute archival work between cluster nodes. But
it would be too big project to do at once, so I decided to address PITR continuity issue first.

## Patch

Patch attached implements the feature with documentation and tests, but main purpose is, of course, a discussion. Does
thisapproach seem right direction of development? 
Looking forward to feedback on the approach and any concerns.


Best regards, Andrey Borodin.

Вложения
Hi,

On Thu, Oct 23, 2025 at 9:25 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> Hi hackers,
>
> I'd like to propose a new archive_mode setting to address a gap in WAL
> archiving for high availability streaming replication configurations.
>
> In HA setups using streaming replication, standbys can be
> promoted when primary has failed. Some WAL segments might be not yet
> archived. This creates gaps in the WAL archive, breaking point-in-time
> recovery:
>
> 1. Primary generates WAL, streams to standby
> 2. Standby receives WAL, marks segments as .done immediately

+1 to the idea.
If I understand correctly, the assumption we're making is that the Standby
doesn't really "archive" just makes it as .done, even though in theory
it could do the same
thing as the primary and avoid this issue. It would be wasted work if
the primary and replica
archives the same WAL and that's what we want to avoid?

>
> ## Implementation
>
> The patch adds two replication protocol messages:
> - 'a' (PqReplMsg_ArchiveStatusQuery): standby → primary, sends (timeline, segno) pairs
> - 'A' (PqReplMsg_ArchiveStatusResponse): primary → standby, responds with archived pairs
>

I might be missing something but isn't it enough for the writer to
send the last_archived_wal
in PgStat_ArchiverStats? That way we can avoid doing the full
directory scan of archive_status.
Or do we not feel comfortable assuming that WAL files are archived in order?

Thanks,
--
John Hsu - Amazon Web Services



On Fri, Oct 24, 2025 at 1:25 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> Hi hackers,
>
> I'd like to propose a new archive_mode setting to address a gap in WAL
> archiving for high availability streaming replication configurations.
>
> ## Problem
>
> In HA setups using streaming replication, standbys can be
> promoted when primary has failed. Some WAL segments might be not yet
> archived. This creates gaps in the WAL archive, breaking point-in-time
> recovery:
>
> 1. Primary generates WAL, streams to standby
> 2. Standby receives WAL, marks segments as .done immediately
> 3. Standby deletes WAL during checkpoints
> 4. Primary hasn't archived yet (archiver lag, network issues, etc.)
> 5. Primary vanishes
> 6. Standby gets promoted
> 7. WAL history lost from archive
>
> This is particularly problematic in synchronous replication where
> promotion might happen while the primary is still catching up on archival.
>
> Promoted standby might have some WALs from walreceiver, some from archive. In
> this case we need to archive only those WALs which were received, but not
> confirmed to be archived by primary.
>
> ## Proposed Solution
>
> Add archive_mode=follow_primary, where standbys defer WAL deletion until
> the primary confirms archival:

Can't we achieve nearly the same behavior by setting archive_mode to
always and configuring archive_command on the standby to check
whether the WAL file already exists in the shared archive area
(e.g., test -f <archive directory>/%f (probably also the WAL file size
should be checked))? In this setup, archive_command would fail
until the WAL file appears in the archive, preventing the standby
from removing it while the command is failing.

Regards,

--
Fujii Masao



Re: [PATCH] Add archive_mode=follow_primary to prevent unarchived WAL on standby promotion

От
Andrey Borodin
Дата:

> On 24 Oct 2025, at 03:19, John H <johnhyvr@gmail.com> wrote:
>
> Hi,
>
> On Thu, Oct 23, 2025 at 9:25 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>>
>> Hi hackers,
>>
>> I'd like to propose a new archive_mode setting to address a gap in WAL
>> archiving for high availability streaming replication configurations.
>>
>> In HA setups using streaming replication, standbys can be
>> promoted when primary has failed. Some WAL segments might be not yet
>> archived. This creates gaps in the WAL archive, breaking point-in-time
>> recovery:
>>
>> 1. Primary generates WAL, streams to standby
>> 2. Standby receives WAL, marks segments as .done immediately
>
> +1 to the idea.
> If I understand correctly, the assumption we're making is that the Standby
> doesn't really "archive" just makes it as .done, even though in theory
> it could do the same
> thing as the primary and avoid this issue. It would be wasted work if
> the primary and replica
> archives the same WAL and that's what we want to avoid?

Yes, I'd like to avoid costs of archiving same file many times. And cost of requesting storage if given file is
archived.

>>
>> ## Implementation
>>
>> The patch adds two replication protocol messages:
>> - 'a' (PqReplMsg_ArchiveStatusQuery): standby → primary, sends (timeline, segno) pairs
>> - 'A' (PqReplMsg_ArchiveStatusResponse): primary → standby, responds with archived pairs
>>
>
> I might be missing something but isn't it enough for the writer to
> send the last_archived_wal
> in PgStat_ArchiverStats? That way we can avoid doing the full
> directory scan of archive_status.
> Or do we not feel comfortable assuming that WAL files are archived in order?

AFAIU archiver archives in order of reading archive_status directory, e.i. random order in worst case.
Anyway, we could send .done signals to standby, but we cannot be sure given standby already have WAL for which we are
commandinghim to avoid archiving it... And standby might have these WALs from archive already, thus not needing .done
fileat all. 

So, I implemented basic design that works for worst case. We can add some heuristics on top, but them must be
negligiblecheap in any possible archiving scenario. 


> On 27 Oct 2025, at 10:26, Fujii Masao <masao.fujii@gmail.com> wrote:
>
> On Fri, Oct 24, 2025 at 1:25 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>>
>> Hi hackers,
>>
>> I'd like to propose a new archive_mode setting to address a gap in WAL
>> archiving for high availability streaming replication configurations.
>>
>> ## Problem
>>
>> In HA setups using streaming replication, standbys can be
>> promoted when primary has failed. Some WAL segments might be not yet
>> archived. This creates gaps in the WAL archive, breaking point-in-time
>> recovery:
>>
>> 1. Primary generates WAL, streams to standby
>> 2. Standby receives WAL, marks segments as .done immediately
>> 3. Standby deletes WAL during checkpoints
>> 4. Primary hasn't archived yet (archiver lag, network issues, etc.)
>> 5. Primary vanishes
>> 6. Standby gets promoted
>> 7. WAL history lost from archive
>>
>> This is particularly problematic in synchronous replication where
>> promotion might happen while the primary is still catching up on archival.
>>
>> Promoted standby might have some WALs from walreceiver, some from archive. In
>> this case we need to archive only those WALs which were received, but not
>> confirmed to be archived by primary.
>>
>> ## Proposed Solution
>>
>> Add archive_mode=follow_primary, where standbys defer WAL deletion until
>> the primary confirms archival:
>
> Can't we achieve nearly the same behavior by setting archive_mode to
> always and configuring archive_command on the standby to check
> whether the WAL file already exists in the shared archive area
> (e.g., test -f <archive directory>/%f (probably also the WAL file size
> should be checked))? In this setup, archive_command would fail
> until the WAL file appears in the archive, preventing the standby
> from removing it while the command is failing.

Many storages charge for request. If archive tool issues HEAD request to S3 - it might costs user some money.
Other storages cap frequency of requests at some RPS. In worst case we might affect archiving capabilities of primary.

The key idea here is that archive storage might be a disaster recovery system that is optimized for storing data, but
notfor listing this data frequently. So the cluster should not delegate archive_status function to some distant storage
ifit can be cheaply tracked within HA cluster internally. 

Thanks for your interest!


Best regards, Andrey Borodin.




Hi,

On Fri, Oct 31, 2025 at 11:14 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> AFAIU archiver archives in order of reading archive_status directory, e.i. random order in worst case.
>

My understanding is the archiver uses a heap to allocate the batch of
files that will be archived to avoid scanning the directory
every-time. [0] The comparison is by name so it would only contain the
oldest WAL segments in order [1].

> Anyway, we could send .done signals to standby, but we cannot be sure given standby already have WAL for which we are
commandinghim to avoid archiving it... And standby might have these WALs from archive already, thus not needing .done
fileat all. 
>
> So, I implemented basic design that works for worst case. We can add some heuristics on top, but them must be
negligiblecheap in any possible archiving scenario. 
>

I was thinking at a high-level pgarch.c just has the latest WAL
segment archived from writer. Then every time before it attempts to
archive the segment in
pgarch_archiveXlog it just checks if the xlog is <
lastArchivedSegmentOnWriter. If it is earlier than the writer's
archived segment return true/skip the segment. It wouldn't matter if
the archived_segment on writer is ahead of what has been streamed to
the standby because standby archiver would only do comparisons against
what it has locally.

If writer has archived WAL 10, it should be safe for standby to skip WAL 1-9.
This way we don't need to stream every .done file from writer to
standby because we can rely on the fact that the segments are archived
in order.

[0] https://github.com/postgres/postgres/blob/master/src/backend/postmaster/pgarch.c#L739-L742
[1]  https://github.com/postgres/postgres/blob/master/src/backend/postmaster/pgarch.c#L792-L797

Thanks,
--
John Hsu - Amazon Web Services



On Sat, Nov 1, 2025 at 3:14 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> Many storages charge for request. If archive tool issues HEAD request to S3 - it might costs user some money.
> Other storages cap frequency of requests at some RPS. In worst case we might affect archiving capabilities of
primary.
>
> The key idea here is that archive storage might be a disaster recovery system that is optimized for storing data, but
notfor listing this data frequently. So the cluster should not delegate archive_status function to some distant storage
ifit can be cheaply tracked within HA cluster internally. 

Just idea, as another approach, we could check whether the specified WAL file
has already been archived, by querying pg_stat_archiver on the primary,
instead of sending a request to the storage service. So, it seems we could set
the standby's archive_command to a script that performs this check
via pg_stat_archiver to achieve the same goal. Thoughts?

Regards,

--
Fujii Masao



> On 23 Oct 2025, at 21:25, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> Hi hackers,
>
> I'd like to propose a new archive_mode setting to address a gap in WAL
> archiving for high availability streaming replication configurations.
>
> Best regards, Andrey Borodin.
> <v1-0001-Add-archive_mode-follow_primary-to-prevent-WAL-lo.patch>

Hi!

As discussed offline: there is one small part of your patch to improve - make
walsender send answer for every requested segments, not for the first 64
(still, walreceiver limits segments count to 64).

Also I noticed some improvement possibilities in test:

 - repeat logic of polling for several conditions across the test archive_follow_primary.pl,
seems like we need some general polling functions in PostgreSQL::Test::Utils?
I added poll_until, poll_cmd_until functions for that, and use them now in the test
(and also in poll_query_until, that was my inspiration)

 - there was some places that checks invariants that is always true:
done_files_appeared, standby2_done_found, standby3_done_found was checked for >= 0,
but initial value was 0, ready_files_found was polled but not verified, and also ready_count_after <=
ready_count_beforewas not checked correctly (in case polling was break by ’timeout’), I replaced 
all of them with more strictly checks, is that right, or I missed some points?

--
Best regards,
Roman Khapov



Вложения