[BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation

Поиск
Список
Период
Сортировка
От Joao Foltran
Тема [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation
Дата
Msg-id CAF8B20CZaF+D81vp1fSd=38YU-i=1RsrLFbm_uucXCH-dOJgWg@mail.gmail.com
обсуждение исходный текст
Ответы RE: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation
Список pgsql-hackers
Hi hackers,

I'd like to report a regression in PostgreSQL 18 regarding physical
replication slot invalidation and propose a fix.

It's my first time sending any type of contribution, so please let me
know if I made anything incorrectly and I'll fix it ASAP.

It's also my first time doing any type of code inside the postgres
project, so if the logic or anything I used is incorrect let me know.

CCing Amit, since he committed f41d8468 and 8709dcc.

## Problem

Commit f41d8468 introduced an ERROR when trying to acquire an invalidated
replication slot. While this is correct for logical replication slots
(which cannot safely recover after invalidation), it breaks recovery
for physical replication slots.

Later, commit 8709dcc improved upon this code to prevent a race
condition and moved the check to after the slot was already acquired.

In PostgreSQL 17 and earlier, when a physical replication slot was
invalidated due to max_slot_wal_keep_size, the slot could still be
reacquired if the required WAL became available through restore_command
or archive recovery in the standby. This is a common operational scenario:

- Temporary network issues
- Planned maintenance windows
- Standbys temporarily falling behind

After commit f41d8468, physical replication slots cannot be reacquired
once invalidated, even when the required WAL is available via archive
recovery. The standby remains stuck recovering from archive and cannot
resume streaming replication, demanding manual intervention (slot recreation).

## Reproduction

1. Set up primary with physical replication slot and small
max_slot_wal_keep_size
2. Configure standby with restore_command for archive recovery
3. Stop standby and generate enough WAL on primary to invalidate the slot
4. Restart standby - it can access WAL from archive but gets:
   "FATAL: can no longer access replication slot"

In PG17, streaming would resume. In PG18, it stays permanently broken.

## Proposed Fix

The attached patch differentiates between logical and physical slots in
ReplicationSlotAcquire():

- Logical slots: Raise ERROR as before (cannot safely recover)
- Physical slots: Log a warning but allow acquisition (can recover)

This restores the PG17 behavior for physical slots while maintaining
safety guarantees for logical slots.

The patch includes a TAP test that:
- Demonstrates the issue
- Verifies the fix works
- Ensures physical slots can recover after invalidation

## Testing

Tested on both master and REL_18_STABLE:
- All existing regression tests pass
- New TAP test passes
- Manual testing confirms standbys can recover

## Backpatching

This should be backpatched to PostgreSQL 18 where the regression was
introduced.

Patches attached:
- v1-0001-Allow-physical-replication-slots-to-recover-after-invalidation.patch
(for master)
- v1-0001-Allow-physical-replication-slots-to-recover-after-invalidation-pg18.patch
(for REL_18_STABLE)

Best regards,

Joao Foltran

Вложения

В списке pgsql-hackers по дате отправления: