[BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation
| От | Joao Foltran |
|---|---|
| Тема | [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation |
| Дата | |
| Msg-id | CAF8B20CZaF+D81vp1fSd=38YU-i=1RsrLFbm_uucXCH-dOJgWg@mail.gmail.com обсуждение исходный текст |
| Ответы |
RE: [BUG] [PATCH] Allow physical replication slots to recover from archive after invalidation
|
| Список | pgsql-hackers |
Hi hackers, I'd like to report a regression in PostgreSQL 18 regarding physical replication slot invalidation and propose a fix. It's my first time sending any type of contribution, so please let me know if I made anything incorrectly and I'll fix it ASAP. It's also my first time doing any type of code inside the postgres project, so if the logic or anything I used is incorrect let me know. CCing Amit, since he committed f41d8468 and 8709dcc. ## Problem Commit f41d8468 introduced an ERROR when trying to acquire an invalidated replication slot. While this is correct for logical replication slots (which cannot safely recover after invalidation), it breaks recovery for physical replication slots. Later, commit 8709dcc improved upon this code to prevent a race condition and moved the check to after the slot was already acquired. In PostgreSQL 17 and earlier, when a physical replication slot was invalidated due to max_slot_wal_keep_size, the slot could still be reacquired if the required WAL became available through restore_command or archive recovery in the standby. This is a common operational scenario: - Temporary network issues - Planned maintenance windows - Standbys temporarily falling behind After commit f41d8468, physical replication slots cannot be reacquired once invalidated, even when the required WAL is available via archive recovery. The standby remains stuck recovering from archive and cannot resume streaming replication, demanding manual intervention (slot recreation). ## Reproduction 1. Set up primary with physical replication slot and small max_slot_wal_keep_size 2. Configure standby with restore_command for archive recovery 3. Stop standby and generate enough WAL on primary to invalidate the slot 4. Restart standby - it can access WAL from archive but gets: "FATAL: can no longer access replication slot" In PG17, streaming would resume. In PG18, it stays permanently broken. ## Proposed Fix The attached patch differentiates between logical and physical slots in ReplicationSlotAcquire(): - Logical slots: Raise ERROR as before (cannot safely recover) - Physical slots: Log a warning but allow acquisition (can recover) This restores the PG17 behavior for physical slots while maintaining safety guarantees for logical slots. The patch includes a TAP test that: - Demonstrates the issue - Verifies the fix works - Ensures physical slots can recover after invalidation ## Testing Tested on both master and REL_18_STABLE: - All existing regression tests pass - New TAP test passes - Manual testing confirms standbys can recover ## Backpatching This should be backpatched to PostgreSQL 18 where the regression was introduced. Patches attached: - v1-0001-Allow-physical-replication-slots-to-recover-after-invalidation.patch (for master) - v1-0001-Allow-physical-replication-slots-to-recover-after-invalidation-pg18.patch (for REL_18_STABLE) Best regards, Joao Foltran
Вложения
В списке pgsql-hackers по дате отправления: