Re: Assertion failure in SnapBuildInitialSnapshot()
| От | Masahiko Sawada |
|---|---|
| Тема | Re: Assertion failure in SnapBuildInitialSnapshot() |
| Дата | |
| Msg-id | CAD21AoDuka_sb+aLs7iUf0u+FHfw-3qzuXzfj6jWF8kjrG=88g@mail.gmail.com обсуждение исходный текст |
| Ответ на | Re: Assertion failure in SnapBuildInitialSnapshot() (Pradeep Kumar <spradeepkumar29@gmail.com>) |
| Ответы |
Re: Assertion failure in SnapBuildInitialSnapshot()
Re: Assertion failure in SnapBuildInitialSnapshot() |
| Список | pgsql-hackers |
On Mon, Oct 27, 2025 at 5:22 AM Pradeep Kumar <spradeepkumar29@gmail.com> wrote:
>
> Hi All,
> In this thread they proposed fix_concurrent_slot_xmin_update.patch will solve this assert failure. After applying
thispatch I execute pg_sync_replication_slots() (which calls SyncReplicationSlots → synchronize_slots() →
synchronize_one_slot()→ ReplicationSlotsComputeRequiredXmin(true)) can hit an assertion failure in
ReplicationSlotsComputeRequiredXmin()because the ReplicationSlotControlLock is not held in that code path. By default
sync_replication_slotsis off, so the background slot-sync worker is not spawned; invoking the UDF directly exercises
thepath without the lock. I have a small patch that acquires ReplicationSlotControlLock in the manual sync path; that
stopsthe assert.
>
> Call Stack :
> TRAP: failed Assert("!already_locked || (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) &&
LWLockHeldByMeInMode(ProcArrayLock,LW_EXCLUSIVE))"), File: "slot.
> c", Line: 1061, PID: 67056
> 0 postgres 0x000000010104aad4 ExceptionalCondition + 216
> 1 postgres 0x0000000100d8718c ReplicationSlotsComputeRequiredXmin + 180
> 2 postgres 0x0000000100d6fba8 synchronize_one_slot + 1488
> 3 postgres 0x0000000100d6e8cc synchronize_slots + 1480
> 4 postgres 0x0000000100d6efe4 SyncReplicationSlots + 164
> 5 postgres 0x0000000100d8da84 pg_sync_replication_slots + 476
> 6 postgres 0x0000000100b34c58 ExecInterpExpr + 2388
> 7 postgres 0x0000000100b33ee8 ExecInterpExprStillValid + 76
> 8 postgres 0x00000001008acd5c ExecEvalExprSwitchContext + 64
> 9 postgres 0x0000000100b54d48 ExecProject + 76
> 10 postgres 0x0000000100b925d4 ExecResult + 312
> 11 postgres 0x0000000100b5083c ExecProcNodeFirst + 92
> 12 postgres 0x0000000100b48b88 ExecProcNode + 60
> 13 postgres 0x0000000100b44410 ExecutePlan + 184
> 14 postgres 0x0000000100b442dc standard_ExecutorRun + 644
> 15 postgres 0x0000000100b44048 ExecutorRun + 104
> 16 postgres 0x0000000100e3053c PortalRunSelect + 308
> 17 postgres 0x0000000100e2ff40 PortalRun + 736
> 18 postgres 0x0000000100e2b21c exec_simple_query + 1368
> 19 postgres 0x0000000100e2a42c PostgresMain + 2508
> 20 postgres 0x0000000100e22ce4 BackendInitialize + 0
> 21 postgres 0x0000000100d1fd4c postmaster_child_launch + 304
> 22 postgres 0x0000000100d26d9c BackendStartup + 448
> 23 postgres 0x0000000100d23f18 ServerLoop + 372
> 24 postgres 0x0000000100d22f18 PostmasterMain + 6396
> 25 postgres 0x0000000100bcffd4 init_locale + 0
> 26 dyld 0x0000000186d82b98 start + 6076
>
> The assert is raised inside ReplicationSlotsComputeRequiredXmin() because that function expects either that
already_lockedis false (and it will acquire what it needs), or that callers already hold both
ReplicationSlotControlLock(exclusive) and ProcArrayLock (exclusive). In the manual-sync path called by the UDF, neither
lockis held, so the assertion trips.
>
> Why this happens:
> The background slot sync worker (spawned when sync_replication_slots = on) acquires the necessary locks before
callingthe routines that update/compute slot xmins, so the worker path is safe.The manual path through the SQL-callable
UDFdoes not take the same locks before calling synchronize_slots()/synchronize_one_slot(). As a result the invariant
assumedby ReplicationSlotsComputeRequiredXmin() can be violated, leading to the assert.
>
> Proposed fix:
> In synchronize_slots() (the code path used by SyncReplicationSlots()/pg_sync_replication_slots()), acquire
ReplicationSlotControlLockbefore any call that can end up calling ReplicationSlotsComputeRequiredXmin(true).
It would be great if we have a test case for this issue possibly using
injection points.
Also, I think it's worth considering the idea Robert shared before[1]:
---
But what about just surgically preventing that?
ProcArraySetReplicationSlotXmin() could refuse to retreat the values,
perhaps? If it computes an older value than what's there, it just does
nothing?
---
We did a similar fix for confirmed_flush LSN by commit ad5eaf390c582,
and it sounds reasonable to me that ProcArraySetReplicationSlotXmin()
refuses to retreat the values.
Regards,
[1] https://www.postgresql.org/message-id/CA%2BTgmoYLzJxCEa0aCan3KR7o_25G52cbqw-90Q0VGRmV3a8XGQ%40mail.gmail.com
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
В списке pgsql-hackers по дате отправления: