Re: Assertion failure in SnapBuildInitialSnapshot()
| От | Pradeep Kumar |
|---|---|
| Тема | Re: Assertion failure in SnapBuildInitialSnapshot() |
| Дата | |
| Msg-id | CAJ4xhPn8QMojKo-+5c_66YL_dZegL+Z9sQUX5UjV2sd58do2UQ@mail.gmail.com обсуждение исходный текст |
| Ответ на | Assertion failure in SnapBuildInitialSnapshot() (Amit Kapila <amit.kapila16@gmail.com>) |
| Список | pgsql-hackers |
Hi All,
In this thread they proposed fix_concurrent_slot_xmin_update.patch will solve this assert failure. After applying this patch I execute pg_sync_replication_slots() (which calls SyncReplicationSlots → synchronize_slots() → synchronize_one_slot() → ReplicationSlotsComputeRequiredXmin(true)) can hit an assertion failure in ReplicationSlotsComputeRequiredXmin() because the ReplicationSlotControlLock is not held in that code path. By default sync_replication_slots is off, so the background slot-sync worker is not spawned; invoking the UDF directly exercises the path without the lock. I have a small patch that acquires ReplicationSlotControlLock in the manual sync path; that stops the assert.
Call Stack :
TRAP: failed Assert("!already_locked || (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) && LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE))"), File: "slot.
c", Line: 1061, PID: 67056
0 postgres 0x000000010104aad4 ExceptionalCondition + 216
1 postgres 0x0000000100d8718c ReplicationSlotsComputeRequiredXmin + 180
2 postgres 0x0000000100d6fba8 synchronize_one_slot + 1488
3 postgres 0x0000000100d6e8cc synchronize_slots + 1480
4 postgres 0x0000000100d6efe4 SyncReplicationSlots + 164
5 postgres 0x0000000100d8da84 pg_sync_replication_slots + 476
6 postgres 0x0000000100b34c58 ExecInterpExpr + 2388
7 postgres 0x0000000100b33ee8 ExecInterpExprStillValid + 76
8 postgres 0x00000001008acd5c ExecEvalExprSwitchContext + 64
9 postgres 0x0000000100b54d48 ExecProject + 76
10 postgres 0x0000000100b925d4 ExecResult + 312
11 postgres 0x0000000100b5083c ExecProcNodeFirst + 92
12 postgres 0x0000000100b48b88 ExecProcNode + 60
13 postgres 0x0000000100b44410 ExecutePlan + 184
14 postgres 0x0000000100b442dc standard_ExecutorRun + 644
15 postgres 0x0000000100b44048 ExecutorRun + 104
16 postgres 0x0000000100e3053c PortalRunSelect + 308
17 postgres 0x0000000100e2ff40 PortalRun + 736
18 postgres 0x0000000100e2b21c exec_simple_query + 1368
19 postgres 0x0000000100e2a42c PostgresMain + 2508
20 postgres 0x0000000100e22ce4 BackendInitialize + 0
21 postgres 0x0000000100d1fd4c postmaster_child_launch + 304
22 postgres 0x0000000100d26d9c BackendStartup + 448
23 postgres 0x0000000100d23f18 ServerLoop + 372
24 postgres 0x0000000100d22f18 PostmasterMain + 6396
25 postgres 0x0000000100bcffd4 init_locale + 0
26 dyld 0x0000000186d82b98 start + 6076
The assert is raised inside ReplicationSlotsComputeRequiredXmin() because that function expects either that already_locked is false (and it will acquire what it needs), or that callers already hold both ReplicationSlotControlLock (exclusive) and ProcArrayLock (exclusive). In the manual-sync path called by the UDF, neither lock is held, so the assertion trips.
Why this happens:
The background slot sync worker (spawned when sync_replication_slots = on) acquires the necessary locks before calling the routines that update/compute slot xmins, so the worker path is safe.The manual path through the SQL-callable UDF does not take the same locks before calling synchronize_slots()/synchronize_one_slot(). As a result the invariant assumed by ReplicationSlotsComputeRequiredXmin() can be violated, leading to the assert.
Proposed fix:
In synchronize_slots() (the code path used by SyncReplicationSlots()/pg_sync_replication_slots()), acquire ReplicationSlotControlLock before any call that can end up calling ReplicationSlotsComputeRequiredXmin(true).
Please refer the attached updated patch.
In this thread they proposed fix_concurrent_slot_xmin_update.patch will solve this assert failure. After applying this patch I execute pg_sync_replication_slots() (which calls SyncReplicationSlots → synchronize_slots() → synchronize_one_slot() → ReplicationSlotsComputeRequiredXmin(true)) can hit an assertion failure in ReplicationSlotsComputeRequiredXmin() because the ReplicationSlotControlLock is not held in that code path. By default sync_replication_slots is off, so the background slot-sync worker is not spawned; invoking the UDF directly exercises the path without the lock. I have a small patch that acquires ReplicationSlotControlLock in the manual sync path; that stops the assert.
Call Stack :
TRAP: failed Assert("!already_locked || (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) && LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE))"), File: "slot.
c", Line: 1061, PID: 67056
0 postgres 0x000000010104aad4 ExceptionalCondition + 216
1 postgres 0x0000000100d8718c ReplicationSlotsComputeRequiredXmin + 180
2 postgres 0x0000000100d6fba8 synchronize_one_slot + 1488
3 postgres 0x0000000100d6e8cc synchronize_slots + 1480
4 postgres 0x0000000100d6efe4 SyncReplicationSlots + 164
5 postgres 0x0000000100d8da84 pg_sync_replication_slots + 476
6 postgres 0x0000000100b34c58 ExecInterpExpr + 2388
7 postgres 0x0000000100b33ee8 ExecInterpExprStillValid + 76
8 postgres 0x00000001008acd5c ExecEvalExprSwitchContext + 64
9 postgres 0x0000000100b54d48 ExecProject + 76
10 postgres 0x0000000100b925d4 ExecResult + 312
11 postgres 0x0000000100b5083c ExecProcNodeFirst + 92
12 postgres 0x0000000100b48b88 ExecProcNode + 60
13 postgres 0x0000000100b44410 ExecutePlan + 184
14 postgres 0x0000000100b442dc standard_ExecutorRun + 644
15 postgres 0x0000000100b44048 ExecutorRun + 104
16 postgres 0x0000000100e3053c PortalRunSelect + 308
17 postgres 0x0000000100e2ff40 PortalRun + 736
18 postgres 0x0000000100e2b21c exec_simple_query + 1368
19 postgres 0x0000000100e2a42c PostgresMain + 2508
20 postgres 0x0000000100e22ce4 BackendInitialize + 0
21 postgres 0x0000000100d1fd4c postmaster_child_launch + 304
22 postgres 0x0000000100d26d9c BackendStartup + 448
23 postgres 0x0000000100d23f18 ServerLoop + 372
24 postgres 0x0000000100d22f18 PostmasterMain + 6396
25 postgres 0x0000000100bcffd4 init_locale + 0
26 dyld 0x0000000186d82b98 start + 6076
The assert is raised inside ReplicationSlotsComputeRequiredXmin() because that function expects either that already_locked is false (and it will acquire what it needs), or that callers already hold both ReplicationSlotControlLock (exclusive) and ProcArrayLock (exclusive). In the manual-sync path called by the UDF, neither lock is held, so the assertion trips.
Why this happens:
The background slot sync worker (spawned when sync_replication_slots = on) acquires the necessary locks before calling the routines that update/compute slot xmins, so the worker path is safe.The manual path through the SQL-callable UDF does not take the same locks before calling synchronize_slots()/synchronize_one_slot(). As a result the invariant assumed by ReplicationSlotsComputeRequiredXmin() can be violated, leading to the assert.
Proposed fix:
In synchronize_slots() (the code path used by SyncReplicationSlots()/pg_sync_replication_slots()), acquire ReplicationSlotControlLock before any call that can end up calling ReplicationSlotsComputeRequiredXmin(true).
Please refer the attached updated patch.
Вложения
В списке pgsql-hackers по дате отправления: