Re: Conflict detection for update_deleted in logical replication
От | shveta malik |
---|---|
Тема | Re: Conflict detection for update_deleted in logical replication |
Дата | |
Msg-id | CAJpy0uDm1+JP=SBkuSp=513cK6BJYZn5Yd_00aLLxpHGvdGEtQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Conflict detection for update_deleted in logical replication (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-hackers |
On Mon, Sep 8, 2025 at 3:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Coverity is not happy with commit a850be2fe: > > /srv/coverity/git/pgsql-git/postgresql/src/backend/replication/logical/worker.c: 3276 in FindDeletedTupleInLocalRel() > 3270 * maybe_advance_nonremovable_xid() for details). > 3271 */ > 3272 LWLockAcquire(LogicalRepWorkerLock, LW_SHARED); > 3273 leader = logicalrep_worker_find(MyLogicalRepWorker->subid, > 3274 InvalidOid, false); > 3275 > >>> CID 1665367: Null pointer dereferences (NULL_RETURNS) > >>> Dereferencing a pointer that might be "NULL" "&leader->relmutex" when calling "tas". > 3276 SpinLockAcquire(&leader->relmutex); > 3277 oldestxmin = leader->oldest_nonremovable_xid; > 3278 SpinLockRelease(&leader->relmutex); > 3279 LWLockRelease(LogicalRepWorkerLock); > 3280 } > > I think Coverity has a point. AFAICS every other call of > logicalrep_worker_find() guards against a NULL result, > so why is it okay for this one to dump core on NULL? > Thanks for pointing it out. It was a miss. I attempted to reproduce a SIGSEGV in this flow. It appears that a SIGSEGV can occur when the tablesync worker is catching up and is in FindDeletedTupleInLocalRel() and meanwhile drop-subscription is done in another session. Here’s the sequence that triggers the issue: 1) Pause the tablesync worker while it's in FindDeletedTupleInLocalRel(). 2) Issue a 'DROP SUBSCRIPTION sub'. 3) Allow DropSubscription to proceed to logicalrep_worker_stop() for the apply worker, but block it using the debugger before it attempts to stop the tablesync worker. 4) Simultaneously, hold the launcher process using the debugger before it can restart the apply worker. 5) Now, resume the tablesync worker. It ends up with a NULL leader worker and hits a SIGSEGV. Since this issue can be reliably reproduced with a simple DROP SUBSCRIPTION, I thought it would be appropriate to add the new error as a user-facing error. Additionally, the issue can also be reproduced if the apply worker is forcefully made to error out in wait_for_relation_state_change() while the tablesync worker is applying changes and is in FindDeletedTupleInLocalRel(). Attached a patch to address the issue. thanks Shveta
Вложения
В списке pgsql-hackers по дате отправления: