Re: Conflict detection for update_deleted in logical replication

Поиск
Список
Период
Сортировка
От shveta malik
Тема Re: Conflict detection for update_deleted in logical replication
Дата
Msg-id CAJpy0uDm1+JP=SBkuSp=513cK6BJYZn5Yd_00aLLxpHGvdGEtQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Conflict detection for update_deleted in logical replication  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On Mon, Sep 8, 2025 at 3:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Coverity is not happy with commit a850be2fe:
>
> /srv/coverity/git/pgsql-git/postgresql/src/backend/replication/logical/worker.c: 3276             in
FindDeletedTupleInLocalRel()
> 3270                     * maybe_advance_nonremovable_xid() for details).
> 3271                     */
> 3272                    LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
> 3273                    leader = logicalrep_worker_find(MyLogicalRepWorker->subid,
> 3274                                                                                    InvalidOid, false);
> 3275
> >>>     CID 1665367:         Null pointer dereferences  (NULL_RETURNS)
> >>>     Dereferencing a pointer that might be "NULL" "&leader->relmutex" when calling "tas".
> 3276                    SpinLockAcquire(&leader->relmutex);
> 3277                    oldestxmin = leader->oldest_nonremovable_xid;
> 3278                    SpinLockRelease(&leader->relmutex);
> 3279                    LWLockRelease(LogicalRepWorkerLock);
> 3280            }
>
> I think Coverity has a point.  AFAICS every other call of
> logicalrep_worker_find() guards against a NULL result,
> so why is it okay for this one to dump core on NULL?
>

Thanks for pointing it out. It was a miss.
I attempted to reproduce a SIGSEGV in this flow. It appears that a
SIGSEGV can occur when the tablesync worker is catching up and is in
FindDeletedTupleInLocalRel() and meanwhile drop-subscription is done
in another session. Here’s the sequence that triggers the issue:

1) Pause the tablesync worker while it's in FindDeletedTupleInLocalRel().
2) Issue a 'DROP SUBSCRIPTION sub'.
3) Allow DropSubscription to proceed to logicalrep_worker_stop() for
the apply worker, but block it using the debugger before it attempts
to stop the tablesync worker.
4) Simultaneously, hold the launcher process using the debugger before
it can restart the apply worker.
5) Now, resume the tablesync worker. It ends up with a NULL leader
worker and hits a SIGSEGV.

Since this issue can be reliably reproduced with a simple DROP
SUBSCRIPTION, I thought it would be appropriate to add the new error
as a user-facing error.

Additionally, the issue can also be reproduced  if the apply worker is
forcefully made to error out in wait_for_relation_state_change() while
the tablesync worker is applying changes and is in
FindDeletedTupleInLocalRel().

Attached a patch to address the issue.

thanks
Shveta

Вложения

В списке pgsql-hackers по дате отправления: