Re: Deadlock between logrep apply worker and tablesync worker

Поиск
Список
Период
Сортировка
От vignesh C
Тема Re: Deadlock between logrep apply worker and tablesync worker
Дата
Msg-id CALDaNm19uLVJAqrQpaKDKXqJ6HCOg-vTQznPeKmiPtu_FrDKBw@mail.gmail.com
обсуждение исходный текст
Ответ на RE: Deadlock between logrep apply worker and tablesync worker  ("houzj.fnst@fujitsu.com" <houzj.fnst@fujitsu.com>)
Список pgsql-hackers
On Mon, 30 Jan 2023 at 13:00, houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, January 30, 2023 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > >
> > > > One thing that looks a bit odd is that we will anyway have a similar
> > > > check in replorigin_drop_guts() which is a static function and
> > > > called from only one place, so, will it be required to check at both places?
> > >
> > > There is a possibility that the initial check to verify if replication
> > > origin exists in replorigin_drop_by_name was successful but later one
> > > of either table sync worker or apply worker process might have dropped
> > > the replication origin,
> > >
> >
> > Won't locking on the particular origin prevent concurrent drops? IIUC, the
> > drop happens after the patch acquires the lock on the origin.
>
> Yes, I think the existence check in replorigin_drop_guts is unnecessary as we
> already lock the origin before that. I think the check in replorigin_drop_guts
> is a custom check after calling SearchSysCache1 to get the tuple, but the error
> should not happen as no concurrent drop can be performed.

This scenario is possible while creating subscription, apply worker
will try to drop the replication origin if the state is
SUBREL_STATE_SYNCDONE. Table sync worker will set the state to
SUBREL_STATE_SYNCDONE and update the relation state before calling
replorigin_drop_by_name. Since the transaction is committed by table
sync worker, the state is visible to apply worker, now apply worker
will  parallelly try to drop the replication origin in this case.
There is a race condition in this case, one of the process table sync
worker or apply worker will acquire the lock and drop the replication
origin, the other process will get the lock after the process drops
the origin and commits the transaction. Now the other process will try
to drop the replication origin once it acquires the lock and get the
error(from replorigin_drop_guts): cache lookup failed for replication
origin with ID.
Concurrent drop is possible in this case.

Regards,
Vignesh



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Richard Guo
Дата:
Сообщение: Re: Check lateral references within PHVs for memoize cache keys
Следующее
От: David Geier
Дата:
Сообщение: Re: Lazy JIT IR code generation to increase JIT speed with partitions