Re: Failure in subscription test 004_sync.pl

Поиск
Список
Период
Сортировка
От Masahiko Sawada
Тема Re: Failure in subscription test 004_sync.pl
Дата
Msg-id CAD21AoBWqUPeXbZviHWZhdZVMP3vop_k43ipzFwVa=emE7M3Og@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Failure in subscription test 004_sync.pl  (Amit Kapila <amit.kapila16@gmail.com>)
Ответы Re: Failure in subscription test 004_sync.pl  (Amit Kapila <amit.kapila16@gmail.com>)
Re: Failure in subscription test 004_sync.pl  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Список pgsql-hackers
On Sat, Jun 12, 2021 at 9:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Jun 12, 2021 at 1:13 PM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > wrasse has just failed with what looks like a timing error with a
> > replication slot drop:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-06-12%2006%3A16%3A30
> >
> > Here is the error:
> > error running SQL: 'psql:<stdin>:1: ERROR:  could not drop replication
> > slot "tap_sub" on publisher: ERROR:  replication slot "tap_sub" is
> > active for PID 1641'
> >
> > It seems to me that this just lacks a poll_query_until() doing some
> > slot monitoring?
> >
>
> I think it is showing a race condition issue in the code. In
> DropSubscription, we first stop the worker that is receiving the WAL,
> and then in a separate connection with the publisher, it tries to drop
> the slot which leads to this error. The reason is that walsender is
> still active as we just wait for wal receiver (or apply worker) to
> stop. Normally, as soon as the apply worker is stopped the walsender
> detects it and exits but in this case, it took some time to exit, and
> in the meantime, we tried to drop the slot which is still in use by
> walsender.

There might be possible.

That's weird since DROP SUBSCRIPTION executes DROP_REPLICATION_SLOT
command with WAIT option. I found a bug that is possibly an oversight
of commit 1632ea4368. The commit changed the code around the error as
follows:

    if (active_pid != MyProcPid)
    {
-       if (behavior == SAB_Error)
+       if (!nowait)
            ereport(ERROR,
                    (errcode(ERRCODE_OBJECT_IN_USE),
                     errmsg("replication slot \"%s\" is active for PID %d",
                            NameStr(s->data.name), active_pid)));
-       else if (behavior == SAB_Inquire)
-           return active_pid;

        /* Wait here until we get signaled, and then restart */
        ConditionVariableSleep(&s->active_cv,

The condition should be the opposite; we should raise the error when
'nowait' is true. I think this is the cause of the test failure. Even
if DROP SUBSCRIPTION tries to drop the slot with the WAIT option, we
don't wait but raise the error.

Attached a small patch fixes it.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От:
Дата:
Сообщение: RE: [PATCH] expand the units that pg_size_pretty supports on output
Следующее
От: "Drouvot, Bertrand"
Дата:
Сообщение: Re: Minimal logical decoding on standbys