On Sat, Jun 12, 2021 at 9:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Jun 12, 2021 at 1:13 PM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > wrasse has just failed with what looks like a timing error with a
> > replication slot drop:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-06-12%2006%3A16%3A30
> >
> > Here is the error:
> > error running SQL: 'psql:<stdin>:1: ERROR: could not drop replication
> > slot "tap_sub" on publisher: ERROR: replication slot "tap_sub" is
> > active for PID 1641'
> >
> > It seems to me that this just lacks a poll_query_until() doing some
> > slot monitoring?
> >
>
> I think it is showing a race condition issue in the code. In
> DropSubscription, we first stop the worker that is receiving the WAL,
> and then in a separate connection with the publisher, it tries to drop
> the slot which leads to this error. The reason is that walsender is
> still active as we just wait for wal receiver (or apply worker) to
> stop. Normally, as soon as the apply worker is stopped the walsender
> detects it and exits but in this case, it took some time to exit, and
> in the meantime, we tried to drop the slot which is still in use by
> walsender.
There might be possible.
That's weird since DROP SUBSCRIPTION executes DROP_REPLICATION_SLOT
command with WAIT option. I found a bug that is possibly an oversight
of commit 1632ea4368. The commit changed the code around the error as
follows:
if (active_pid != MyProcPid)
{
- if (behavior == SAB_Error)
+ if (!nowait)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_IN_USE),
errmsg("replication slot \"%s\" is active for PID %d",
NameStr(s->data.name), active_pid)));
- else if (behavior == SAB_Inquire)
- return active_pid;
/* Wait here until we get signaled, and then restart */
ConditionVariableSleep(&s->active_cv,
The condition should be the opposite; we should raise the error when
'nowait' is true. I think this is the cause of the test failure. Even
if DROP SUBSCRIPTION tries to drop the slot with the WAIT option, we
don't wait but raise the error.
Attached a small patch fixes it.
Regards,
--
Masahiko Sawada
EDB: https://www.enterprisedb.com/