Обсуждение: Failure in subscription test 004_sync.pl

Поиск
Список
Период
Сортировка

Failure in subscription test 004_sync.pl

От
Michael Paquier
Дата:
Hi all,

wrasse has just failed with what looks like a timing error with a
replication slot drop:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-06-12%2006%3A16%3A30

Here is the error:
error running SQL: 'psql:<stdin>:1: ERROR:  could not drop replication
slot "tap_sub" on publisher: ERROR:  replication slot "tap_sub" is
active for PID 1641'

It seems to me that this just lacks a poll_query_until() doing some
slot monitoring?  I have not checked in details if we need to do that
in more places.  The code path that failed has been added in 7c4f524
from 2017.

Thanks,
--
Michael

Вложения

Re: Failure in subscription test 004_sync.pl

От
Amit Kapila
Дата:
On Sat, Jun 12, 2021 at 1:13 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> wrasse has just failed with what looks like a timing error with a
> replication slot drop:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-06-12%2006%3A16%3A30
>
> Here is the error:
> error running SQL: 'psql:<stdin>:1: ERROR:  could not drop replication
> slot "tap_sub" on publisher: ERROR:  replication slot "tap_sub" is
> active for PID 1641'
>
> It seems to me that this just lacks a poll_query_until() doing some
> slot monitoring?
>

I think it is showing a race condition issue in the code. In
DropSubscription, we first stop the worker that is receiving the WAL,
and then in a separate connection with the publisher, it tries to drop
the slot which leads to this error. The reason is that walsender is
still active as we just wait for wal receiver (or apply worker) to
stop. Normally, as soon as the apply worker is stopped the walsender
detects it and exits but in this case, it took some time to exit, and
in the meantime, we tried to drop the slot which is still in use by
walsender.

If we want to fix this, we might want to wait till the slot is active
on the publisher before trying to drop it but not sure if it is a good
idea. In the worst case, if the user retries this operation (Drop
Subscription), it will succeed.

-- 
With Regards,
Amit Kapila.



Re: Failure in subscription test 004_sync.pl

От
Tom Lane
Дата:
Amit Kapila <amit.kapila16@gmail.com> writes:
> On Sat, Jun 12, 2021 at 1:13 PM Michael Paquier <michael@paquier.xyz> wrote:
>> wrasse has just failed with what looks like a timing error with a
>> replication slot drop:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-06-12%2006%3A16%3A30

> If we want to fix this, we might want to wait till the slot is active
> on the publisher before trying to drop it but not sure if it is a good
> idea. In the worst case, if the user retries this operation (Drop
> Subscription), it will succeed.

wrasse's not the only animal reporting this type of failure.
See also

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=komodoensis&dt=2021-06-12%2011%3A32%3A04

error running SQL: 'psql:<stdin>:1: ERROR:  could not drop replication slot "pg_16387_sync_16384_6972886888894805957"
onpublisher: ERROR:  replication slot "pg_16387_sync_16384_6972886888894805957" is active for PID 2971625' 
while running 'psql -XAtq -d port=60321 host=/tmp/vdQmH7ijFI dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'DROP
SUBSCRIPTIONtestsub2' at /home/bf/build/buildfarm-komodoensis/HEAD/pgsql.build/../pgsql/src/test/perl/PostgresNode.pm
line1771. 

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2021-06-11%2020%3A30%3A28

error running SQL: 'psql:<stdin>:1: ERROR:  could not drop replication slot "testsub2" on publisher: ERROR:
replicationslot "testsub2" is active for PID 27175' 
while running 'psql -XAtq -d port=59579 host=/tmp/9Qchjsykek dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'DROP
SUBSCRIPTIONtestsub2' at
/home/pgbf/buildroot/HEAD/pgsql.build/src/test/subscription/../../../src/test/perl/PostgresNode.pmline 1771. 

Those are both in the t/100_bugs.pl test script, but otherwise they
look like the exact same thing.

I don't think that it's optional to fix a problem that occurs as
often as three times in 10 days in the buildfarm.

            regards, tom lane



Re: Failure in subscription test 004_sync.pl

От
Masahiko Sawada
Дата:
On Sat, Jun 12, 2021 at 9:57 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Jun 12, 2021 at 1:13 PM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > wrasse has just failed with what looks like a timing error with a
> > replication slot drop:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&dt=2021-06-12%2006%3A16%3A30
> >
> > Here is the error:
> > error running SQL: 'psql:<stdin>:1: ERROR:  could not drop replication
> > slot "tap_sub" on publisher: ERROR:  replication slot "tap_sub" is
> > active for PID 1641'
> >
> > It seems to me that this just lacks a poll_query_until() doing some
> > slot monitoring?
> >
>
> I think it is showing a race condition issue in the code. In
> DropSubscription, we first stop the worker that is receiving the WAL,
> and then in a separate connection with the publisher, it tries to drop
> the slot which leads to this error. The reason is that walsender is
> still active as we just wait for wal receiver (or apply worker) to
> stop. Normally, as soon as the apply worker is stopped the walsender
> detects it and exits but in this case, it took some time to exit, and
> in the meantime, we tried to drop the slot which is still in use by
> walsender.

There might be possible.

That's weird since DROP SUBSCRIPTION executes DROP_REPLICATION_SLOT
command with WAIT option. I found a bug that is possibly an oversight
of commit 1632ea4368. The commit changed the code around the error as
follows:

    if (active_pid != MyProcPid)
    {
-       if (behavior == SAB_Error)
+       if (!nowait)
            ereport(ERROR,
                    (errcode(ERRCODE_OBJECT_IN_USE),
                     errmsg("replication slot \"%s\" is active for PID %d",
                            NameStr(s->data.name), active_pid)));
-       else if (behavior == SAB_Inquire)
-           return active_pid;

        /* Wait here until we get signaled, and then restart */
        ConditionVariableSleep(&s->active_cv,

The condition should be the opposite; we should raise the error when
'nowait' is true. I think this is the cause of the test failure. Even
if DROP SUBSCRIPTION tries to drop the slot with the WAIT option, we
don't wait but raise the error.

Attached a small patch fixes it.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Вложения

Re: Failure in subscription test 004_sync.pl

От
Amit Kapila
Дата:
On Mon, Jun 14, 2021 at 10:41 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> >
> > I think it is showing a race condition issue in the code. In
> > DropSubscription, we first stop the worker that is receiving the WAL,
> > and then in a separate connection with the publisher, it tries to drop
> > the slot which leads to this error. The reason is that walsender is
> > still active as we just wait for wal receiver (or apply worker) to
> > stop. Normally, as soon as the apply worker is stopped the walsender
> > detects it and exits but in this case, it took some time to exit, and
> > in the meantime, we tried to drop the slot which is still in use by
> > walsender.
>
> There might be possible.
>
> That's weird since DROP SUBSCRIPTION executes DROP_REPLICATION_SLOT
> command with WAIT option. I found a bug that is possibly an oversight
> of commit 1632ea4368.
>
..
>
> The condition should be the opposite; we should raise the error when
> 'nowait' is true. I think this is the cause of the test failure. Even
> if DROP SUBSCRIPTION tries to drop the slot with the WAIT option, we
> don't wait but raise the error.
>
> Attached a small patch fixes it.
>

Yes, this should fix the recent buildfarm failures. Alvaro, would you
like to take care of this?

-- 
With Regards,
Amit Kapila.



Re: Failure in subscription test 004_sync.pl

От
Alvaro Herrera
Дата:
On 2021-Jun-14, Amit Kapila wrote:

> On Mon, Jun 14, 2021 at 10:41 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

> > The condition should be the opposite; we should raise the error when
> > 'nowait' is true. I think this is the cause of the test failure. Even
> > if DROP SUBSCRIPTION tries to drop the slot with the WAIT option, we
> > don't wait but raise the error.
> 
> Yes, this should fix the recent buildfarm failures. Alvaro, would you
> like to take care of this?

Ugh, thanks for CCing me. Yes, I'll get this fixed ASAP.

-- 
Álvaro Herrera       Valdivia, Chile
"Saca el libro que tu religión considere como el indicado para encontrar la
oración que traiga paz a tu alma. Luego rebootea el computador
y ve si funciona" (Carlos Duclós)



Re: Failure in subscription test 004_sync.pl

От
Alvaro Herrera
Дата:
On 2021-Jun-14, Masahiko Sawada wrote:

> The condition should be the opposite; we should raise the error when
> 'nowait' is true. I think this is the cause of the test failure. Even
> if DROP SUBSCRIPTION tries to drop the slot with the WAIT option, we
> don't wait but raise the error.

Hi, thanks for diagnosing this and producing a patch!  I ended up
turning the condition around, so that all three stanzas still test
"!nowait"; which seems a bit easier to follow.

TBH I'm quite troubled by the fact that this test only failed once on
each animal; they all had a lot of successful runs after that.  I wonder
if this is because coverage is insufficient, or is it just bad luck.

I also wonder if this bug is what caused the random failures in the test
case I tried to add.  I should look at that some more now ...

-- 
Álvaro Herrera       Valdivia, Chile
Al principio era UNIX, y UNIX habló y dijo: "Hello world\n".
No dijo "Hello New Jersey\n", ni "Hello USA\n".