Обсуждение: [HACKERS] More replication race conditions

Поиск
Список
Период
Сортировка

[HACKERS] More replication race conditions

От
Tom Lane
Дата:
sungazer just failed with

pg_recvlogical exited with code '256', stdout '' and stderr 'pg_recvlogical: could not send replication command
"START_REPLICATIONSLOT "test_slot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')": ERROR:  replication slot
"test_slot"is active for PID 8913148 
pg_recvlogical: disconnected
' at /home/nm/farm/gcc64/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm line 1657.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-08-24%2015%3A16%3A10

Looks like we're still not there on preventing replication startup
race conditions.
        regards, tom lane



Re: [HACKERS] More replication race conditions

От
Petr Jelinek
Дата:
On 24/08/17 19:54, Tom Lane wrote:
> sungazer just failed with
> 
> pg_recvlogical exited with code '256', stdout '' and stderr 'pg_recvlogical: could not send replication command
"START_REPLICATIONSLOT "test_slot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')": ERROR:  replication slot
"test_slot"is active for PID 8913148
 
> pg_recvlogical: disconnected
> ' at /home/nm/farm/gcc64/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm line 1657.
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-08-24%2015%3A16%3A10
> 
> Looks like we're still not there on preventing replication startup
> race conditions.

Hmm, that looks like "by design" behavior. Slot acquiring will throw
error if the slot is already used by somebody else (slots use their own
locking mechanism that does not wait). In this case it seems the
walsender which was using slot for previous previous step didn't finish
releasing the slot by the time we start new command. We can work around
this by changing the test to wait perhaps.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] More replication race conditions

От
Noah Misch
Дата:
On Fri, Aug 25, 2017 at 12:09:00PM +0200, Petr Jelinek wrote:
> On 24/08/17 19:54, Tom Lane wrote:
> > sungazer just failed with
> > 
> > pg_recvlogical exited with code '256', stdout '' and stderr 'pg_recvlogical: could not send replication command
"START_REPLICATIONSLOT "test_slot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')": ERROR:  replication slot
"test_slot"is active for PID 8913148
 
> > pg_recvlogical: disconnected
> > ' at /home/nm/farm/gcc64/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm line 1657.
> > 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-08-24%2015%3A16%3A10
> > 
> > Looks like we're still not there on preventing replication startup
> > race conditions.
> 
> Hmm, that looks like "by design" behavior. Slot acquiring will throw
> error if the slot is already used by somebody else (slots use their own
> locking mechanism that does not wait). In this case it seems the
> walsender which was using slot for previous previous step didn't finish
> releasing the slot by the time we start new command. We can work around
> this by changing the test to wait perhaps.

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Simon,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



Re: [HACKERS] More replication race conditions

От
Tom Lane
Дата:
And *another* replication test race condition just now:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-08-26%2019%3A37%3A08

As best I can interpret this, it's pointing out that this bit in
src/test/recovery/t/009_twophase.pl:

$cur_master->psql('postgres', "BEGIN;CREATE TABLE t_009_tbl2 (id int, msg text);SAVEPOINT s1;INSERT INTO t_009_tbl2
VALUES(27, 'issued to ${cur_master_name}');PREPARE TRANSACTION 'xact_009_13';-- checkpoint will issue XLOG_STANDBY_LOCK
thatcan conflict with lock-- held by 'create table' statementCHECKPOINT;COMMIT PREPARED 'xact_009_13';"); 

$cur_standby->psql('postgres',"SELECT count(*) FROM t_009_tbl2",stdout => \$psql_out);
is($psql_out, '1', "Replay prepared transaction with DDL");

contains exactly no means of ensuring that the master's transaction has
been replayed on the standby before we check for its results.  It's not
real clear why it seems to work 99.99% of the time, because, well, there
isn't any frickin' interlock there ...
        regards, tom lane



Re: [HACKERS] More replication race conditions

От
Michael Paquier
Дата:
On Sun, Aug 27, 2017 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> And *another* replication test race condition just now:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-08-26%2019%3A37%3A08
>
> As best I can interpret this, it's pointing out that this bit in
> src/test/recovery/t/009_twophase.pl:
>
> $cur_master->psql(
>         'postgres', "
>         BEGIN;
>         CREATE TABLE t_009_tbl2 (id int, msg text);
>         SAVEPOINT s1;
>         INSERT INTO t_009_tbl2 VALUES (27, 'issued to ${cur_master_name}');
>         PREPARE TRANSACTION 'xact_009_13';
>         -- checkpoint will issue XLOG_STANDBY_LOCK that can conflict with lock
>         -- held by 'create table' statement
>         CHECKPOINT;
>         COMMIT PREPARED 'xact_009_13';");
>
> $cur_standby->psql(
>         'postgres',
>         "SELECT count(*) FROM t_009_tbl2",
>         stdout => \$psql_out);
> is($psql_out, '1', "Replay prepared transaction with DDL");
>
> contains exactly no means of ensuring that the master's transaction has
> been replayed on the standby before we check for its results.  It's not
> real clear why it seems to work 99.99% of the time, because, well, there
> isn't any frickin' interlock there ...

I have noticed this one this morning, and I am planning to address it
with a proper patch soonishly. (I am still fighting a bit to get
dangomushi in a more stable stable, and things run slow on it, so it
is good at catching race conditions of this kind).
-- 
Michael



Re: [HACKERS] More replication race conditions

От
Petr Jelinek
Дата:
On 27/08/17 04:32, Noah Misch wrote:
> On Fri, Aug 25, 2017 at 12:09:00PM +0200, Petr Jelinek wrote:
>> On 24/08/17 19:54, Tom Lane wrote:
>>> sungazer just failed with
>>>
>>> pg_recvlogical exited with code '256', stdout '' and stderr 'pg_recvlogical: could not send replication command
"START_REPLICATIONSLOT "test_slot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')": ERROR:  replication slot
"test_slot"is active for PID 8913148
 
>>> pg_recvlogical: disconnected
>>> ' at /home/nm/farm/gcc64/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm line 1657.
>>>
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-08-24%2015%3A16%3A10
>>>
>>> Looks like we're still not there on preventing replication startup
>>> race conditions.
>>
>> Hmm, that looks like "by design" behavior. Slot acquiring will throw
>> error if the slot is already used by somebody else (slots use their own
>> locking mechanism that does not wait). In this case it seems the
>> walsender which was using slot for previous previous step didn't finish
>> releasing the slot by the time we start new command. We can work around
>> this by changing the test to wait perhaps.
> 
> [Action required within three days.  This is a generic notification.]
> 
> The above-described topic is currently a PostgreSQL 10 open item.  Simon,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> v10 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within three calendar days of
> this message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping v10.  Consequently, I will appreciate your efforts
> toward speedy resolution.  Thanks.
> 
> [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
> 

Attached should fix this.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] More replication race conditions

От
Michael Paquier
Дата:
On Sun, Aug 27, 2017 at 3:34 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sun, Aug 27, 2017 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> contains exactly no means of ensuring that the master's transaction has
>> been replayed on the standby before we check for its results.  It's not
>> real clear why it seems to work 99.99% of the time, because, well, there
>> isn't any frickin' interlock there ...
>
> I have noticed this one this morning, and I am planning to address it
> with a proper patch soonishly. (I am still fighting a bit to get
> dangomushi in a more stable stable, and things run slow on it, so it
> is good at catching race conditions of this kind).

Today's run has finished with the same failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-08-27%2018%3A00%3A13
Attached is a patch to make this code path wait that the transaction
has been replayed. We could use as well synchronous_commit = apply,
but I prefer the solution of this patch with a wait query.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] More replication race conditions

От
Tom Lane
Дата:
Michael Paquier <michael.paquier@gmail.com> writes:
> Attached is a patch to make this code path wait that the transaction
> has been replayed. We could use as well synchronous_commit = apply,
> but I prefer the solution of this patch with a wait query.

Petr proposed a different patch to fix the same problem at
https://www.postgresql.org/message-id/1636c52e-c144-993a-6665-9358f322deda%402ndquadrant.com

Which one is better?
        regards, tom lane



Re: [HACKERS] More replication race conditions

От
Michael Paquier
Дата:
On Sun, Aug 27, 2017 at 6:32 PM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> Attached should fix this.

+$node_master->poll_query_until('postgres',
+"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name =
'test_slot' AND active_pid IS NULL)"
+)
+  or die "slot never became inactive";
+$stdout_recv = $node_master->pg_recvlogical_upto(
I am wondering if a similar check should actually go into
pg_recvlogical_upto() instead. I tend to think that we've learned
things the hard way with 3043c1dd and such.
-- 
Michael



Re: [HACKERS] More replication race conditions

От
Michael Paquier
Дата:
On Mon, Aug 28, 2017 at 8:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> Attached is a patch to make this code path wait that the transaction
>> has been replayed. We could use as well synchronous_commit = apply,
>> but I prefer the solution of this patch with a wait query.
>
> Petr proposed a different patch to fix the same problem at
> https://www.postgresql.org/message-id/1636c52e-c144-993a-6665-9358f322deda%402ndquadrant.com
>
> Which one is better?

Petr's patch is here to fix the race condition in test 006 for logical
decoding. My patch involves the failures that dangomushi sees since
yesterday with 2PC test 009. (Note: It took me a while to put this
animal in an environment with a more stable connection, so now I hope
that we'll get again daily reports)
-- 
Michael



Re: [HACKERS] More replication race conditions

От
Petr Jelinek
Дата:
On 28/08/17 01:36, Michael Paquier wrote:
> On Sun, Aug 27, 2017 at 6:32 PM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com> wrote:
>> Attached should fix this.
> 
> +$node_master->poll_query_until('postgres',
> +"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name =
> 'test_slot' AND active_pid IS NULL)"
> +)
> +  or die "slot never became inactive";
> +
>  $stdout_recv = $node_master->pg_recvlogical_upto(
> I am wondering if a similar check should actually go into
> pg_recvlogical_upto() instead. I tend to think that we've learned
> things the hard way with 3043c1dd and such.
> 

Putting it there can lead to tests that take forever to finish if
written incorrectly. This way they at least just fail.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] More replication race conditions

От
Michael Paquier
Дата:
On Mon, Aug 28, 2017 at 8:25 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Today's run has finished with the same failure:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-08-27%2018%3A00%3A13
> Attached is a patch to make this code path wait that the transaction
> has been replayed. We could use as well synchronous_commit = apply,
> but I prefer the solution of this patch with a wait query.

I have added an open item for this issue for PG10. The 2PC tests in
src/test/recovery are new in PG10, introduced by 3082098.
-- 
Michael



Re: [HACKERS] More replication race conditions

От
Noah Misch
Дата:
On Sun, Aug 27, 2017 at 02:32:49AM +0000, Noah Misch wrote:
> On Fri, Aug 25, 2017 at 12:09:00PM +0200, Petr Jelinek wrote:
> > On 24/08/17 19:54, Tom Lane wrote:
> > > sungazer just failed with
> > > 
> > > pg_recvlogical exited with code '256', stdout '' and stderr 'pg_recvlogical: could not send replication command
"START_REPLICATIONSLOT "test_slot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')": ERROR:  replication slot
"test_slot"is active for PID 8913148
 
> > > pg_recvlogical: disconnected
> > > ' at /home/nm/farm/gcc64/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm line 1657.
> > > 
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-08-24%2015%3A16%3A10
> > > 
> > > Looks like we're still not there on preventing replication startup
> > > race conditions.
> > 
> > Hmm, that looks like "by design" behavior. Slot acquiring will throw
> > error if the slot is already used by somebody else (slots use their own
> > locking mechanism that does not wait). In this case it seems the
> > walsender which was using slot for previous previous step didn't finish
> > releasing the slot by the time we start new command. We can work around
> > this by changing the test to wait perhaps.
> 
> [Action required within three days.  This is a generic notification.]
> 
> The above-described topic is currently a PostgreSQL 10 open item.  Simon,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> v10 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within three calendar days of
> this message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping v10.  Consequently, I will appreciate your efforts
> toward speedy resolution.  Thanks.
> 
> [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

This PostgreSQL 10 open item is past due for your status update.  Kindly send
a status update within 24 hours, and include a date for your subsequent status
update.  Refer to the policy on open item ownership:
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



Re: [HACKERS] More replication race conditions

От
Noah Misch
Дата:
On Tue, Aug 29, 2017 at 08:44:42PM +0900, Michael Paquier wrote:
> On Mon, Aug 28, 2017 at 8:25 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > Today's run has finished with the same failure:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-08-27%2018%3A00%3A13
> > Attached is a patch to make this code path wait that the transaction
> > has been replayed. We could use as well synchronous_commit = apply,
> > but I prefer the solution of this patch with a wait query.
> 
> I have added an open item for this issue for PG10. The 2PC tests in
> src/test/recovery are new in PG10, introduced by 3082098.

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Álvaro,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



Re: [HACKERS] More replication race conditions

От
Simon Riggs
Дата:
On 27 August 2017 at 03:32, Noah Misch <noah@leadboat.com> wrote:
> On Fri, Aug 25, 2017 at 12:09:00PM +0200, Petr Jelinek wrote:
>> On 24/08/17 19:54, Tom Lane wrote:
>> > sungazer just failed with
>> >
>> > pg_recvlogical exited with code '256', stdout '' and stderr 'pg_recvlogical: could not send replication command
"START_REPLICATIONSLOT "test_slot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')": ERROR:  replication slot
"test_slot"is active for PID 8913148 
>> > pg_recvlogical: disconnected
>> > ' at /home/nm/farm/gcc64/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm line 1657.
>> >
>> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-08-24%2015%3A16%3A10
>> >
>> > Looks like we're still not there on preventing replication startup
>> > race conditions.
>>
>> Hmm, that looks like "by design" behavior. Slot acquiring will throw
>> error if the slot is already used by somebody else (slots use their own
>> locking mechanism that does not wait). In this case it seems the
>> walsender which was using slot for previous previous step didn't finish
>> releasing the slot by the time we start new command. We can work around
>> this by changing the test to wait perhaps.
>
> [Action required within three days.  This is a generic notification.]
>
> The above-described topic is currently a PostgreSQL 10 open item.  Simon,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> v10 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within three calendar days of
> this message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping v10.  Consequently, I will appreciate your efforts
> toward speedy resolution.  Thanks.
>
> [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

I acknowledge receipt of the reminder and will fix by end of day tomorrow.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] More replication race conditions

От
Simon Riggs
Дата:
On 31 August 2017 at 12:54, Simon Riggs <simon@2ndquadrant.com> wrote:

>> The above-described topic is currently a PostgreSQL 10 open item.  Simon,
>> since you committed the patch believed to have created it, you own this open
>> item.  If some other commit is more relevant or if this does not belong as a
>> v10 open item, please let us know.  Otherwise, please observe the policy on
>> open item ownership[1] and send a status update within three calendar days of
>> this message.  Include a date for your subsequent status update.  Testers may
>> discover new open items at any time, and I want to plan to get them all fixed
>> well in advance of shipping v10.  Consequently, I will appreciate your efforts
>> toward speedy resolution.  Thanks.
>>
>> [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
>
> I acknowledge receipt of the reminder and will fix by end of day tomorrow.

Applied. Thanks for the patch, Petr.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] More replication race conditions

От
Alvaro Herrera
Дата:
Michael Paquier wrote:
> On Mon, Aug 28, 2017 at 8:25 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > Today's run has finished with the same failure:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-08-27%2018%3A00%3A13
> > Attached is a patch to make this code path wait that the transaction
> > has been replayed. We could use as well synchronous_commit = apply,
> > but I prefer the solution of this patch with a wait query.
> 
> I have added an open item for this issue for PG10. The 2PC tests in
> src/test/recovery are new in PG10, introduced by 3082098.

Thanks, pushed.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] More replication race conditions

От
Michael Paquier
Дата:
On Sat, Sep 2, 2017 at 12:03 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Michael Paquier wrote:
>> On Mon, Aug 28, 2017 at 8:25 AM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>> > Today's run has finished with the same failure:
>> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-08-27%2018%3A00%3A13
>> > Attached is a patch to make this code path wait that the transaction
>> > has been replayed. We could use as well synchronous_commit = apply,
>> > but I prefer the solution of this patch with a wait query.
>>
>> I have added an open item for this issue for PG10. The 2PC tests in
>> src/test/recovery are new in PG10, introduced by 3082098.
>
> Thanks, pushed.

Thanks, the error has now not happened for 5 runs. Let's see if it
ever shows up again.
-- 
Michael