Обсуждение: [HACKERS] More replication race conditions
sungazer just failed with pg_recvlogical exited with code '256', stdout '' and stderr 'pg_recvlogical: could not send replication command "START_REPLICATIONSLOT "test_slot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')": ERROR: replication slot "test_slot"is active for PID 8913148 pg_recvlogical: disconnected ' at /home/nm/farm/gcc64/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm line 1657. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-08-24%2015%3A16%3A10 Looks like we're still not there on preventing replication startup race conditions. regards, tom lane
On 24/08/17 19:54, Tom Lane wrote: > sungazer just failed with > > pg_recvlogical exited with code '256', stdout '' and stderr 'pg_recvlogical: could not send replication command "START_REPLICATIONSLOT "test_slot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')": ERROR: replication slot "test_slot"is active for PID 8913148 > pg_recvlogical: disconnected > ' at /home/nm/farm/gcc64/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm line 1657. > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-08-24%2015%3A16%3A10 > > Looks like we're still not there on preventing replication startup > race conditions. Hmm, that looks like "by design" behavior. Slot acquiring will throw error if the slot is already used by somebody else (slots use their own locking mechanism that does not wait). In this case it seems the walsender which was using slot for previous previous step didn't finish releasing the slot by the time we start new command. We can work around this by changing the test to wait perhaps. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Aug 25, 2017 at 12:09:00PM +0200, Petr Jelinek wrote: > On 24/08/17 19:54, Tom Lane wrote: > > sungazer just failed with > > > > pg_recvlogical exited with code '256', stdout '' and stderr 'pg_recvlogical: could not send replication command "START_REPLICATIONSLOT "test_slot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')": ERROR: replication slot "test_slot"is active for PID 8913148 > > pg_recvlogical: disconnected > > ' at /home/nm/farm/gcc64/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm line 1657. > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-08-24%2015%3A16%3A10 > > > > Looks like we're still not there on preventing replication startup > > race conditions. > > Hmm, that looks like "by design" behavior. Slot acquiring will throw > error if the slot is already used by somebody else (slots use their own > locking mechanism that does not wait). In this case it seems the > walsender which was using slot for previous previous step didn't finish > releasing the slot by the time we start new command. We can work around > this by changing the test to wait perhaps. [Action required within three days. This is a generic notification.] The above-described topic is currently a PostgreSQL 10 open item. Simon, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a v10 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within three calendar days of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed well in advance of shipping v10. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
And *another* replication test race condition just now: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-08-26%2019%3A37%3A08 As best I can interpret this, it's pointing out that this bit in src/test/recovery/t/009_twophase.pl: $cur_master->psql('postgres', "BEGIN;CREATE TABLE t_009_tbl2 (id int, msg text);SAVEPOINT s1;INSERT INTO t_009_tbl2 VALUES(27, 'issued to ${cur_master_name}');PREPARE TRANSACTION 'xact_009_13';-- checkpoint will issue XLOG_STANDBY_LOCK thatcan conflict with lock-- held by 'create table' statementCHECKPOINT;COMMIT PREPARED 'xact_009_13';"); $cur_standby->psql('postgres',"SELECT count(*) FROM t_009_tbl2",stdout => \$psql_out); is($psql_out, '1', "Replay prepared transaction with DDL"); contains exactly no means of ensuring that the master's transaction has been replayed on the standby before we check for its results. It's not real clear why it seems to work 99.99% of the time, because, well, there isn't any frickin' interlock there ... regards, tom lane
On Sun, Aug 27, 2017 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > And *another* replication test race condition just now: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-08-26%2019%3A37%3A08 > > As best I can interpret this, it's pointing out that this bit in > src/test/recovery/t/009_twophase.pl: > > $cur_master->psql( > 'postgres', " > BEGIN; > CREATE TABLE t_009_tbl2 (id int, msg text); > SAVEPOINT s1; > INSERT INTO t_009_tbl2 VALUES (27, 'issued to ${cur_master_name}'); > PREPARE TRANSACTION 'xact_009_13'; > -- checkpoint will issue XLOG_STANDBY_LOCK that can conflict with lock > -- held by 'create table' statement > CHECKPOINT; > COMMIT PREPARED 'xact_009_13';"); > > $cur_standby->psql( > 'postgres', > "SELECT count(*) FROM t_009_tbl2", > stdout => \$psql_out); > is($psql_out, '1', "Replay prepared transaction with DDL"); > > contains exactly no means of ensuring that the master's transaction has > been replayed on the standby before we check for its results. It's not > real clear why it seems to work 99.99% of the time, because, well, there > isn't any frickin' interlock there ... I have noticed this one this morning, and I am planning to address it with a proper patch soonishly. (I am still fighting a bit to get dangomushi in a more stable stable, and things run slow on it, so it is good at catching race conditions of this kind). -- Michael
On 27/08/17 04:32, Noah Misch wrote: > On Fri, Aug 25, 2017 at 12:09:00PM +0200, Petr Jelinek wrote: >> On 24/08/17 19:54, Tom Lane wrote: >>> sungazer just failed with >>> >>> pg_recvlogical exited with code '256', stdout '' and stderr 'pg_recvlogical: could not send replication command "START_REPLICATIONSLOT "test_slot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')": ERROR: replication slot "test_slot"is active for PID 8913148 >>> pg_recvlogical: disconnected >>> ' at /home/nm/farm/gcc64/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm line 1657. >>> >>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-08-24%2015%3A16%3A10 >>> >>> Looks like we're still not there on preventing replication startup >>> race conditions. >> >> Hmm, that looks like "by design" behavior. Slot acquiring will throw >> error if the slot is already used by somebody else (slots use their own >> locking mechanism that does not wait). In this case it seems the >> walsender which was using slot for previous previous step didn't finish >> releasing the slot by the time we start new command. We can work around >> this by changing the test to wait perhaps. > > [Action required within three days. This is a generic notification.] > > The above-described topic is currently a PostgreSQL 10 open item. Simon, > since you committed the patch believed to have created it, you own this open > item. If some other commit is more relevant or if this does not belong as a > v10 open item, please let us know. Otherwise, please observe the policy on > open item ownership[1] and send a status update within three calendar days of > this message. Include a date for your subsequent status update. Testers may > discover new open items at any time, and I want to plan to get them all fixed > well in advance of shipping v10. Consequently, I will appreciate your efforts > toward speedy resolution. Thanks. > > [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com > Attached should fix this. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
On Sun, Aug 27, 2017 at 3:34 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Sun, Aug 27, 2017 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> contains exactly no means of ensuring that the master's transaction has >> been replayed on the standby before we check for its results. It's not >> real clear why it seems to work 99.99% of the time, because, well, there >> isn't any frickin' interlock there ... > > I have noticed this one this morning, and I am planning to address it > with a proper patch soonishly. (I am still fighting a bit to get > dangomushi in a more stable stable, and things run slow on it, so it > is good at catching race conditions of this kind). Today's run has finished with the same failure: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-08-27%2018%3A00%3A13 Attached is a patch to make this code path wait that the transaction has been replayed. We could use as well synchronous_commit = apply, but I prefer the solution of this patch with a wait query. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Michael Paquier <michael.paquier@gmail.com> writes: > Attached is a patch to make this code path wait that the transaction > has been replayed. We could use as well synchronous_commit = apply, > but I prefer the solution of this patch with a wait query. Petr proposed a different patch to fix the same problem at https://www.postgresql.org/message-id/1636c52e-c144-993a-6665-9358f322deda%402ndquadrant.com Which one is better? regards, tom lane
On Sun, Aug 27, 2017 at 6:32 PM, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > Attached should fix this. +$node_master->poll_query_until('postgres', +"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'test_slot' AND active_pid IS NULL)" +) + or die "slot never became inactive"; +$stdout_recv = $node_master->pg_recvlogical_upto( I am wondering if a similar check should actually go into pg_recvlogical_upto() instead. I tend to think that we've learned things the hard way with 3043c1dd and such. -- Michael
On Mon, Aug 28, 2017 at 8:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >> Attached is a patch to make this code path wait that the transaction >> has been replayed. We could use as well synchronous_commit = apply, >> but I prefer the solution of this patch with a wait query. > > Petr proposed a different patch to fix the same problem at > https://www.postgresql.org/message-id/1636c52e-c144-993a-6665-9358f322deda%402ndquadrant.com > > Which one is better? Petr's patch is here to fix the race condition in test 006 for logical decoding. My patch involves the failures that dangomushi sees since yesterday with 2PC test 009. (Note: It took me a while to put this animal in an environment with a more stable connection, so now I hope that we'll get again daily reports) -- Michael
On 28/08/17 01:36, Michael Paquier wrote: > On Sun, Aug 27, 2017 at 6:32 PM, Petr Jelinek > <petr.jelinek@2ndquadrant.com> wrote: >> Attached should fix this. > > +$node_master->poll_query_until('postgres', > +"SELECT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = > 'test_slot' AND active_pid IS NULL)" > +) > + or die "slot never became inactive"; > + > $stdout_recv = $node_master->pg_recvlogical_upto( > I am wondering if a similar check should actually go into > pg_recvlogical_upto() instead. I tend to think that we've learned > things the hard way with 3043c1dd and such. > Putting it there can lead to tests that take forever to finish if written incorrectly. This way they at least just fail. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Aug 28, 2017 at 8:25 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > Today's run has finished with the same failure: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-08-27%2018%3A00%3A13 > Attached is a patch to make this code path wait that the transaction > has been replayed. We could use as well synchronous_commit = apply, > but I prefer the solution of this patch with a wait query. I have added an open item for this issue for PG10. The 2PC tests in src/test/recovery are new in PG10, introduced by 3082098. -- Michael
On Sun, Aug 27, 2017 at 02:32:49AM +0000, Noah Misch wrote: > On Fri, Aug 25, 2017 at 12:09:00PM +0200, Petr Jelinek wrote: > > On 24/08/17 19:54, Tom Lane wrote: > > > sungazer just failed with > > > > > > pg_recvlogical exited with code '256', stdout '' and stderr 'pg_recvlogical: could not send replication command "START_REPLICATIONSLOT "test_slot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')": ERROR: replication slot "test_slot"is active for PID 8913148 > > > pg_recvlogical: disconnected > > > ' at /home/nm/farm/gcc64/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm line 1657. > > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-08-24%2015%3A16%3A10 > > > > > > Looks like we're still not there on preventing replication startup > > > race conditions. > > > > Hmm, that looks like "by design" behavior. Slot acquiring will throw > > error if the slot is already used by somebody else (slots use their own > > locking mechanism that does not wait). In this case it seems the > > walsender which was using slot for previous previous step didn't finish > > releasing the slot by the time we start new command. We can work around > > this by changing the test to wait perhaps. > > [Action required within three days. This is a generic notification.] > > The above-described topic is currently a PostgreSQL 10 open item. Simon, > since you committed the patch believed to have created it, you own this open > item. If some other commit is more relevant or if this does not belong as a > v10 open item, please let us know. Otherwise, please observe the policy on > open item ownership[1] and send a status update within three calendar days of > this message. Include a date for your subsequent status update. Testers may > discover new open items at any time, and I want to plan to get them all fixed > well in advance of shipping v10. Consequently, I will appreciate your efforts > toward speedy resolution. Thanks. > > [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com This PostgreSQL 10 open item is past due for your status update. Kindly send a status update within 24 hours, and include a date for your subsequent status update. Refer to the policy on open item ownership: https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
On Tue, Aug 29, 2017 at 08:44:42PM +0900, Michael Paquier wrote: > On Mon, Aug 28, 2017 at 8:25 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: > > Today's run has finished with the same failure: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-08-27%2018%3A00%3A13 > > Attached is a patch to make this code path wait that the transaction > > has been replayed. We could use as well synchronous_commit = apply, > > but I prefer the solution of this patch with a wait query. > > I have added an open item for this issue for PG10. The 2PC tests in > src/test/recovery are new in PG10, introduced by 3082098. [Action required within three days. This is a generic notification.] The above-described topic is currently a PostgreSQL 10 open item. Álvaro, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a v10 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within three calendar days of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed well in advance of shipping v10. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
On 27 August 2017 at 03:32, Noah Misch <noah@leadboat.com> wrote: > On Fri, Aug 25, 2017 at 12:09:00PM +0200, Petr Jelinek wrote: >> On 24/08/17 19:54, Tom Lane wrote: >> > sungazer just failed with >> > >> > pg_recvlogical exited with code '256', stdout '' and stderr 'pg_recvlogical: could not send replication command "START_REPLICATIONSLOT "test_slot" LOGICAL 0/0 ("include-xids" '0', "skip-empty-xacts" '1')": ERROR: replication slot "test_slot"is active for PID 8913148 >> > pg_recvlogical: disconnected >> > ' at /home/nm/farm/gcc64/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pm line 1657. >> > >> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-08-24%2015%3A16%3A10 >> > >> > Looks like we're still not there on preventing replication startup >> > race conditions. >> >> Hmm, that looks like "by design" behavior. Slot acquiring will throw >> error if the slot is already used by somebody else (slots use their own >> locking mechanism that does not wait). In this case it seems the >> walsender which was using slot for previous previous step didn't finish >> releasing the slot by the time we start new command. We can work around >> this by changing the test to wait perhaps. > > [Action required within three days. This is a generic notification.] > > The above-described topic is currently a PostgreSQL 10 open item. Simon, > since you committed the patch believed to have created it, you own this open > item. If some other commit is more relevant or if this does not belong as a > v10 open item, please let us know. Otherwise, please observe the policy on > open item ownership[1] and send a status update within three calendar days of > this message. Include a date for your subsequent status update. Testers may > discover new open items at any time, and I want to plan to get them all fixed > well in advance of shipping v10. Consequently, I will appreciate your efforts > toward speedy resolution. Thanks. > > [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com I acknowledge receipt of the reminder and will fix by end of day tomorrow. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 31 August 2017 at 12:54, Simon Riggs <simon@2ndquadrant.com> wrote: >> The above-described topic is currently a PostgreSQL 10 open item. Simon, >> since you committed the patch believed to have created it, you own this open >> item. If some other commit is more relevant or if this does not belong as a >> v10 open item, please let us know. Otherwise, please observe the policy on >> open item ownership[1] and send a status update within three calendar days of >> this message. Include a date for your subsequent status update. Testers may >> discover new open items at any time, and I want to plan to get them all fixed >> well in advance of shipping v10. Consequently, I will appreciate your efforts >> toward speedy resolution. Thanks. >> >> [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com > > I acknowledge receipt of the reminder and will fix by end of day tomorrow. Applied. Thanks for the patch, Petr. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Michael Paquier wrote: > On Mon, Aug 28, 2017 at 8:25 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: > > Today's run has finished with the same failure: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-08-27%2018%3A00%3A13 > > Attached is a patch to make this code path wait that the transaction > > has been replayed. We could use as well synchronous_commit = apply, > > but I prefer the solution of this patch with a wait query. > > I have added an open item for this issue for PG10. The 2PC tests in > src/test/recovery are new in PG10, introduced by 3082098. Thanks, pushed. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Sep 2, 2017 at 12:03 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Michael Paquier wrote: >> On Mon, Aug 28, 2017 at 8:25 AM, Michael Paquier >> <michael.paquier@gmail.com> wrote: >> > Today's run has finished with the same failure: >> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dangomushi&dt=2017-08-27%2018%3A00%3A13 >> > Attached is a patch to make this code path wait that the transaction >> > has been replayed. We could use as well synchronous_commit = apply, >> > but I prefer the solution of this patch with a wait query. >> >> I have added an open item for this issue for PG10. The 2PC tests in >> src/test/recovery are new in PG10, introduced by 3082098. > > Thanks, pushed. Thanks, the error has now not happened for 5 runs. Let's see if it ever shows up again. -- Michael