Обсуждение: Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders
Simon Riggs wrote: > Replication lag tracking for walsenders > > Adds write_lag, flush_lag and replay_lag cols to pg_stat_replication. Did anyone notice that this seems to be causing buildfarm member 'tern' to fail the recovery check? See here: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=tern&dt=2017-04-21%2012%3A48%3A09&stg=recovery-check which has TRAP: FailedAssertion("!(lsn >= prev.lsn)", File: "walsender.c", Line: 3331) Line 3331 was added by this commit. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Simon Riggs wrote: >> Replication lag tracking for walsenders >> >> Adds write_lag, flush_lag and replay_lag cols to pg_stat_replication. > Did anyone notice that this seems to be causing buildfarm member 'tern' > to fail the recovery check? See here: > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=tern&dt=2017-04-21%2012%3A48%3A09&stg=recovery-check > which has > TRAP: FailedAssertion("!(lsn >= prev.lsn)", File: "walsender.c", Line: 3331) > Line 3331 was added by this commit. Note that while that commit was some time back, tern has only just started running recovery-check, following its update to the latest buildfarm script. It looks like it's run that test four times and failed twice, so far. So, not 100% reproducible, but there's something rotten there. Timing-dependent, maybe? Some excavation in the buildfarm database says that the coverage for the recovery-check test has been mighty darn thin up until just recently. These are all the reports we have: pgbfprod=> select sysname, min(snapshot) as oldest, count(*) from build_status_log where log_stage = 'recovery-check.log'group by 1 order by 2;sysname | oldest | count ----------+---------------------+-------hamster | 2016-03-01 02:34:26 | 182crake | 2017-04-09 01:58:15 | 80nightjar| 2017-04-11 15:54:34 | 52longfin | 2017-04-19 16:29:39 | 9hornet | 2017-04-20 14:12:08 | 4mandrill| 2017-04-20 14:14:08 | 4sungazer | 2017-04-20 14:16:08 | 4tern | 2017-04-20 14:18:08 | 4prion | 2017-04-20 14:23:05 | 8jacana | 2017-04-20 15:00:17 | 3 (10 rows) So, other than hamster which is certainly going to have its own spin on the timing question, we have next to no track record for this test. I wouldn't bet that this issue is unique to tern; more likely, that's just the first critter to show an intermittent issue. regards, tom lane
On 2017-04-21 17:04:08 -0400, Tom Lane wrote: > Some excavation in the buildfarm database says that the coverage for > the recovery-check test has been mighty darn thin up until just > recently. Hm, not good. Just enabled it for most of my animals (there seems to be no point in having it on the C89 animal). Very curious what it'll do to skink's runtime :/. I'll add a few more cores to the machine, otherwise it'll probably wreak havock. Greetings, Andres Freund
On Sat, Apr 22, 2017 at 9:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: >> Simon Riggs wrote: >>> Replication lag tracking for walsenders >>> >>> Adds write_lag, flush_lag and replay_lag cols to pg_stat_replication. > >> Did anyone notice that this seems to be causing buildfarm member 'tern' >> to fail the recovery check? See here: > >> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=tern&dt=2017-04-21%2012%3A48%3A09&stg=recovery-check >> which has >> TRAP: FailedAssertion("!(lsn >= prev.lsn)", File: "walsender.c", Line: 3331) > >> Line 3331 was added by this commit. > > Note that while that commit was some time back, tern has only just started > running recovery-check, following its update to the latest buildfarm > script. It looks like it's run that test four times and failed twice, > so far. So, not 100% reproducible, but there's something rotten there. > Timing-dependent, maybe? Thanks. I'm away from my computer right now but will investigate this and send a fix later today. -- Thomas Munro http://www.enterprisedb.com
On 22 April 2017 at 06:45, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Thanks. I'm away from my computer right now but will investigate this > and send a fix later today. Thanks. I'll review later today. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Apr 22, 2017 at 9:13 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 22 April 2017 at 06:45, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > >> Thanks. I'm away from my computer right now but will investigate this >> and send a fix later today. > > Thanks. I'll review later today. The assertion fails reliably for me, because standby2's reported write LSN jumps backwards after the timeline changes: for example I see 3020000 then 3028470 then 3020000 followed by a normal progression. Surprisingly, 004_timeline_switch.pl reports success anyway. I'm not sure why the test fails sometimes on tern, but you can see that even when it passed on tern the assertion had failed. Here is a fix for the assertion failure. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Thomas Munro <thomas.munro@enterprisedb.com> writes: > The assertion fails reliably for me, because standby2's reported write > LSN jumps backwards after the timeline changes: for example I see > 3020000 then 3028470 then 3020000 followed by a normal progression. > Surprisingly, 004_timeline_switch.pl reports success anyway. I'm not > sure why the test fails sometimes on tern, but you can see that even > when it passed on tern the assertion had failed. Whoa. This just turned into a much larger can of worms than I expected. How can it be that processes are getting assertion crashes and yet the test framework reports success anyway? That's impossibly broken/unacceptable. Looking closer at the tern report we started the thread with, there are actually TWO assertion trap reports, the one Alvaro noted and another one in 009_twophase_master.log: TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "subtrans.c", Line: 92) When I run the recovery test on my own machine, it reports success (quite reliably, I tried a bunch of times yesterday), but now that I know to look: $ grep TRAP tmp_check/log/* tmp_check/log/009_twophase_master.log:TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))",File: "subtrans.c", Line: 92) So we now have three problems not just one: * How is it that the TAP tests aren't noticing the failure? This one, to my mind, is a code-red situation, as it basically invalidates every TAP test we've ever run. * If Thomas's explanation for the timeline-switch assertion is correct, why isn't it reproducible everywhere? * What's with that second TRAP? > Here is a fix for the assertion failure. As for this patch itself, is it reasonable to try to assert that the timeline has in fact changed? regards, tom lane
I wrote: > Whoa. This just turned into a much larger can of worms than I expected. > How can it be that processes are getting assertion crashes and yet the > test framework reports success anyway? That's impossibly > broken/unacceptable. Taking a quick census of other buildfarm machines that are known to be running the recovery test, it appears that most (not all) are seeing one or both traps. But the test is reporting success anyway, everywhere except on Noah's 32-bit AIX critters. regards, tom lane
I wrote: > Taking a quick census of other buildfarm machines that are known to be > running the recovery test, it appears that most (not all) are seeing > one or both traps. But the test is reporting success anyway, everywhere > except on Noah's 32-bit AIX critters. Or, to be a bit more scientific, let's dig into the buildfarm database. A couple more critters have started running the recovery test since yesterday; these are the latest reports we have: pgbfprod=> select sysname, max(snapshot) as newest, count(*) from build_status_log where log_stage = 'recovery-check.log'group by 1 order by 2; sysname | newest | count ---------------+---------------------+-------hamster | 2016-09-24 16:00:07 | 182jacana | 2017-04-20 21:00:20| 3skink | 2017-04-22 05:00:01 | 2sungazer | 2017-04-22 06:07:17 | 7tern | 2017-04-2206:38:09 | 8hornet | 2017-04-22 06:41:12 | 7mandrill | 2017-04-22 08:44:09 | 8nightjar | 2017-04-22 13:54:24 | 55longfin | 2017-04-22 14:29:17 | 13calliphoridae | 2017-04-22 14:30:01| 4piculet | 2017-04-22 14:30:01 | 3culicidae | 2017-04-22 14:30:01 | 5francolin | 2017-04-2214:30:01 | 3prion | 2017-04-22 14:33:05 | 12crake | 2017-04-22 14:37:21 | 86 (15 rows) Grepping those specific reports for "TRAP" yields sysname | l ---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------jacana | TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "c:/mingw/msys/1.0/home/pgrunner/bf/root/HEAD/pgsql.build/../pgsql/src/backend/access/transam/subtrans.c",Line: 92)sungazer | TRAP: FailedAssertion("!(lsn >= prev.lsn)", File: "walsender.c", Line: 3331)sungazer | TRAP: FailedAssertion("!(*ptr== ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "subtrans.c", Line: 92)tern | TRAP: FailedAssertion("!(lsn >= prev.lsn)", File: "walsender.c", Line: 3331)tern | TRAP: FailedAssertion("!(*ptr== ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "subtrans.c", Line: 92)hornet | TRAP: FailedAssertion("!(lsn >= prev.lsn)", File: "walsender.c", Line: 3331)hornet | TRAP: FailedAssertion("!(*ptr== ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "subtrans.c", Line: 92)mandrill | TRAP: FailedAssertion("!(lsn >= prev.lsn)", File: "walsender.c", Line: 3331)mandrill | TRAP: FailedAssertion("!(*ptr== ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "subtrans.c", Line: 92)nightjar | TRAP: FailedAssertion("!(lsn >= prev.lsn)", File: "/pgbuild/root/HEAD/pgsql.build/../pgsql/src/backend/replication/walsender.c",Line: 3331)nightjar | TRAP: FailedAssertion("!(*ptr== ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "/pgbuild/root/HEAD/pgsql.build/../pgsql/src/backend/access/transam/subtrans.c",Line: 92)longfin | TRAP: FailedAssertion("!(lsn>= prev.lsn)", File: "walsender.c", Line: 3331)longfin | TRAP: FailedAssertion("!(*ptr == ((TransactionId)0) || (*ptr == parent && overwriteOK))", File: "subtrans.c", Line: 92)calliphoridae | TRAP: FailedAssertion("!(*ptr== ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "/home/andres/build/buildfarm-calliphoridae/HEAD/pgsql.build/../pgsql/src/backend/access/transam/subtrans.c",Line: 92)piculet | TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "/home/andres/build/buildfarm-piculet/HEAD/pgsql.build/../pgsql/src/backend/access/transam/subtrans.c",Line: 92)culicidae | TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "/home/andres/build/buildfarm-culicidae/HEAD/pgsql.build/../pgsql/src/backend/access/transam/subtrans.c",Line: 92)francolin | TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "/home/andres/build/buildfarm-francolin/HEAD/pgsql.build/../pgsql/src/backend/access/transam/subtrans.c",Line: 92)prion | TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "/home/ec2-user/bf/root/HEAD/pgsql.build/../pgsql/src/backend/access/transam/subtrans.c",Line: 92) (18 rows) So 6 of 15 critters are getting the walsender.c assertion, and those six plus six more are seeing the subtrans.c one, and three are seeing neither one. There's probably a pattern to that, don't know what it is. (Actually, it looks like hamster stopped running this test a long time ago, so whatever is in its last report is probably not very relevant. So more like 12 of 14 critters are seeing one or both traps.) regards, tom lane
I wrote: > So 6 of 15 critters are getting the walsender.c assertion, > and those six plus six more are seeing the subtrans.c one, > and three are seeing neither one. There's probably a pattern > to that, don't know what it is. Ah, got it: skink *is* seeing the subtrans.c assertion, but not the other. (I'd forgotten to limit the query to the HEAD branch, and it was looking at 9.6 for skink.) hamster, as stated, isn't giving us a recent report; and crake is running the test but it doesn't use --enable-cassert, hence no TRAP. So actually, every single buildfarm member that could be reporting the subtrans.c crash is doing so. But only about half of them are reporting the walsender.c crash. regards, tom lane
I wrote: > Whoa. This just turned into a much larger can of worms than I expected. > How can it be that processes are getting assertion crashes and yet the > test framework reports success anyway? That's impossibly > broken/unacceptable. I poked into this on my laptop, where I'm able to reproduce both assertion failures. The reason the one in subtrans.c is not being detected by the test framework is that it happens late in the run and the test doesn't do anything more with that server than shut it down. "pg_ctl stop" actually notices there's a problem; for instance, the log on tern for this step shows ### Stopping node "master" using mode immediate # Running: pg_ctl -D /home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata -m immediatestop pg_ctl: PID file "/home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata/postmaster.pid"does not exist Is server running? # No postmaster PID However, PostgresNode::stop blithely ignores the failure exit from pg_ctl. I think it should use system_or_bail() not just system_log(), so that we'll notice if the server is not running when we expect it to be. It's conceivable that there should also be a "stop_if_running" method that allows the case where the server is not running, but so far as I can find, no existing TAP test needs such a behavior --- and it certainly shouldn't be the default. The walsender.c crash is harder to detect because the postmaster very nicely recovers and restarts its children. That's great for robustness, but it sucks for testing. I think that we ought to disable that in TAP tests, which we can do by having PostgresNode::init include "restart_after_crash = off" in the postgresql.conf modifications it applies. Again, it's conceivable that some tests would not want that, but there is no existing test that seems to need crash recovery on, and I don't see a good argument for it being default for test purposes. As best I've been able to tell, the reason why the walsender.c crash is detected when it's detected is that the TAP script chances to try to connect while the recovery is happening: connection error: 'psql: FATAL: the database system is in recovery mode' while running 'psql -XAtq -d port=57718 host=/tmp/_uP8FKEynq dbname='postgres' -f - -v ON_ERROR_STOP=1' at /home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pmline 1173. The window for that is not terribly wide, explaining why the detection is unreliable even when the crash does occur. In short then, I propose the attached patch to make these cases fail more reliably. We might extend this later to allow the old behaviors to be explicitly opted-into, but we don't seem to need that today. regards, tom lane diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm index cb84f1f..f8a0581 100644 --- a/src/test/perl/PostgresNode.pm +++ b/src/test/perl/PostgresNode.pm @@ -402,6 +402,7 @@ sub init open my $conf, '>>', "$pgdata/postgresql.conf"; print $conf "\n# Added by PostgresNode.pm\n"; print $conf "fsync = off\n"; + print $conf "restart_after_crash = off\n"; print $conf "log_line_prefix = '%m [%p] %q%a '\n"; print $conf "log_statement = all\n"; print $conf "port = $port\n"; @@ -675,7 +676,7 @@ sub stop $mode = 'fast' unless defined $mode; return unless defined $self->{_pid}; print "### Stopping node \"$name\" using mode $mode\n"; - TestLib::system_log('pg_ctl', '-D', $pgdata, '-m', $mode, 'stop'); + TestLib::system_or_bail('pg_ctl', '-D', $pgdata, '-m', $mode, 'stop'); $self->{_pid} = undef; $self->_update_pid; } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Apr 23, 2017 at 3:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: >> The assertion fails reliably for me, because standby2's reported write >> LSN jumps backwards after the timeline changes: for example I see >> 3020000 then 3028470 then 3020000 followed by a normal progression. >> Surprisingly, 004_timeline_switch.pl reports success anyway. I'm not >> sure why the test fails sometimes on tern, but you can see that even >> when it passed on tern the assertion had failed. > > Whoa. This just turned into a much larger can of worms than I expected. > How can it be that processes are getting assertion crashes and yet the > test framework reports success anyway? That's impossibly > broken/unacceptable. Agreed, thanks for fixing that. > Looking closer at the tern report we started the thread with, there > are actually TWO assertion trap reports, the one Alvaro noted and > another one in 009_twophase_master.log: > > TRAP: FailedAssertion("!(*ptr == ((TransactionId) 0) || (*ptr == parent && overwriteOK))", File: "subtrans.c", Line: 92) I see you started another thread for that one. I admit I spent a couple of hours trying to figure this out before I saw your email, but I was looking at the wrong bit of git history and didn't spot that it's likely a 7 year old problem. So this is a good result for these TAP tests, despite teething difficulties with, erm, "pass" vs "fail" and the fact that 009_twophase.pl was bombing from the moment it was committed. Hoping to use this framework in future work. >> Here is a fix for the assertion failure. > > As for this patch itself, is it reasonable to try to assert that the > timeline has in fact changed? The protocol doesn't include the timeline in reply messages, so it's not clear how the upstream server would know what timeline the standby thinks it's dealing with in any given reply message. The sending server has its own idea of the current timeline but it's not in sync with the stream of incoming replies. -- Thomas Munro http://www.enterprisedb.com
> On Apr 22, 2017, at 11:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wrote: >> Whoa. This just turned into a much larger can of worms than I expected. >> How can it be that processes are getting assertion crashes and yet the >> test framework reports success anyway? That's impossibly >> broken/unacceptable. > > I poked into this on my laptop, where I'm able to reproduce both assertion > failures. The reason the one in subtrans.c is not being detected by the > test framework is that it happens late in the run and the test doesn't > do anything more with that server than shut it down. "pg_ctl stop" > actually notices there's a problem; for instance, the log on tern for > this step shows > > ### Stopping node "master" using mode immediate > # Running: pg_ctl -D /home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata -m immediatestop > pg_ctl: PID file "/home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/tmp_check/data_master_xHGA/pgdata/postmaster.pid"does not exist > Is server running? > # No postmaster PID > > However, PostgresNode::stop blithely ignores the failure exit from > pg_ctl. I think it should use system_or_bail() not just system_log(), > so that we'll notice if the server is not running when we expect it > to be. It's conceivable that there should also be a "stop_if_running" > method that allows the case where the server is not running, but so > far as I can find, no existing TAP test needs such a behavior --- and > it certainly shouldn't be the default. > > The walsender.c crash is harder to detect because the postmaster very > nicely recovers and restarts its children. That's great for robustness, > but it sucks for testing. I think that we ought to disable that in > TAP tests, which we can do by having PostgresNode::init include > "restart_after_crash = off" in the postgresql.conf modifications it > applies. Again, it's conceivable that some tests would not want that, > but there is no existing test that seems to need crash recovery on, > and I don't see a good argument for it being default for test purposes. > > As best I've been able to tell, the reason why the walsender.c crash > is detected when it's detected is that the TAP script chances to try > to connect while the recovery is happening: > > connection error: 'psql: FATAL: the database system is in recovery mode' > while running 'psql -XAtq -d port=57718 host=/tmp/_uP8FKEynq dbname='postgres' -f - -v ON_ERROR_STOP=1' at /home/nm/farm/gcc32/HEAD/pgsql.build/src/test/recovery/../../../src/test/perl/PostgresNode.pmline 1173. > > The window for that is not terribly wide, explaining why the detection > is unreliable even when the crash does occur. > > In short then, I propose the attached patch to make these cases fail > more reliably. We might extend this later to allow the old behaviors > to be explicitly opted-into, but we don't seem to need that today. > > regards, tom lane I pulled fresh sources with your latest commit, 7d68f2281a4b56834c8e5648fc7da0b73b674c45, and the tests consistently fail (5 out of 5 attempts) for me on my laptop with: /Applications/Xcode.app/Contents/Developer/usr/bin/make -C recovery check for extra in contrib/test_decoding; do /Applications/Xcode.app/Contents/Developer/usr/bin/make -C '../../..'/$extra DESTDIR='/Users/mark/vanilla/bitoctetlength'/tmp_installinstall >>'/Users/mark/vanilla/bitoctetlength'/tmp_install/log/install.log|| exit; done rm -rf /Users/mark/vanilla/bitoctetlength/src/test/recovery/tmp_check/log cd . && TESTDIR='/Users/mark/vanilla/bitoctetlength/src/test/recovery' PATH="/Users/mark/vanilla/bitoctetlength/tmp_install/usr/local/pgsql/bin:$PATH" DYLD_LIBRARY_PATH="/Users/mark/vanilla/bitoctetlength/tmp_install/usr/local/pgsql/lib"PGPORT='65432' PG_REGRESS='/Users/mark/vanilla/bitoctetlength/src/test/recovery/../../../src/test/regress/pg_regress'prove -I ../../../src/test/perl/-I . t/*.pl t/001_stream_rep.pl .................. ok t/002_archiving.pl ................... ok t/003_recovery_targets.pl ............ ok t/004_timeline_switch.pl ............. Bailout called. Further testing stopped: system pg_ctl failed FAILED--Further testing stopped: system pg_ctl failed make[2]: *** [check] Error 255 make[1]: *** [check-recovery-recurse] Error 2 make: *** [check-world-src/test-recurse] Error 2 The first time, I had some local changes, but I've stashed them and am still getting the error each of these next four times. I'm compiling as follows: make distclean; make clean; ./configure --enable-cassert --enable-tap-tests --enable-depend && make -j4 && make check-world Are the errors expected now? mark
Mark Dilger <hornschnorter@gmail.com> writes: >> On Apr 22, 2017, at 11:40 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> In short then, I propose the attached patch to make these cases fail >> more reliably. We might extend this later to allow the old behaviors >> to be explicitly opted-into, but we don't seem to need that today. > I pulled fresh sources with your latest commit, 7d68f2281a4b56834c8e5648fc7da0b73b674c45, > and the tests consistently fail (5 out of 5 attempts) for me on my laptop with: > Are the errors expected now? Yes, as of right now, if HEAD *doesn't* fail the src/test/recovery tests for you in an assert-enabled build, that would be surprising. The next step is to actually fix the bugs ... but the point of commit 7d68f2281 was to make these failures reproducible, which they were not before. regards, tom lane
Thomas Munro <thomas.munro@enterprisedb.com> writes: > On Sun, Apr 23, 2017 at 3:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> As for this patch itself, is it reasonable to try to assert that the >> timeline has in fact changed? > The protocol doesn't include the timeline in reply messages, so it's > not clear how the upstream server would know what timeline the standby > thinks it's dealing with in any given reply message. The sending > server has its own idea of the current timeline but it's not in sync > with the stream of incoming replies. Fair enough. But I'd still like an explanation of why only about half of the population is showing a failure here. Seems like every machine should be seeing the LSN as moving backwards in this test. So (a) why aren't they all failing, and (b) should we change the test to make sure every platform sees that happening? regards, tom lane
On 22 April 2017 at 16:41, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: >> The assertion fails reliably for me, because standby2's reported write >> LSN jumps backwards after the timeline changes: for example I see >> 3020000 then 3028470 then 3020000 followed by a normal progression. >> Surprisingly, 004_timeline_switch.pl reports success anyway. I'm not >> sure why the test fails sometimes on tern, but you can see that even >> when it passed on tern the assertion had failed. > > Whoa. This just turned into a much larger can of worms than I expected. > How can it be that processes are getting assertion crashes and yet the > test framework reports success anyway? That's impossibly > broken/unacceptable. Agreed, thanks for fixing. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Apr 23, 2017 at 6:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: >> On Sun, Apr 23, 2017 at 3:41 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> As for this patch itself, is it reasonable to try to assert that the >>> timeline has in fact changed? > >> The protocol doesn't include the timeline in reply messages, so it's >> not clear how the upstream server would know what timeline the standby >> thinks it's dealing with in any given reply message. The sending >> server has its own idea of the current timeline but it's not in sync >> with the stream of incoming replies. > > Fair enough. But I'd still like an explanation of why only about > half of the population is showing a failure here. Seems like every > machine should be seeing the LSN as moving backwards in this test. > So (a) why aren't they all failing, and (b) should we change the > test to make sure every platform sees that happening? Every machine sees the LSN moving backwards, but the code path that had the assertion only reached if it decides to interpolate, which is timing dependent: there needs to be a future sample in the lag tracking buffer, which I guess is not the case in those runs. For example, "jacana" is one of the animals that reported the assertion failure for 009_twophase.pl but not the one for 004_timeline_switch.pl, in this run: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2017-04-20%2021%3A00%3A20 This standby2 log fragment corresponds to the LSN zigzag manoeuvre: 2017-04-20 22:04:51.744 GMT [1984] LOG: fetching timeline history file for timeline 2 from primary server 2017-04-20 22:04:51.744 GMT [1984] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 2017-04-20 22:04:51.744 GMT [1984] LOG: replication terminated by primary server 2017-04-20 22:04:51.744 GMT [1984] DETAIL: End of WAL reached on timeline 1 at 0/3028D50. 2017-04-20 22:04:51.744 GMT [1708] LOG: new target timeline is 2 2017-04-20 22:04:51.744 GMT [1984] LOG: restarted WAL streaming at 0/3000000 on timeline 2 It looks approximately the same on my development machine where I got the assertion failure until commit 84638808: 2017-04-24 09:04:46.563 NZST [95164] LOG: fetching timeline history file for timeline 2 from primary server 2017-04-24 09:04:46.563 NZST [95164] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 2017-04-24 09:04:46.564 NZST [95164] LOG: replication terminated by primary server 2017-04-24 09:04:46.564 NZST [95164] DETAIL: End of WAL reached on timeline 1 at 0/3028470. 2017-04-24 09:04:46.565 NZST [95160] LOG: new target timeline is 2 2017-04-24 09:04:46.565 NZST [95164] LOG: restarted WAL streaming at 0/3000000 on timeline 2 -- Thomas Munro http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> writes: > On Sun, Apr 23, 2017 at 6:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Fair enough. But I'd still like an explanation of why only about >> half of the population is showing a failure here. Seems like every >> machine should be seeing the LSN as moving backwards in this test. >> So (a) why aren't they all failing, and (b) should we change the >> test to make sure every platform sees that happening? > Every machine sees the LSN moving backwards, but the code path that > had the assertion only reached if it decides to interpolate, which is > timing dependent: there needs to be a future sample in the lag > tracking buffer, which I guess is not the case in those runs. I'm dissatisfied with this explanation because if it's just timing, it doesn't seem very likely that some machines would reproduce the failure every single time while others never would. Maybe that can be blamed on kernel scheduler vagaries + different numbers of cores, but I can't escape the feeling that there's something here we've not fully understood. While chasing after this earlier today, I turned on some debug logging and noted that the standby's reports look like 2017-04-23 15:46:46.206 EDT [34829] LOG: database system is ready to accept read only connections 2017-04-23 15:46:46.212 EDT [34834] LOG: fetching timeline history file for timeline 2 from primary server 2017-04-23 15:46:46.212 EDT [34834] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 2017-04-23 15:46:46.213 EDT [34834] LOG: sending write 0/3020000 flush 0/3028470 apply 0/3028470 2017-04-23 15:46:46.214 EDT [34834] LOG: replication terminated by primary server 2017-04-23 15:46:46.214 EDT [34834] DETAIL: End of WAL reached on timeline 1 at 0/3028470. 2017-04-23 15:46:46.214 EDT [34834] LOG: sending write 0/3028470 flush 0/3028470 apply 0/3028470 2017-04-23 15:46:46.214 EDT [34830] LOG: new target timeline is 2 2017-04-23 15:46:46.214 EDT [34834] LOG: restarted WAL streaming at 0/3000000 on timeline 2 2017-04-23 15:46:46.228 EDT [34834] LOG: sending write 0/3020000 flush 0/3028470 apply 0/3028470 So you're right that the standby's reported "write" position can go backward, but it seems pretty darn odd that the flush and apply positions didn't go backward too. Is there a bug there? I remain of the opinion that if we can't tell from the transmitted data whether a timeline switch has caused the position to go backward, then that's a protocol shortcoming that ought to be fixed. regards, tom lane
> On Apr 23, 2017, at 7:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Thomas Munro <thomas.munro@enterprisedb.com> writes: >> On Sun, Apr 23, 2017 at 6:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Fair enough. But I'd still like an explanation of why only about >>> half of the population is showing a failure here. Seems like every >>> machine should be seeing the LSN as moving backwards in this test. >>> So (a) why aren't they all failing, and (b) should we change the >>> test to make sure every platform sees that happening? > >> Every machine sees the LSN moving backwards, but the code path that >> had the assertion only reached if it decides to interpolate, which is >> timing dependent: there needs to be a future sample in the lag >> tracking buffer, which I guess is not the case in those runs. > > I'm dissatisfied with this explanation because if it's just timing, > it doesn't seem very likely that some machines would reproduce the > failure every single time while others never would. Maybe that can be > blamed on kernel scheduler vagaries + different numbers of cores, but > I can't escape the feeling that there's something here we've not > fully understood. > > While chasing after this earlier today, I turned on some debug logging > and noted that the standby's reports look like > > 2017-04-23 15:46:46.206 EDT [34829] LOG: database system is ready to accept read only connections > 2017-04-23 15:46:46.212 EDT [34834] LOG: fetching timeline history file for timeline 2 from primary server > 2017-04-23 15:46:46.212 EDT [34834] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 > 2017-04-23 15:46:46.213 EDT [34834] LOG: sending write 0/3020000 flush 0/3028470 apply 0/3028470 > 2017-04-23 15:46:46.214 EDT [34834] LOG: replication terminated by primary server > 2017-04-23 15:46:46.214 EDT [34834] DETAIL: End of WAL reached on timeline 1 at 0/3028470. > 2017-04-23 15:46:46.214 EDT [34834] LOG: sending write 0/3028470 flush 0/3028470 apply 0/3028470 > 2017-04-23 15:46:46.214 EDT [34830] LOG: new target timeline is 2 > 2017-04-23 15:46:46.214 EDT [34834] LOG: restarted WAL streaming at 0/3000000 on timeline 2 > 2017-04-23 15:46:46.228 EDT [34834] LOG: sending write 0/3020000 flush 0/3028470 apply 0/3028470 > > So you're right that the standby's reported "write" position can go > backward, but it seems pretty darn odd that the flush and apply > positions didn't go backward too. Is there a bug there? > > I remain of the opinion that if we can't tell from the transmitted > data whether a timeline switch has caused the position to go backward, > then that's a protocol shortcoming that ought to be fixed. The recent fix in 546c13e11b29a5408b9d6a6e3cca301380b47f7f has local variable overwriteOK assigned but not used in twophase.c RecoverPreparedTransactions(void). I'm not sure if that's future-proofing or an oversight. It seems to be used in other functions. Just FYI. Mark Dilger
On 2017-04-24 15:41:25 -0700, Mark Dilger wrote: > The recent fix in 546c13e11b29a5408b9d6a6e3cca301380b47f7f has local variable overwriteOK > assigned but not used in twophase.c RecoverPreparedTransactions(void). I'm not sure if that's > future-proofing or an oversight. It seems to be used in other functions. Just FYI. That's just a temporary workaround while the problem's being analyzed. - Andres
On Mon, Apr 24, 2017 at 2:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@enterprisedb.com> writes: >> Every machine sees the LSN moving backwards, but the code path that >> had the assertion only reached if it decides to interpolate, which is >> timing dependent: there needs to be a future sample in the lag >> tracking buffer, which I guess is not the case in those runs. > > I'm dissatisfied with this explanation because if it's just timing, > it doesn't seem very likely that some machines would reproduce the > failure every single time while others never would. Maybe that can be > blamed on kernel scheduler vagaries + different numbers of cores, but > I can't escape the feeling that there's something here we've not > fully understood. Ahh. It is timing-dependent, but what I said before wasn't right. The write LSN always jumps back to the start of the current segment, but some machines don't report that upstream because of timing effects. I found a machine that doesn't reproduce the earlier assertion failure (or, rather, an elog I put into that code path where the assertion was). It's a VirtualBox VM running Debian 8 on amd64, and it doesn't matter whether I tell it to use 1 or 4 CPU cores. After the timeline change, standby2 restarts WAL streaming from 0/3000000, and standby1 sends it two messages: one 128KB message, and then another smaller one. On this virtual machine, the inner message processing loop in walreceiver.c reliably manages to read both messages from the socket one after the other, taking it all the way to 0/3028470 (end of WAL). Then it calls XLogWalRcvSendReply, so the position reported to the upstream server never goes backwards. On my other machines it reliably reads the first message, runs out of data on the socket so it falls out of the loop, and calls XLogWalRcvSendReply to report the intermediate location 0/3020000. Then the socket's soon ready again, we go around the outer loop again and finally report 0/3028470. > [...] > > So you're right that the standby's reported "write" position can go > backward, but it seems pretty darn odd that the flush and apply > positions didn't go backward too. Is there a bug there? Flush doesn't go backwards because XLogWalRcvFlush does nothing unless Flush < Write. I am not sure whether it constitutes a bug that we rewrite the start of the current segment without flushing it. A timeline change isn't actually necessary for the rewind-to-start-of-segment behaviour, by the way: we always do that at the start of streaming, as described in RequestXLogStreaming. But it's it's only a problem for timeline changes, because in other cases it's talking to a fresh walsender that doesn't have any state that would have triggered the (now removed) assertion. I think the change that was committed in 84638808 was therefore correct. > I remain of the opinion that if we can't tell from the transmitted > data whether a timeline switch has caused the position to go backward, > then that's a protocol shortcoming that ought to be fixed. Another approach might have been to teach the switch case for T_StartReplicationCmd in exec_replication_command to clear all LagTracker state, since that is the point at which the remote walreceiver has requested a new stream and all subsequent messages will refer to that stream. -- Thomas Munro http://www.enterprisedb.com