Обсуждение: Adding a TAP test checking data consistency on standby withminRecoveryPoint
Adding a TAP test checking data consistency on standby withminRecoveryPoint
От
Michael Paquier
Дата:
Hi all, While working on a fix for c186ba13 which fixes the way minRecoveryPoint is updated for other processes than the startup process, I was struggling about being able to get that into a reproducible test case. I have been thinking what Andrew Gierth has mentioned yesterday, and roughly designed a test case mentioned here which is able to see the problem: https://www.postgresql.org/message-id/20181107044915.GF1677@paquier.xyz I have also been trying to shape that into a TAP test which can be added into the in-core recovery test suite, and it happens that the part which scans if all the pages of a relation are not newer than what minRecoveryPoint is set to in the control file can be easily calculated by using pageinspect and pg_control_recovery() with a simple SQL query. So, digging into it, I have been able to get a reproducible TAP test case which is in the patch attached. On HEAD, if you revert c186ba13 and then trigger the test the inconsistency shows up immediately. Keeping the fix makes the test pass. This test suite will make sure that we don't break again how minRecoveryPoint is handled across multiple processes, so I think that it would be a good addition for HEAD and the future. Thoughts? -- Michael
Вложения
Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint
От
Andrew Gierth
Дата:
>>>>> "Michael" == Michael Paquier <michael@paquier.xyz> writes: Michael> I have also been trying to shape that into a TAP test which Michael> can be added into the in-core recovery test suite, and it Michael> happens that the part which scans if all the pages of a Michael> relation are not newer than what minRecoveryPoint is set to in Michael> the control file can be easily calculated by using pageinspect Michael> and pg_control_recovery() with a simple SQL query. How? It's OK (and normal) for in-core pages to have newer LSNs than minRecoveryPoint, it's only on-disk pages that must not be more recent than that. And pageinspect will show the in-core page... -- Andrew (irc:RhodiumToad)
Re: Adding a TAP test checking data consistency on standby withminRecoveryPoint
От
Michael Paquier
Дата:
On Thu, Nov 08, 2018 at 05:04:35AM +0000, Andrew Gierth wrote: > How? It's OK (and normal) for in-core pages to have newer LSNs than > minRecoveryPoint, it's only on-disk pages that must not be more recent > than that. And pageinspect will show the in-core page... If the standby is stopped cleanly once, what's in shared buffers gets flushed to disk by the checkpointer once at shutdown, forcing minRecoveryPoint to be updated by the checkpointer, and that update was not happening beforethe fix. And once the standby is started again, what's on disk gets reloaded, showing the inconsistency. -- Michael
Вложения
Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint
От
Andrew Gierth
Дата:
>>>>> "Michael" == Michael Paquier <michael@paquier.xyz> writes: >> How? It's OK (and normal) for in-core pages to have newer LSNs than >> minRecoveryPoint, it's only on-disk pages that must not be more >> recent than that. And pageinspect will show the in-core page... Michael> If the standby is stopped cleanly once, what's in shared Michael> buffers gets flushed to disk by the checkpointer once at Michael> shutdown, forcing minRecoveryPoint to be updated by the Michael> checkpointer, and that update was not happening beforethe fix. Michael> And once the standby is started again, what's on disk gets Michael> reloaded, showing the inconsistency. Ah, I missed that you were shutting down. I think it would be nice to have a test that didn't rely on that, too. -- Andrew (irc:RhodiumToad)
Re: Adding a TAP test checking data consistency on standby withminRecoveryPoint
От
Michael Paquier
Дата:
On Thu, Nov 08, 2018 at 06:00:29AM +0000, Andrew Gierth wrote: > I think it would be nice to have a test that didn't rely on that, too. Yes, I don't disagree with you and I thought about it. Fetching the value from the control file is easy, doing the comparison between two LSNs is also simple by doing it directly with pg_lsn in the database (and I don't want to add math logics about LSNs as a TAP API). Now I am less sure about how portable it is possible to make the read of 8 bytes on the page header for the last page of a relation portable across many architectures in perl. And I am not sure that we actually need this addition that as the standby spawned in the TAP test will not have any clients scanning the pages and generating WAL, except autovacuum which would be better switched off in the test. -- Michael
Вложения
Re: Adding a TAP test checking data consistency on standby withminRecoveryPoint
От
Michael Paquier
Дата:
On Thu, Nov 08, 2018 at 04:00:26PM +0900, Michael Paquier wrote: > Yes, I don't disagree with you and I thought about it. Fetching the > value from the control file is easy, doing the comparison between two > LSNs is also simple by doing it directly with pg_lsn in the database > (and I don't want to add math logics about LSNs as a TAP API). Now I am > less sure about how portable it is possible to make the read of 8 bytes > on the page header for the last page of a relation portable across many > architectures in perl. Has a perl monk a specific idea here? It seems to me that the amount of specific facility which would be needed overweights the current simplicity of the test. -- Michael
Вложения
Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint
От
Andrew Gierth
Дата:
>>>>> "Michael" == Michael Paquier <michael@paquier.xyz> writes: > On Thu, Nov 08, 2018 at 04:00:26PM +0900, Michael Paquier wrote: >> Yes, I don't disagree with you and I thought about it. Fetching the >> value from the control file is easy, doing the comparison between >> two LSNs is also simple by doing it directly with pg_lsn in the >> database (and I don't want to add math logics about LSNs as a TAP >> API). Now I am less sure about how portable it is possible to make >> the read of 8 bytes on the page header for the last page of a >> relation portable across many architectures in perl. Michael> Has a perl monk a specific idea here? It seems to me that the Michael> amount of specific facility which would be needed overweights Michael> the current simplicity of the test. Seems simple enough - the LSN on the page is actually two 4-byte values with the most significant one first, regardless of platform byte order (though each individual word is in native order), so my ($hi,$lo) = unpack("LL", $buf); should suffice. ("L" is always 32 bits regardless of platform, and it has the platform's endianness.) Looking only at the last page seems questionable. Something like this should work to return the largest LSN of any page in the specified list of files: # find_largest_lsn(blocksize,filenames...) sub find_largest_lsn { my $blocksize = shift; my @filenames = ref($_[0]) eq 'ARRAY' ? @{$_[0]} : @_; my ($max_hi,$max_lo) = (0,0); for my $filename (@filenames) { open(my $fh, "<:raw", $filename) or die "failed to open $filename: $!"; my ($buf,$len); while ($len = read($fh, $buf, $blocksize)) { $len == $blocksize or die "read only $len of $blocksize bytes from $filename"; my ($hi,$lo) = unpack("LL", $buf); ($max_hi,$max_lo) = ($hi,$lo) if ($hi > $max_hi or ($hi == $max_hi and $lo > $max_lo)); } defined($len) or die "read error on $filename: $!"; close($fh); } return sprintf("%X/%X", $max_hi, $max_lo); } -- Andrew (irc:RhodiumToad)
Re: Adding a TAP test checking data consistency on standby withminRecoveryPoint
От
Michael Paquier
Дата:
On Fri, Nov 09, 2018 at 12:42:27PM +0000, Andrew Gierth wrote: > Seems simple enough - the LSN on the page is actually two 4-byte values > with the most significant one first, regardless of platform byte order > (though each individual word is in native order), so > > my ($hi,$lo) = unpack("LL", $buf); > > should suffice. ("L" is always 32 bits regardless of platform, and it > has the platform's endianness.) Thanks for the review, Andrew. And I completely forgot that this is at the beginning of the page. > Looking only at the last page seems questionable. I have switched also the online check so as it also looks at the full range of blocks instead of only the last one. > Something like this should work to return the largest LSN of any page > in the specified list of files: > > # find_largest_lsn(blocksize,filenames...) > sub find_largest_lsn Thanks! I am stealing that stuff, and I have added an offline check by comparing the value of minRecoveryPoint in pg_controldata. Again, if you revert c186ba13 and run the tests, both online and offline failures are showing up. What do you think? -- Michael
Вложения
Re: Adding a TAP test checking data consistency on standby withminRecoveryPoint
От
Michael Paquier
Дата:
On Mon, Nov 12, 2018 at 01:08:38PM +0900, Michael Paquier wrote: > Thanks! I am stealing that stuff, and I have added an offline check by > comparing the value of minRecoveryPoint in pg_controldata. Again, if > you revert c186ba13 and run the tests, both online and offline failures > are showing up. > > What do you think? I think that this test has some value, and it did not get any reviews, so I am moving it to the next CF. -- Michael
Вложения
Re: Adding a TAP test checking data consistency on standby withminRecoveryPoint
От
Georgios Kokolatos
Дата:
Hi, I applied the patch on current master and run the tests, but I am afraid that the newly introduced test failed on installcheck-world: ```t/016_min_consistency.pl ............. # Looks like your test exited with 29 before it could output anything. t/016_min_consistency.pl ............. Dubious, test returned 29 (wstat 7424, 0x1d00) Failed 2/2 subtests Test Summary Report ------------------- t/016_min_consistency.pl (Wstat: 7424 Tests: 0 Failed: 0) Non-zero exit status: 29 Parse errors: Bad plan. You planned 2 tests but ran 0. Files=16, Tests=143, 65 wallclock secs ( 0.04 usr 0.04 sys + 6.53 cusr 5.08 csys = 11.69 CPU) Result: FAIL``` To be honest, I have not checked closely on the failure, still it is the only test failing which by itself should be worthwhilementioning.
Re: Adding a TAP test checking data consistency on standby withminRecoveryPoint
От
Michael Paquier
Дата:
On Mon, Mar 11, 2019 at 02:35:49PM +0000, Georgios Kokolatos wrote: > To be honest, I have not checked closely on the failure, still it is > the only test failing which by itself should be worthwhile > mentioning. This works for me, as well as a plain installcheck and check from the subpath. Are you sure that contrib/pageinspect is installed in the set of binaries your tests are using? What do the logs of the server tell you? These are located in src/test/recovery/tmp_check/. Please note that when running installcheck we assume that all the needed components are installed. -- Michael
Вложения
Re: Adding a TAP test checking data consistency on standby withminRecoveryPoint
От
Arthur Zakirov
Дата:
On 12.03.2019 08:19, Michael Paquier wrote: > On Mon, Mar 11, 2019 at 02:35:49PM +0000, Georgios Kokolatos wrote: >> To be honest, I have not checked closely on the failure, still it is >> the only test failing which by itself should be worthwhile >> mentioning. > > This works for me, as well as a plain installcheck and check from the > subpath. Are you sure that contrib/pageinspect is installed in the > set of binaries your tests are using? What do the logs of the server > tell you? These are located in src/test/recovery/tmp_check/. Please > note that when running installcheck we assume that all the needed > components are installed. On my environment installcheck passes as well as check. Also I ran tests with reverted c186ba13, 016_min_consistency.pl doesn't pass as expected. -- Arthur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Re: Adding a TAP test checking data consistency on standby withminRecoveryPoint
От
Georgios Kokolatos
Дата:
Hi, you are right in saying that my comment didn't offer much of a constructive explanation. Apologies for that. To the issue at hand. Tests were run in the same manner as in all other cases and the test in question was the only one to fail in the whole tree. By looking a bit closer to the error, the culprit seems to be the missing installation of `pageinspect`, as it was correctly pointed out. I did notice before sending the first message in the thread that `contrib/pageinspect` was added to the `EXTRA_INSTALL` variable and I *assumed* that it should be properly installed to the system. However that was not the case. Without being a recursive Makefile guru, I noticed one has to explicitly $(call recurse,checkprep, $(recurse_alldirs_targets)) in `src/test/Makefile` in order for the call to `checkprep` to be made for running make check or installcheck or any other of the variants. Having said all the above, and after spending more time reviewing the whole patch I do think it is worthwhile committing, as long as it is not tripping any users up with unexpected test errors due to incomplete installations. I hope this comment helps a bit more.
Re: Adding a TAP test checking data consistency on standby withminRecoveryPoint
От
Michael Paquier
Дата:
On Tue, Mar 12, 2019 at 01:16:14PM +0300, Arthur Zakirov wrote: > On my environment installcheck passes as well as check. Also I ran tests > with reverted c186ba13, 016_min_consistency.pl doesn't pass as expected. Thanks all for the feedback! I have done an extra pass on the new, tweaking some sentences in the comments, and committed it. -- Michael
Вложения
Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint
От
Peter Geoghegan
Дата:
On Wed, Mar 13, 2019 at 12:22 AM Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Mar 12, 2019 at 01:16:14PM +0300, Arthur Zakirov wrote: > > On my environment installcheck passes as well as check. Also I ran tests > > with reverted c186ba13, 016_min_consistency.pl doesn't pass as expected. Seems like there might be a problem either caused by or detected by 016_min_consistency.pl on piculet: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2019-03-23%2022%3A28%3A59 -- Peter Geoghegan
Re: Adding a TAP test checking data consistency on standby withminRecoveryPoint
От
Michael Paquier
Дата:
On Sat, Mar 23, 2019 at 04:08:42PM -0700, Peter Geoghegan wrote: > Seems like there might be a problem either caused by or detected by > 016_min_consistency.pl on piculet: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2019-03-23%2022%3A28%3A59 Interesting. Based on what regress_log_016_min_consistency tells, the test attempts to stop the standby in fast mode but it fails because of a timeout: ### Stopping node "standby" using mode fast [...] pg_ctl: server does not shut down Bail out! system pg_ctl failed There is only one place in the tests where that happens, and before attempting to stop the standby we issue a checkpoint on it with its primary killed: # Issue a restart point on the standby now, which makes the checkpointer # update minRecoveryPoint. $standby->safe_psql('postgres', 'CHECKPOINT;'); [...] $primary->stop('immediate'); $standby->stop('fast'); The failure is a bit weird, as I would expect all those three actions to be sequential. piculet is the only failure happening on the buildfarm and it uses --disable-atomics, so I am wondering if that is related and if 0dfe3d0 is part of that. With a primary/standby set, it could be possible to test that scenario pretty easily. I'll give it a shot. -- Michael
Вложения
Re: Adding a TAP test checking data consistency on standby withminRecoveryPoint
От
Michael Paquier
Дата:
On Sun, Mar 24, 2019 at 09:47:58PM +0900, Michael Paquier wrote: > The failure is a bit weird, as I would expect all those three actions > to be sequential. piculet is the only failure happening on the > buildfarm and it uses --disable-atomics, so I am wondering if that is > related and if 0dfe3d0 is part of that. With a primary/standby set, > it could be possible to test that scenario pretty easily. I'll give > it a shot. The buildfarm has just failed with a similar failure, for another test aka 009_twophase.pl: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dragonet&dt=2019-04-16%2006%3A14%3A01 I think that we have race conditions with checkpointing and shutdown on HEAD. -- Michael
Вложения
Re: Adding a TAP test checking data consistency on standby with minRecoveryPoint
От
Robert Haas
Дата:
On Tue, Apr 16, 2019 at 2:45 AM Michael Paquier <michael@paquier.xyz> wrote: > I think that we have race conditions with checkpointing and shutdown > on HEAD. Any idea whether it's something newly-introduced or of long standing? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Adding a TAP test checking data consistency on standby withminRecoveryPoint
От
Michael Paquier
Дата:
On Thu, Apr 18, 2019 at 03:16:17PM -0400, Robert Haas wrote: > On Tue, Apr 16, 2019 at 2:45 AM Michael Paquier <michael@paquier.xyz> wrote: >> I think that we have race conditions with checkpointing and shutdown >> on HEAD. > > Any idea whether it's something newly-introduced or of long standing? I would suggest to keep the discussion on the more general thread I have created a couple of days ago: https://www.postgresql.org/message-id/20190416070119.GK2673@paquier.xyz Thanks, -- Michael