Обсуждение: New instability in stats regression test
In the past few days we've had two buildfarm failures[1][2] in the stats regression test that look like @@ -1582,7 +1582,7 @@ SELECT :io_stats_post_reset < :io_stats_pre_reset; ?column? ---------- - t + f (1 row) -- test BRIN index doesn't block HOT update I'm a bit mystified by this. This test was introduced in Andres' commit 10a082bf7 of 2023-02-11, and it seems to have been stable since then. I trawled the buildfarm logs going back three months and found no similar failures. So why's it failing now? The most plausible theory seems to be that Michael's recent commits adding pg_stat_reset_xxx features destabilized the test somehow ... but I sure don't see how/why. Failure [1] was on my own animal longfin, so I tried to reproduce it on that animal's host, but no luck so far. Thoughts? regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2023-11-21%2001%3A55%3A00 [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=guaibasaurus&dt=2023-11-25%2016%3A20%3A04
I wrote: > I'm a bit mystified by this. This test was introduced in Andres' > commit 10a082bf7 of 2023-02-11, and it seems to have been stable > since then. I trawled the buildfarm logs going back three months > and found no similar failures. So why's it failing now? The > most plausible theory seems to be that Michael's recent commits > adding pg_stat_reset_xxx features destabilized the test somehow ... > but I sure don't see how/why. After a bit more looking around, I have part of a theory. Commit 23c8c0c8f of 2023-11-12 added this, a little ways before the problematic test: -- Test that reset_shared with no argument resets all the stats types -- supported (providing NULL as argument has the same effect). SELECT pg_stat_reset_shared(); The test that is failing is of course -- Test IO stats reset SELECT pg_stat_have_stats('io', 0, 0); SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits)AS io_stats_pre_reset FROM pg_stat_io \gset SELECT pg_stat_reset_shared('io'); SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits)AS io_stats_post_reset FROM pg_stat_io \gset SELECT :io_stats_post_reset < :io_stats_pre_reset; So the observed failure could be explained if, between the "pg_stat_reset_shared('io')" call and the subsequent scan of pg_stat_io, concurrent sessions had done more I/O operations than happened since that new pg_stat_reset_shared() call. Previously, the "pre_reset" counts would be large enough to make that a pretty ridiculous theory, but after 23c8c0c8f maybe it's not. To test this idea, I made the test print out the actual values of the counts, like this: @@ -1585,10 +1585,10 @@ SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits)AS io_stats_post_reset FROM pg_stat_io \gset -SELECT :io_stats_post_reset < :io_stats_pre_reset; - ?column? ----------- - t +SELECT :io_stats_post_reset, :io_stats_pre_reset; + ?column? | ?column? +----------+---------- + 10452 | 190087 (1 row) Of course, this makes it fail every time, but the idea is to get a sense of the magnitude of the counts; and what I'm seeing is that the "pre reset" counts are typically 10x more than the "post reset" ones, even after 23c8c0c8f. If I remove the suspicious pg_stat_reset_shared() call, there's about 3 orders of magnitude difference; but still you'd think a 10x safety margin would be enough. So this theory doesn't seem to quite work as-is. Perhaps there's some additional contributing factor I didn't think to control. Nonetheless, it seems like a really bad idea that this test of I/O stats reset happens after the newly-added test. It is clearly now dependent on timing and the amount of concurrent activity whether it will pass or not. We should probably re-order the tests to do the old test first; or else abandon this test methodology and just test I/O reset the same way we test the other cases (checking only for timestamp advance). Or maybe we don't really need the pg_stat_reset_shared() test? regards, tom lane
On Sat, Nov 25, 2023 at 02:34:40PM -0500, Tom Lane wrote: > -- Test that reset_shared with no argument resets all the stats types > -- supported (providing NULL as argument has the same effect). > SELECT pg_stat_reset_shared(); Right, this has switched pg_stat_reset_shared() from doing nothing to do a full reset. Removing this reset switches the results of io_stats_pre_reset from a 7-digit number to a 4-digit number, thanks to all the previous I/O activity generated by all the tests. > Nonetheless, it seems like a really bad idea that this test > of I/O stats reset happens after the newly-added test. It > is clearly now dependent on timing and the amount of concurrent > activity whether it will pass or not. We should probably > re-order the tests to do the old test first; or else abandon > this test methodology and just test I/O reset the same way > we test the other cases (checking only for timestamp advance). > Or maybe we don't really need the pg_stat_reset_shared() test? I was ready to argue that we'd better keep this test and keep it close to the end of stats.sql while documenting why things are kept in this order, but two resets done on the same shared stats type would still be prone to race conditions without all the previous activity done in the tests (like pg_stat_wal). With all that in mind and because we have checks for the individual targets with pg_stat_reset_shared(), I would agree to just remove it entirely. Say as of the attached? -- Michael
Вложения
Michael Paquier <michael@paquier.xyz> writes: > With all that in mind and because we have checks for the individual > targets with pg_stat_reset_shared(), I would agree to just remove it > entirely. Say as of the attached? I'm good with that answer --- I doubt that this test sequence is proving anything that's worth the cycles it takes. If it'd catch oversights like failing to add new stats types to the "reset all" code path, then I'd be for keeping it; but I don't see how the test could notice that. regards, tom lane
On Mon, Nov 27, 2023 at 8:26 AM Michael Paquier <michael@paquier.xyz> wrote: > > I was ready to argue that we'd better keep this test and keep it close > to the end of stats.sql while documenting why things are kept in this > order, It's easy for someone to come and add pg_stat_reset_shared() before the end without noticing the comment as the test failure is sporadic in nature. > but two resets done on the same shared stats type would still > be prone to race conditions without all the previous activity done in > the tests (like pg_stat_wal). Can running stats.sql in non-parallel mode help stabilize the tests as-is? > With all that in mind and because we have checks for the individual > targets with pg_stat_reset_shared(), I would agree to just remove it > entirely. Say as of the attached? I tend to agree with this approach, the code is anyways covered. I think the attached patch also needs to remove setting archiver_reset_ts (and friends) after pg_stat_reset_shared('archiver') (and friends), something like [1]. Can we also remove pg_stat_reset_slru() with no argument test to keep things consistent? -- Test that multiple SLRUs are reset when no specific SLRU provided to reset function SELECT pg_stat_reset_slru(); SELECT stats_reset > :'slru_commit_ts_reset_ts'::timestamptz FROM pg_stat_slru WHERE name = 'CommitTs'; SELECT stats_reset > :'slru_notify_reset_ts'::timestamptz FROM pg_stat_slru WHERE name = 'Notify'; [1] diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql index d867fb406f..e3b4ca96e8 100644 --- a/src/test/regress/sql/stats.sql +++ b/src/test/regress/sql/stats.sql @@ -462,37 +462,31 @@ SELECT stats_reset > :'slru_notify_reset_ts'::timestamptz FROM pg_stat_slru WHER SELECT stats_reset AS archiver_reset_ts FROM pg_stat_archiver \gset SELECT pg_stat_reset_shared('archiver'); SELECT stats_reset > :'archiver_reset_ts'::timestamptz FROM pg_stat_archiver; -SELECT stats_reset AS archiver_reset_ts FROM pg_stat_archiver \gset -- Test that reset_shared with bgwriter specified as the stats type works SELECT stats_reset AS bgwriter_reset_ts FROM pg_stat_bgwriter \gset SELECT pg_stat_reset_shared('bgwriter'); SELECT stats_reset > :'bgwriter_reset_ts'::timestamptz FROM pg_stat_bgwriter; -SELECT stats_reset AS bgwriter_reset_ts FROM pg_stat_bgwriter \gset -- Test that reset_shared with checkpointer specified as the stats type works SELECT stats_reset AS checkpointer_reset_ts FROM pg_stat_checkpointer \gset SELECT pg_stat_reset_shared('checkpointer'); SELECT stats_reset > :'checkpointer_reset_ts'::timestamptz FROM pg_stat_checkpointer; -SELECT stats_reset AS checkpointer_reset_ts FROM pg_stat_checkpointer \gset -- Test that reset_shared with recovery_prefetch specified as the stats type works SELECT stats_reset AS recovery_prefetch_reset_ts FROM pg_stat_recovery_prefetch \gset SELECT pg_stat_reset_shared('recovery_prefetch'); SELECT stats_reset > :'recovery_prefetch_reset_ts'::timestamptz FROM pg_stat_recovery_prefetch; -SELECT stats_reset AS recovery_prefetch_reset_ts FROM pg_stat_recovery_prefetch \gset -- Test that reset_shared with slru specified as the stats type works SELECT max(stats_reset) AS slru_reset_ts FROM pg_stat_slru \gset SELECT pg_stat_reset_shared('slru'); SELECT max(stats_reset) > :'slru_reset_ts'::timestamptz FROM pg_stat_slru; -SELECT max(stats_reset) AS slru_reset_ts FROM pg_stat_slru \gset -- Test that reset_shared with wal specified as the stats type works SELECT stats_reset AS wal_reset_ts FROM pg_stat_wal \gset SELECT pg_stat_reset_shared('wal'); SELECT stats_reset > :'wal_reset_ts'::timestamptz FROM pg_stat_wal; -SELECT stats_reset AS wal_reset_ts FROM pg_stat_wal \gset -- Test error case for reset_shared with unknown stats type SELECT pg_stat_reset_shared('unknown'); -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 2023-11-27 15:49:01 +0530, Bharath Rupireddy wrote: > On Mon, Nov 27, 2023 at 8:26 AM Michael Paquier <michael@paquier.xyz> wrote: > > but two resets done on the same shared stats type would still > > be prone to race conditions without all the previous activity done in > > the tests (like pg_stat_wal). > > Can running stats.sql in non-parallel mode help stabilize the tests as-is? I think that'd be a cure *way* worse than the disease. Having concurrent stats activity isn't exactly uncommon. And because of checkpoints, autovacuum etc, you'd still have rare situations of concurrency. Greetings, Andres Freund
Hi, On 2023-11-27 11:56:19 +0900, Michael Paquier wrote: > I was ready to argue that we'd better keep this test and keep it close > to the end of stats.sql while documenting why things are kept in this > order, but two resets done on the same shared stats type would still > be prone to race conditions without all the previous activity done in > the tests (like pg_stat_wal). I am probably under-caffeinated: What precisely is the potential race? Just that the timestamps on some system might not be granular enough? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > I am probably under-caffeinated: What precisely is the potential race? Just > that the timestamps on some system might not be granular enough? The problem as I see it is that this test: SELECT :io_stats_post_reset < :io_stats_pre_reset; requires an assumption that less I/O has happened since the commanded reset action than happened before it (extending back to the previous reset, or cluster start). Since concurrent processes might be doing I/O, this has a race condition. If we are slow enough about obtaining :io_stats_post_reset, the test *will* fail eventually. But the shorter the distance back to the previous reset, the bigger the odds of observable trouble; thus Michael's concern that adding more reset tests in future would increase the risk of failure. regards, tom lane
On Mon, Nov 27, 2023 at 02:01:51PM -0500, Tom Lane wrote: > The problem as I see it is that this test: > > SELECT :io_stats_post_reset < :io_stats_pre_reset; > > requires an assumption that less I/O has happened since the commanded > reset action than happened before it (extending back to the previous > reset, or cluster start). Since concurrent processes might be doing > I/O, this has a race condition. If we are slow enough about obtaining > :io_stats_post_reset, the test *will* fail eventually. But the shorter > the distance back to the previous reset, the bigger the odds of > observable trouble; thus Michael's concern that adding more reset > tests in future would increase the risk of failure. The new reset added just before checking the contents of pg_stat_io reduces :io_stats_pre_reset from 7M to 50k. That's a threshold easy to reach if you have a checkpoint or an autovacuum running in parallel. I have not checked the buildfarm logs in details, but I'd put a coin on a checkpoint triggered by time if the issue happened on a slow machine. -- Michael
Вложения
On Sun, Nov 26, 2023 at 10:34:59PM -0500, Tom Lane wrote: > I'm good with that answer --- I doubt that this test sequence is > proving anything that's worth the cycles it takes. If it'd catch > oversights like failing to add new stats types to the "reset all" > code path, then I'd be for keeping it; but I don't see how the > test could notice that. For now I've applied a patch that removes the whole sequence. I'll keep an eye on the buildfarm for a few days in case there are more failures. -- Michael