Обсуждение: New instability in stats regression test

Поиск
Список
Период
Сортировка

New instability in stats regression test

От
Tom Lane
Дата:
In the past few days we've had two buildfarm failures[1][2] in the
stats regression test that look like

@@ -1582,7 +1582,7 @@
 SELECT :io_stats_post_reset < :io_stats_pre_reset;
  ?column?
 ----------
- t
+ f
 (1 row)

 -- test BRIN index doesn't block HOT update

I'm a bit mystified by this.  This test was introduced in Andres'
commit 10a082bf7 of 2023-02-11, and it seems to have been stable
since then.  I trawled the buildfarm logs going back three months
and found no similar failures.  So why's it failing now?  The
most plausible theory seems to be that Michael's recent commits
adding pg_stat_reset_xxx features destabilized the test somehow ...
but I sure don't see how/why.

Failure [1] was on my own animal longfin, so I tried to reproduce
it on that animal's host, but no luck so far.

Thoughts?

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=longfin&dt=2023-11-21%2001%3A55%3A00
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=guaibasaurus&dt=2023-11-25%2016%3A20%3A04



Re: New instability in stats regression test

От
Tom Lane
Дата:
I wrote:
> I'm a bit mystified by this.  This test was introduced in Andres'
> commit 10a082bf7 of 2023-02-11, and it seems to have been stable
> since then.  I trawled the buildfarm logs going back three months
> and found no similar failures.  So why's it failing now?  The
> most plausible theory seems to be that Michael's recent commits
> adding pg_stat_reset_xxx features destabilized the test somehow ...
> but I sure don't see how/why.

After a bit more looking around, I have part of a theory.
Commit 23c8c0c8f of 2023-11-12 added this, a little ways before
the problematic test:

-- Test that reset_shared with no argument resets all the stats types
-- supported (providing NULL as argument has the same effect).
SELECT pg_stat_reset_shared();

The test that is failing is of course

-- Test IO stats reset
SELECT pg_stat_have_stats('io', 0, 0);
SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) +
sum(hits)AS io_stats_pre_reset 
  FROM pg_stat_io \gset
SELECT pg_stat_reset_shared('io');
SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) +
sum(hits)AS io_stats_post_reset 
  FROM pg_stat_io \gset
SELECT :io_stats_post_reset < :io_stats_pre_reset;

So the observed failure could be explained if, between the
"pg_stat_reset_shared('io')" call and the subsequent scan of
pg_stat_io, concurrent sessions had done more I/O operations
than happened since that new pg_stat_reset_shared() call.
Previously, the "pre_reset" counts would be large enough to
make that a pretty ridiculous theory, but after 23c8c0c8f maybe
it's not.

To test this idea, I made the test print out the actual values
of the counts, like this:

@@ -1585,10 +1585,10 @@

 SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) +
sum(hits)AS io_stats_post_reset 
   FROM pg_stat_io \gset
-SELECT :io_stats_post_reset < :io_stats_pre_reset;
- ?column?
-----------
- t
+SELECT :io_stats_post_reset, :io_stats_pre_reset;
+ ?column? | ?column?
+----------+----------
+    10452 |   190087
 (1 row)

Of course, this makes it fail every time, but the idea is to get
a sense of the magnitude of the counts; and what I'm seeing is
that the "pre reset" counts are typically 10x more than the
"post reset" ones, even after 23c8c0c8f.  If I remove the
suspicious pg_stat_reset_shared() call, there's about 3 orders
of magnitude difference; but still you'd think a 10x safety
margin would be enough.  So this theory doesn't seem to quite
work as-is.  Perhaps there's some additional contributing factor
I didn't think to control.

Nonetheless, it seems like a really bad idea that this test
of I/O stats reset happens after the newly-added test.  It
is clearly now dependent on timing and the amount of concurrent
activity whether it will pass or not.  We should probably
re-order the tests to do the old test first; or else abandon
this test methodology and just test I/O reset the same way
we test the other cases (checking only for timestamp advance).
Or maybe we don't really need the pg_stat_reset_shared() test?

            regards, tom lane



Re: New instability in stats regression test

От
Michael Paquier
Дата:
On Sat, Nov 25, 2023 at 02:34:40PM -0500, Tom Lane wrote:
> -- Test that reset_shared with no argument resets all the stats types
> -- supported (providing NULL as argument has the same effect).
> SELECT pg_stat_reset_shared();

Right, this has switched pg_stat_reset_shared() from doing nothing to
do a full reset.  Removing this reset switches the results of
io_stats_pre_reset from a 7-digit number to a 4-digit number, thanks
to all the previous I/O activity generated by all the tests.

> Nonetheless, it seems like a really bad idea that this test
> of I/O stats reset happens after the newly-added test.  It
> is clearly now dependent on timing and the amount of concurrent
> activity whether it will pass or not.  We should probably
> re-order the tests to do the old test first; or else abandon
> this test methodology and just test I/O reset the same way
> we test the other cases (checking only for timestamp advance).
> Or maybe we don't really need the pg_stat_reset_shared() test?

I was ready to argue that we'd better keep this test and keep it close
to the end of stats.sql while documenting why things are kept in this
order, but two resets done on the same shared stats type would still
be prone to race conditions without all the previous activity done in
the tests (like pg_stat_wal).

With all that in mind and because we have checks for the individual
targets with pg_stat_reset_shared(), I would agree to just remove it
entirely.  Say as of the attached?
--
Michael

Вложения

Re: New instability in stats regression test

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> With all that in mind and because we have checks for the individual
> targets with pg_stat_reset_shared(), I would agree to just remove it
> entirely.  Say as of the attached?

I'm good with that answer --- I doubt that this test sequence is
proving anything that's worth the cycles it takes.  If it'd catch
oversights like failing to add new stats types to the "reset all"
code path, then I'd be for keeping it; but I don't see how the
test could notice that.

            regards, tom lane



Re: New instability in stats regression test

От
Bharath Rupireddy
Дата:
On Mon, Nov 27, 2023 at 8:26 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> I was ready to argue that we'd better keep this test and keep it close
> to the end of stats.sql while documenting why things are kept in this
> order,

It's easy for someone to come and add pg_stat_reset_shared() before
the end without noticing the comment as the test failure is sporadic
in nature.

> but two resets done on the same shared stats type would still
> be prone to race conditions without all the previous activity done in
> the tests (like pg_stat_wal).

Can running stats.sql in non-parallel mode help stabilize the tests as-is?

> With all that in mind and because we have checks for the individual
> targets with pg_stat_reset_shared(), I would agree to just remove it
> entirely.  Say as of the attached?

I tend to agree with this approach, the code is anyways covered. I
think the attached patch also needs to remove setting
archiver_reset_ts (and friends) after pg_stat_reset_shared('archiver')
(and friends), something like [1].

Can we also remove pg_stat_reset_slru() with no argument test to keep
things consistent?

-- Test that multiple SLRUs are reset when no specific SLRU provided
to reset function
SELECT pg_stat_reset_slru();
SELECT stats_reset > :'slru_commit_ts_reset_ts'::timestamptz FROM
pg_stat_slru WHERE name = 'CommitTs';
SELECT stats_reset > :'slru_notify_reset_ts'::timestamptz FROM
pg_stat_slru WHERE name = 'Notify';

[1]
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index d867fb406f..e3b4ca96e8 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -462,37 +462,31 @@ SELECT stats_reset >
:'slru_notify_reset_ts'::timestamptz FROM pg_stat_slru WHER
 SELECT stats_reset AS archiver_reset_ts FROM pg_stat_archiver \gset
 SELECT pg_stat_reset_shared('archiver');
 SELECT stats_reset > :'archiver_reset_ts'::timestamptz FROM pg_stat_archiver;
-SELECT stats_reset AS archiver_reset_ts FROM pg_stat_archiver \gset

 -- Test that reset_shared with bgwriter specified as the stats type works
 SELECT stats_reset AS bgwriter_reset_ts FROM pg_stat_bgwriter \gset
 SELECT pg_stat_reset_shared('bgwriter');
 SELECT stats_reset > :'bgwriter_reset_ts'::timestamptz FROM pg_stat_bgwriter;
-SELECT stats_reset AS bgwriter_reset_ts FROM pg_stat_bgwriter \gset

 -- Test that reset_shared with checkpointer specified as the stats type works
 SELECT stats_reset AS checkpointer_reset_ts FROM pg_stat_checkpointer \gset
 SELECT pg_stat_reset_shared('checkpointer');
 SELECT stats_reset > :'checkpointer_reset_ts'::timestamptz FROM
pg_stat_checkpointer;
-SELECT stats_reset AS checkpointer_reset_ts FROM pg_stat_checkpointer \gset

 -- Test that reset_shared with recovery_prefetch specified as the
stats type works
 SELECT stats_reset AS recovery_prefetch_reset_ts FROM
pg_stat_recovery_prefetch \gset
 SELECT pg_stat_reset_shared('recovery_prefetch');
 SELECT stats_reset > :'recovery_prefetch_reset_ts'::timestamptz FROM
pg_stat_recovery_prefetch;
-SELECT stats_reset AS recovery_prefetch_reset_ts FROM
pg_stat_recovery_prefetch \gset

 -- Test that reset_shared with slru specified as the stats type works
 SELECT max(stats_reset) AS slru_reset_ts FROM pg_stat_slru \gset
 SELECT pg_stat_reset_shared('slru');
 SELECT max(stats_reset) > :'slru_reset_ts'::timestamptz FROM pg_stat_slru;
-SELECT max(stats_reset) AS slru_reset_ts FROM pg_stat_slru \gset

 -- Test that reset_shared with wal specified as the stats type works
 SELECT stats_reset AS wal_reset_ts FROM pg_stat_wal \gset
 SELECT pg_stat_reset_shared('wal');
 SELECT stats_reset > :'wal_reset_ts'::timestamptz FROM pg_stat_wal;
-SELECT stats_reset AS wal_reset_ts FROM pg_stat_wal \gset

 -- Test error case for reset_shared with unknown stats type
 SELECT pg_stat_reset_shared('unknown');


--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: New instability in stats regression test

От
Andres Freund
Дата:
Hi,

On 2023-11-27 15:49:01 +0530, Bharath Rupireddy wrote:
> On Mon, Nov 27, 2023 at 8:26 AM Michael Paquier <michael@paquier.xyz> wrote:
> > but two resets done on the same shared stats type would still
> > be prone to race conditions without all the previous activity done in
> > the tests (like pg_stat_wal).
> 
> Can running stats.sql in non-parallel mode help stabilize the tests as-is?

I think that'd be a cure *way* worse than the disease. Having concurrent stats
activity isn't exactly uncommon. And because of checkpoints, autovacuum etc,
you'd still have rare situations of concurrency.

Greetings,

Andres Freund



Re: New instability in stats regression test

От
Andres Freund
Дата:
Hi,

On 2023-11-27 11:56:19 +0900, Michael Paquier wrote:
> I was ready to argue that we'd better keep this test and keep it close
> to the end of stats.sql while documenting why things are kept in this
> order, but two resets done on the same shared stats type would still
> be prone to race conditions without all the previous activity done in
> the tests (like pg_stat_wal).

I am probably under-caffeinated: What precisely is the potential race? Just
that the timestamps on some system might not be granular enough?

Greetings,

Andres Freund



Re: New instability in stats regression test

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> I am probably under-caffeinated: What precisely is the potential race? Just
> that the timestamps on some system might not be granular enough?

The problem as I see it is that this test:

SELECT :io_stats_post_reset < :io_stats_pre_reset;

requires an assumption that less I/O has happened since the commanded
reset action than happened before it (extending back to the previous
reset, or cluster start).  Since concurrent processes might be doing
I/O, this has a race condition.  If we are slow enough about obtaining
:io_stats_post_reset, the test *will* fail eventually.  But the shorter
the distance back to the previous reset, the bigger the odds of
observable trouble; thus Michael's concern that adding more reset
tests in future would increase the risk of failure.

            regards, tom lane



Re: New instability in stats regression test

От
Michael Paquier
Дата:
On Mon, Nov 27, 2023 at 02:01:51PM -0500, Tom Lane wrote:
> The problem as I see it is that this test:
>
> SELECT :io_stats_post_reset < :io_stats_pre_reset;
>
> requires an assumption that less I/O has happened since the commanded
> reset action than happened before it (extending back to the previous
> reset, or cluster start).  Since concurrent processes might be doing
> I/O, this has a race condition.  If we are slow enough about obtaining
> :io_stats_post_reset, the test *will* fail eventually.  But the shorter
> the distance back to the previous reset, the bigger the odds of
> observable trouble; thus Michael's concern that adding more reset
> tests in future would increase the risk of failure.

The new reset added just before checking the contents of pg_stat_io
reduces :io_stats_pre_reset from 7M to 50k.  That's a threshold easy
to reach if you have a checkpoint or an autovacuum running in
parallel.  I have not checked the buildfarm logs in details, but I'd
put a coin on a checkpoint triggered by time if the issue happened on
a slow machine.
--
Michael

Вложения

Re: New instability in stats regression test

От
Michael Paquier
Дата:
On Sun, Nov 26, 2023 at 10:34:59PM -0500, Tom Lane wrote:
> I'm good with that answer --- I doubt that this test sequence is
> proving anything that's worth the cycles it takes.  If it'd catch
> oversights like failing to add new stats types to the "reset all"
> code path, then I'd be for keeping it; but I don't see how the
> test could notice that.

For now I've applied a patch that removes the whole sequence.  I'll
keep an eye on the buildfarm for a few days in case there are more
failures.
--
Michael

Вложения