Обсуждение: Intermittent test plan change in "privileges" test on BF animal prion

Поиск
Список
Период
Сортировка

Intermittent test plan change in "privileges" test on BF animal prion

От
Thomas Munro
Дата:
Hello,

Two recent failures show plan changes in RLS queries on master.  Based
on nearby comments, the choice plan is being used to verify access (or
lack of access) to row estimates, so I guess that means something
could be amiss here.  (Or it could be due to the dropped UDP flaky
stats problem, but then why in the same place twice, and why twice in
a week, only on master, and not for months before that?)

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2020-06-08%2002%3A58%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2020-06-06%2003%3A18%3A03



Re: Intermittent test plan change in "privileges" test on BF animal prion

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> Two recent failures show plan changes in RLS queries on master.

Yeah.  I assume this is related to 0c882e52a, but I'm not sure how.
The fact that we've only seen it on prion (which runs
-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE) is suggestive,
but it's not clear why those options would lead to unstable
planner estimates.  I've been waiting to see if we start to get
similar reports from the full-fledged CLOBBER_CACHE_ALWAYS critters.

            regards, tom lane



Re: Intermittent test plan change in "privileges" test on BF animal prion

От
David Rowley
Дата:
On Tue, 9 Jun 2020 at 14:27, Thomas Munro <thomas.munro@gmail.com> wrote:
> Two recent failures show plan changes in RLS queries on master.  Based
> on nearby comments, the choice plan is being used to verify access (or
> lack of access) to row estimates, so I guess that means something
> could be amiss here.  (Or it could be due to the dropped UDP flaky
> stats problem, but then why in the same place twice, and why twice in
> a week, only on master, and not for months before that?)

I see 0c882e52a did change the number of statistics targets on that
table. The first failure was on the commit directly after that one.
I'm not sure what instability Tom meant when he wrote "-- results
below depend on having quite accurate stats for atest12".

It does seem plausible, given how slow prion is that autovacuum might
be trigger after the manual vacuum somehow and building stats with
just 1k buckets instead of 10k. 0936d1b6 made some changes to disable
autovacuum because it was sometimes coming in and messing with the
statistics, maybe we need to do the same here, or at least do
something less temporary than changing default_statistics_target.

select attname,array_length(histogram_bounds,1) from pg_stats where
tablename = 'atest12' order by attname;

should mention the array length is 10000 if it's working as intended.
Is it worth sticking that query in there before and after the failures
to ensure we're working with the stats we think we are?

David



Re: Intermittent test plan change in "privileges" test on BF animal prion

От
Tom Lane
Дата:
David Rowley <dgrowleyml@gmail.com> writes:
> I see 0c882e52a did change the number of statistics targets on that
> table. The first failure was on the commit directly after that one.
> I'm not sure what instability Tom meant when he wrote "-- results
> below depend on having quite accurate stats for atest12".

See [1], particularly the para about "When I went to test 0002".
At least one of those test cases fails if the planner estimates more
than one row being selected by the user-defined operator, and since the
table has 10K rows, that means we need 1/10000 selectivity precision.

> It does seem plausible, given how slow prion is that autovacuum might
> be trigger after the manual vacuum somehow and building stats with
> just 1k buckets instead of 10k.

Hmm ... that's a plausible theory, perhaps.  I forget: does autovac
recheck, after acquiring the requisite table lock, whether the table
still needs to be processed?

> 0936d1b6 made some changes to disable
> autovacuum because it was sometimes coming in and messing with the
> statistics, maybe we need to do the same here, or at least do
> something less temporary than changing default_statistics_target.

Yeah, setting that as a table parameter seems like a better idea than
setting default_statistics_target.

            regards, tom lane



Re: Intermittent test plan change in "privileges" test on BF animal prion

От
David Rowley
Дата:
On Tue, 9 Jun 2020 at 15:41, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > It does seem plausible, given how slow prion is that autovacuum might
> > be trigger after the manual vacuum somehow and building stats with
> > just 1k buckets instead of 10k.
>
> Hmm ... that's a plausible theory, perhaps.  I forget: does autovac
> recheck, after acquiring the requisite table lock, whether the table
> still needs to be processed?

It does, but I wondered if there was a window after the manual vacuum
resets n_ins_since_vacuum and between when autovacuum looks at it.

David



Re: Intermittent test plan change in "privileges" test on BF animal prion

От
Tom Lane
Дата:
David Rowley <dgrowleyml@gmail.com> writes:
> On Tue, 9 Jun 2020 at 15:41, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hmm ... that's a plausible theory, perhaps.  I forget: does autovac
>> recheck, after acquiring the requisite table lock, whether the table
>> still needs to be processed?

> It does, but I wondered if there was a window after the manual vacuum
> resets n_ins_since_vacuum and between when autovacuum looks at it.

Oh, there surely is, because of the lag in the stats collection mechanism.
I'm trying to reproduce this now, but it's sounding pretty plausible.

BTW, it looks like I managed to trim the reference off my prior message,
but I meant [1] to refer to
https://www.postgresql.org/message-id/666679.1591138428%40sss.pgh.pa.us

            regards, tom lane



Re: Intermittent test plan change in "privileges" test on BF animal prion

От
Tom Lane
Дата:
I wrote:
> I'm trying to reproduce this now, but it's sounding pretty plausible.

Yeah, that's definitely it.  I was able to reproduce the failure semi
reliably (every two or three tries) after adding -DRELCACHE_FORCE_RELEASE
-DCATCACHE_FORCE_RELEASE and inserting a "pg_sleep(1)" just after the
manual vacuum in privileges.sql; and as you'd guessed, the stats arrays
were just normal size in the failing runs.  After disabling autovac on
the table, the failure went away.

Thanks for the insight!

            regards, tom lane