Обсуждение: increased duration of stats_ext tests with -DCLOBBER_CACHE_ALWAYS
Hi, I have two buildfarm animals running tests with -DCLOBBER_CACHE_ALWAYS, and both of them started having issues with timeouts Last file mtime in snapshot: Sat Nov 29 21:41:49 2025 GMT =================================================== timed out after 14400 secs The animals are avocet and trilobite, and it's actually a single VM, with one animal for gcc or clang. It's fine that these tests take quite a bit of time, but I'm puzzled why it started to fail. And it seems the stats_ext regression test got much slower at the end of September. The last couple runs from avocet have these durations: 11:44:42 ago timedout [713d9a8] (01:54:23) Details 7 days 23:33:36 ago OK [bc32a12] (34:19:40) Config 16 days 12:19:09 ago OK [* 23792] (34:32:51) Config 63 days 03:55:34 ago OK [91df046] (31:14:46) Config 71 days 10:57:06 ago OK [9fc7f6a] (31:51:02) Config 79 days 05:46:56 ago OK [454c046] (19:35:25) Config 87 days 00:36:55 ago OK [06473f5] (19:31:16) Config 94 days 19:50:32 ago OK [f727b63] (19:05:05) Config 102 days 14:43:13 ago OK [ef03ea0] (19:09:05) Config ... So it was ~19h for a while (started at ~14h about 4y back). And then between September 14 and 22 it jumped to ~32h. Which seems like a lot. And that seems to be due to stats_ext changing from ok 157 + stats_ext 855865 ms to ok 157 + stats_ext 15158426 ms That's ~14300 seconds, i.e. ~4h. Pretty close to the timeout limit. The machine is running the test with two locales, so there are multiple runs of this regression test. Any ideas what might be causing this? I see exactly the same issue on the other machine (trilobite). There were no upgrades or other changes around that time, and no other regression test has this issue. regards -- Tomas Vondra
On 12/2/25 17:05, Tomas Vondra wrote:
> ...
>
> So it was ~19h for a while (started at ~14h about 4y back). And then
> between September 14 and 22 it jumped to ~32h. Which seems like a lot.
>
> And that seems to be due to stats_ext changing from
>
> ok 157 + stats_ext 855865 ms
>
> to
>
> ok 157 + stats_ext 15158426 ms
>
> That's ~14300 seconds, i.e. ~4h. Pretty close to the timeout limit. The
> machine is running the test with two locales, so there are multiple runs
> of this regression test.
>
After bisecting this, it seems to have changed in this commit:
commit 1eccb93150707acfcc8f24556a15742a6313c8ac (HEAD -> stats-ext)
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat Sep 20 12:44:52 2025 -0400
Re-allow using statistics for bool-valued functions in WHERE.
Commit a391ff3c3, which added the ability for a function's support
function to provide a custom selectivity estimate for "WHERE ...
And it seems to be "stuck" on the last few queries doing this:
SELECT * FROM check_estimated_rows('SELECT * FROM sb_2 WHERE
extstat_small(y)');
Attached is a perf-report for this, but the interesting part is:
--99.51%--plpgsql_exec_function
exec_toplevel_block.constprop.0
exec_stmt_block
exec_stmts
exec_stmt_return (inlined)
exec_eval_expr
exec_eval_simple_expr (inlined)
|
--99.37%--SPI_plan_get_cached_plan
GetCachedPlan
|
|--96.58%--RevalidateCachedQuery
| |
| |--96.00%--pg_analyze_and_rewrite_withcb
| | parse_analyze_withcb
| | |
| | --96.00%--transformTopLevelStmt (inlined)
I'm not sure if this a problem in the code, or whether it's just that
the test case happens to be this expensive with cache clobbering. Maybe
I simply need to increase the timeout for this machine, so that it does
not fail after 4h? It'd be unfortunate, as the runs are already long,
and this pushes them over 24h ... :-(
regards
--
Tomas Vondra
Вложения
Tomas Vondra <tomas@vondra.me> writes:
> On 12/2/25 17:05, Tomas Vondra wrote:
>> So it was ~19h for a while (started at ~14h about 4y back). And then
>> between September 14 and 22 it jumped to ~32h. Which seems like a lot.
> After bisecting this, it seems to have changed in this commit:
> commit 1eccb93150707acfcc8f24556a15742a6313c8ac (HEAD -> stats-ext)
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date: Sat Sep 20 12:44:52 2025 -0400
> Re-allow using statistics for bool-valued functions in WHERE.
Ugh.
> Attached is a perf-report for this, but the interesting part is:
> --99.51%--plpgsql_exec_function
Yeah, I can imagine that constantly flushing the cached plan for
that plpgsql function would be bad. Let me see if I can reformulate
that test without using a plpgsql function --- right offhand, it's
not obvious why a built-in function wouldn't serve the purpose
just as well.
regards, tom lane
I wrote:
> Yeah, I can imagine that constantly flushing the cached plan for
> that plpgsql function would be bad. Let me see if I can reformulate
> that test without using a plpgsql function --- right offhand, it's
> not obvious why a built-in function wouldn't serve the purpose
> just as well.
I pushed a change for this. On my Mac laptop, it brings the time
for stats_ext with -DCLOBBER_CACHE_ALWAYS down to ~8 minutes, from
I-didn't-have-the-patience-to-wait-but-it-would-have-been-hours.
BTW, I noticed that neither avocet nor trilobite seem to have
'use_installcheck_parallel' enabled in their BF config files.
That results in the installcheck steps taking longer than the
check step, when they should be the same time or shorter.
You could shave several hours off the animals' runtime by
enabling that.
(I'm not really sure why that's not on by default ... if the
check steps are automatically parallelized, why not installcheck?)
regards, tom lane
On 12/3/25 19:33, Tom Lane wrote: > I wrote: >> Yeah, I can imagine that constantly flushing the cached plan for >> that plpgsql function would be bad. Let me see if I can reformulate >> that test without using a plpgsql function --- right offhand, it's >> not obvious why a built-in function wouldn't serve the purpose >> just as well. > > I pushed a change for this. On my Mac laptop, it brings the time > for stats_ext with -DCLOBBER_CACHE_ALWAYS down to ~8 minutes, from > I-didn't-have-the-patience-to-wait-but-it-would-have-been-hours. > Thanks! > BTW, I noticed that neither avocet nor trilobite seem to have > 'use_installcheck_parallel' enabled in their BF config files. > That results in the installcheck steps taking longer than the > check step, when they should be the same time or shorter. > You could shave several hours off the animals' runtime by > enabling that. > Good idea, I enabled that for both animals. -- Tomas Vondra