Обсуждение: Test timings are increasing too fast for cfbot

Поиск
Список
Период
Сортировка

Test timings are increasing too fast for cfbot

От
Andres Freund
Дата:
Hi,

I was working on giving at least committers, and perhaps a few more folks,
access to our custom compute setup for running CI, as I had gotten a few
reports about folks running out of CI cycles during busy periods like right
now.

I was thinking that we had enough compute credits to make that possible. Which
I luckily, prodded by a question from Melanie, did check.

Unfortunately it seems that our usage has increased substantially enough to
make that a problem, even with just cfbot.

Most of the test are quite cheap to run, however windows costs a fair bit in
licensing cost (we spend 2x on windows than we do for compute, ram and storage
for all the tests together). There's a few other things we can improve
(e.g. we spend a fair bit on cross region data transfers, which should largely
be fixable).  But we do need to get the increases in windows runtime under
control.


Here's a look at the test timings.  Unfortunately the trends are decidedly not
good, particularly on windows. Looking at a few random cross-branch commits
and looking only at the test times (since build times are affected by caching
etc).

         CI msvc    CI mingw    CI linux    andres-local
17       11:21      10:38       3:32        1:22
18       16:03      14:22       5:39            1:38
master   18:13      15:00       6:69            1:49

I checked this across a few commits and got similar numbers
everywhere. There's a lot of noise, but the trendline is pretty darn clear.


Due to the number of times we run the main regression tests, they have a
particularly large effect on test resources.

             time    cycles     syscalls
17           37.13    239.2T   1.573M
18           44.27      295.7T   1.715M
master       48.19      323.5T   1.854M


Another interesting aspect is the time spent doing initdb has gone up
noticeably:

        assert    optimize
17      1.934   819.1
18      2.190   827.9
master  2.274   814.5


Which, given the number of times initdb is executed in the tests (even with
the template mechanism in place), probably is a factor in all this.  It's
certainly interesting that somehow assertions play a major role, given that
the non-assert time hasn't regressed.



Some increase in test cycles and time is inevitable, but to me the current
rate of increase is not sustainable / faster than the rate at which hardware
speed increases.


Greetings,

Andres Freund



Re: Test timings are increasing too fast for cfbot

От
Andres Freund
Дата:
Hi,

On 2026-03-24 17:30:46 -0400, Andres Freund wrote:
> Due to the number of times we run the main regression tests, they have a
> particularly large effect on test resources.
>
>              time    cycles     syscalls
> 17           37.13    239.2T   1.573M
> 18           44.27      295.7T   1.715M
> master       48.19      323.5T   1.854M

A surprisingly large source of this is psql internal queries.  I first ran the
regression tests with pg_stat_statements activated [1] and noticed that a lot
of the queries, by exec and or plan time, are psql queries.

I patched psql to add "-- psql internal\n" to every internal query and then
used log_min_duration_statement=0 to see how much of the time is spent in psql
- ~13% of the reported duration of all queries.


A lot of that is because describeTbleDetails() ends up with a sequential scan:

regression[1020428][1]=# EXPLAIN ANALYZE SELECT c.oid,
  n.nspname,
  c.relname
FROM pg_catalog.pg_class c
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relname OPERATOR(pg_catalog.~) '^(pg_class)$' COLLATE pg_catalog.default
  AND pg_catalog.pg_table_is_visible(c.oid)
ORDER BY 2, 3;

┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                        QUERY PLAN
  │
 

├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Sort  (cost=324.79..324.80 rows=1 width=132) (actual time=7.192..7.194 rows=1.00 loops=1)
  │
 
│   Sort Key: n.nspname, c.relname
  │
 
│   Sort Method: quicksort  Memory: 25kB
  │
 
│   Buffers: shared hit=234
  │
 
│   ->  Nested Loop Left Join  (cost=0.00..324.78 rows=1 width=132) (actual time=2.489..7.121 rows=1.00 loops=1)
  │
 
│         Join Filter: (n.oid = c.relnamespace)
  │
 
│         Buffers: shared hit=231
  │
 
│         ->  Seq Scan on pg_class c  (cost=0.00..313.79 rows=1 width=72) (actual time=2.473..7.105 rows=1.00 loops=1)
  │
 
│               Filter: ((relname ~ '^(pg_class)$'::text) AND pg_table_is_visible(oid))
  │
 
│               Rows Removed by Filter: 2260
  │
 
│               Buffers: shared hit=229
  │
 
│         ->  Seq Scan on pg_namespace n  (cost=0.00..8.22 rows=222 width=68) (actual time=0.011..0.011 rows=1.00
loops=1)│
 
│               Buffers: shared hit=2
  │
 
│ Planning:
  │
 
│   Buffers: shared hit=6
  │
 
│ Planning Time: 0.520 ms
  │
 
│ Execution Time: 7.265 ms
  │
 

└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(17 rows)

It seems decidedly not optimal that "\d tablename", without any patterns, ends
up doing a seqscan.  That's bad enough in the regression database, but there
are many PG instances with many many entries in pg_class.

I don't think this was always the case?

If I remove the COLLATE pg_catalog.default, a sane plan is chosen. That's
obviously not the right fix, but seemed interesting enough to mention.


That alone seems makes a surprising 4.5% improvement in an unoptimized -O0
build.


Another psql query that stands out is \d's query for publications. Not because
of runtime, but because of the planning cost. Nuking that (obviously not
correct), is worth another 3% of test time (and some test failues).


I've not analyzed the psql queries any further, but I'd bet that the rest also
do add up.



The most expensive query in the entire regression tests is:
  DO $$
  BEGIN
      FOR r IN 1..1350 LOOP
          DELETE FROM dedup_unique_test_table;
          INSERT INTO dedup_unique_test_table SELECT 1;
      END LOOP;
  END$$;
which makes sense, as that's pretty clearly O(N^2).



There are a surprisingly large number of temporary file creations in the
regression tests. A lot of them due to shared file sets in the context of
parallelism.  I wonder whether that is a partial cause for the slowness on
windows, which has very slow metadata operations and slower data caching.


The slowest test is stats_ext.sql - Not surprising, it does sequential scans
of tables with ~1000-10000 rows over and over again.  I don't see why it has
to do that with as many rows as it does.



Another thing we spend a decent amount of time, distributed over many places,
is FROM generate_series(x, y), which ends up slow due to nodeFunctionscan.c
always using a tuplestore.


Code wise, the most immediately noticeable things are
1) MEMORY_CONTEXT_CHECKING
2) AssertCheckRanges() (only in the brin test, but there a very large portion
   of the runtime)
3) verify_compact_attribute(), pretty spread around


Greetings,

Andres Freund

[1] which annoyingly generates a bunch of test failures due to "Query
Identifier" getting added to tests



Re: Test timings are increasing too fast for cfbot

От
Julien Rouhaud
Дата:
Hi,

On Tue, Mar 24, 2026 at 11:15:32PM -0400, Andres Freund wrote:
>
> A surprisingly large source of this is psql internal queries.  I first ran the
> regression tests with pg_stat_statements activated [1] and noticed that a lot
> of the queries, by exec and or plan time, are psql queries.
>
> [1] which annoyingly generates a bunch of test failures due to "Query
> Identifier" getting added to tests

FWIW `compute_query_id = regress` is supposed to avoid that problem.



Make \d tablename fast again, regression introduced by 85b7efa1cdd

От
"Jelte Fennema-Nio"
Дата:
(forked from: Test timings are increasing too fast for cfbot)

On Wed, 25 Mar 2026 at 04:15, Andres Freund <andres@anarazel.de> wrote:
> It seems decidedly not optimal that "\d tablename", without any patterns, ends
> up doing a seqscan.  That's bad enough in the regression database, but there
> are many PG instances with many many entries in pg_class.
>
> I don't think this was always the case?
>
> If I remove the COLLATE pg_catalog.default, a sane plan is chosen. That's
> obviously not the right fix, but seemed interesting enough to mention.

Due to a very similar problem I faced in the past[1], I thought I had a
good sense of where roughly the problem was. And I indeed quickly found
it.

Attached is a patch that addresses this issue and starts using index
scans again for \d tablename.

This should be backpatched to PG18 where the regression was introduced
by 85b7efa1cdd

[1]: https://www.postgresql.org/message-id/flat/CAGECzQRqysy0eJMKR5he3gwtLrT87f9u5CQQua6B_XNwMnUtFA%40mail.gmail.com

Вложения

Re: Test timings are increasing too fast for cfbot

От
Tomas Vondra
Дата:
On 3/25/26 04:15, Andres Freund wrote:
> ...
> 
> The slowest test is stats_ext.sql - Not surprising, it does sequential scans
> of tables with ~1000-10000 rows over and over again.  I don't see why it has
> to do that with as many rows as it does.
> 

IIRC we needed to use non-trivial amounts of data to ensure building the
right right type of statistics (e.g. no MCV). But I understand it can be
annoyingly expensive, so I'll try to make it cheaper.

>  ...
> 2) AssertCheckRanges() (only in the brin test, but there a very large portion
>    of the runtime)

True. It is a very comprehensive validation of the ranges, and it was
very useful - particularly during development. But I'll try to make it
more targeted at the stuff actually changed / called less often.

Both changes will require time (so that we don't lose test coverage),
but I assume it's OK if that happens sometime after the feature freeze.


regards

-- 
Tomas Vondra




Re: Test timings are increasing too fast for cfbot

От
Andres Freund
Дата:
Hi,

On 2026-03-25 15:38:01 +0100, Tomas Vondra wrote:
> On 3/25/26 04:15, Andres Freund wrote:
> > ...
> > 
> > The slowest test is stats_ext.sql - Not surprising, it does sequential scans
> > of tables with ~1000-10000 rows over and over again.  I don't see why it has
> > to do that with as many rows as it does.
> > 
> 
> IIRC we needed to use non-trivial amounts of data to ensure building the
> right right type of statistics (e.g. no MCV). But I understand it can be
> annoyingly expensive, so I'll try to make it cheaper.

Great.  It might also be worth checking if any of the query conditions can be
made cheaper without actually changing what they do test.


> >  ...
> > 2) AssertCheckRanges() (only in the brin test, but there a very large portion
> >    of the runtime)
> 
> True. It is a very comprehensive validation of the ranges, and it was
> very useful - particularly during development. But I'll try to make it
> more targeted at the stuff actually changed / called less often.

Makes sense.  There might also be some changs that make it faster without
loosing any coverage. E.g. not using FunctionCall2Coll() - which initializes
stuff on every call - but doing InitFunctionCallInfoData() once and then
update arguments + FunctionCallInvoke() for each call.


> Both changes will require time (so that we don't lose test coverage),
> but I assume it's OK if that happens sometime after the feature freeze.

Yea. We're not running out of credits tomorrow or something like that.

Greetings,

Andres Freund



Re: Test timings are increasing too fast for cfbot

От
David Rowley
Дата:
On Wed, 25 Mar 2026 at 16:15, Andres Freund <andres@anarazel.de> wrote:
> Code wise, the most immediately noticeable things are

> 3) verify_compact_attribute(), pretty spread around

We do now have TupleDescFinalize(), where those could be checked just
once rather than on each call to TupleDescCompactAttr(). That's not
quite as watertight a guarantee as someone could change the
FormData_pg_attribute after TupleDescFinalize(). Just doing it in
TupleDescFinalize() would at least still catch the places where people
forget to call populate_compact_attribute() before
TupleDescFinalize().

I would have expected this to be a little less overhead now since
d8a859d22 removed the calls to TupleDescCompactAttr() in the main
deforming routine. Maybe I should just make that change in the other
deformers...? Do you have an idea of which callers of
verify_compact_attribute() are causing the most overhead?

David



Re: Test timings are increasing too fast for cfbot

От
Andres Freund
Дата:
On 2026-03-26 11:09:36 +1300, David Rowley wrote:
> On Wed, 25 Mar 2026 at 16:15, Andres Freund <andres@anarazel.de> wrote:
> > Code wise, the most immediately noticeable things are
> 
> > 3) verify_compact_attribute(), pretty spread around
> 
> We do now have TupleDescFinalize(), where those could be checked just
> once rather than on each call to TupleDescCompactAttr(). That's not
> quite as watertight a guarantee as someone could change the
> FormData_pg_attribute after TupleDescFinalize(). Just doing it in
> TupleDescFinalize() would at least still catch the places where people
> forget to call populate_compact_attribute() before
> TupleDescFinalize().

Maybe verify_compact_attribute() could just do an assert comparison between
the underlying non-compact attribute and the compact one? Or even just an
assert checking if the compact attribute is initialized, with the full
checking happening in TupleDescFinalize(), as you suggest?


> I would have expected this to be a little less overhead now since
> d8a859d22 removed the calls to TupleDescCompactAttr() in the main
> deforming routine. Maybe I should just make that change in the other
> deformers...?

Might be worth it.


> Do you have an idea of which callers of verify_compact_attribute() are
> causing the most overhead?

I'm not sure how much to believe the profile when the costs are as distributed
as they are here. But according to the profile it's
- heap_form_tuple
- heap_form_minimal_tuple
- index_getattr
- nocachegetattr

Greetings,

Andres Freund



Re: Test timings are increasing too fast for cfbot

От
Andres Freund
Дата:
Hi,

On 2026-03-25 12:36:12 +0800, Julien Rouhaud wrote:
> On Tue, Mar 24, 2026 at 11:15:32PM -0400, Andres Freund wrote:
> >
> > A surprisingly large source of this is psql internal queries.  I first ran the
> > regression tests with pg_stat_statements activated [1] and noticed that a lot
> > of the queries, by exec and or plan time, are psql queries.
> >
> > [1] which annoyingly generates a bunch of test failures due to "Query
> > Identifier" getting added to tests
> 
> FWIW `compute_query_id = regress` is supposed to avoid that problem.

Hm. I guess that should work. Thanks!

It's pretty annoying that one has to manually specify it to keep the tests
passing.  I wonder if we should make pg_regress specify it to avoid that.

Greetings,

Andres Freund