Обсуждение: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

Поиск

Список

Период

Сортировка

Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

От

Radim Marek

Дата:

03 февраля, 01:54:24

Hello,

First let me say this is a first time bug report and trying to patch the PostgreSQL, fresh after the hacking webinar. Please let me know if I need to clarify something.

TLDR; EXPLAIN ANALYZE BUFFERS reports different buffer counts for the Sort node between first and subsequent executions per session.

Buffers: shared hit=4
-> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1 width=32) (actual time=0.042..0.126 rows=1.00 loops=1)
Filter: (lower(email) = 'email_1@example.com'::text)
Rows Removed by Filter: 99
Buffers: shared hit=1

Buffers: shared hit=1
-> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1 width=32) (actual time=0.021..0.057 rows=1.00 loops=1)
Filter: (lower(email) = 'email_1@example.com'::text)
Rows Removed by Filter: 99
Buffers: shared hit=1

I found it while working on SQL regression tool (regresql) that uses total reported buffers from EXPLAIN ANALYZE BUFFERS to detect deviations for query performance.

The issue came on smaller queries where test cases showed consistent "improvement" on performance (fewer buffers) when comparing baseline and actual test runs. Previously did not notice it on larger data sets due to the difference in buffers being negligible given the larger operations.

The example is the test case I managed to isolate.

--- way how to reproduce

CREATE TABLE sort_buffer_test (id serial PRIMARY KEY, val int);
INSERT INTO sort_buffer_test (val) SELECT generate_series(1, 100);
ANALYZE sort_buffer_test;

-- run twice in new psql session and observe reported buffer change
EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMING OFF)
SELECT * FROM sort_buffer_test ORDER BY val DESC LIMIT 1;

---

This behaviour is session specific, tested on 17.x, 18.x and current development version (just fiy, all verified on both amd64 and aarch64).

I traced the potential issue down in case of ORDER BY to pg_amop/pg_amproc lookups. The specific lookups vary by operation but the pattern is the same: first execution incurs catalog reads that get cached for subsequent runs. This applies to DISTINCT (sort, not hashed), GROUP BY, window functions, etc - which seems to isolate it to sort node.

All those I believe are actually metadata resolution (i.e. planner already knows what to sort by).

Having said that - I'm happy to try to fix it. Just wondering what is the right direction to go:

1. Pre-warm syscache during planning (tried locally and it works - but does not feel conceptually right and it's rather 'hacky')

2. Pre-compute and store metadata and modify each plan node struct for each type (lots of work). I believe that's what MergeJoin already does.

3. Or this is 'works as expected' - which I don't think it's the case and would force me to do pre-warming on regresql side

Kind regards,

Radim

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

От

David Rowley

Дата:

03 февраля, 02:37:42

On Tue, 3 Feb 2026 at 11:54, Radim Marek <radim@boringsql.com> wrote:
>          Buffers: shared hit=4
>          ->  Seq Scan on sort_buffer_test  (cost=0.00..2.50 rows=1 width=32) (actual time=0.042..0.126 rows=1.00
loops=1)
>                Filter: (lower(email) = 'email_1@example.com'::text)
>                Rows Removed by Filter: 99
>                Buffers: shared hit=1
>
> vs
>
>          Buffers: shared hit=1
>          ->  Seq Scan on sort_buffer_test  (cost=0.00..2.50 rows=1 width=32) (actual time=0.021..0.057 rows=1.00
loops=1)
>                Filter: (lower(email) = 'email_1@example.com'::text)
>                Rows Removed by Filter: 99
>                Buffers: shared hit=1

> I traced the potential issue down in case of ORDER BY to pg_amop/pg_amproc lookups. The specific lookups vary by
operationbut the pattern is the same: first execution incurs catalog reads that get cached for subsequent runs. This
appliesto DISTINCT (sort, not hashed), GROUP BY, window functions, etc - which seems to isolate it to sort node. 

I'm finding it hard to fathom why you think this is a bug. We have
various caches that will require accessing various catalogue tables
and probably indexes too, which will be accessed unless the cache has
already been populated. These accessing the buffers for these are
included in the buffers count in EXPLAIN.

What is it you expect to happen here? If we access buffers and don't
report them, then IMO, that's a bug.

David

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

От

Radim Marek

Дата:

03 февраля, 10:21:03

On Tue, 3 Feb 2026 at 00:37, David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 3 Feb 2026 at 11:54, Radim Marek <radim@boringsql.com> wrote:
> Buffers: shared hit=4
> -> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1 width=32) (actual time=0.042..0.126 rows=1.00 loops=1)
> Filter: (lower(email) = 'email_1@example.com'::text)
> Rows Removed by Filter: 99
> Buffers: shared hit=1
>
> vs
>
> Buffers: shared hit=1
> -> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1 width=32) (actual time=0.021..0.057 rows=1.00 loops=1)
> Filter: (lower(email) = 'email_1@example.com'::text)
> Rows Removed by Filter: 99
> Buffers: shared hit=1

> I traced the potential issue down in case of ORDER BY to pg_amop/pg_amproc lookups. The specific lookups vary by operation but the pattern is the same: first execution incurs catalog reads that get cached for subsequent runs. This applies to DISTINCT (sort, not hashed), GROUP BY, window functions, etc - which seems to isolate it to sort node.

I'm finding it hard to fathom why you think this is a bug. We have
various caches that will require accessing various catalogue tables
and probably indexes too, which will be accessed unless the cache has
already been populated. These accessing the buffers for these are
included in the buffers count in EXPLAIN.

What is it you expect to happen here? If we access buffers and don't
report them, then IMO, that's a bug.

David

Apologies, "bug" might be a bit harsh and didn't want to imply those buffers won't be reported. My assumption is that catalogue lookups for resolving sort operator metadata would be reported as planning buffers rather than execution buffers. This is already what's reported for other lookups for very same query above.

The executor re-resolves sort operator metadata that the planner already looked up. The first lookup counts as planning buffers, the second as execution buffers.

Would passing that already resolved opfamily from the planner be a reasonable direction? My reasoning in this case is that I'm building on the hypothesis that buffers (whether shared hit or read) should be deterministic for the same query on the same data, and the same plan. This re-resolution on first execution breaks it.

Radim

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

От

Tomas Vondra

Дата:

03 февраля, 13:57:01

On 2/2/26 23:54, Radim Marek wrote:
> Hello,
> 
> First let me say this is a first time bug report and trying to patch the
> PostgreSQL, fresh after the hacking webinar. Please let me know if I
> need to clarify something.
> 
> TLDR; EXPLAIN ANALYZE BUFFERS reports different buffer counts for the
> Sort node between first and subsequent executions per session.
> 
>          Buffers: shared hit=4
>          ->  Seq Scan on sort_buffer_test  (cost=0.00..2.50 rows=1
> width=32) (actual time=0.042..0.126 rows=1.00 loops=1)
>                Filter: (lower(email) = 'email_1@example.com
> <mailto:email_1@example.com>'::text)
>                Rows Removed by Filter: 99
>                Buffers: shared hit=1
> 
> vs
> 
>          Buffers: shared hit=1
>          ->  Seq Scan on sort_buffer_test  (cost=0.00..2.50 rows=1
> width=32) (actual time=0.021..0.057 rows=1.00 loops=1)
>                Filter: (lower(email) = 'email_1@example.com
> <mailto:email_1@example.com>'::text)
>                Rows Removed by Filter: 99
>                Buffers: shared hit=1
> 
> I found it while working on SQL regression tool (regresql) that uses
> total reported buffers from EXPLAIN ANALYZE BUFFERS to detect deviations
> for query performance.
> 
> The issue came on smaller queries where test cases showed consistent
> "improvement" on performance (fewer buffers) when comparing baseline and
> actual test runs. Previously did not notice it on larger data sets due
> to the difference in buffers being negligible given the larger operations.
> 
> The example is the test case I managed to isolate.
> 
> --- way how to reproduce
> 
>   CREATE TABLE sort_buffer_test (id serial PRIMARY KEY, val int);
>   INSERT INTO sort_buffer_test (val) SELECT generate_series(1, 100);
>   ANALYZE sort_buffer_test;
> 
>   -- run twice in new psql session and observe reported buffer change
>   EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMING OFF)
>   SELECT * FROM sort_buffer_test ORDER BY val DESC LIMIT 1;
> 
> ---
> 
> This behaviour is session specific, tested on 17.x, 18.x and current
> development version (just fiy, all verified on both amd64 and aarch64).
> 
> I traced the potential issue down in case of ORDER BY to pg_amop/
> pg_amproc lookups. The specific lookups vary by operation but the
> pattern is the same: first execution incurs catalog reads that get
> cached for subsequent runs. This applies to DISTINCT (sort, not hashed),
> GROUP BY, window functions, etc - which seems to isolate it to sort node.
> 
> All those I believe are actually metadata resolution (i.e. planner
> already knows what to sort by).
> 

Correct. AFAICS this happens because the execution may need additional
metadata that was not needed for planning. In this particular case the
planning checks that the operators are valid/compatible, but does not
lookup all the associated pg_amproc entries. That only happens during
execution, when actually constructing the "tuplesort" object.

But there are probably many similar cases.

> Having said that - I'm happy to try to fix it. Just wondering what is
> the right direction to go:
> 
> 1. Pre-warm syscache during planning (tried locally and it works - but
> does not feel conceptually right and it's rather 'hacky')

Yeah, not going to happen. We'd have to preload the whole syscache,
because during planning we don't even know if that particular path/plan
will win. So we only do the bare minimum, to minimize the cost of
planning. And then we initialize the additional fields later, once we
have the cheapest plan.

> 2. Pre-compute and store metadata and modify each plan node struct for
> each type (lots of work). I believe that's what MergeJoin already does.

I suppose MergeJoin may simply need more metadata during planning. If it
does more than that, it might count as a bug (and it should do less).

> 3. Or this is 'works as expected' - which I don't think it's the case
> and would force me to do pre-warming on regresql side
> 

I think it mostly works as expected/designed. That however does not mean
the issue you pointed out does not exist. IMHO what regresql is aiming
to do would be quite handy - e.g. in the index prefetching patch we have
regression tests checking that we're not accessing unexpected number of
buffers (or more buffers than master).

I'm planning to check if regresql would be a more elegant way to do
that. For now we're relying on plain regression tests (SQL + expected
output), and that happens to not have this issue because it executes the
queries in a session with a "deterministic" state.

regards

-- 
Tomas Vondra

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

От

Tomas Vondra

Дата:

03 февраля, 14:02:50


On 2/3/26 08:21, Radim Marek wrote:
> On Tue, 3 Feb 2026 at 00:37, David Rowley <dgrowleyml@gmail.com
> <mailto:dgrowleyml@gmail.com>> wrote:
> 
>     On Tue, 3 Feb 2026 at 11:54, Radim Marek <radim@boringsql.com
>     <mailto:radim@boringsql.com>> wrote:
>     >          Buffers: shared hit=4
>     >          ->  Seq Scan on sort_buffer_test  (cost=0.00..2.50 rows=1
>     width=32) (actual time=0.042..0.126 rows=1.00 loops=1)
>     >                Filter: (lower(email) = 'email_1@example.com
>     <mailto:email_1@example.com>'::text)
>     >                Rows Removed by Filter: 99
>     >                Buffers: shared hit=1
>     >
>     > vs
>     >
>     >          Buffers: shared hit=1
>     >          ->  Seq Scan on sort_buffer_test  (cost=0.00..2.50 rows=1
>     width=32) (actual time=0.021..0.057 rows=1.00 loops=1)
>     >                Filter: (lower(email) = 'email_1@example.com
>     <mailto:email_1@example.com>'::text)
>     >                Rows Removed by Filter: 99
>     >                Buffers: shared hit=1
> 
>     > I traced the potential issue down in case of ORDER BY to pg_amop/
>     pg_amproc lookups. The specific lookups vary by operation but the
>     pattern is the same: first execution incurs catalog reads that get
>     cached for subsequent runs. This applies to DISTINCT (sort, not
>     hashed), GROUP BY, window functions, etc - which seems to isolate it
>     to sort node.
> 
>     I'm finding it hard to fathom why you think this is a bug. We have
>     various caches that will require accessing various catalogue tables
>     and probably indexes too, which will be accessed unless the cache has
>     already been populated. These accessing the buffers for these are
>     included in the buffers count in EXPLAIN.
> 
>     What is it you expect to happen here? If we access buffers and don't
>     report them, then IMO, that's a bug.
> 
>     David
> 
> 
> Apologies, "bug" might be a bit harsh and didn't want to imply those
> buffers won't be reported. My assumption is that catalogue lookups for
> resolving sort operator metadata would be reported as planning buffers
> rather than execution buffers. This is already what's reported for other
> lookups for very same query above.
> 

I wouldn't call this a "bug", but the behavior may be a bit surprising
and working against what regresql aims to do.

> The executor re-resolves sort operator metadata that the planner already
> looked up. The first lookup counts as planning buffers, the second as
> execution buffers.
> 

I don't think it re-resolves the metadata. AFAIK it does fetch
additional information about the opclass, which was not needed for planning.

> Would passing that already resolved opfamily from the planner be a
> reasonable direction? My reasoning in this case is that I'm building on
> the hypothesis that buffers (whether shared hit or read) should be
> deterministic for the same query on the same data, and the same plan.
> This re-resolution on first execution breaks it. 
> 

No, not really. That would increase the cost of planning - we do need to
do the lookup later, but we only do that for the one plan that "won". If
we did that during planning, it'd have to happen for all plans.

Perhaps we could identify buffer accesses from syscache lookups, and
track those separately? Then we'd know which of the hits/reads are from
"actual" execution.


regards

-- 
Tomas Vondra

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

От

Lukas Fittl

Дата:

03 февраля, 18:44:14

On Tue, Feb 3, 2026 at 3:02 AM Tomas Vondra <tomas@vondra.me> wrote:
> Perhaps we could identify buffer accesses from syscache lookups, and
> track those separately? Then we'd know which of the hits/reads are from
> "actual" execution.

Doing this in the current infrastructure (with the global
pgBufferUsage counters) is pretty expensive I think, since we'd need
to do an extra call to BufferUsageAccumDiff after syscache lookups, to
get the buffer usage just for that activity.

If splitting out buffers by syscache vs others is of interest, I would
welcome review of the stack-based instrumentation patch over at [0],
which has an example of how it can be used to more easily break Index
Scans out into index and table access - the same logic could be
applied to syscache lookups I think.

[0]:
https://www.postgresql.org/message-id/flat/CAP53Pkw85U-aMRzkZ%2BkRKfCh0pA5vyo%3D_W16gPK4sirZxiJy%3DA%40mail.gmail.com#170c740dbeb87d25d3f921ab28e4bb40

Thanks,
Lukas

--
Lukas Fittl

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

От

David Rowley

Дата:

04 февраля, 00:53:07

On Tue, 3 Feb 2026 at 20:21, Radim Marek <radim@boringsql.com> wrote:
> Apologies, "bug" might be a bit harsh and didn't want to imply those buffers won't be reported. My assumption is that
cataloguelookups for resolving sort operator metadata would be reported as planning buffers rather than execution
buffers.This is already what's reported for other lookups for very same query above. 
>
> The executor re-resolves sort operator metadata that the planner already looked up. The first lookup counts as
planningbuffers, the second as execution buffers. 

If it was looking up the same thing, then there shouldn't be a cache
miss, and the buffers won't be accessed.

> Would passing that already resolved opfamily from the planner be a reasonable direction? My reasoning in this case is
thatI'm building on the hypothesis that buffers (whether shared hit or read) should be deterministic for the same query
onthe same data, and the same plan. This re-resolution on first execution breaks it. 

I think generally, offloading more work into the planner for looking
up things that are always the same for every invocation of a plan is
generally good. For anything that moves things more in that direction,
you'd need to ensure that the plan is correctly invalidated when
something changes about the extra thing you looked up in planning.
I've not looked into the specifics of this instance, other than the
function being called which causes the buffer usage is
get_opfamily_proc(). I don't see that called during planning for this
query, so your statement about "already resolved opfamily from the
planner" doesn't seem true.

However, I imagine this is just one of many things and if you're
aiming to stabilise your tests doing this, then you'll likely be
fixing things for a long time to come. I imagine a better way if
you're just doing this for running test suites would be to invent some
way to prepopulate the sys and cat caches. Maybe that could be a
contrib module. I don't think there's a way to get a relcache miss
during execution, but I guess it could change with prepared statements
if we ever decided to start evicting long unused relcache entries one
day.

David

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

От

Tomas Vondra

Дата:

04 февраля, 18:59:31

On 2/3/26 22:53, David Rowley wrote:
> On Tue, 3 Feb 2026 at 20:21, Radim Marek <radim@boringsql.com> wrote:
>> Apologies, "bug" might be a bit harsh and didn't want to imply those buffers won't be reported. My assumption is
thatcatalogue lookups for resolving sort operator metadata would be reported as planning buffers rather than execution
buffers.This is already what's reported for other lookups for very same query above.

>>
>> The executor re-resolves sort operator metadata that the planner already looked up. The first lookup counts as
planningbuffers, the second as execution buffers.

> 
> If it was looking up the same thing, then there shouldn't be a cache
> miss, and the buffers won't be accessed.
> 

Right.

I think there's a bit of confusion because the planning phase resolves
only some of the information, and then some additional pieces are left
to the execution phase. Which won't be counted in "planning" buffers,
and that may seem surprising to users.

>> Would passing that already resolved opfamily from the planner be a reasonable direction? My reasoning in this case
isthat I'm building on the hypothesis that buffers (whether shared hit or read) should be deterministic for the same
queryon the same data, and the same plan. This re-resolution on first execution breaks it.

> 
> I think generally, offloading more work into the planner for looking
> up things that are always the same for every invocation of a plan is
> generally good. For anything that moves things more in that direction,
> you'd need to ensure that the plan is correctly invalidated when
> something changes about the extra thing you looked up in planning.
> I've not looked into the specifics of this instance, other than the
> function being called which causes the buffer usage is
> get_opfamily_proc(). I don't see that called during planning for this
> query, so your statement about "already resolved opfamily from the
> planner" doesn't seem true.
> 

Yes, that's essentially what I explained yesterday.

I'm not sure about moving all these lookups to the planner (from the
executor). Could this easily hurt some cases? We'd need to do this for
all fields the final plan might need, i.e. we'd need to initialize all
fields for all paths we construct - not just the metadata needed by the
final cheapest one. That could be a lot of additional lookups, right?
That would make the first execution more expensive. Later executions of
a query that would be cached, but we'd be still paying the cost for the
lookup (cheap).

Of course, there's a lot of such fields - this amproc is just an
example. I don't think it's practical to rework all of that to
initialize everything early.

> However, I imagine this is just one of many things and if you're
> aiming to stabilise your tests doing this, then you'll likely be
> fixing things for a long time to come. I imagine a better way if
> you're just doing this for running test suites would be to invent some
> way to prepopulate the sys and cat caches. Maybe that could be a
> contrib module. I don't think there's a way to get a relcache miss
> during execution, but I guess it could change with prepared statements
> if we ever decided to start evicting long unused relcache entries one
> day.
> 

I'm not sure there's a good way to warmup the sys/cat caches, short of
loading everything. Because how would you know what will be needed by
the execution plan? It seems very complex.

One option would be to run the queries twice - the first one would
warmup caches, the second execution would be the measured one. But that
makes the tool 2x as expensive.

But what if we had a way to *invalidate* all the caches? That'd also
make the behavior deterministic - there would be no hits initially. And
it seems much simpler to implement.

regards

-- 
Tomas Vondra

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

От

David Rowley

Дата:

05 февраля, 03:15:45

On Thu, 5 Feb 2026 at 04:59, Tomas Vondra <tomas@vondra.me> wrote:
>
> On 2/3/26 22:53, David Rowley wrote:
> > I think generally, offloading more work into the planner for looking
> > up things that are always the same for every invocation of a plan is
> > generally good. For anything that moves things more in that direction,
> > you'd need to ensure that the plan is correctly invalidated when
> > something changes about the extra thing you looked up in planning.
> > I've not looked into the specifics of this instance, other than the
> > function being called which causes the buffer usage is
> > get_opfamily_proc(). I don't see that called during planning for this
> > query, so your statement about "already resolved opfamily from the
> > planner" doesn't seem true.
> >
>
> Yes, that's essentially what I explained yesterday.
>
> I'm not sure about moving all these lookups to the planner (from the
> executor). Could this easily hurt some cases? We'd need to do this for
> all fields the final plan might need, i.e. we'd need to initialize all
> fields for all paths we construct - not just the metadata needed by the
> final cheapest one. That could be a lot of additional lookups, right?
> That would make the first execution more expensive. Later executions of
> a query that would be cached, but we'd be still paying the cost for the
> lookup (cheap).

We do look some things up at createplan time. e.g calling
get_opfamily_member_for_cmptype() from prepare_sort_from_pathkeys().
However, that could still be an additional lookup as the existing
lookup for the case in question is in ExecSort rather than in
ExecInitSort(), so there could be a surplus lookup if the sort node is
never executed.  I doubt that's worth worrying about too much. It's
normally large memory allocations we want to defer until fetching the
first row from the node.

> Of course, there's a lot of such fields - this amproc is just an
> example. I don't think it's practical to rework all of that to
> initialize everything early.
>
> > However, I imagine this is just one of many things and if you're
> > aiming to stabilise your tests doing this, then you'll likely be
> > fixing things for a long time to come. I imagine a better way if
> > you're just doing this for running test suites would be to invent some
> > way to prepopulate the sys and cat caches. Maybe that could be a
> > contrib module. I don't think there's a way to get a relcache miss
> > during execution, but I guess it could change with prepared statements
> > if we ever decided to start evicting long unused relcache entries one
> > day.
> >
>
> I'm not sure there's a good way to warmup the sys/cat caches, short of
> loading everything. Because how would you know what will be needed by
> the execution plan? It seems very complex.

I imagined if it's just for machines running tests then you could just
load everything. If it was coded in such a way that a tuple fetched by
doing a Seq Scan on the catalogue table was what went into the cache,
rather than the Seq Scan drives the normal cache lookup code,
resulting in a subsequent Index Scan on the catalogue's index, then it
could be done with fairly low overhead. I imagine in the order of
<10ms from fresh initdb. That doesn't seem excessively long for
machines running tests in the background.

> One option would be to run the queries twice - the first one would
> warmup caches, the second execution would be the measured one. But that
> makes the tool 2x as expensive.

Maybe. I don't know the tool or how people use it.

> But what if we had a way to *invalidate* all the caches? That'd also
> make the behavior deterministic - there would be no hits initially. And
> it seems much simpler to implement.

There is debug_discard_caches, but that requires an assert-enabled build.

David

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

От

Tomas Vondra

Дата:

05 февраля, 18:19:38


On 2/5/26 01:15, David Rowley wrote:
> On Thu, 5 Feb 2026 at 04:59, Tomas Vondra <tomas@vondra.me> wrote:
>>
>> On 2/3/26 22:53, David Rowley wrote:
>>> I think generally, offloading more work into the planner for looking
>>> up things that are always the same for every invocation of a plan is
>>> generally good. For anything that moves things more in that direction,
>>> you'd need to ensure that the plan is correctly invalidated when
>>> something changes about the extra thing you looked up in planning.
>>> I've not looked into the specifics of this instance, other than the
>>> function being called which causes the buffer usage is
>>> get_opfamily_proc(). I don't see that called during planning for this
>>> query, so your statement about "already resolved opfamily from the
>>> planner" doesn't seem true.
>>>
>>
>> Yes, that's essentially what I explained yesterday.
>>
>> I'm not sure about moving all these lookups to the planner (from the
>> executor). Could this easily hurt some cases? We'd need to do this for
>> all fields the final plan might need, i.e. we'd need to initialize all
>> fields for all paths we construct - not just the metadata needed by the
>> final cheapest one. That could be a lot of additional lookups, right?
>> That would make the first execution more expensive. Later executions of
>> a query that would be cached, but we'd be still paying the cost for the
>> lookup (cheap).
> 
> We do look some things up at createplan time. e.g calling
> get_opfamily_member_for_cmptype() from prepare_sort_from_pathkeys().
> However, that could still be an additional lookup as the existing
> lookup for the case in question is in ExecSort rather than in
> ExecInitSort(), so there could be a surplus lookup if the sort node is
> never executed.  I doubt that's worth worrying about too much. It's
> normally large memory allocations we want to defer until fetching the
> first row from the node.
> 

Perhaps, you may be right about this. I'm still skeptical about having
to figure out all the fields we'll need during execution. I doubt we'll
go around and adjust all those places, so the discussion about the cost
of additional lookups are purely theoretical.

>> Of course, there's a lot of such fields - this amproc is just an
>> example. I don't think it's practical to rework all of that to
>> initialize everything early.
>>
>>> However, I imagine this is just one of many things and if you're
>>> aiming to stabilise your tests doing this, then you'll likely be
>>> fixing things for a long time to come. I imagine a better way if
>>> you're just doing this for running test suites would be to invent some
>>> way to prepopulate the sys and cat caches. Maybe that could be a
>>> contrib module. I don't think there's a way to get a relcache miss
>>> during execution, but I guess it could change with prepared statements
>>> if we ever decided to start evicting long unused relcache entries one
>>> day.
>>>
>>
>> I'm not sure there's a good way to warmup the sys/cat caches, short of
>> loading everything. Because how would you know what will be needed by
>> the execution plan? It seems very complex.
> 
> I imagined if it's just for machines running tests then you could just
> load everything. If it was coded in such a way that a tuple fetched by
> doing a Seq Scan on the catalogue table was what went into the cache,
> rather than the Seq Scan drives the normal cache lookup code,
> resulting in a subsequent Index Scan on the catalogue's index, then it
> could be done with fairly low overhead. I imagine in the order of
> <10ms from fresh initdb. That doesn't seem excessively long for
> machines running tests in the background.
> 

So we'd just go through all the caches relcaches/catcaches/... and load
all the stuff that's in pg_catalog? I guess that could work, although
I'm not sure how convenient would it be for objects created in the tests
themselves (I'm not sure if those end up in the cache right away).

>> One option would be to run the queries twice - the first one would
>> warmup caches, the second execution would be the measured one. But that
>> makes the tool 2x as expensive.
> 
> Maybe. I don't know the tool or how people use it.
> 
>> But what if we had a way to *invalidate* all the caches? That'd also
>> make the behavior deterministic - there would be no hits initially. And
>> it seems much simpler to implement.
> 
> There is debug_discard_caches, but that requires an assert-enabled build.
> 

Right. I suppose we could have a SQL function calling
InvalidateSystemCachesExtended to do that. In fact, an extension can
probably add such function already.


regards

-- 
Tomas Vondra

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

От

Tom Lane

Дата:

05 февраля, 18:25:25

Tomas Vondra <tomas@vondra.me> writes:
> On 2/5/26 01:15, David Rowley wrote:
>> I imagined if it's just for machines running tests then you could just
>> load everything. If it was coded in such a way that a tuple fetched by
>> doing a Seq Scan on the catalogue table was what went into the cache,
>> rather than the Seq Scan drives the normal cache lookup code,
>> resulting in a subsequent Index Scan on the catalogue's index, then it
>> could be done with fairly low overhead. I imagine in the order of
>> <10ms from fresh initdb. That doesn't seem excessively long for
>> machines running tests in the background.

> So we'd just go through all the caches relcaches/catcaches/... and load
> all the stuff that's in pg_catalog? I guess that could work,

... until there's a cache flush event.  This whole discussion seems
to me to be based on a misconception.

            regards, tom lane

Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS

От

Radim Marek

Дата:

05 февраля, 22:53:57

Thank you all for the time and detailed responses. To clarify - RegreSQL works with tolerances and I only came across this when detecting improvements (i.e. opposite of the regressions), where smaller executions showed consistent "gains" that turned out to be cache related. Understanding the complexity of the underlying cause is in a way enough for me - there's much more depth here than I initially anticipated. My warm up fix attempts led me to believe it might be much easier.

I'll be honest - just understanding and reviewing Lukas' patches will take me a few weeks to properly digest. But at the end it's exactly the kind of deeper understanding I was hoping to build by engaging here.

I really do appreciate the patience with a first-time poster :)

Radim

On Thu, 5 Feb 2026 at 16:25, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Tomas Vondra <tomas@vondra.me> writes:
> On 2/5/26 01:15, David Rowley wrote:
>> I imagined if it's just for machines running tests then you could just
>> load everything. If it was coded in such a way that a tuple fetched by
>> doing a Seq Scan on the catalogue table was what went into the cache,
>> rather than the Seq Scan drives the normal cache lookup code,
>> resulting in a subsequent Index Scan on the catalogue's index, then it
>> could be done with fairly low overhead. I imagine in the order of
>> <10ms from fresh initdb. That doesn't seem excessively long for
>> machines running tests in the background.

> So we'd just go through all the caches relcaches/catcaches/... and load
> all the stuff that's in pg_catalog? I guess that could work,

... until there's a cache flush event. This whole discussion seems
to me to be based on a misconception.

regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS