Обсуждение: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS
Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS
От
Radim Marek
Дата:
Hello,
First let me say this is a first time bug report and trying to patch the PostgreSQL, fresh after the hacking webinar. Please let me know if I need to clarify something.
TLDR; EXPLAIN ANALYZE BUFFERS reports different buffer counts for the Sort node between first and subsequent executions per session.
Buffers: shared hit=4
-> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1 width=32) (actual time=0.042..0.126 rows=1.00 loops=1)
Filter: (lower(email) = 'email_1@example.com'::text)
Rows Removed by Filter: 99
Buffers: shared hit=1
-> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1 width=32) (actual time=0.042..0.126 rows=1.00 loops=1)
Filter: (lower(email) = 'email_1@example.com'::text)
Rows Removed by Filter: 99
Buffers: shared hit=1
vs
Buffers: shared hit=1
-> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1 width=32) (actual time=0.021..0.057 rows=1.00 loops=1)
Filter: (lower(email) = 'email_1@example.com'::text)
Rows Removed by Filter: 99
Buffers: shared hit=1
-> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1 width=32) (actual time=0.021..0.057 rows=1.00 loops=1)
Filter: (lower(email) = 'email_1@example.com'::text)
Rows Removed by Filter: 99
Buffers: shared hit=1
I found it while working on SQL regression tool (regresql) that uses total reported buffers from EXPLAIN ANALYZE BUFFERS to detect deviations for query performance.
The issue came on smaller queries where test cases showed consistent "improvement" on performance (fewer buffers) when comparing baseline and actual test runs. Previously did not notice it on larger data sets due to the difference in buffers being negligible given the larger operations.
The example is the test case I managed to isolate.
--- way how to reproduce
CREATE TABLE sort_buffer_test (id serial PRIMARY KEY, val int);
INSERT INTO sort_buffer_test (val) SELECT generate_series(1, 100);
ANALYZE sort_buffer_test;
-- run twice in new psql session and observe reported buffer change
EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMING OFF)
SELECT * FROM sort_buffer_test ORDER BY val DESC LIMIT 1;
INSERT INTO sort_buffer_test (val) SELECT generate_series(1, 100);
ANALYZE sort_buffer_test;
-- run twice in new psql session and observe reported buffer change
EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMING OFF)
SELECT * FROM sort_buffer_test ORDER BY val DESC LIMIT 1;
---
This behaviour is session specific, tested on 17.x, 18.x and current development version (just fiy, all verified on both amd64 and aarch64).
I traced the potential issue down in case of ORDER BY to pg_amop/pg_amproc lookups. The specific lookups vary by operation but the pattern is the same: first execution incurs catalog reads that get cached for subsequent runs. This applies to DISTINCT (sort, not hashed), GROUP BY, window functions, etc - which seems to isolate it to sort node.
All those I believe are actually metadata resolution (i.e. planner already knows what to sort by).
Having said that - I'm happy to try to fix it. Just wondering what is the right direction to go:
1. Pre-warm syscache during planning (tried locally and it works - but does not feel conceptually right and it's rather 'hacky')
2. Pre-compute and store metadata and modify each plan node struct for each type (lots of work). I believe that's what MergeJoin already does.
3. Or this is 'works as expected' - which I don't think it's the case and would force me to do pre-warming on regresql side
Kind regards,
Radim
Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS
От
David Rowley
Дата:
On Tue, 3 Feb 2026 at 11:54, Radim Marek <radim@boringsql.com> wrote: > Buffers: shared hit=4 > -> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1 width=32) (actual time=0.042..0.126 rows=1.00 loops=1) > Filter: (lower(email) = 'email_1@example.com'::text) > Rows Removed by Filter: 99 > Buffers: shared hit=1 > > vs > > Buffers: shared hit=1 > -> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1 width=32) (actual time=0.021..0.057 rows=1.00 loops=1) > Filter: (lower(email) = 'email_1@example.com'::text) > Rows Removed by Filter: 99 > Buffers: shared hit=1 > I traced the potential issue down in case of ORDER BY to pg_amop/pg_amproc lookups. The specific lookups vary by operationbut the pattern is the same: first execution incurs catalog reads that get cached for subsequent runs. This appliesto DISTINCT (sort, not hashed), GROUP BY, window functions, etc - which seems to isolate it to sort node. I'm finding it hard to fathom why you think this is a bug. We have various caches that will require accessing various catalogue tables and probably indexes too, which will be accessed unless the cache has already been populated. These accessing the buffers for these are included in the buffers count in EXPLAIN. What is it you expect to happen here? If we access buffers and don't report them, then IMO, that's a bug. David
On Tue, 3 Feb 2026 at 00:37, David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 3 Feb 2026 at 11:54, Radim Marek <radim@boringsql.com> wrote:
> Buffers: shared hit=4
> -> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1 width=32) (actual time=0.042..0.126 rows=1.00 loops=1)
> Filter: (lower(email) = 'email_1@example.com'::text)
> Rows Removed by Filter: 99
> Buffers: shared hit=1
>
> vs
>
> Buffers: shared hit=1
> -> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1 width=32) (actual time=0.021..0.057 rows=1.00 loops=1)
> Filter: (lower(email) = 'email_1@example.com'::text)
> Rows Removed by Filter: 99
> Buffers: shared hit=1
> I traced the potential issue down in case of ORDER BY to pg_amop/pg_amproc lookups. The specific lookups vary by operation but the pattern is the same: first execution incurs catalog reads that get cached for subsequent runs. This applies to DISTINCT (sort, not hashed), GROUP BY, window functions, etc - which seems to isolate it to sort node.
I'm finding it hard to fathom why you think this is a bug. We have
various caches that will require accessing various catalogue tables
and probably indexes too, which will be accessed unless the cache has
already been populated. These accessing the buffers for these are
included in the buffers count in EXPLAIN.
What is it you expect to happen here? If we access buffers and don't
report them, then IMO, that's a bug.
David
Apologies, "bug" might be a bit harsh and didn't want to imply those buffers won't be reported. My assumption is that catalogue lookups for resolving sort operator metadata would be reported as planning buffers rather than execution buffers. This is already what's reported for other lookups for very same query above.
The executor re-resolves sort operator metadata that the planner already looked up. The first lookup counts as planning buffers, the second as execution buffers.
Would passing that already resolved opfamily from the planner be a reasonable direction? My reasoning in this case is that I'm building on the hypothesis that buffers (whether shared hit or read) should be deterministic for the same query on the same data, and the same plan. This re-resolution on first execution breaks it.
Radim
Radim
Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS
От
Tomas Vondra
Дата:
On 2/2/26 23:54, Radim Marek wrote: > Hello, > > First let me say this is a first time bug report and trying to patch the > PostgreSQL, fresh after the hacking webinar. Please let me know if I > need to clarify something. > > TLDR; EXPLAIN ANALYZE BUFFERS reports different buffer counts for the > Sort node between first and subsequent executions per session. > > Buffers: shared hit=4 > -> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1 > width=32) (actual time=0.042..0.126 rows=1.00 loops=1) > Filter: (lower(email) = 'email_1@example.com > <mailto:email_1@example.com>'::text) > Rows Removed by Filter: 99 > Buffers: shared hit=1 > > vs > > Buffers: shared hit=1 > -> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1 > width=32) (actual time=0.021..0.057 rows=1.00 loops=1) > Filter: (lower(email) = 'email_1@example.com > <mailto:email_1@example.com>'::text) > Rows Removed by Filter: 99 > Buffers: shared hit=1 > > I found it while working on SQL regression tool (regresql) that uses > total reported buffers from EXPLAIN ANALYZE BUFFERS to detect deviations > for query performance. > > The issue came on smaller queries where test cases showed consistent > "improvement" on performance (fewer buffers) when comparing baseline and > actual test runs. Previously did not notice it on larger data sets due > to the difference in buffers being negligible given the larger operations. > > The example is the test case I managed to isolate. > > --- way how to reproduce > > CREATE TABLE sort_buffer_test (id serial PRIMARY KEY, val int); > INSERT INTO sort_buffer_test (val) SELECT generate_series(1, 100); > ANALYZE sort_buffer_test; > > -- run twice in new psql session and observe reported buffer change > EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMING OFF) > SELECT * FROM sort_buffer_test ORDER BY val DESC LIMIT 1; > > --- > > This behaviour is session specific, tested on 17.x, 18.x and current > development version (just fiy, all verified on both amd64 and aarch64). > > I traced the potential issue down in case of ORDER BY to pg_amop/ > pg_amproc lookups. The specific lookups vary by operation but the > pattern is the same: first execution incurs catalog reads that get > cached for subsequent runs. This applies to DISTINCT (sort, not hashed), > GROUP BY, window functions, etc - which seems to isolate it to sort node. > > All those I believe are actually metadata resolution (i.e. planner > already knows what to sort by). > Correct. AFAICS this happens because the execution may need additional metadata that was not needed for planning. In this particular case the planning checks that the operators are valid/compatible, but does not lookup all the associated pg_amproc entries. That only happens during execution, when actually constructing the "tuplesort" object. But there are probably many similar cases. > Having said that - I'm happy to try to fix it. Just wondering what is > the right direction to go: > > 1. Pre-warm syscache during planning (tried locally and it works - but > does not feel conceptually right and it's rather 'hacky') Yeah, not going to happen. We'd have to preload the whole syscache, because during planning we don't even know if that particular path/plan will win. So we only do the bare minimum, to minimize the cost of planning. And then we initialize the additional fields later, once we have the cheapest plan. > 2. Pre-compute and store metadata and modify each plan node struct for > each type (lots of work). I believe that's what MergeJoin already does. I suppose MergeJoin may simply need more metadata during planning. If it does more than that, it might count as a bug (and it should do less). > 3. Or this is 'works as expected' - which I don't think it's the case > and would force me to do pre-warming on regresql side > I think it mostly works as expected/designed. That however does not mean the issue you pointed out does not exist. IMHO what regresql is aiming to do would be quite handy - e.g. in the index prefetching patch we have regression tests checking that we're not accessing unexpected number of buffers (or more buffers than master). I'm planning to check if regresql would be a more elegant way to do that. For now we're relying on plain regression tests (SQL + expected output), and that happens to not have this issue because it executes the queries in a session with a "deterministic" state. regards -- Tomas Vondra
Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS
От
Tomas Vondra
Дата:
On 2/3/26 08:21, Radim Marek wrote: > On Tue, 3 Feb 2026 at 00:37, David Rowley <dgrowleyml@gmail.com > <mailto:dgrowleyml@gmail.com>> wrote: > > On Tue, 3 Feb 2026 at 11:54, Radim Marek <radim@boringsql.com > <mailto:radim@boringsql.com>> wrote: > > Buffers: shared hit=4 > > -> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1 > width=32) (actual time=0.042..0.126 rows=1.00 loops=1) > > Filter: (lower(email) = 'email_1@example.com > <mailto:email_1@example.com>'::text) > > Rows Removed by Filter: 99 > > Buffers: shared hit=1 > > > > vs > > > > Buffers: shared hit=1 > > -> Seq Scan on sort_buffer_test (cost=0.00..2.50 rows=1 > width=32) (actual time=0.021..0.057 rows=1.00 loops=1) > > Filter: (lower(email) = 'email_1@example.com > <mailto:email_1@example.com>'::text) > > Rows Removed by Filter: 99 > > Buffers: shared hit=1 > > > I traced the potential issue down in case of ORDER BY to pg_amop/ > pg_amproc lookups. The specific lookups vary by operation but the > pattern is the same: first execution incurs catalog reads that get > cached for subsequent runs. This applies to DISTINCT (sort, not > hashed), GROUP BY, window functions, etc - which seems to isolate it > to sort node. > > I'm finding it hard to fathom why you think this is a bug. We have > various caches that will require accessing various catalogue tables > and probably indexes too, which will be accessed unless the cache has > already been populated. These accessing the buffers for these are > included in the buffers count in EXPLAIN. > > What is it you expect to happen here? If we access buffers and don't > report them, then IMO, that's a bug. > > David > > > Apologies, "bug" might be a bit harsh and didn't want to imply those > buffers won't be reported. My assumption is that catalogue lookups for > resolving sort operator metadata would be reported as planning buffers > rather than execution buffers. This is already what's reported for other > lookups for very same query above. > I wouldn't call this a "bug", but the behavior may be a bit surprising and working against what regresql aims to do. > The executor re-resolves sort operator metadata that the planner already > looked up. The first lookup counts as planning buffers, the second as > execution buffers. > I don't think it re-resolves the metadata. AFAIK it does fetch additional information about the opclass, which was not needed for planning. > Would passing that already resolved opfamily from the planner be a > reasonable direction? My reasoning in this case is that I'm building on > the hypothesis that buffers (whether shared hit or read) should be > deterministic for the same query on the same data, and the same plan. > This re-resolution on first execution breaks it. > No, not really. That would increase the cost of planning - we do need to do the lookup later, but we only do that for the one plan that "won". If we did that during planning, it'd have to happen for all plans. Perhaps we could identify buffer accesses from syscache lookups, and track those separately? Then we'd know which of the hits/reads are from "actual" execution. regards -- Tomas Vondra
On Tue, Feb 3, 2026 at 3:02 AM Tomas Vondra <tomas@vondra.me> wrote: > Perhaps we could identify buffer accesses from syscache lookups, and > track those separately? Then we'd know which of the hits/reads are from > "actual" execution. Doing this in the current infrastructure (with the global pgBufferUsage counters) is pretty expensive I think, since we'd need to do an extra call to BufferUsageAccumDiff after syscache lookups, to get the buffer usage just for that activity. If splitting out buffers by syscache vs others is of interest, I would welcome review of the stack-based instrumentation patch over at [0], which has an example of how it can be used to more easily break Index Scans out into index and table access - the same logic could be applied to syscache lookups I think. [0]: https://www.postgresql.org/message-id/flat/CAP53Pkw85U-aMRzkZ%2BkRKfCh0pA5vyo%3D_W16gPK4sirZxiJy%3DA%40mail.gmail.com#170c740dbeb87d25d3f921ab28e4bb40 Thanks, Lukas -- Lukas Fittl
Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS
От
David Rowley
Дата:
On Tue, 3 Feb 2026 at 20:21, Radim Marek <radim@boringsql.com> wrote: > Apologies, "bug" might be a bit harsh and didn't want to imply those buffers won't be reported. My assumption is that cataloguelookups for resolving sort operator metadata would be reported as planning buffers rather than execution buffers.This is already what's reported for other lookups for very same query above. > > The executor re-resolves sort operator metadata that the planner already looked up. The first lookup counts as planningbuffers, the second as execution buffers. If it was looking up the same thing, then there shouldn't be a cache miss, and the buffers won't be accessed. > Would passing that already resolved opfamily from the planner be a reasonable direction? My reasoning in this case is thatI'm building on the hypothesis that buffers (whether shared hit or read) should be deterministic for the same query onthe same data, and the same plan. This re-resolution on first execution breaks it. I think generally, offloading more work into the planner for looking up things that are always the same for every invocation of a plan is generally good. For anything that moves things more in that direction, you'd need to ensure that the plan is correctly invalidated when something changes about the extra thing you looked up in planning. I've not looked into the specifics of this instance, other than the function being called which causes the buffer usage is get_opfamily_proc(). I don't see that called during planning for this query, so your statement about "already resolved opfamily from the planner" doesn't seem true. However, I imagine this is just one of many things and if you're aiming to stabilise your tests doing this, then you'll likely be fixing things for a long time to come. I imagine a better way if you're just doing this for running test suites would be to invent some way to prepopulate the sys and cat caches. Maybe that could be a contrib module. I don't think there's a way to get a relcache miss during execution, but I guess it could change with prepared statements if we ever decided to start evicting long unused relcache entries one day. David
Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS
От
Tomas Vondra
Дата:
On 2/3/26 22:53, David Rowley wrote: > On Tue, 3 Feb 2026 at 20:21, Radim Marek <radim@boringsql.com> wrote: >> Apologies, "bug" might be a bit harsh and didn't want to imply those buffers won't be reported. My assumption is thatcatalogue lookups for resolving sort operator metadata would be reported as planning buffers rather than execution buffers.This is already what's reported for other lookups for very same query above. >> >> The executor re-resolves sort operator metadata that the planner already looked up. The first lookup counts as planningbuffers, the second as execution buffers. > > If it was looking up the same thing, then there shouldn't be a cache > miss, and the buffers won't be accessed. > Right. I think there's a bit of confusion because the planning phase resolves only some of the information, and then some additional pieces are left to the execution phase. Which won't be counted in "planning" buffers, and that may seem surprising to users. >> Would passing that already resolved opfamily from the planner be a reasonable direction? My reasoning in this case isthat I'm building on the hypothesis that buffers (whether shared hit or read) should be deterministic for the same queryon the same data, and the same plan. This re-resolution on first execution breaks it. > > I think generally, offloading more work into the planner for looking > up things that are always the same for every invocation of a plan is > generally good. For anything that moves things more in that direction, > you'd need to ensure that the plan is correctly invalidated when > something changes about the extra thing you looked up in planning. > I've not looked into the specifics of this instance, other than the > function being called which causes the buffer usage is > get_opfamily_proc(). I don't see that called during planning for this > query, so your statement about "already resolved opfamily from the > planner" doesn't seem true. > Yes, that's essentially what I explained yesterday. I'm not sure about moving all these lookups to the planner (from the executor). Could this easily hurt some cases? We'd need to do this for all fields the final plan might need, i.e. we'd need to initialize all fields for all paths we construct - not just the metadata needed by the final cheapest one. That could be a lot of additional lookups, right? That would make the first execution more expensive. Later executions of a query that would be cached, but we'd be still paying the cost for the lookup (cheap). Of course, there's a lot of such fields - this amproc is just an example. I don't think it's practical to rework all of that to initialize everything early. > However, I imagine this is just one of many things and if you're > aiming to stabilise your tests doing this, then you'll likely be > fixing things for a long time to come. I imagine a better way if > you're just doing this for running test suites would be to invent some > way to prepopulate the sys and cat caches. Maybe that could be a > contrib module. I don't think there's a way to get a relcache miss > during execution, but I guess it could change with prepared statements > if we ever decided to start evicting long unused relcache entries one > day. > I'm not sure there's a good way to warmup the sys/cat caches, short of loading everything. Because how would you know what will be needed by the execution plan? It seems very complex. One option would be to run the queries twice - the first one would warmup caches, the second execution would be the measured one. But that makes the tool 2x as expensive. But what if we had a way to *invalidate* all the caches? That'd also make the behavior deterministic - there would be no hits initially. And it seems much simpler to implement. regards -- Tomas Vondra
Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS
От
David Rowley
Дата:
On Thu, 5 Feb 2026 at 04:59, Tomas Vondra <tomas@vondra.me> wrote: > > On 2/3/26 22:53, David Rowley wrote: > > I think generally, offloading more work into the planner for looking > > up things that are always the same for every invocation of a plan is > > generally good. For anything that moves things more in that direction, > > you'd need to ensure that the plan is correctly invalidated when > > something changes about the extra thing you looked up in planning. > > I've not looked into the specifics of this instance, other than the > > function being called which causes the buffer usage is > > get_opfamily_proc(). I don't see that called during planning for this > > query, so your statement about "already resolved opfamily from the > > planner" doesn't seem true. > > > > Yes, that's essentially what I explained yesterday. > > I'm not sure about moving all these lookups to the planner (from the > executor). Could this easily hurt some cases? We'd need to do this for > all fields the final plan might need, i.e. we'd need to initialize all > fields for all paths we construct - not just the metadata needed by the > final cheapest one. That could be a lot of additional lookups, right? > That would make the first execution more expensive. Later executions of > a query that would be cached, but we'd be still paying the cost for the > lookup (cheap). We do look some things up at createplan time. e.g calling get_opfamily_member_for_cmptype() from prepare_sort_from_pathkeys(). However, that could still be an additional lookup as the existing lookup for the case in question is in ExecSort rather than in ExecInitSort(), so there could be a surplus lookup if the sort node is never executed. I doubt that's worth worrying about too much. It's normally large memory allocations we want to defer until fetching the first row from the node. > Of course, there's a lot of such fields - this amproc is just an > example. I don't think it's practical to rework all of that to > initialize everything early. > > > However, I imagine this is just one of many things and if you're > > aiming to stabilise your tests doing this, then you'll likely be > > fixing things for a long time to come. I imagine a better way if > > you're just doing this for running test suites would be to invent some > > way to prepopulate the sys and cat caches. Maybe that could be a > > contrib module. I don't think there's a way to get a relcache miss > > during execution, but I guess it could change with prepared statements > > if we ever decided to start evicting long unused relcache entries one > > day. > > > > I'm not sure there's a good way to warmup the sys/cat caches, short of > loading everything. Because how would you know what will be needed by > the execution plan? It seems very complex. I imagined if it's just for machines running tests then you could just load everything. If it was coded in such a way that a tuple fetched by doing a Seq Scan on the catalogue table was what went into the cache, rather than the Seq Scan drives the normal cache lookup code, resulting in a subsequent Index Scan on the catalogue's index, then it could be done with fairly low overhead. I imagine in the order of <10ms from fresh initdb. That doesn't seem excessively long for machines running tests in the background. > One option would be to run the queries twice - the first one would > warmup caches, the second execution would be the measured one. But that > makes the tool 2x as expensive. Maybe. I don't know the tool or how people use it. > But what if we had a way to *invalidate* all the caches? That'd also > make the behavior deterministic - there would be no hits initially. And > it seems much simpler to implement. There is debug_discard_caches, but that requires an assert-enabled build. David
Re: Non-deterministic buffer counts reported in execution with EXPLAIN ANALYZE BUFFERS
От
Tomas Vondra
Дата:
On 2/5/26 01:15, David Rowley wrote: > On Thu, 5 Feb 2026 at 04:59, Tomas Vondra <tomas@vondra.me> wrote: >> >> On 2/3/26 22:53, David Rowley wrote: >>> I think generally, offloading more work into the planner for looking >>> up things that are always the same for every invocation of a plan is >>> generally good. For anything that moves things more in that direction, >>> you'd need to ensure that the plan is correctly invalidated when >>> something changes about the extra thing you looked up in planning. >>> I've not looked into the specifics of this instance, other than the >>> function being called which causes the buffer usage is >>> get_opfamily_proc(). I don't see that called during planning for this >>> query, so your statement about "already resolved opfamily from the >>> planner" doesn't seem true. >>> >> >> Yes, that's essentially what I explained yesterday. >> >> I'm not sure about moving all these lookups to the planner (from the >> executor). Could this easily hurt some cases? We'd need to do this for >> all fields the final plan might need, i.e. we'd need to initialize all >> fields for all paths we construct - not just the metadata needed by the >> final cheapest one. That could be a lot of additional lookups, right? >> That would make the first execution more expensive. Later executions of >> a query that would be cached, but we'd be still paying the cost for the >> lookup (cheap). > > We do look some things up at createplan time. e.g calling > get_opfamily_member_for_cmptype() from prepare_sort_from_pathkeys(). > However, that could still be an additional lookup as the existing > lookup for the case in question is in ExecSort rather than in > ExecInitSort(), so there could be a surplus lookup if the sort node is > never executed. I doubt that's worth worrying about too much. It's > normally large memory allocations we want to defer until fetching the > first row from the node. > Perhaps, you may be right about this. I'm still skeptical about having to figure out all the fields we'll need during execution. I doubt we'll go around and adjust all those places, so the discussion about the cost of additional lookups are purely theoretical. >> Of course, there's a lot of such fields - this amproc is just an >> example. I don't think it's practical to rework all of that to >> initialize everything early. >> >>> However, I imagine this is just one of many things and if you're >>> aiming to stabilise your tests doing this, then you'll likely be >>> fixing things for a long time to come. I imagine a better way if >>> you're just doing this for running test suites would be to invent some >>> way to prepopulate the sys and cat caches. Maybe that could be a >>> contrib module. I don't think there's a way to get a relcache miss >>> during execution, but I guess it could change with prepared statements >>> if we ever decided to start evicting long unused relcache entries one >>> day. >>> >> >> I'm not sure there's a good way to warmup the sys/cat caches, short of >> loading everything. Because how would you know what will be needed by >> the execution plan? It seems very complex. > > I imagined if it's just for machines running tests then you could just > load everything. If it was coded in such a way that a tuple fetched by > doing a Seq Scan on the catalogue table was what went into the cache, > rather than the Seq Scan drives the normal cache lookup code, > resulting in a subsequent Index Scan on the catalogue's index, then it > could be done with fairly low overhead. I imagine in the order of > <10ms from fresh initdb. That doesn't seem excessively long for > machines running tests in the background. > So we'd just go through all the caches relcaches/catcaches/... and load all the stuff that's in pg_catalog? I guess that could work, although I'm not sure how convenient would it be for objects created in the tests themselves (I'm not sure if those end up in the cache right away). >> One option would be to run the queries twice - the first one would >> warmup caches, the second execution would be the measured one. But that >> makes the tool 2x as expensive. > > Maybe. I don't know the tool or how people use it. > >> But what if we had a way to *invalidate* all the caches? That'd also >> make the behavior deterministic - there would be no hits initially. And >> it seems much simpler to implement. > > There is debug_discard_caches, but that requires an assert-enabled build. > Right. I suppose we could have a SQL function calling InvalidateSystemCachesExtended to do that. In fact, an extension can probably add such function already. regards -- Tomas Vondra
Tomas Vondra <tomas@vondra.me> writes:
> On 2/5/26 01:15, David Rowley wrote:
>> I imagined if it's just for machines running tests then you could just
>> load everything. If it was coded in such a way that a tuple fetched by
>> doing a Seq Scan on the catalogue table was what went into the cache,
>> rather than the Seq Scan drives the normal cache lookup code,
>> resulting in a subsequent Index Scan on the catalogue's index, then it
>> could be done with fairly low overhead. I imagine in the order of
>> <10ms from fresh initdb. That doesn't seem excessively long for
>> machines running tests in the background.
> So we'd just go through all the caches relcaches/catcaches/... and load
> all the stuff that's in pg_catalog? I guess that could work,
... until there's a cache flush event. This whole discussion seems
to me to be based on a misconception.
regards, tom lane
Thank you all for the time and detailed responses. To clarify - RegreSQL works with tolerances and I only came across this when detecting improvements (i.e. opposite of the regressions), where smaller executions showed consistent "gains" that turned out to be cache related. Understanding the complexity of the underlying cause is in a way enough for me - there's much more depth here than I initially anticipated. My warm up fix attempts led me to believe it might be much easier.
I'll be honest - just understanding and reviewing Lukas' patches will take me a few weeks to properly digest. But at the end it's exactly the kind of deeper understanding I was hoping to build by engaging here.
I really do appreciate the patience with a first-time poster :)
Radim
On Thu, 5 Feb 2026 at 16:25, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Tomas Vondra <tomas@vondra.me> writes:
> On 2/5/26 01:15, David Rowley wrote:
>> I imagined if it's just for machines running tests then you could just
>> load everything. If it was coded in such a way that a tuple fetched by
>> doing a Seq Scan on the catalogue table was what went into the cache,
>> rather than the Seq Scan drives the normal cache lookup code,
>> resulting in a subsequent Index Scan on the catalogue's index, then it
>> could be done with fairly low overhead. I imagine in the order of
>> <10ms from fresh initdb. That doesn't seem excessively long for
>> machines running tests in the background.
> So we'd just go through all the caches relcaches/catcaches/... and load
> all the stuff that's in pg_catalog? I guess that could work,
... until there's a cache flush event. This whole discussion seems
to me to be based on a misconception.
regards, tom lane