Обсуждение: Stack-based tracking of per-node WAL/buffer usage
Hi,
Please find attached a patch series that introduces a new paradigm for how per-node WAL/buffer usage is tracked, with two primary goals: (1) reduce overhead of EXPLAIN ANALYZE, (2) enable future work like tracking estimated distinct buffer hits [0].
Currently we utilize pgWalUsage/pgBufferUsage as global counters, and in InstrStopNode we call the rather expensive BufferUsageAccumDiff/WalUsageAccumDiff to know how much activity happened within a given node cycle.
This proposal instead uses a stack, where each time we enter a node (InstrStartNode) we point a new global (pgInstrStack) to the current stack entry. Whilst we're in that node we increment buffer/WAL usage statistics to the stack entry. On exit (InstrStopNode) we restore the previous entry.
This change provides about a 10% performance benefit for EXPLAIN ANALYZE on paths that repeatedly enter InstrStopNode, e.g. SELECT COUNT(*):
CREATE TABLE test(id int);
INSERT INTO test SELECT * FROM generate_series(0, 1000000);
INSERT INTO test SELECT * FROM generate_series(0, 1000000);
master (124ms, best out of 3):
postgres=# EXPLAIN (ANALYZE) SELECT COUNT(*) FROM test;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=16925.01..16925.02 rows=1 width=8) (actual time=124.910..124.910 rows=1.00 loops=1)
Buffers: shared hit=752 read=3673
-> Seq Scan on test (cost=0.00..14425.01 rows=1000001 width=0) (actual time=0.201..62.228 rows=1000001.00 loops=1)
Buffers: shared hit=752 read=3673
Planning Time: 0.116 ms
Execution Time: 124.961 ms
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=16925.01..16925.02 rows=1 width=8) (actual time=124.910..124.910 rows=1.00 loops=1)
Buffers: shared hit=752 read=3673
-> Seq Scan on test (cost=0.00..14425.01 rows=1000001 width=0) (actual time=0.201..62.228 rows=1000001.00 loops=1)
Buffers: shared hit=752 read=3673
Planning Time: 0.116 ms
Execution Time: 124.961 ms
patched (109ms, best out of 3):
postgres=# EXPLAIN (ANALYZE) SELECT COUNT(*) FROM test;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=16925.01..16925.02 rows=1 width=8) (actual time=109.788..109.788 rows=1.00 loops=1)
Buffers: shared hit=940 read=3485
-> Seq Scan on test (cost=0.00..14425.01 rows=1000001 width=0) (actual time=0.153..69.368 rows=1000001.00 loops=1)
Buffers: shared hit=940 read=3485
Planning Time: 0.134 ms
Execution Time: 109.837 ms
(6 rows)
postgres=# EXPLAIN (ANALYZE) SELECT COUNT(*) FROM test;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=16925.01..16925.02 rows=1 width=8) (actual time=109.788..109.788 rows=1.00 loops=1)
Buffers: shared hit=940 read=3485
-> Seq Scan on test (cost=0.00..14425.01 rows=1000001 width=0) (actual time=0.153..69.368 rows=1000001.00 loops=1)
Buffers: shared hit=940 read=3485
Planning Time: 0.134 ms
Execution Time: 109.837 ms
(6 rows)
I have also prototyped a more ambitious approach that completely removes pgWalUsage/pgBufferUsage (utilizing the stack-collected data for e.g. pg_stat_statements), but for now this patch set does not include that change, but instead keeps adding to these legacy globals as well.
Patches attached:
0001: Separate node instrumentation from other use of Instrumentation struct
Previously different places (e.g. query "total time") were repurposing the per-node Instrumentation struct. Instead, simplify the Instrumentation struct to only track time, WAL/buffer usage, and tuple counts. Similarly, drop the use of InstrEndLoop outside of per-node instrumentation. Introduce the NodeInstrumentation struct to carry forward the per-node instrumentation information.
Previously different places (e.g. query "total time") were repurposing the per-node Instrumentation struct. Instead, simplify the Instrumentation struct to only track time, WAL/buffer usage, and tuple counts. Similarly, drop the use of InstrEndLoop outside of per-node instrumentation. Introduce the NodeInstrumentation struct to carry forward the per-node instrumentation information.
0002: Replace direct changes of pgBufferUsage/pgWalUsage with INSTR_* macros
0003: Introduce stack for tracking per-node WAL/buffer usage
Feedback/thoughts welcome!
CCing Andres since he had expressed interest in this off-list.
[0]: See lightning talk slides from PGConf.Dev discussing an HLL-based EXPLAIN (BUFFERS DISTINCT): https://resources.pganalyze.com/pganalyze_PGConf.dev_2025_shared_blks_hit_distinct.pdf
Thanks,
Lukas
Lukas Fittl
Вложения
Hi,
On 2025-08-31 16:57:01 -0700, Lukas Fittl wrote:
> Please find attached a patch series that introduces a new paradigm for how
> per-node WAL/buffer usage is tracked, with two primary goals: (1) reduce
> overhead of EXPLAIN ANALYZE, (2) enable future work like tracking estimated
> distinct buffer hits [0].
I like this for a third reason: To separate out buffer access statistics for
the index and the table in index scans. Right now it's very hard to figure out
if a query is slow because of the index lookups or finding the tuples in the
table.
> 0001: Separate node instrumentation from other use of Instrumentation struct
>
> Previously different places (e.g. query "total time") were repurposing
> the per-node Instrumentation struct. Instead, simplify the Instrumentation
> struct to only track time, WAL/buffer usage, and tuple counts. Similarly,
> drop the use of InstrEndLoop outside of per-node instrumentation. Introduce
> the NodeInstrumentation struct to carry forward the per-node
> instrumentation information.
It's mildly odd that the two types of instrumentation have a different
definition of 'total' (one double, one instr_time).
> 0003: Introduce stack for tracking per-node WAL/buffer usage
> From 4375fcb4141f18d6cd927659970518553aa3fe94 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas@fittl.com>
> Date: Sun, 31 Aug 2025 16:37:05 -0700
> Subject: [PATCH v1 3/3] Introduce stack for tracking per-node WAL/buffer usage
> diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
> index b83ced9a57a..1c2268bc608 100644
> --- a/src/backend/executor/execMain.c
> +++ b/src/backend/executor/execMain.c
> @@ -312,6 +312,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
> DestReceiver *dest;
> bool sendTuples;
> MemoryContext oldcontext;
> + InstrStackResource *res;
>
> /* sanity checks */
> Assert(queryDesc != NULL);
> @@ -333,6 +334,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
> if (queryDesc->totaltime)
> InstrStart(queryDesc->totaltime);
>
> + /* Start up per-query node level instrumentation */
> + res = InstrStartQuery();
> +
> /*
> * extract information from the query descriptor and the query feature.
> */
> @@ -382,6 +386,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
> if (sendTuples)
> dest->rShutdown(dest);
>
> + /* Shut down per-query node level instrumentation */
> + InstrShutdownQuery(res);
> +
> if (queryDesc->totaltime)
> InstrStop(queryDesc->totaltime, estate->es_processed);
Why are we doing Instr{Start,Stop}Query when not using instrumentation?
Resowner stuff ain't free, so I'd skip them when not necessary.
> diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
> index d286471254b..7436f307994 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -823,8 +823,17 @@ ExecShutdownNode_walker(PlanState *node, void *context)
>
> /* Stop the node if we started it above, reporting 0 tuples. */
> if (node->instrument && node->instrument->running)
> + {
> InstrStopNode(node->instrument, 0);
>
> + /*
> + * Propagate WAL/buffer stats to the parent node on the
> + * instrumentation stack (which is where InstrStopNode returned us
> + * to).
> + */
> + InstrNodeAddToCurrent(&node->instrument->stack);
> + }
> +
> return false;
> }
Can we rely on this being reached? Note that ExecutePlan() calls
ExecShutdownNode() conditionally:
/*
* If we know we won't need to back up, we can release resources at this
* point.
*/
if (!(estate->es_top_eflags & EXEC_FLAG_BACKWARD))
ExecShutdownNode(planstate);
> +static void
> +ResOwnerReleaseInstrStack(Datum res)
> +{
> + /*
> + * XXX: Registered resources are *not* called in reverse order, i.e. we'll
> + * get what was first registered first at shutdown. To avoid handling
> + * that, we are resetting the stack here on abort (instead of recovering
> + * to previous).
> + */
> + pgInstrStack = NULL;
> +}
Hm, doesn't that mean we loose track of instrumentation if you e.g. do an
EXPLAIN ANALYZE of a query that executes a function, which internally triggers
an error and catches it?
I wonder if the solution could be to walk the stack and search for the
to-be-released element.
Greetings,
Andres Freund
Hi Andres,
Thanks for the review!
Attached an updated patch set that addresses the feedback, and also adds the complete removal of the global pgBufferUsage variable in later patches (0005-0007), to avoid counting both the stack and the variable.
FWIW, pgWalUsage would also be nice to remove, but it has some interesting interactions with the cumulative statistics system and heap_page_prune_and_freeze, and seems less performance critical.
On Thu, Sep 4, 2025 at 1:23 PM Andres Freund <andres@anarazel.de> wrote:
> 0001: Separate node instrumentation from other use of Instrumentation struct
...
It's mildly odd that the two types of instrumentation have a different
definition of 'total' (one double, one instr_time).
Yeah, agreed. I added in a new 0001 patch that changes this to instr_time consistently. I don't see a good reason to keep the transformation to seconds in the Instrumentation logic, since all in-tree callers convert it to milliseconds anyway.
> 0003: Introduce stack for tracking per-node WAL/buffer usage
Why are we doing Instr{Start,Stop}Query when not using instrumentation?
Resowner stuff ain't free, so I'd skip them when not necessary.
Makes sense, I've adjusted that to be conditional (in the now renamed 0004 patch).
In the updated patch I've also decided to piggyback on QueryDesc totaltime as the "owning" Instrumentation here for the query's lifetime. It seems simpler that way and avoids having special purpose methods. To go along with that I've changed the general purpose Instrumentation struct to use stack-based instrumentation at the same time.
> diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
> index d286471254b..7436f307994 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -823,8 +823,17 @@ ExecShutdownNode_walker(PlanState *node, void *context)
...
> + InstrNodeAddToCurrent(&node->instrument->stack);
...
Can we rely on this being reached? Note that ExecutePlan() calls
ExecShutdownNode() conditionally:
You are of course correct, I didn't consider cursors correctly here.
It seems there isn't an existing executor node walk that could be repurposed, so I added a new one in ExecutorFinish ("ExecAccumNodeInstrumentation"). From my read of the code there are no use cases where we need aggregated instrumentation data before ExecutorFinish.
> +static void
> +ResOwnerReleaseInstrStack(Datum res)
> +{
> + /*
> + * XXX: Registered resources are *not* called in reverse order, i.e. we'll
> + * get what was first registered first at shutdown. To avoid handling
> + * that, we are resetting the stack here on abort (instead of recovering
> + * to previous).
> + */
> + pgInstrStack = NULL;
> +}
Hm, doesn't that mean we loose track of instrumentation if you e.g. do an
EXPLAIN ANALYZE of a query that executes a function, which internally triggers
an error and catches it?
I wonder if the solution could be to walk the stack and search for the
to-be-released element.
Yes, good point, I did not consider that case. I've addressed this by only updating the current stack if its not already a parent of the element being released. We are also always adding the element's statistics to the (updated) active stack element at abort.
Thanks,
Lukas
--
Lukas Fittl
Вложения
- v2-0006-Introduce-alternate-Instrumentation-stack-mechani.patch
- v2-0007-Convert-remaining-users-of-pgBufferUsage-to-use-I.patch
- v2-0001-Instrumentation-Keep-time-fields-as-instrtime-req.patch
- v2-0005-Use-Instrumentation-stack-for-parallel-query-aggr.patch
- v2-0004-Introduce-stack-for-tracking-per-node-WAL-buffer-.patch
- v2-0003-Replace-direct-changes-of-pgBufferUsage-pgWalUsag.patch
- v2-0002-Separate-node-instrumentation-from-other-use-of-I.patch
On Tue, Sep 9, 2025 at 10:35 PM Lukas Fittl <lukas@fittl.com> wrote:
Attached an updated patch set that addresses the feedback, and also adds the complete removal of the global pgBufferUsage variable in later patches (0005-0007), to avoid counting both the stack and the variable.
See attached the same patch set rebased on latest master.
Thanks,
Lukas
Lukas Fittl
Вложения
- v3-0001-Instrumentation-Keep-time-fields-as-instrtime-req.patch
- v3-0004-Introduce-stack-for-tracking-per-node-WAL-buffer-.patch
- v3-0005-Use-Instrumentation-stack-for-parallel-query-aggr.patch
- v3-0002-Separate-node-instrumentation-from-other-use-of-I.patch
- v3-0006-Introduce-alternate-Instrumentation-stack-mechani.patch
- v3-0007-Convert-remaining-users-of-pgBufferUsage-to-use-I.patch
- v3-0003-Replace-direct-changes-of-pgBufferUsage-pgWalUsag.patch
On 2025-10-22 14:28:24 +0300, Lukas Fittl wrote:
> On Tue, Sep 9, 2025 at 10:35 PM Lukas Fittl <lukas@fittl.com> wrote:
>
> > Attached an updated patch set that addresses the feedback, and also adds
> > the complete removal of the global pgBufferUsage variable in later patches
> > (0005-0007), to avoid counting both the stack and the variable.
> >
>
> See attached the same patch set rebased on latest master.
> From d40f69cce15dfa10479c8be31917b33a49d01477 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas@fittl.com>
> Date: Sun, 31 Aug 2025 16:37:05 -0700
> Subject: [PATCH v3 1/7] Instrumentation: Keep time fields as instrtime,
> require caller to convert
>
> Previously the Instrumentation logic always converted to seconds, only for many
> of the callers to do unnecessary division to get to milliseconds. Since an upcoming
> refactoring will split the Instrumentation struct, utilize instrtime always to
> keep things simpler.
LGTM, think we should apply this regardless of the rest of the patches.
> @@ -173,14 +169,14 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
> dst->running = true;
> dst->firsttuple = add->firsttuple;
> }
> - else if (dst->running && add->running && dst->firsttuple > add->firsttuple)
> + else if (dst->running && add->running && INSTR_TIME_CMP_LT(dst->firsttuple, add->firsttuple))
> dst->firsttuple = add->firsttuple;
This isn't due to this patch, but it seems a bit odd that we use the minimum
time for the first tuple, but the average time for the node's completion...
> diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
> index f71a851b18d..646934020d1 100644
> --- a/src/include/portability/instr_time.h
> +++ b/src/include/portability/instr_time.h
> @@ -184,6 +184,8 @@ GetTimerFrequency(void)
> #define INSTR_TIME_ACCUM_DIFF(x,y,z) \
> ((x).ticks += (y).ticks - (z).ticks)
>
> +#define INSTR_TIME_CMP_LT(x,y) \
> + ((x).ticks > (y).ticks)
>
> #define INSTR_TIME_GET_DOUBLE(t) \
> ((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
> --
> 2.47.1
Any reason to actually have _CMP_ in the name? Other operations like _ADD
don't have such an additional verb in the name.
> From 7546f855d138d0dac0d8c22ea5915314810f13e5 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas@fittl.com>
> Date: Sat, 1 Mar 2025 19:31:30 -0800
> Subject: [PATCH v3 2/7] Separate node instrumentation from other use of
> Instrumentation struct
>
> Previously different places (e.g. query "total time") were repurposing
> the Instrumentation struct initially introduced for capturing per-node
> statistics during execution. This dual use of the struct is confusing,
> e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
> code paths, and prevents future refactorings.
>
> Instead, simplify the Instrumentation struct to only track time,
> WAL/buffer usage, and tuple counts. Similarly, drop the use of InstrEndLoop
> outside of per-node instrumentation. Introduce the NodeInstrumentation
> struct to carry forward the per-node instrumentation information.
> @@ -381,12 +381,6 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
> */
> oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
>
> - /*
> - * Make sure stats accumulation is done. (Note: it's okay if several
> - * levels of hook all do this.)
> - */
> - InstrEndLoop(queryDesc->totaltime);
> -
> /* Log plan if duration is exceeded. */
> msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
> if (msec >= auto_explain_log_min_duration)
Maybe add a comment about the removal of these InstrEndLoop() calls to the
commit message? If I understand correctly they were superfluous before, but
that's not entirely obvious when just looking at the patch.
> +/*
> + * General purpose instrumentation that can capture time, WAL/buffer usage and tuples
> + *
> + * Initialized through InstrAlloc, followed by one or more calls to a pair of
> + * InstrStart/InstrStop (activity is measured inbetween).
> + */
> typedef struct Instrumentation
> +{
> + /* Parameters set at creation: */
> + bool need_timer; /* true if we need timer data */
> + bool need_bufusage; /* true if we need buffer usage data */
> + bool need_walusage; /* true if we need WAL usage data */
> + /* Internal state keeping: */
> + instr_time starttime; /* start time of last InstrStart */
> + BufferUsage bufusage_start; /* buffer usage at start */
> + WalUsage walusage_start; /* WAL usage at start */
> + /* Accumulated statistics: */
> + instr_time total; /* total runtime */
> + double ntuples; /* total tuples counted in InstrStop */
> + BufferUsage bufusage; /* total buffer usage */
> + WalUsage walusage; /* total WAL usage */
> +} Instrumentation;
Maybe add a comment explaining why ntuples is in here?
> From aa1acccb3dfa6a5d81a9a049d8cb63762a3d7cf7 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas@fittl.com>
> Date: Tue, 9 Sep 2025 02:16:59 -0700
> Subject: [PATCH v3 4/7] Introduce stack for tracking per-node WAL/buffer usage
Could use a commit message :)
> ---
> .../pg_stat_statements/pg_stat_statements.c | 4 +-
> src/backend/commands/explain.c | 8 +-
> src/backend/commands/trigger.c | 4 +-
> src/backend/executor/execMain.c | 25 ++-
> src/backend/executor/execProcnode.c | 29 +++
> src/backend/executor/instrument.c | 199 ++++++++++++++----
> src/include/executor/executor.h | 1 +
> src/include/executor/instrument.h | 53 ++++-
> 8 files changed, 260 insertions(+), 63 deletions(-)
>
> diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
> index f43a33b3787..eeabd820d8e 100644
> --- a/contrib/pg_stat_statements/pg_stat_statements.c
> +++ b/contrib/pg_stat_statements/pg_stat_statements.c
> @@ -1089,8 +1089,8 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
> PGSS_EXEC,
> INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
> queryDesc->estate->es_total_processed,
> - &queryDesc->totaltime->bufusage,
> - &queryDesc->totaltime->walusage,
> + &INSTR_GET_BUFUSAGE(queryDesc->totaltime),
> + &INSTR_GET_WALUSAGE(queryDesc->totaltime),
Getting a pointer to something returned by a macro is a bit ugly... Perhaps
it'd be better to just pass the &queryDesc->totaltime? But ugh, that's not
easily possible given how pgss_planner() currently tracks things :(
Maybe it's worth refactoring this a bit in a precursor patch?
> @@ -1266,7 +1280,12 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
> resultRelInfo->ri_TrigWhenExprs = (ExprState **)
> palloc0(n * sizeof(ExprState *));
> if (instrument_options)
> - resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options);
> + {
> + if ((instrument_options & INSTRUMENT_TIMER) != 0)
> + resultRelInfo->ri_TrigInstrument = InstrAlloc(n, INSTRUMENT_TIMER);
> + else
> + resultRelInfo->ri_TrigInstrument = InstrAlloc(n, INSTRUMENT_ROWS);
> + }
> }
> else
> {
I'd not duplicate the InstrAlloc(), but compute the flags separately.
> /* ------------------------------------------------------------------------
> @@ -828,6 +829,34 @@ ExecShutdownNode_walker(PlanState *node, void *context)
> return false;
> }
>
> +/*
> + * ExecAccumNodeInstrumentation
> + *
> + * Accumulate instrumentation stats from all execution nodes to their respective
> + * parents (or the original parent instrumentation stack).
> + */
> +void
> +ExecAccumNodeInstrumentation(PlanState *node)
> +{
> + (void) ExecAccumNodeInstrumentation_walker(node, NULL);
> +}
I wonder if this is too narrow a name. There might be other uses of a pass
across the node tree at that point. OTOH, it's probably better to just rename
it at that later point.
> +static bool
> +ExecAccumNodeInstrumentation_walker(PlanState *node, void *context)
> +{
> + if (node == NULL)
> + return false;
> +
> + check_stack_depth();
> +
> + planstate_tree_walker(node, ExecAccumNodeInstrumentation_walker, context);
There already is a check_stack_depth() in planstate_tree_walker().
> + if (node->instrument && node->instrument->stack.previous)
> + InstrStackAdd(node->instrument->stack.previous, &node->instrument->stack);
> +
> + return false;
> +}
E.g. in ExecShutdownNode_walker we use planstate_tree_walker(), but then also
have special handling for a few node types. Do we need something like that
here too? It probably is ok, but it's worth explicitly checking and adding a
comment.
> +/*
> + * Use ResourceOwner mechanism to correctly reset pgInstrStack on abort.
> + */
> +static void ResOwnerReleaseInstrumentation(Datum res);
> +static const ResourceOwnerDesc instrumentation_resowner_desc =
> +{
> + .name = "instrumentation",
> + .release_phase = RESOURCE_RELEASE_BEFORE_LOCKS,
> + .release_priority = RELEASE_PRIO_FIRST,
> + .ReleaseResource = ResOwnerReleaseInstrumentation,
> + .DebugPrint = NULL, /* default message is fine */
> +};
Is there a reason to do the release here before the lock release? And why
_FIRST?
> +static void
> +ResOwnerReleaseInstrumentation(Datum res)
> +{
> + Instrumentation *instr = (Instrumentation *) DatumGetPointer(res);
> +
> + /*
> + * Because registered resources are *not* called in reverse order, we'll
> + * get what was first registered first at shutdown. Thus, on any later
> + * resources we need to not change the stack, which was already set to the
> + * correct previous entry.
> + */
FWIW, the release order is not guaranteed to be in that order either,
e.g. once resowner switches to hashing, it'll essentially be random.
> + if (pgInstrStack && !StackIsParent(pgInstrStack, &instr->stack))
> + pgInstrStack = instr->stack.previous;
Hm - this is effectively O(stack-depth^2), right? It's probably fine, given
that we have fairly limited nesting (explain + pg_stat_statements +
auto_explain is probably the current max), but seems worth noting in a
comment?
> + /*
> + * Always accumulate all collected stats before the abort, even if we
> + * already walked up the stack with an earlier resource.
> + */
> + if (pgInstrStack)
> + InstrStackAdd(pgInstrStack, &instr->stack);
Why are we accumulating stats in case of errors? It's probably fine, but doing
less as part of cleanup is pre ferrable...
> /* General purpose instrumentation handling */
> Instrumentation *
> InstrAlloc(int n, int instrument_options)
> {
> - Instrumentation *instr;
> + Instrumentation *instr = NULL;
> + bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
> + bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
> + bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
> + int i;
> +
> + /*
> + * If resource owner will be used, we must allocate in the transaction
> + * context (not the calling context, usually a lower context), because the
> + * memory might otherwise be freed too early in an abort situation.
> + */
> + if (need_buffers || need_wal)
> + instr = MemoryContextAllocZero(CurTransactionContext, n * sizeof(Instrumentation));
> + else
> + instr = palloc0(n * sizeof(Instrumentation));
Is this long-lived enough? I'm e.g. wondering about utility statements that
internally starting transactions, wouldn't that cause problems with a user
like pgss tracking something like CIC?
> - /* initialize all fields to zeroes, then modify as needed */
> - instr = palloc0(n * sizeof(Instrumentation));
> - if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
> + for (i = 0; i < n; i++)
> {
> - bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
> - bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
> - bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
> - int i;
> -
> - for (i = 0; i < n; i++)
> - {
> - instr[i].need_bufusage = need_buffers;
> - instr[i].need_walusage = need_wal;
> - instr[i].need_timer = need_timer;
> - }
> + instr[i].need_bufusage = need_buffers;
> + instr[i].need_walusage = need_wal;
> + instr[i].need_timer = need_timer;
> }
>
> return instr;
> }
> +
> void
> InstrStart(Instrumentation *instr)
> {
> + Assert(!instr->finalized);
> +
> if (instr->need_timer &&
> !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
> elog(ERROR, "InstrStart called twice in a row");
>
> - if (instr->need_bufusage)
> - instr->bufusage_start = pgBufferUsage;
> -
> - if (instr->need_walusage)
> - instr->walusage_start = pgWalUsage;
> + if (instr->need_bufusage || instr->need_walusage)
> + InstrPushStackResource(instr);
> }
> +
> void
> -InstrStop(Instrumentation *instr, double nTuples)
> +InstrStop(Instrumentation *instr, double nTuples, bool finalize)
> {
> instr_time endtime;
>
> @@ -84,14 +178,15 @@ InstrStop(Instrumentation *instr, double nTuples)
> INSTR_TIME_SET_ZERO(instr->starttime);
> }
>
> - /* Add delta of buffer usage since entry to node's totals */
> - if (instr->need_bufusage)
> - BufferUsageAccumDiff(&instr->bufusage,
> - &pgBufferUsage, &instr->bufusage_start);
> + if (instr->need_bufusage || instr->need_walusage)
> + InstrPopStackResource(instr);
>
> - if (instr->need_walusage)
> - WalUsageAccumDiff(&instr->walusage,
> - &pgWalUsage, &instr->walusage_start);
> + if (finalize)
> + {
> + instr->finalized = true;
> + if (pgInstrStack)
> + InstrStackAdd(pgInstrStack, &instr->stack);
> + }
> }
>
> /* Allocate new node instrumentation structure(s) */
> @@ -139,12 +234,14 @@ InstrStartNode(NodeInstrumentation * instr)
> !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
> elog(ERROR, "InstrStartNode called twice in a row");
>
> - /* save buffer usage totals at node entry, if needed */
> - if (instr->need_bufusage)
> - instr->bufusage_start = pgBufferUsage;
> + if (instr->need_bufusage || instr->need_walusage)
> + {
> + /* Ensure that we always have a parent, even at the top most node */
> + Assert(pgInstrStack != NULL);
>
> - if (instr->need_walusage)
> - instr->walusage_start = pgWalUsage;
> + instr->stack.previous = pgInstrStack;
> + pgInstrStack = &instr->stack;
> + }
> }
>
> /* Exit from a plan node */
> @@ -169,14 +266,12 @@ InstrStopNode(NodeInstrumentation * instr, double nTuples)
> INSTR_TIME_SET_ZERO(instr->starttime);
> }
>
> - /* Add delta of buffer usage since entry to node's totals */
> - if (instr->need_bufusage)
> - BufferUsageAccumDiff(&instr->bufusage,
> - &pgBufferUsage, &instr->bufusage_start);
> -
> - if (instr->need_walusage)
> - WalUsageAccumDiff(&instr->walusage,
> - &pgWalUsage, &instr->walusage_start);
> + if (instr->need_bufusage || instr->need_walusage)
> + {
> + /* Ensure that we always have a parent, even at the top most node */
> + Assert(instr->stack.previous != NULL);
> + pgInstrStack = instr->stack.previous;
> + }
>
> /* Is this the first tuple of this cycle? */
> if (!instr->running)
> @@ -253,10 +348,20 @@ InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add)
>
> /* Add delta of buffer usage since entry to node's totals */
> if (dst->need_bufusage)
> - BufferUsageAdd(&dst->bufusage, &add->bufusage);
> + BufferUsageAdd(&dst->stack.bufusage, &add->stack.bufusage);
>
> if (dst->need_walusage)
> - WalUsageAdd(&dst->walusage, &add->walusage);
> + WalUsageAdd(&dst->stack.walusage, &add->stack.walusage);
> +}
> +
> +void
> +InstrStackAdd(InstrStack * dst, InstrStack * add)
> +{
> + Assert(dst != NULL);
> + Assert(add != NULL);
> +
> + BufferUsageAdd(&dst->bufusage, &add->bufusage);
> + WalUsageAdd(&dst->walusage, &add->walusage);
> }
>
Do we want to do BufferUsageAdd() etc even if we are not tracking buffer
usage? Those operations aren't cheap...
> /* note current values during parallel executor startup */
> @@ -281,6 +386,14 @@ InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
> void
> InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
> {
> + if (pgInstrStack != NULL)
> + {
> + InstrStack *dst = pgInstrStack;
> +
> + BufferUsageAdd(&dst->bufusage, bufusage);
> + WalUsageAdd(&dst->walusage, walusage);
> + }
> +
> BufferUsageAdd(&pgBufferUsage, bufusage);
> WalUsageAdd(&pgWalUsage, walusage);
> }
Is the pgInstrStack == NULL case actually reachable?
> diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
> index 78d3653997b..d04607ce40c 100644
> --- a/src/include/executor/instrument.h
> +++ b/src/include/executor/instrument.h
> @@ -14,6 +14,7 @@
> #define INSTRUMENT_H
>
> #include "portability/instr_time.h"
> +#include "utils/resowner.h"
I'd probably not include resowner here but just forward declare the typedef.
> @@ -146,26 +161,46 @@ extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
> extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
> extern void BufferUsageAccumDiff(BufferUsage *dst,
> const BufferUsage *add, const BufferUsage *sub);
> +extern void InstrStackAdd(InstrStack * dst, InstrStack * add);
> extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
> const WalUsage *sub);
>
> +#define INSTR_GET_BUFUSAGE(instr) \
> + instr->stack.bufusage
> +
> +#define INSTR_GET_WALUSAGE(instr) \
> + instr->stack.walusage
Not convinced that having these macros is worthwhile.
At this point I reached return -ENEEDCOFFEE :)
Greetings,
Andres Freund