Обсуждение: Stack-based tracking of per-node WAL/buffer usage

Поиск

Список

Период

Сортировка

Stack-based tracking of per-node WAL/buffer usage

От

Lukas Fittl

Дата:

01 сентября, 02:57:01

Hi,

Please find attached a patch series that introduces a new paradigm for how per-node WAL/buffer usage is tracked, with two primary goals: (1) reduce overhead of EXPLAIN ANALYZE, (2) enable future work like tracking estimated distinct buffer hits [0].

Currently we utilize pgWalUsage/pgBufferUsage as global counters, and in InstrStopNode we call the rather expensive BufferUsageAccumDiff/WalUsageAccumDiff to know how much activity happened within a given node cycle.

This proposal instead uses a stack, where each time we enter a node (InstrStartNode) we point a new global (pgInstrStack) to the current stack entry. Whilst we're in that node we increment buffer/WAL usage statistics to the stack entry. On exit (InstrStopNode) we restore the previous entry.

This change provides about a 10% performance benefit for EXPLAIN ANALYZE on paths that repeatedly enter InstrStopNode, e.g. SELECT COUNT(*):

CREATE TABLE test(id int);
INSERT INTO test SELECT * FROM generate_series(0, 1000000);

master (124ms, best out of 3):

postgres=# EXPLAIN (ANALYZE) SELECT COUNT(*) FROM test;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=16925.01..16925.02 rows=1 width=8) (actual time=124.910..124.910 rows=1.00 loops=1)
Buffers: shared hit=752 read=3673
-> Seq Scan on test (cost=0.00..14425.01 rows=1000001 width=0) (actual time=0.201..62.228 rows=1000001.00 loops=1)
Buffers: shared hit=752 read=3673
Planning Time: 0.116 ms
Execution Time: 124.961 ms

patched (109ms, best out of 3):

postgres=# EXPLAIN (ANALYZE) SELECT COUNT(*) FROM test;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=16925.01..16925.02 rows=1 width=8) (actual time=109.788..109.788 rows=1.00 loops=1)
Buffers: shared hit=940 read=3485
-> Seq Scan on test (cost=0.00..14425.01 rows=1000001 width=0) (actual time=0.153..69.368 rows=1000001.00 loops=1)
Buffers: shared hit=940 read=3485
Planning Time: 0.134 ms
Execution Time: 109.837 ms
(6 rows)

I have also prototyped a more ambitious approach that completely removes pgWalUsage/pgBufferUsage (utilizing the stack-collected data for e.g. pg_stat_statements), but for now this patch set does not include that change, but instead keeps adding to these legacy globals as well.

Patches attached:

0001: Separate node instrumentation from other use of Instrumentation struct

Previously different places (e.g. query "total time") were repurposing the per-node Instrumentation struct. Instead, simplify the Instrumentation struct to only track time, WAL/buffer usage, and tuple counts. Similarly, drop the use of InstrEndLoop outside of per-node instrumentation. Introduce the NodeInstrumentation struct to carry forward the per-node instrumentation information.

0002: Replace direct changes of pgBufferUsage/pgWalUsage with INSTR_* macros

0003: Introduce stack for tracking per-node WAL/buffer usage

Feedback/thoughts welcome!

CCing Andres since he had expressed interest in this off-list.

[0]: See lightning talk slides from PGConf.Dev discussing an HLL-based EXPLAIN (BUFFERS DISTINCT): https://resources.pganalyze.com/pganalyze_PGConf.dev_2025_shared_blks_hit_distinct.pdf

Thanks,

Lukas

Lukas Fittl

Вложения

Re: Stack-based tracking of per-node WAL/buffer usage

От

Andres Freund

Дата:

04 сентября, 23:23:06

Hi,

On 2025-08-31 16:57:01 -0700, Lukas Fittl wrote:
> Please find attached a patch series that introduces a new paradigm for how
> per-node WAL/buffer usage is tracked, with two primary goals: (1) reduce
> overhead of EXPLAIN ANALYZE, (2) enable future work like tracking estimated
> distinct buffer hits [0].

I like this for a third reason: To separate out buffer access statistics for
the index and the table in index scans. Right now it's very hard to figure out
if a query is slow because of the index lookups or finding the tuples in the
table.


> 0001: Separate node instrumentation from other use of Instrumentation struct
> 
>     Previously different places (e.g. query "total time") were repurposing
> the per-node Instrumentation struct. Instead, simplify the Instrumentation
> struct to only track time, WAL/buffer usage, and tuple counts. Similarly,
> drop the use of InstrEndLoop outside of per-node instrumentation. Introduce
> the NodeInstrumentation struct to carry forward the per-node
> instrumentation information.

It's mildly odd that the two types of instrumentation have a different
definition of 'total' (one double, one instr_time).


> 0003: Introduce stack for tracking per-node WAL/buffer usage


> From 4375fcb4141f18d6cd927659970518553aa3fe94 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas@fittl.com>
> Date: Sun, 31 Aug 2025 16:37:05 -0700
> Subject: [PATCH v1 3/3] Introduce stack for tracking per-node WAL/buffer usage


> diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
> index b83ced9a57a..1c2268bc608 100644
> --- a/src/backend/executor/execMain.c
> +++ b/src/backend/executor/execMain.c
> @@ -312,6 +312,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
>      DestReceiver *dest;
>      bool        sendTuples;
>      MemoryContext oldcontext;
> +    InstrStackResource *res;
>  
>      /* sanity checks */
>      Assert(queryDesc != NULL);
> @@ -333,6 +334,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
>      if (queryDesc->totaltime)
>          InstrStart(queryDesc->totaltime);
>  
> +    /* Start up per-query node level instrumentation */
> +    res = InstrStartQuery();
> +
>      /*
>       * extract information from the query descriptor and the query feature.
>       */
> @@ -382,6 +386,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
>      if (sendTuples)
>          dest->rShutdown(dest);
>  
> +    /* Shut down per-query node level instrumentation */
> +    InstrShutdownQuery(res);
> +
>      if (queryDesc->totaltime)
>          InstrStop(queryDesc->totaltime, estate->es_processed);


Why are we doing Instr{Start,Stop}Query when not using instrumentation?
Resowner stuff ain't free, so I'd skip them when not necessary.


> diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
> index d286471254b..7436f307994 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -823,8 +823,17 @@ ExecShutdownNode_walker(PlanState *node, void *context)
>  
>      /* Stop the node if we started it above, reporting 0 tuples. */
>      if (node->instrument && node->instrument->running)
> +    {
>          InstrStopNode(node->instrument, 0);
>  
> +        /*
> +         * Propagate WAL/buffer stats to the parent node on the
> +         * instrumentation stack (which is where InstrStopNode returned us
> +         * to).
> +         */
> +        InstrNodeAddToCurrent(&node->instrument->stack);
> +    }
> +
>      return false;
>  }

Can we rely on this being reached? Note that ExecutePlan() calls
ExecShutdownNode() conditionally:

    /*
     * If we know we won't need to back up, we can release resources at this
     * point.
     */
    if (!(estate->es_top_eflags & EXEC_FLAG_BACKWARD))
        ExecShutdownNode(planstate);


> +static void
> +ResOwnerReleaseInstrStack(Datum res)
> +{
> +    /*
> +     * XXX: Registered resources are *not* called in reverse order, i.e. we'll
> +     * get what was first registered first at shutdown. To avoid handling
> +     * that, we are resetting the stack here on abort (instead of recovering
> +     * to previous).
> +     */
> +    pgInstrStack = NULL;
> +}

Hm, doesn't that mean we loose track of instrumentation if you e.g. do an
EXPLAIN ANALYZE of a query that executes a function, which internally triggers
an error and catches it?

I wonder if the solution could be to walk the stack and search for the
to-be-released element.

Greetings,

Andres Freund

Re: Stack-based tracking of per-node WAL/buffer usage

От

Lukas Fittl

Дата:

09 сентября, 22:35:43

Hi Andres,

Thanks for the review!

Attached an updated patch set that addresses the feedback, and also adds the complete removal of the global pgBufferUsage variable in later patches (0005-0007), to avoid counting both the stack and the variable.

FWIW, pgWalUsage would also be nice to remove, but it has some interesting interactions with the cumulative statistics system and heap_page_prune_and_freeze, and seems less performance critical.

On Thu, Sep 4, 2025 at 1:23 PM Andres Freund <andres@anarazel.de> wrote:

> 0001: Separate node instrumentation from other use of Instrumentation struct
...
It's mildly odd that the two types of instrumentation have a different
definition of 'total' (one double, one instr_time).

Yeah, agreed. I added in a new 0001 patch that changes this to instr_time consistently. I don't see a good reason to keep the transformation to seconds in the Instrumentation logic, since all in-tree callers convert it to milliseconds anyway.

> 0003: Introduce stack for tracking per-node WAL/buffer usage

Why are we doing Instr{Start,Stop}Query when not using instrumentation?
Resowner stuff ain't free, so I'd skip them when not necessary.

Makes sense, I've adjusted that to be conditional (in the now renamed 0004 patch).

In the updated patch I've also decided to piggyback on QueryDesc totaltime as the "owning" Instrumentation here for the query's lifetime. It seems simpler that way and avoids having special purpose methods. To go along with that I've changed the general purpose Instrumentation struct to use stack-based instrumentation at the same time.

> diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
> index d286471254b..7436f307994 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -823,8 +823,17 @@ ExecShutdownNode_walker(PlanState *node, void *context)
...
> + InstrNodeAddToCurrent(&node->instrument->stack);
...

Can we rely on this being reached? Note that ExecutePlan() calls
ExecShutdownNode() conditionally:

You are of course correct, I didn't consider cursors correctly here.

It seems there isn't an existing executor node walk that could be repurposed, so I added a new one in ExecutorFinish ("ExecAccumNodeInstrumentation"). From my read of the code there are no use cases where we need aggregated instrumentation data before ExecutorFinish.

> +static void
> +ResOwnerReleaseInstrStack(Datum res)
> +{
> + /*
> + * XXX: Registered resources are *not* called in reverse order, i.e. we'll
> + * get what was first registered first at shutdown. To avoid handling
> + * that, we are resetting the stack here on abort (instead of recovering
> + * to previous).
> + */
> + pgInstrStack = NULL;
> +}

Hm, doesn't that mean we loose track of instrumentation if you e.g. do an
EXPLAIN ANALYZE of a query that executes a function, which internally triggers
an error and catches it?

I wonder if the solution could be to walk the stack and search for the
to-be-released element.

Yes, good point, I did not consider that case. I've addressed this by only updating the current stack if its not already a parent of the element being released. We are also always adding the element's statistics to the (updated) active stack element at abort.

Thanks,

Lukas

Lukas Fittl

Вложения

Re: Stack-based tracking of per-node WAL/buffer usage

От

Lukas Fittl

Дата:

22 октября, 14:28:24

On Tue, Sep 9, 2025 at 10:35 PM Lukas Fittl <lukas@fittl.com> wrote:

Attached an updated patch set that addresses the feedback, and also adds the complete removal of the global pgBufferUsage variable in later patches (0005-0007), to avoid counting both the stack and the variable.

See attached the same patch set rebased on latest master.

Thanks,

Lukas

Lukas Fittl

Вложения

Re: Stack-based tracking of per-node WAL/buffer usage

От

Andres Freund

Дата:

22 октября, 15:59:02

On 2025-10-22 14:28:24 +0300, Lukas Fittl wrote:
> On Tue, Sep 9, 2025 at 10:35 PM Lukas Fittl <lukas@fittl.com> wrote:
>
> > Attached an updated patch set that addresses the feedback, and also adds
> > the complete removal of the global pgBufferUsage variable in later patches
> > (0005-0007), to avoid counting both the stack and the variable.
> >
>
> See attached the same patch set rebased on latest master.

> From d40f69cce15dfa10479c8be31917b33a49d01477 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas@fittl.com>
> Date: Sun, 31 Aug 2025 16:37:05 -0700
> Subject: [PATCH v3 1/7] Instrumentation: Keep time fields as instrtime,
>  require caller to convert
>
> Previously the Instrumentation logic always converted to seconds, only for many
> of the callers to do unnecessary division to get to milliseconds. Since an upcoming
> refactoring will split the Instrumentation struct, utilize instrtime always to
> keep things simpler.

LGTM, think we should apply this regardless of the rest of the patches.


> @@ -173,14 +169,14 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
>          dst->running = true;
>          dst->firsttuple = add->firsttuple;
>      }
> -    else if (dst->running && add->running && dst->firsttuple > add->firsttuple)
> +    else if (dst->running && add->running && INSTR_TIME_CMP_LT(dst->firsttuple, add->firsttuple))
>          dst->firsttuple = add->firsttuple;

This isn't due to this patch, but it seems a bit odd that we use the minimum
time for the first tuple, but the average time for the node's completion...


> diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
> index f71a851b18d..646934020d1 100644
> --- a/src/include/portability/instr_time.h
> +++ b/src/include/portability/instr_time.h
> @@ -184,6 +184,8 @@ GetTimerFrequency(void)
>  #define INSTR_TIME_ACCUM_DIFF(x,y,z) \
>      ((x).ticks += (y).ticks - (z).ticks)
>
> +#define INSTR_TIME_CMP_LT(x,y) \
> +    ((x).ticks > (y).ticks)
>
>  #define INSTR_TIME_GET_DOUBLE(t) \
>      ((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
> --
> 2.47.1

Any reason to actually have _CMP_ in the name? Other operations like _ADD
don't have such an additional verb in the name.


> From 7546f855d138d0dac0d8c22ea5915314810f13e5 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas@fittl.com>
> Date: Sat, 1 Mar 2025 19:31:30 -0800
> Subject: [PATCH v3 2/7] Separate node instrumentation from other use of
>  Instrumentation struct
>
> Previously different places (e.g. query "total time") were repurposing
> the Instrumentation struct initially introduced for capturing per-node
> statistics during execution. This dual use of the struct is confusing,
> e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
> code paths, and prevents future refactorings.
>
> Instead, simplify the Instrumentation struct to only track time,
> WAL/buffer usage, and tuple counts. Similarly, drop the use of InstrEndLoop
> outside of per-node instrumentation. Introduce the NodeInstrumentation
> struct to carry forward the per-node instrumentation information.


> @@ -381,12 +381,6 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
>           */
>          oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
>
> -        /*
> -         * Make sure stats accumulation is done.  (Note: it's okay if several
> -         * levels of hook all do this.)
> -         */
> -        InstrEndLoop(queryDesc->totaltime);
> -
>          /* Log plan if duration is exceeded. */
>          msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
>          if (msec >= auto_explain_log_min_duration)

Maybe add a comment about the removal of these InstrEndLoop() calls to the
commit message? If I understand correctly they were superfluous before, but
that's not entirely obvious when just looking at the patch.



> +/*
> + * General purpose instrumentation that can capture time, WAL/buffer usage and tuples
> + *
> + * Initialized through InstrAlloc, followed by one or more calls to a pair of
> + * InstrStart/InstrStop (activity is measured inbetween).
> + */
>  typedef struct Instrumentation
> +{
> +    /* Parameters set at creation: */
> +    bool        need_timer;        /* true if we need timer data */
> +    bool        need_bufusage;    /* true if we need buffer usage data */
> +    bool        need_walusage;    /* true if we need WAL usage data */
> +    /* Internal state keeping: */
> +    instr_time    starttime;        /* start time of last InstrStart */
> +    BufferUsage bufusage_start; /* buffer usage at start */
> +    WalUsage    walusage_start; /* WAL usage at start */
> +    /* Accumulated statistics: */
> +    instr_time    total;            /* total runtime */
> +    double        ntuples;        /* total tuples counted in InstrStop */
> +    BufferUsage bufusage;        /* total buffer usage */
> +    WalUsage    walusage;        /* total WAL usage */
> +} Instrumentation;

Maybe add a comment explaining why ntuples is in here?



> From aa1acccb3dfa6a5d81a9a049d8cb63762a3d7cf7 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lukas@fittl.com>
> Date: Tue, 9 Sep 2025 02:16:59 -0700
> Subject: [PATCH v3 4/7] Introduce stack for tracking per-node WAL/buffer usage

Could use a commit message :)


> ---
>  .../pg_stat_statements/pg_stat_statements.c   |   4 +-
>  src/backend/commands/explain.c                |   8 +-
>  src/backend/commands/trigger.c                |   4 +-
>  src/backend/executor/execMain.c               |  25 ++-
>  src/backend/executor/execProcnode.c           |  29 +++
>  src/backend/executor/instrument.c             | 199 ++++++++++++++----
>  src/include/executor/executor.h               |   1 +
>  src/include/executor/instrument.h             |  53 ++++-
>  8 files changed, 260 insertions(+), 63 deletions(-)
>
> diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
> index f43a33b3787..eeabd820d8e 100644
> --- a/contrib/pg_stat_statements/pg_stat_statements.c
> +++ b/contrib/pg_stat_statements/pg_stat_statements.c
> @@ -1089,8 +1089,8 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
>                     PGSS_EXEC,
>                     INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
>                     queryDesc->estate->es_total_processed,
> -                   &queryDesc->totaltime->bufusage,
> -                   &queryDesc->totaltime->walusage,
> +                   &INSTR_GET_BUFUSAGE(queryDesc->totaltime),
> +                   &INSTR_GET_WALUSAGE(queryDesc->totaltime),

Getting a pointer to something returned by a macro is a bit ugly... Perhaps
it'd be better to just pass the &queryDesc->totaltime? But ugh, that's not
easily possible given how pgss_planner() currently tracks things :(

Maybe it's worth refactoring this a bit in a precursor patch?


> @@ -1266,7 +1280,12 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
>          resultRelInfo->ri_TrigWhenExprs = (ExprState **)
>              palloc0(n * sizeof(ExprState *));
>          if (instrument_options)
> -            resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options);
> +        {
> +            if ((instrument_options & INSTRUMENT_TIMER) != 0)
> +                resultRelInfo->ri_TrigInstrument = InstrAlloc(n, INSTRUMENT_TIMER);
> +            else
> +                resultRelInfo->ri_TrigInstrument = InstrAlloc(n, INSTRUMENT_ROWS);
> +        }
>      }
>      else
>      {

I'd not duplicate the InstrAlloc(), but compute the flags separately.


>  /* ------------------------------------------------------------------------
> @@ -828,6 +829,34 @@ ExecShutdownNode_walker(PlanState *node, void *context)
>      return false;
>  }
>
> +/*
> + * ExecAccumNodeInstrumentation
> + *
> + * Accumulate instrumentation stats from all execution nodes to their respective
> + * parents (or the original parent instrumentation stack).
> + */
> +void
> +ExecAccumNodeInstrumentation(PlanState *node)
> +{
> +    (void) ExecAccumNodeInstrumentation_walker(node, NULL);
> +}

I wonder if this is too narrow a name. There might be other uses of a pass
across the node tree at that point. OTOH, it's probably better to just rename
it at that later point.


> +static bool
> +ExecAccumNodeInstrumentation_walker(PlanState *node, void *context)
> +{
> +    if (node == NULL)
> +        return false;
> +
> +    check_stack_depth();
> +
> +    planstate_tree_walker(node, ExecAccumNodeInstrumentation_walker, context);

There already is a check_stack_depth() in planstate_tree_walker().


> +    if (node->instrument && node->instrument->stack.previous)
> +        InstrStackAdd(node->instrument->stack.previous, &node->instrument->stack);
> +
> +    return false;
> +}

E.g. in ExecShutdownNode_walker we use planstate_tree_walker(), but then also
have special handling for a few node types. Do we need something like that
here too?  It probably is ok, but it's worth explicitly checking and adding a
comment.



> +/*
> + * Use ResourceOwner mechanism to correctly reset pgInstrStack on abort.
> + */
> +static void ResOwnerReleaseInstrumentation(Datum res);
> +static const ResourceOwnerDesc instrumentation_resowner_desc =
> +{
> +    .name = "instrumentation",
> +    .release_phase = RESOURCE_RELEASE_BEFORE_LOCKS,
> +    .release_priority = RELEASE_PRIO_FIRST,
> +    .ReleaseResource = ResOwnerReleaseInstrumentation,
> +    .DebugPrint = NULL,            /* default message is fine */
> +};

Is there a reason to do the release here before the lock release? And why
_FIRST?


> +static void
> +ResOwnerReleaseInstrumentation(Datum res)
> +{
> +    Instrumentation *instr = (Instrumentation *) DatumGetPointer(res);
> +
> +    /*
> +     * Because registered resources are *not* called in reverse order, we'll
> +     * get what was first registered first at shutdown. Thus, on any later
> +     * resources we need to not change the stack, which was already set to the
> +     * correct previous entry.
> +     */

FWIW, the release order is not guaranteed to be in that order either,
e.g. once resowner switches to hashing, it'll essentially be random.


> +    if (pgInstrStack && !StackIsParent(pgInstrStack, &instr->stack))
> +        pgInstrStack = instr->stack.previous;

Hm - this is effectively O(stack-depth^2), right?  It's probably fine, given
that we have fairly limited nesting (explain + pg_stat_statements +
auto_explain is probably the current max), but seems worth noting in a
comment?


> +    /*
> +     * Always accumulate all collected stats before the abort, even if we
> +     * already walked up the stack with an earlier resource.
> +     */
> +    if (pgInstrStack)
> +        InstrStackAdd(pgInstrStack, &instr->stack);

Why are we accumulating stats in case of errors? It's probably fine, but doing
less as part of cleanup is pre ferrable...


>  /* General purpose instrumentation handling */
>  Instrumentation *
>  InstrAlloc(int n, int instrument_options)
>  {
> -    Instrumentation *instr;
> +    Instrumentation *instr = NULL;
> +    bool        need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
> +    bool        need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
> +    bool        need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
> +    int            i;
> +
> +    /*
> +     * If resource owner will be used, we must allocate in the transaction
> +     * context (not the calling context, usually a lower context), because the
> +     * memory might otherwise be freed too early in an abort situation.
> +     */
> +    if (need_buffers || need_wal)
> +        instr = MemoryContextAllocZero(CurTransactionContext, n * sizeof(Instrumentation));
> +    else
> +        instr = palloc0(n * sizeof(Instrumentation));

Is this long-lived enough? I'm e.g. wondering about utility statements that
internally starting transactions, wouldn't that cause problems with a user
like pgss tracking something like CIC?


> -    /* initialize all fields to zeroes, then modify as needed */
> -    instr = palloc0(n * sizeof(Instrumentation));
> -    if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
> +    for (i = 0; i < n; i++)
>      {
> -        bool        need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
> -        bool        need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
> -        bool        need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
> -        int            i;
> -
> -        for (i = 0; i < n; i++)
> -        {
> -            instr[i].need_bufusage = need_buffers;
> -            instr[i].need_walusage = need_wal;
> -            instr[i].need_timer = need_timer;
> -        }
> +        instr[i].need_bufusage = need_buffers;
> +        instr[i].need_walusage = need_wal;
> +        instr[i].need_timer = need_timer;
>      }
>
>      return instr;
>  }
> +
>  void
>  InstrStart(Instrumentation *instr)
>  {
> +    Assert(!instr->finalized);
> +
>      if (instr->need_timer &&
>          !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
>          elog(ERROR, "InstrStart called twice in a row");
>
> -    if (instr->need_bufusage)
> -        instr->bufusage_start = pgBufferUsage;
> -
> -    if (instr->need_walusage)
> -        instr->walusage_start = pgWalUsage;
> +    if (instr->need_bufusage || instr->need_walusage)
> +        InstrPushStackResource(instr);
>  }
> +
>  void
> -InstrStop(Instrumentation *instr, double nTuples)
> +InstrStop(Instrumentation *instr, double nTuples, bool finalize)
>  {
>      instr_time    endtime;
>
> @@ -84,14 +178,15 @@ InstrStop(Instrumentation *instr, double nTuples)
>          INSTR_TIME_SET_ZERO(instr->starttime);
>      }
>
> -    /* Add delta of buffer usage since entry to node's totals */
> -    if (instr->need_bufusage)
> -        BufferUsageAccumDiff(&instr->bufusage,
> -                             &pgBufferUsage, &instr->bufusage_start);
> +    if (instr->need_bufusage || instr->need_walusage)
> +        InstrPopStackResource(instr);
>
> -    if (instr->need_walusage)
> -        WalUsageAccumDiff(&instr->walusage,
> -                          &pgWalUsage, &instr->walusage_start);
> +    if (finalize)
> +    {
> +        instr->finalized = true;
> +        if (pgInstrStack)
> +            InstrStackAdd(pgInstrStack, &instr->stack);
> +    }
>  }
>
>  /* Allocate new node instrumentation structure(s) */
> @@ -139,12 +234,14 @@ InstrStartNode(NodeInstrumentation * instr)
>          !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
>          elog(ERROR, "InstrStartNode called twice in a row");
>
> -    /* save buffer usage totals at node entry, if needed */
> -    if (instr->need_bufusage)
> -        instr->bufusage_start = pgBufferUsage;
> +    if (instr->need_bufusage || instr->need_walusage)
> +    {
> +        /* Ensure that we always have a parent, even at the top most node */
> +        Assert(pgInstrStack != NULL);
>
> -    if (instr->need_walusage)
> -        instr->walusage_start = pgWalUsage;
> +        instr->stack.previous = pgInstrStack;
> +        pgInstrStack = &instr->stack;
> +    }
>  }
>
>  /* Exit from a plan node */
> @@ -169,14 +266,12 @@ InstrStopNode(NodeInstrumentation * instr, double nTuples)
>          INSTR_TIME_SET_ZERO(instr->starttime);
>      }
>
> -    /* Add delta of buffer usage since entry to node's totals */
> -    if (instr->need_bufusage)
> -        BufferUsageAccumDiff(&instr->bufusage,
> -                             &pgBufferUsage, &instr->bufusage_start);
> -
> -    if (instr->need_walusage)
> -        WalUsageAccumDiff(&instr->walusage,
> -                          &pgWalUsage, &instr->walusage_start);
> +    if (instr->need_bufusage || instr->need_walusage)
> +    {
> +        /* Ensure that we always have a parent, even at the top most node */
> +        Assert(instr->stack.previous != NULL);
> +        pgInstrStack = instr->stack.previous;
> +    }
>
>      /* Is this the first tuple of this cycle? */
>      if (!instr->running)
> @@ -253,10 +348,20 @@ InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add)
>
>      /* Add delta of buffer usage since entry to node's totals */
>      if (dst->need_bufusage)
> -        BufferUsageAdd(&dst->bufusage, &add->bufusage);
> +        BufferUsageAdd(&dst->stack.bufusage, &add->stack.bufusage);
>
>      if (dst->need_walusage)
> -        WalUsageAdd(&dst->walusage, &add->walusage);
> +        WalUsageAdd(&dst->stack.walusage, &add->stack.walusage);
> +}
> +
> +void
> +InstrStackAdd(InstrStack * dst, InstrStack * add)
> +{
> +    Assert(dst != NULL);
> +    Assert(add != NULL);
> +
> +    BufferUsageAdd(&dst->bufusage, &add->bufusage);
> +    WalUsageAdd(&dst->walusage, &add->walusage);
>  }
>

Do we want to do BufferUsageAdd() etc even if we are not tracking buffer
usage? Those operations aren't cheap...


>  /* note current values during parallel executor startup */
> @@ -281,6 +386,14 @@ InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
>  void
>  InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
>  {
> +    if (pgInstrStack != NULL)
> +    {
> +        InstrStack *dst = pgInstrStack;
> +
> +        BufferUsageAdd(&dst->bufusage, bufusage);
> +        WalUsageAdd(&dst->walusage, walusage);
> +    }
> +
>      BufferUsageAdd(&pgBufferUsage, bufusage);
>      WalUsageAdd(&pgWalUsage, walusage);
>  }

Is the pgInstrStack == NULL case actually reachable?


> diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
> index 78d3653997b..d04607ce40c 100644
> --- a/src/include/executor/instrument.h
> +++ b/src/include/executor/instrument.h
> @@ -14,6 +14,7 @@
>  #define INSTRUMENT_H
>
>  #include "portability/instr_time.h"
> +#include "utils/resowner.h"

I'd probably not include resowner here but just forward declare the typedef.

> @@ -146,26 +161,46 @@ extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
>  extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
>  extern void BufferUsageAccumDiff(BufferUsage *dst,
>                                   const BufferUsage *add, const BufferUsage *sub);
> +extern void InstrStackAdd(InstrStack * dst, InstrStack * add);
>  extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
>                                const WalUsage *sub);
>
> +#define INSTR_GET_BUFUSAGE(instr) \
> +    instr->stack.bufusage
> +
> +#define INSTR_GET_WALUSAGE(instr) \
> +    instr->stack.walusage

Not convinced that having these macros is worthwhile.


At this point I reached return -ENEEDCOFFEE :)

Greetings,

Andres Freund

Re: Stack-based tracking of per-node WAL/buffer usage

От

Lukas Fittl

Дата:

31 октября, 10:18:04

Hi Andres,

Thanks for the detailed review!

Attached v4 patchset that addresses feedback (unless otherwise noted below) and is rebased on master. Other changes:

- Ensured each patch is individually pgindent clean (and compiles)

- Refactored 0003 a bit to consistently use InstrPushStack/InstrPopStack helpers for modifying the active stack entry

- Building on that refactoring, merged v3/0006 "Introduce alternate Instrumentation stack mechanism relying on PG_FINALLY" into the main commit introducing the stack mechanism (the "alternate" mechanism is just using these helpers directly and making sure InstrPopStack is called via PG_FINALLY, instead of using resource owners)

- Per our off-list conversation at PGConf.EU, added a patch (v4/0007) that illustrates how the stack mechanism can be used to separate index and table buffer accesses in the EXPLAIN for Index Scans

On Wed, Oct 22, 2025 at 5:59 AM Andres Freund <andres@anarazel.de> wrote:

> +/*
> + * General purpose instrumentation that can capture time, WAL/buffer usage and tuples
> + *
> + * Initialized through InstrAlloc, followed by one or more calls to a pair of
> + * InstrStart/InstrStop (activity is measured inbetween).
> + */
> typedef struct Instrumentation
> +{
> + /* Parameters set at creation: */
> + bool need_timer; /* true if we need timer data */
> + bool need_bufusage; /* true if we need buffer usage data */
> + bool need_walusage; /* true if we need WAL usage data */
> + /* Internal state keeping: */
> + instr_time starttime; /* start time of last InstrStart */
> + BufferUsage bufusage_start; /* buffer usage at start */
> + WalUsage walusage_start; /* WAL usage at start */
> + /* Accumulated statistics: */
> + instr_time total; /* total runtime */
> + double ntuples; /* total tuples counted in InstrStop */
> + BufferUsage bufusage; /* total buffer usage */
> + WalUsage walusage; /* total WAL usage */
> +} Instrumentation;

Maybe add a comment explaining why ntuples is in here?

After thinking about this some more, I'd think we should just go ahead and special case trigger instrumentation, and specifically count firings of the trigger (which was counted in "ntuples" before).

I've adjusted the 0002 patch accordingly to split out both node and trigger instrumentation.

> ---
> .../pg_stat_statements/pg_stat_statements.c | 4 +-
> src/backend/commands/explain.c | 8 +-
> src/backend/commands/trigger.c | 4 +-
> src/backend/executor/execMain.c | 25 ++-
> src/backend/executor/execProcnode.c | 29 +++
> src/backend/executor/instrument.c | 199 ++++++++++++++----
> src/include/executor/executor.h | 1 +
> src/include/executor/instrument.h | 53 ++++-
> 8 files changed, 260 insertions(+), 63 deletions(-)
>
> diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
> index f43a33b3787..eeabd820d8e 100644
> --- a/contrib/pg_stat_statements/pg_stat_statements.c
> +++ b/contrib/pg_stat_statements/pg_stat_statements.c
> @@ -1089,8 +1089,8 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
> PGSS_EXEC,
> INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
> queryDesc->estate->es_total_processed,
> - &queryDesc->totaltime->bufusage,
> - &queryDesc->totaltime->walusage,
> + &INSTR_GET_BUFUSAGE(queryDesc->totaltime),
> + &INSTR_GET_WALUSAGE(queryDesc->totaltime),

Getting a pointer to something returned by a macro is a bit ugly... Perhaps
it'd be better to just pass the &queryDesc->totaltime? But ugh, that's not
easily possible given how pgss_planner() currently tracks things :(

Maybe it's worth refactoring this a bit in a precursor patch?

I've dropped the macro (its just one additional indirection after all) - I think we could refactor this further (i.e. pass the stack), but that doesn't seem strictly necessary.

> /* ------------------------------------------------------------------------
> @@ -828,6 +829,34 @@ ExecShutdownNode_walker(PlanState *node, void *context)
> return false;
> }
>
> +/*
> + * ExecAccumNodeInstrumentation
> + *
> + * Accumulate instrumentation stats from all execution nodes to their respective
> + * parents (or the original parent instrumentation stack).
> + */
> +void
> +ExecAccumNodeInstrumentation(PlanState *node)
> +{
> + (void) ExecAccumNodeInstrumentation_walker(node, NULL);
> +}

I wonder if this is too narrow a name. There might be other uses of a pass
across the node tree at that point. OTOH, it's probably better to just rename
it at that later point.

Yeah, I can't think of a better name, so I've left this the same for now.

> + if (node->instrument && node->instrument->stack.previous)
> + InstrStackAdd(node->instrument->stack.previous, &node->instrument->stack);
> +
> + return false;
> +}

E.g. in ExecShutdownNode_walker we use planstate_tree_walker(), but then also
have special handling for a few node types. Do we need something like that
here too? It probably is ok, but it's worth explicitly checking and adding a
comment.

I went through and double checked - we don't need to special case these node types in my understanding, and I couldn't see any specific cases where intermediary nodes would be removed in shutdown either (o.e.

causing dropped stats since a node was removed with its associated stats not being added to the parent yet).

I added a comment to make it clear this must run after ExecShutdownNode.

> +/*
> + * Use ResourceOwner mechanism to correctly reset pgInstrStack on abort.
> + */
> +static void ResOwnerReleaseInstrumentation(Datum res);
> +static const ResourceOwnerDesc instrumentation_resowner_desc =
> +{
> + .name = "instrumentation",
> + .release_phase = RESOURCE_RELEASE_BEFORE_LOCKS,
> + .release_priority = RELEASE_PRIO_FIRST,
> + .ReleaseResource = ResOwnerReleaseInstrumentation,
> + .DebugPrint = NULL, /* default message is fine */
> +};

Is there a reason to do the release here before the lock release? And why
_FIRST?

Adjusted to have its own phase, and after lock release.

> + if (pgInstrStack && !StackIsParent(pgInstrStack, &instr->stack))
> + pgInstrStack = instr->stack.previous;

Hm - this is effectively O(stack-depth^2), right? It's probably fine, given
that we have fairly limited nesting (explain + pg_stat_statements +
auto_explain is probably the current max), but seems worth noting in a
comment?

Yeah, I added a comment - I don't see a case where this is a bottleneck today given the limited nesting of resource owner using stacks.

> + /*
> + * Always accumulate all collected stats before the abort, even if we
> + * already walked up the stack with an earlier resource.
> + */
> + if (pgInstrStack)
> + InstrStackAdd(pgInstrStack, &instr->stack);

Why are we accumulating stats in case of errors? It's probably fine, but doing
less as part of cleanup is pre ferrable...

In my understanding, we need to do this in case of functions called in a query that catch a rollback/error, since we'd otherwise not account for that function's activity as part of the top-level query.

> /* General purpose instrumentation handling */
> Instrumentation *
> InstrAlloc(int n, int instrument_options)
> {
...
> + if (need_buffers || need_wal)
> + instr = MemoryContextAllocZero(CurTransactionContext, n * sizeof(Instrumentation));
> + else
> + instr = palloc0(n * sizeof(Instrumentation));

Is this long-lived enough? I'm e.g. wondering about utility statements that
internally starting transactions, wouldn't that cause problems with a user
like pgss tracking something like CIC?

I ended up refactoring this a bit, since it seemed useful to do an explicit pfree at InstrStop or when aborting, both to avoid leaks, and to theoretically support using TopMemoryContext here.

That said, from my testing I think CurTransactionContext is sufficient, because we just need something that lives long enough during resource owner abort situations (e.g. per-query context doesn't work, since the abort frees it before we do our resource owner handling).

The pgss+CIC case isn't relevant here (I think), because utility statements don't use the resource owner mechanism at all (with the exception of EXPLAIN which calls into ExecutorStart), instead we use a PG_TRY/PG_FINALLY in pg_stat_statements to pop the stack.

> +void
> +InstrStackAdd(InstrStack * dst, InstrStack * add)
> +{
> + Assert(dst != NULL);
> + Assert(add != NULL);
> +
> + BufferUsageAdd(&dst->bufusage, &add->bufusage);
> + WalUsageAdd(&dst->walusage, &add->walusage);
> }
>

Do we want to do BufferUsageAdd() etc even if we are not tracking buffer
usage? Those operations aren't cheap...

I briefly considered whether we could add this to the InstrStack itself (i.e. whether we actually care about buffer, WAL usage or both), but I think where it gets messy is that we can have indirect requirements to track this - you might have pg_stat_statements capturing both, but e.g. the utility statement being executed only caring about emitting WAL usage in the log.

I'm also not familiar with an in-core use case today where only WAL (but not buffers) is needed, short of doing something like "EXPLAIN (ANALYZE, BUFFERS OFF, WAL ON)" without having pg_stat_statements or similar enabled. Do you have a specific example where this could help?

> /* note current values during parallel executor startup */
> @@ -281,6 +386,14 @@ InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
> void
> InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
> {
> + if (pgInstrStack != NULL)
> + {
> + InstrStack *dst = pgInstrStack;
> +
> + BufferUsageAdd(&dst->bufusage, bufusage);
> + WalUsageAdd(&dst->walusage, walusage);
> + }
> +
> BufferUsageAdd(&pgBufferUsage, bufusage);
> WalUsageAdd(&pgWalUsage, walusage);
> }

Is the pgInstrStack == NULL case actually reachable?

In my reading of the code, its necessary because we unconditionally track WAL/buffer usage in parallel workers, even if the leader doesn't actually need it.

We could be smarter about this (i.e. tell the workers not to collect the information in the first place), but for now it seemed easiest to just discard it.

Thanks,

Lukas

Lukas Fittl

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Stack-based tracking of per-node WAL/buffer usage

Stack-based tracking of per-node WAL/buffer usage

Вложения

Re: Stack-based tracking of per-node WAL/buffer usage

Re: Stack-based tracking of per-node WAL/buffer usage

Вложения

Re: Stack-based tracking of per-node WAL/buffer usage

Вложения

Re: Stack-based tracking of per-node WAL/buffer usage

Re: Stack-based tracking of per-node WAL/buffer usage

Вложения