Обсуждение: Batching in executor

Поиск
Список
Период
Сортировка

Batching in executor

От
Amit Langote
Дата:
At PGConf.dev this year we had an unconference session [1] on whether
the community can support an additional batch executor. The discussion
there led me to start hacking on $subject. I have also had off-list
discussions on this topic in recent months with Andres and David, who
have offered useful thoughts.

This patch series is an early attempt to make executor nodes pass
around batches of tuples instead of tuple-at-a-time slots. The main
motivation is to enable expression evaluation in batch form, which can
substantially reduce per-tuple overhead (mainly from function calls)
and open the door to further optimizations such as SIMD usage in
aggregate transition functions. We could even change algorithms of
some plan nodes to operate on batches when, for example, a child node
can return batches.

The expression evaluation changes are still exploratory, but before
moving to make them ready for serious review, we first need a way for
scan nodes to produce tuples in batches and an executor API that
allows upper nodes to consume them. The series includes both the
foundational work to let scan nodes produce batches and an executor
API to pass them around, and a set of follow-on patches that
experiment with batch-aware expression evaluation.

The patch set is structured in two parts. The first three patches lay
the groundwork in the executor and table AM, and the later patches
prototype batch-aware expression evaluation.

Patches 0001-0003 introduce a new batch table AM API and an initial
heapam implementation that can return multiple tuples per call.
SeqScan is adapted to use this interface, with new ExecSeqScanBatch*
routines that fetch tuples in bulk but can still return one
TupleTableSlot at a time to preserve compatibility. On the executor
side, ExecProcNodeBatch() is added alongside ExecProcNode(), with
TupleBatch as the new container for passing groups of tuples. ExecScan
has batch-aware variants that use the AM API internally, but can fall
back to row-at-a-time behavior when required. Plan shapes and EXPLAIN
output remain unchanged; the differences here are executor-internal.

At present, heapam batches are restricted to tuples from a single
page, which means they may not always fill EXEC_BATCH_ROWS (currently
64). That limits how much upper executor nodes can leverage batching,
especially with selective quals where batches may end up sparsely
populated. A future improvement would be to allow batches to span
pages or to let the scan node request more tuples when its buffer is
not yet full, so it avoids passing mostly empty TupleBatch to upper
nodes.

It might also be worth adding some lightweight instrumentation to make
it easier to reason about batch behavior. For example, counters for
average rows per batch, reasons why a batch ended (capacity reached,
page boundary, end of scan), or batches per million rows could help
confirm whether limitations like the single-page restriction or
EXEC_BATCH_ROWS size are showing up in benchmarks. Suggestions from
others on which forms of instrumentation would be most useful are
welcome.

Patches 0004 onwards start experimenting with making expression
evaluation batch-aware, first in the aggregate node. These patches add
new EEOPs (ExprEvalOps and ExprEvalSteps) to fetch attributes into
TupleBatch vectors, evaluate quals across a batch, and run aggregate
transitions over multiple rows at once. Agg is extended to pull
TupleBatch from its child via ExecProcNodeBatch(), with two prototype
paths: one that loops inside the interpreter and another that calls
the transition function once per batch using AggBulkArgs. These are
still PoCs, but with scan nodes and the executor capable of moving
batches around, they provide a base from which the work can be refined
into something potentially committable after the usual polish,
testing, and review.

One area that needs more thought is how TupleBatch interacts with
ExprContext. At present the patches extend ExprContext with
scan_batch, inner_batch, and outer_batch fields, but per-batch
evaluation still spills into ecxt_per_tuple_memory, effectively
reusing the per-tuple context for per-batch work. That’s arguably an
abuse of the contract described in ExecEvalExprSwitchContext(), and it
will need a cleaner definition of how batch-scoped memory should be
managed. Feedback on how best to structure that would be particularly
helpful.

To evaluate the overheads and benefits, I ran microbenchmarks with
single and multi-aggregate queries on a single table, with and without
WHERE clauses. Tables were fully VACUUMed so visibility maps are set
and IO costs are minimal. shared_buffers was large enough to fit the
whole table (up to 10M rows, ~43 on each page), and all pages were
prewarmed into cache before tests. Table schema/script is at [2].

Observations from benchmarking (Detailed benchmark tables are at [3];
below is just a high-level summary of the main patterns):

* Single aggregate, no WHERE (SELECT count(*) FROM bar_N, SELECT
sum(a) FROM bar_N): batching scan output alone improved latency by
~10-20%. Adding batched transition evaluation pushed gains to ~30-40%,
especially once fmgr overhead was paid per batch instead of per row.

* Single aggregate, with WHERE (WHERE a > 0 AND a < N): batching the
qual interpreter gave a big step up, with latencies dropping by
~30-40% compared to batching=off.

* Five aggregates, no WHERE: batching input from the child scan cut
~15% off runtime. Adding batched transition evaluation increased
improvements to ~30%.

* Five aggregates, with WHERE: modest gains from scan/input batching,
but per-batch transition evaluation and batched quals brought ~20-30%
improvement.

* Across all cases, executor overheads became visible only after IO
was minimized. Once executor cost dominated, batching consistently
reduced CPU time, with the largest benefits coming from avoiding
per-row fmgr calls and evaluating quals across batches.

I would appreciate if others could try these patches with their own
microbenchmarks or workloads and see if they can reproduce numbers
similar to mine. Feedback on both the general direction and the
details of the patches would be very helpful. In particular, patches
0001-0003, which add the basic batch APIs and integrate them into
SeqScan, are intended to be the first candidates for review and
eventual commit. Comments on the later, more experimental patches
(aggregate input batching and expression evaluation (qual, aggregate
transition) batching) are also welcome.

--
Thanks, Amit Langote

[1]
https://wiki.postgresql.org/wiki/PGConf.dev_2025_Developer_Unconference#Can_the_Community_Support_an_Additional_Batch_Executor

[2] Tables:
cat create_tables.sh
for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
psql -c "drop table if exists bar_$i; create table bar_$i (a int, b
int, c int, d int, e int, f int, g int, h int, i text, j int, k int, l
int, m int, n int, o int);" 2>&1 > /dev/null
psql -c "insert into bar_$i select i, i, i, i, i, i, i, i, repeat('x',
100), i, i, i, i, i, i from generate_series(1, $i) i;" 2>&1 >
/dev/null
echo "bar_$i created."
done

[3] Benchmark result tables

All timings are in milliseconds. off = executor_batching off, on =
executor_batching on.  Negative %diff means on is better than off.

Single aggregate, no WHERE
(~20% faster with scan batching only; ~40%+ faster with batched transitions)

With only batched-seqscan (0001-0003):
Rows    off       on       %diff
1M      10.448    8.147    -22.0
2M      18.442    14.552   -21.1
3M      25.296    22.195   -12.3
4M      36.285    33.383   -8.0
5M      44.441    39.894   -10.2
10M     93.110    82.744   -11.1

With batched-agg on top (0001-0007):
Rows    off       on       %diff
1M      9.891     5.579    -43.6
2M      17.648    9.653    -45.3
3M      27.451    13.919   -49.3
4M      36.394    24.269   -33.3
5M      44.665    29.260   -34.5
10M     87.898    56.221   -36.0

Single aggregate, with WHERE
(~30–40% faster once quals + transitions are batched)

With only batched-seqscan (0001-0003):
Rows    off       on       %diff
1M      18.485    17.749   -4.0
2M      34.696    33.033   -4.8
3M      49.582    46.155   -6.9
4M      70.270    67.036   -4.6
5M      84.616    81.013   -4.3
10M     174.649   164.611  -5.7

With batched-agg and batched-qual on top (0001-0008):
Rows    off       on       %diff
1M      18.887    12.367   -34.5
2M      35.706    22.457   -37.1
3M      51.626    30.902   -40.1
4M      72.694    48.214   -33.7
5M      88.103    57.623   -34.6
10M     181.350   124.278  -31.5

Five aggregates, no WHERE
(~15% faster with scan/input batching; ~30% with batched transitions)

Agg input batching only (0001-0004):
Rows    off       on       %diff
1M      23.193    19.196   -17.2
2M      42.177    35.862   -15.0
3M      62.192    51.121   -17.8
4M      83.215    74.665   -10.3
5M      99.426    91.904   -7.6
10M     213.794   184.263  -13.8

Batched transition eval, per-row fmgr (0001-0006):
Rows    off       on       %diff
1M      23.501    19.672   -16.3
2M      44.128    36.603   -17.0
3M      64.466    53.079   -17.7
5M      103.442   97.623   -5.6
10M     219.120   190.354  -13.1

Batched transition eval, per-batch fmgr (0001-0007):
Rows    off       on       %diff
1M      24.238    16.806   -30.7
2M      43.056    30.939   -28.1
3M      62.938    43.295   -31.2
4M      83.346    63.357   -24.0
5M      100.772   78.351   -22.2
10M     213.755   162.203  -24.1

Five aggregates, with WHERE
(~10–15% faster with scan/input batching; ~30% with batched transitions + quals)

Agg input batching only (0001-0004):
Rows    off       on       %diff
1M      24.261    22.744   -6.3
2M      45.802    41.712   -8.9
3M      79.311    72.732   -8.3
4M      107.189   93.870   -12.4
5M      129.172   115.300  -10.7
10M     278.785   236.275  -15.2

Batched transition eval, per-batch fmgr (0001-0007):
Rows    off       on       %diff
1M      24.354    19.409   -20.3
2M      46.888    36.687   -21.8
3M      82.147    57.683   -29.8
4M      109.616   76.471   -30.2
5M      133.777   94.776   -29.2
10M     282.514   194.954  -31.0

Batched transition eval + batched qual (0001-0008):
Rows    off       on       %diff
1M      24.691    20.193   -18.2
2M      47.182    36.530   -22.6
3M      82.030    58.663   -28.5
4M      110.573   76.500   -30.8
5M      136.701   93.299   -31.7
10M     280.551   191.021  -31.9

Вложения

Re: Batching in executor

От
Bruce Momjian
Дата:
On Fri, Sep 26, 2025 at 10:28:33PM +0900, Amit Langote wrote:
> At PGConf.dev this year we had an unconference session [1] on whether
> the community can support an additional batch executor. The discussion
> there led me to start hacking on $subject. I have also had off-list
> discussions on this topic in recent months with Andres and David, who
> have offered useful thoughts.
> 
> This patch series is an early attempt to make executor nodes pass
> around batches of tuples instead of tuple-at-a-time slots. The main
> motivation is to enable expression evaluation in batch form, which can
> substantially reduce per-tuple overhead (mainly from function calls)
> and open the door to further optimizations such as SIMD usage in
> aggregate transition functions. We could even change algorithms of
> some plan nodes to operate on batches when, for example, a child node
> can return batches.

For background, people might want to watch these two videos from POSETTE
2025.  The first video explains how data warehouse query needs are
different from OLTP needs:

    Building a PostgreSQL data warehouse
    https://www.youtube.com/watch?v=tpq4nfEoioE

and the second one explains the executor optimizations done in PG 18:

    Hacking Postgres Executor For Performance
    https://www.youtube.com/watch?v=D3Ye9UlcR5Y

I learned from these two videos that to handle new workloads, I need to
think of the query demands differently, and of course can this be
accomplished without hampering OLTP workloads?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.



Re: Batching in executor

От
Tomas Vondra
Дата:
Hi Amit,

Thanks for the patch. I took a look over the weekend, and done a couple
experiments / benchmarks, so let me share some initial feedback (or
rather a bunch of questions I came up with).

I'll start with some general thoughts, before going into some nitpicky
comments about patches / code and perf results.

I think the general goal of the patch - reducing the per-tuple overhead
and making the executor more efficient for OLAP workloads - is very
desirable. I believe the limitations of per-row executor are one of the
reasons why attempts to implement a columnar TAM mostly failed. The
compression is nice, but it's hard to be competitive without an executor
that leverages that too. So starting with an executor, in a way that
helps even heap, seems like a good plan. So +1 to this.

While looking at the patch, I couldn't help but think about the index
prefetching stuff that I work on. It also introduces the concept of a
"batch", for passing data between an index AM and the executor. It's
interesting how different the designs are in some respects. I'm not
saying one of those designs is wrong, it's more due different goals.

For example, the index prefetching patch establishes a "shared" batch
struct, and the index AM is expected to fill it with data. After that,
the batch is managed entirely by indexam.c, with no AM calls. The only
AM-specific bit in the batch is "position", but that's used only when
advancing to the next page, etc.

This patch does things differently. IIUC, each TAM may produce it's own
"batch", which is then wrapped in a generic one. For example, heap
produces HeapBatch, and it gets wrapped in TupleBatch. But I think this
is fine. In the prefetching we chose to move all this code (walking the
batch items) from the AMs into the layer above, and make it AM agnostic.

But for the batching, we want to retain the custom format as long as
possible. Presumably, the various advantages of the TAMs are tied to the
custom/columnar storage format. Memory efficiency thanks to compression,
execution on compressed data, etc. Keeping the custom format as long as
possible is the whole point of "late materialization" (and materializing
as late as possible is one of the important details in column stores).

How far ahead have you though about these capabilities? I was wondering
about two things in particular. First, at which point do we have to
"materialize" the TupleBatch into some generic format (e.g. TupleSlots).
I get it that you want to enable passing batches between nodes, but
would those use the same "format" as the underlying scan node, or some
generic one? Second, will it be possible to execute expressions on the
custom batches (i.e. on "compressed data")? Or is it necessary to
"materialize" the batch into regular tuple slots? I realize those may
not be there "now" but maybe it'd be nice to plan for the future.

It might be worth exploring some columnar formats, and see if this
design would be a good fit. Let's say we want to process data read from
a parquet file. Would we be able to leverage the format, or would we
need to "materialize" into slots too early? Or maybe it'd be good to
look at the VCI extension [1], discussed in a nearby thread. AFAICS
that's still based on an index AM, but there were suggestions to use TAM
instead (and maybe that'd be a better choice).

The other option would be to "create batches" during execution, say by
having a new node that accumulates tuples, builds a batch and sends it
to the node above. This would help both in cases when either the lower
node does not produce batches at all, or the batches are too small (due
to filtering, aggregation, ...). Or course, it'd only win if this
increases efficiency of the upper part of the plan enough to pay for
building the batches. That can be a hard decision.

You also mentioned we could make batches larger by letting them span
multiple pages, etc. I'm not sure that's worth it - wouldn't that
substantially complicate the TAM code, which would need to pin+track
multiple buffers for each batch, etc.? Possible, but is it worth it?

I'm not sure allowing multi-page batches would actually solve the issue.
It'd help with batches at the "scan level", but presumably the batch
size in the upper nodes matters just as much. Large scan batches may
help, but hard to predict.

In the index prefetching patch we chose to keep batches 1:1 with leaf
pages, at least for now. Instead we allowed having multiple batches at
once. I'm not sure that'd be necessary for TAMs, though.

This also reminds me of LIMIT queries. The way I imagine a "batchified"
executor to work is that batches are essentially "units of work". For
example, a nested loop would grab a batch of tuples from the outer
relation, lookup inner tuples for the whole batch, and only then pass
the result batch. (I'm ignoring the cases when the batch explodes due to
duplicates.)

But what if there's a LIMIT 1 on top? Maybe it'd be enough to process
just the first tuple, and the rest of the batch is wasted work? Plenty
of (very expensive) OLAP have that, and many would likely benefit from
batching, so just disabling batching if there's LIMIT seems way too
heavy handed.

Perhaps it'd be good to gradually ramp up the batch size? Start with
small batches, and then make them larger. The index prefetching does
that too, indirectly - it reads the whole leaf page as a batch, but then
gradually ramps up the prefetch distance (well, read_stream does that).
Maybe the batching should have similar thing ...

In fact, how shall the optimizer decide whether to use batching? It's
one thing to decide whether a node can produce/consume batches, but
another thing is "should it"? With a node that "builds" a batch, this
decision would apply to even more plans, I guess.

I don't have a great answer to this, it seems like an incredibly tricky
costing issue. I'm a bit worried we might end up with something too
coarse, like "jit=on" which we know is causing problems (admittedly,
mostly due to a lot of the LLVM work being unpredictable/external). But
having some "adaptive" heuristics (like the gradual ramp up) might make
it less risky.

FWIW the current batch size limit (64 tuples) seems rather low, but it's
hard to say. It'd be good to be able to experiment with different
values, so I suggest we make this a GUC and not a hard-coded constant.

As for what to add to explain, I'd start by adding info about which
nodes are "batched" (consuming/producing batches), and some info about
the batch sizes. An average size, maybe a histogram if you want to be a
bit fancy.

I have no thoughts about the expression patches, at least not beyond
what I already wrote above. I don't know enough about that part.

[1]

https://www.postgresql.org/message-id/OS7PR01MB119648CA4E8502FE89056E56EEA7D2%40OS7PR01MB11964.jpnprd01.prod.outlook.com


Now, numbers from some microbenchmarks:

On 9/26/25 15:28, Amit Langote wrote:
> 
> To evaluate the overheads and benefits, I ran microbenchmarks with
> single and multi-aggregate queries on a single table, with and without
> WHERE clauses. Tables were fully VACUUMed so visibility maps are set
> and IO costs are minimal. shared_buffers was large enough to fit the
> whole table (up to 10M rows, ~43 on each page), and all pages were
> prewarmed into cache before tests. Table schema/script is at [2].
> 
> Observations from benchmarking (Detailed benchmark tables are at [3];
> below is just a high-level summary of the main patterns):
> 
> * Single aggregate, no WHERE (SELECT count(*) FROM bar_N, SELECT
> sum(a) FROM bar_N): batching scan output alone improved latency by
> ~10-20%. Adding batched transition evaluation pushed gains to ~30-40%,
> especially once fmgr overhead was paid per batch instead of per row.
> 
> * Single aggregate, with WHERE (WHERE a > 0 AND a < N): batching the
> qual interpreter gave a big step up, with latencies dropping by
> ~30-40% compared to batching=off.
> 
> * Five aggregates, no WHERE: batching input from the child scan cut
> ~15% off runtime. Adding batched transition evaluation increased
> improvements to ~30%.
> 
> * Five aggregates, with WHERE: modest gains from scan/input batching,
> but per-batch transition evaluation and batched quals brought ~20-30%
> improvement.
> 
> * Across all cases, executor overheads became visible only after IO
> was minimized. Once executor cost dominated, batching consistently
> reduced CPU time, with the largest benefits coming from avoiding
> per-row fmgr calls and evaluating quals across batches.
> 
> I would appreciate if others could try these patches with their own
> microbenchmarks or workloads and see if they can reproduce numbers
> similar to mine. Feedback on both the general direction and the
> details of the patches would be very helpful. In particular, patches
> 0001-0003, which add the basic batch APIs and integrate them into
> SeqScan, are intended to be the first candidates for review and
> eventual commit. Comments on the later, more experimental patches
> (aggregate input batching and expression evaluation (qual, aggregate
> transition) batching) are also welcome.
> 

I tried to replicate the results, but the numbers I see are not this
good. In fact, I see a fair number of regressions (and some are not
negligible).

I'm attaching the scripts I used to build the tables / run the test. I
used the same table structure, and tried to follow the same query
pattern with 1 or 5 aggregates (I used "avg"), [0, 1, 5] where
conditions (with 100% selectivity).

I measured master vs. 0001-0003 vs. 0001-0007 (with batching on/off).
And I did that on my (relatively) new ryzen machine, and old xeon. The
behavior is quite different for the two machines, but none of them shows
such improvements. I used clang 19.0, and --with-llvm.

See the attached PDFs with a summary of the results, comparing the
results for master and the two batching branches.

The ryzen is much "smoother" - it shows almost no difference with
batching "off" (as expected). The "scan" branch (with 0001-0003) shows
an improvement of 5-10% - it's consistent, but much less than the 10-20%
you report. For the "agg" branch the benefits are much larger, but
there's also a significant regression for the largest table with 100M
rows (which is ~18GB on disk).

For xeon, the results are a bit more variable, but it affects runs both
with batching "on" and "off". The machine is just more noisy. There
seems to be a small benefit of "scan" batching (in most cases much less
than the 10-20%). The "agg" is a clear win, with up to 30-40% speedup,
and no regression similar to the ryzen.

Perhaps I did something wrong. It does not surprise me this is somewhat
CPU dependent. It's a bit sad the improvements are smaller for the newer
CPU, though.

I also tried running TPC-H. I don't have useful numbers yet, but I ran
into a segfault - see the attached backtrace. It only happens with the
batching, and only on Q22 for some reason. I initially thought it's a
bug in clang, because I saw it with clang-22 built from git, and not
with clang-14 or gcc. But since then I reproduced it with clang-19 (on
debian 13). Still could be a clang bug, of course. I've seen ~20 of
those segfaults so far, and the backtraces look exactly the same.


regards

-- 
Tomas Vondra

Вложения

Re: Batching in executor

От
Amit Langote
Дата:
Hi Tomas,

Thanks a lot for your comments and benchmarking.

I plan to reply to your detailed comments and benchmark results, but I
just realized I had forgotten to attach patch 0008 (oops!) in my last
email. That patch adds batched qual evaluation.

I also noticed that the batched path was unnecessarily doing early
“batch-materialization” in cases like SELECT count(*) FROM bar. I’ve
fixed that as well. It was originally designed to avoid such
materialization, but I must have broken it while refactoring.

Вложения

Re: Batching in executor

От
Amit Langote
Дата:
Hi Bruce,

On Fri, Sep 26, 2025 at 10:49 PM Bruce Momjian <bruce@momjian.us> wrote:
> On Fri, Sep 26, 2025 at 10:28:33PM +0900, Amit Langote wrote:
> > At PGConf.dev this year we had an unconference session [1] on whether
> > the community can support an additional batch executor. The discussion
> > there led me to start hacking on $subject. I have also had off-list
> > discussions on this topic in recent months with Andres and David, who
> > have offered useful thoughts.
> >
> > This patch series is an early attempt to make executor nodes pass
> > around batches of tuples instead of tuple-at-a-time slots. The main
> > motivation is to enable expression evaluation in batch form, which can
> > substantially reduce per-tuple overhead (mainly from function calls)
> > and open the door to further optimizations such as SIMD usage in
> > aggregate transition functions. We could even change algorithms of
> > some plan nodes to operate on batches when, for example, a child node
> > can return batches.
>
> For background, people might want to watch these two videos from POSETTE
> 2025.  The first video explains how data warehouse query needs are
> different from OLTP needs:
>
>         Building a PostgreSQL data warehouse
>         https://www.youtube.com/watch?v=tpq4nfEoioE
>
> and the second one explains the executor optimizations done in PG 18:
>
>         Hacking Postgres Executor For Performance
>         https://www.youtube.com/watch?v=D3Ye9UlcR5Y
>
> I learned from these two videos that to handle new workloads, I need to
> think of the query demands differently, and of course can this be
> accomplished without hampering OLTP workloads?

Thanks for pointing to those talks -- I gave the second one. :-)

Yes, the idea here is to introduce batching without adding much
overhead or new code into the OLTP path.

--
Thanks, Amit Langote



Re: Batching in executor

От
Amit Langote
Дата:
On Tue, Sep 30, 2025 at 11:11 AM Amit Langote <amitlangote09@gmail.com> wrote:
> Hi Tomas,
>
> Thanks a lot for your comments and benchmarking.
>
> I plan to reply to your detailed comments and benchmark results

For now, I reran a few benchmarks with the master branch as an
explicit baseline, since Tomas reported possible regressions with
executor_batching=off. I can reproduce that on my side:

5 aggregates, no where:
select avg(a), avg(b), avg(c), avg(d), avg(e) from bar;

parallel_workers=0, jit=off
Rows    master    batching off    batching on    master vs off    master vs on
1M      47.118    48.545          39.531         +3.0%            -16.1%
2M      95.098    97.241          80.189         +2.3%            -15.7%
3M      141.821   148.540         122.005        +4.7%            -14.0%
4M      188.969   197.056         163.779        +4.3%            -13.3%
5M      240.113   245.902         213.645        +2.4%            -11.0%
10M     556.738   564.120         486.359        +1.3%            -12.6%

parallel_workers=2, jit=on
Rows    master    batching off    batching on    master vs off    master vs on
1M      21.147    22.278          20.737         +5.3%            -1.9%
2M      40.319    41.509          37.851         +3.0%            -6.1%
3M      61.582    63.026          55.927         +2.3%            -9.2%
4M      96.363    95.245          78.494         -1.2%            -18.5%
5M      117.226   117.649         97.968         +0.4%            -16.4%
10M     245.503   246.896         196.335        +0.6%            -20.0%

1 aggregate, no where:
select count(*) from bar;

parallel_workers=0, jit=off
Rows    master    batching off    batching on    master vs off    master vs on
1M      17.071    20.135          6.698          +17.9%           -60.8%
2M      36.905    41.522          15.188         +12.5%           -58.9%
3M      56.094    63.110          23.485         +12.5%           -58.1%
4M      74.299    83.912          32.950         +12.9%           -55.7%
5M      94.229    108.621         41.338         +15.2%           -56.1%
10M     234.425   261.490         117.833        +11.6%           -49.7%

parallel_workers=2, jit=on
Rows    master    batching off    batching on    master vs off    master vs on
1M      8.820     9.832           5.324          +11.5%           -39.6%
2M      16.368    18.001          9.526          +10.0%           -41.8%
3M      24.810    28.193          14.482         +13.6%           -41.6%
4M      34.369    35.741          23.212         +4.0%            -32.5%
5M      41.595    45.103          27.918         +8.4%            -32.9%
10M     99.494    112.226         94.081         +12.8%           -5.4%

The regression is more noticeable in the single aggregate case, where
more time is spent in scanning.

Looking into it.

--
Thanks, Amit Langote



Re: Batching in executor

От
Amit Langote
Дата:
Hi,

On Mon, Sep 29, 2025 at 8:01 PM Tomas Vondra <tomas@vondra.me> wrote:
> I also tried running TPC-H. I don't have useful numbers yet, but I ran
> into a segfault - see the attached backtrace. It only happens with the
> batching, and only on Q22 for some reason. I initially thought it's a
> bug in clang, because I saw it with clang-22 built from git, and not
> with clang-14 or gcc. But since then I reproduced it with clang-19 (on
> debian 13). Still could be a clang bug, of course. I've seen ~20 of
> those segfaults so far, and the backtraces look exactly the same.

I can reproduce the Q22 segfault with clang-17 on macOS and the
attached patch 0009 fixes it.

The issue I observed is that two EEOPs both called the same helper,
and that helper re-peeked ExecExprEvalOp(op) to choose its path; in
this particular build the two EEOP cases in ExecInterpExpr() compiled
to identical code so their dispatch labels had the same address
(reverse_dispatch_table logging in ExecInitInterpreter() showed the
duplicate), and because ExecEvalStepOp() maps by label address the
reverse lookup could yield the other EEOP -- I saw ExprInit select
ROWLOOP EEOP while the ExprExec-time helper observed DIRECT EEOP and
ran code for it, which then crashed.

In 0009 (the fix), I split the helper into two functions, one per
EEOP, so the helper does not re-derive the opcode; with that change I
cannot reproduce the crash on macOS clang-17.

--
Thanks, Amit Langote

Вложения

Re: Batching in executor

От
Amit Langote
Дата:
Hi Tomas,

On Mon, Sep 29, 2025 at 8:01 PM Tomas Vondra <tomas@vondra.me> wrote:
>
> Hi Amit,
>
> Thanks for the patch. I took a look over the weekend, and done a couple
> experiments / benchmarks, so let me share some initial feedback (or
> rather a bunch of questions I came up with).

Thank you for reviewing the patch and taking the time to run those
experiments. I appreciate the detailed feedback and questions.  I also
apologize for my late reply, I spent perhaps way too much time going
over your index prefetching thread trying to understand the notion of
batching that it uses and getting sidelined by other things while
writing this reply.

> I'll start with some general thoughts, before going into some nitpicky
> comments about patches / code and perf results.
>
> I think the general goal of the patch - reducing the per-tuple overhead
> and making the executor more efficient for OLAP workloads - is very
> desirable. I believe the limitations of per-row executor are one of the
> reasons why attempts to implement a columnar TAM mostly failed. The
> compression is nice, but it's hard to be competitive without an executor
> that leverages that too. So starting with an executor, in a way that
> helps even heap, seems like a good plan. So +1 to this.

I'm happy to hear that you find the overall direction worthwhile.

> While looking at the patch, I couldn't help but think about the index
> prefetching stuff that I work on. It also introduces the concept of a
> "batch", for passing data between an index AM and the executor. It's
> interesting how different the designs are in some respects. I'm not
> saying one of those designs is wrong, it's more due different goals.
>
> For example, the index prefetching patch establishes a "shared" batch
> struct, and the index AM is expected to fill it with data. After that,
> the batch is managed entirely by indexam.c, with no AM calls. The only
> AM-specific bit in the batch is "position", but that's used only when
> advancing to the next page, etc.
>
> This patch does things differently. IIUC, each TAM may produce it's own
> "batch", which is then wrapped in a generic one. For example, heap
> produces HeapBatch, and it gets wrapped in TupleBatch. But I think this
> is fine. In the prefetching we chose to move all this code (walking the
> batch items) from the AMs into the layer above, and make it AM agnostic.

Yes, the design of this patch does differ from the index prefetching
approach, and that’s largely due to the differing goals as you say.
AIUI, the index prefetching patch uses a shared batch structure
managed mostly by indexam.c and populated by the index AM.  In my
patch, each table AM produces its own batch format that gets wrapped
in a generic TupleBatch which contains the AM-specified TupleBatchOps
for operations on the AM's opaque data. This was a conscious choice:
in prefetching, the aim seems to be to make indexam.c manage batches
and operations based on it in a mostly AM-agnostic manner.  But for
executor batching, the aim is to retain TAM-specific optimizations as
much as possible and rely on the TAM for most operations on the batch
contents. Both designs have their merits given their respective use
cases, but I guess you understand that very well.

> But for the batching, we want to retain the custom format as long as
> possible. Presumably, the various advantages of the TAMs are tied to the
> custom/columnar storage format. Memory efficiency thanks to compression,
> execution on compressed data, etc. Keeping the custom format as long as
> possible is the whole point of "late materialization" (and materializing
> as late as possible is one of the important details in column stores).

Exactly -- keeping the TAM-specific batch format as long as possible
is a key goal here. As you noted, the benefits of a custom storage
format (compression, operating on compressed data, etc.) are best
realized when we delay materialization until absolutely necessary.  I
want to design this patch that each TAM can produce and use its own
batch representation internally, only wrapping it when interfacing
with the executor in a generic way.  I admit that's not entirely true
with the patch as it stands as I write above below.

> How far ahead have you though about these capabilities? I was wondering
> about two things in particular. First, at which point do we have to
> "materialize" the TupleBatch into some generic format (e.g. TupleSlots).
> I get it that you want to enable passing batches between nodes, but
> would those use the same "format" as the underlying scan node, or some
> generic one? Second, will it be possible to execute expressions on the
> custom batches (i.e. on "compressed data")? Or is it necessary to
> "materialize" the batch into regular tuple slots? I realize those may
> not be there "now" but maybe it'd be nice to plan for the future.

I have been thinking about those future capabilities. Currently, the
patch keeps tuples in the TAM-specific batch format up until they need
to be consumed by a node that doesn’t understand that format or has
not been modified to invoke the TAM callbacks to decode it.  In the
current patch, that means we materialize to regular TupleTableSlots at
nodes that require it (for example, the scan node reading from TAM
needing to evaluate quals, etc.). However, the intention is to allow
batches to be passed through as many nodes as possible without
materialization, ideally using the same format produced by the scan
node all the way up until reaching a node that can only work with
tuples in TupleTableSlots.

As for executing expressions directly on the custom batch data: that’s
something I would like to enable in the future. Right now, expressions
(quals, projections, etc.) are evaluated after materializing into
normal tuples in TupleTableSlots stored in TupleBatch, because the
expression evaluation code isn’t yet totally batch-aware or is very
from doing things like operate on compressed data in its native form.
Patches 0004-0008 do try to add batch-aware expression evaluation but
that's just a prototype.  In the long term, the goal is to allow
expression evaluation on batch data (for example, applying a WHERE
clause or aggregate transition directly on a columnar batch without
converting it to heap tuples first). This will require significant new
infrastructure (perhaps specialized batch-aware expression operators
and functions), so it's not in the current patch, but I agree it's
important to plan for it. The current design doesn’t preclude it, it
lays some groundwork by introducing the batch abstraction -- but fully
supporting that will be future work.

That said, one area I’d like to mention while at it, especially to
enable native execution on compressed or columnar batches, is giving
the table AM more control over how expression evaluation is performed
on its batch data. In the current patch, the AM can provide a
materialize function via TupleBatchOps, but that always produces an
array of TupleTableSlots stored in the TupleBatch, not an opaque
representation that remains under AM control. Maybe that's not bad for
a v1 patch.  When evaluating expressions over a batch, a BatchVector
is built by looping over these slots and invoking the standard
per-tuple getsomeattrs() to "deform" a tuple into needed columns.
While that enables batch-style EEOPs for qual evaluation and aggregate
transition (and is already a gain over per-row evaluation), it misses
the opportunity to leverage any batch-specific optimizations the AM
could offer, such as vectorized decoding or filtering over compressed
data, and other AM optimizations for getting only the necessary
columns out possibly in a vector format.

I’m considering extending TupleTableSlotOps with a batch-aware variant
of getsomeattrs(), something like slot_getsomeattrs_batch(), so that
AMs can populate column vectors (e.g., BatchVector) directly from
their native format. That would allow bypassing slot materialization
entirely and plug AM-provided decoding logic directly into the
executor’s batch expression paths. This isn’t implemented yet, but I
see it as a necessary step toward supporting fully native expression
evaluation over compressed or columnar formats. I’m not yet sure if
TupleTableSlotOps is the right place for such a hook, it might belong
elsewhere in the abstraction, but exposing a batch-aware interface for
this purpose seems like the right direction.

> It might be worth exploring some columnar formats, and see if this
> design would be a good fit. Let's say we want to process data read from
> a parquet file. Would we be able to leverage the format, or would we
> need to "materialize" into slots too early? Or maybe it'd be good to
> look at the VCI extension [1], discussed in a nearby thread. AFAICS
> that's still based on an index AM, but there were suggestions to use TAM
> instead (and maybe that'd be a better choice).

Yeah, looking at columnar TAMs or FDWs is on my list. I do think the
design should be able to accommodate true columnar formats like
Parquet. If we had a table AM (or FDW) that reads Parquet files into a
columnar batch structure, the executor batching framework should
ideally allow us to pass that batch along without immediately
materializing to tuples.  As mentioned before, we might have to adjust
or extend the TupleBatch abstraction to handle a wider variety of
batch formats, but conceptually it fits -- the goal is to avoid
forcing early materialization. I will definitely keep the Parquet
use-case in mind and perhaps do some experiments with a columnar
source to ensure we aren’t baking in any unnecessary materialization.
Also, thanks for the reference to the VCI extension thread; I'll take
a look at that.

> The other option would be to "create batches" during execution, say by
> having a new node that accumulates tuples, builds a batch and sends it
> to the node above. This would help both in cases when either the lower
> node does not produce batches at all, or the batches are too small (due
> to filtering, aggregation, ...). Or course, it'd only win if this
> increases efficiency of the upper part of the plan enough to pay for
> building the batches. That can be a hard decision.

Yes, introducing a dedicated executor node to accumulate and form
batches on the fly is an interesting idea, I have thought about it and
even mentioned it in passing in the pgconf.dev unconference. This
could indeed cover scenarios where the data source (a node) doesn't
produce batches (e.g., a non-batching node feeding into a
batching-aware upper node) or where batches coming from below are too
small to be efficient. The current patch set doesn’t implement such a
node; I focused on enabling batching at the scan/TAM level first. The
cost/benefit decision for a batch-aggregator node is tricky, as you
said. We’d need a way to decide when the overhead of gathering tuples
into a batch is outweighed by the benefits to the upper node. This
likely ties into costing or adaptive execution decisions. It's
something I’m open to exploring in a future iteration, perhaps once we
have more feedback on how the existing batching performs in various
scenarios. It might also require some planner or executor smarts
(maybe the executor can decide to batch on the fly if it sees a
pattern of use, or the planner could insert such nodes when
beneficial).

> You also mentioned we could make batches larger by letting them span
> multiple pages, etc. I'm not sure that's worth it - wouldn't that
> substantially complicate the TAM code, which would need to pin+track
> multiple buffers for each batch, etc.? Possible, but is it worth it?
>
> I'm not sure allowing multi-page batches would actually solve the issue.
> It'd help with batches at the "scan level", but presumably the batch
> size in the upper nodes matters just as much. Large scan batches may
> help, but hard to predict.
>
> In the index prefetching patch we chose to keep batches 1:1 with leaf
> pages, at least for now. Instead we allowed having multiple batches at
> once. I'm not sure that'd be necessary for TAMs, though.

I tend to agree with you here. Allowing a single batch to span
multiple pages would add quite a bit of complexity to the table AM
implementations (managing multiple buffer pins per batch, tracking
page boundaries, etc.), and it's unclear if the benefit would justify
that complexity. For now, I'm inclined not to pursue multi-page
batches at the scan level in this patch. We can keep the batch
page-local (e.g., for heap, one batch corresponds to max one page, as
it does now). If we need larger batch sizes overall, we might address
that by other means -- for example, by the above-mentioned idea of a
higher-level batching node or by simply producing multiple batches in
quick succession.

You’re right that even if we made scan batches larger, it doesn’t
necessarily solve everything, since the effective batch size at
higher-level nodes could still be constrained by other factors. So
rather than complicating the low-level TAM code with multi-page
batches, I'd prefer to first see if the current approach (with
one-page batches) yields good benefits and then consider alternatives.
We could also consider letting a scan node produce multiple batches
before yielding to the upper node (similar to how the index
prefetching patch can have multiple leaf page batches in flight) if
needed, but as you note, it might not be necessary for TAMs yet. So at
this stage, I'll keep it simple.

> This also reminds me of LIMIT queries. The way I imagine a "batchified"
> executor to work is that batches are essentially "units of work". For
> example, a nested loop would grab a batch of tuples from the outer
> relation, lookup inner tuples for the whole batch, and only then pass
> the result batch. (I'm ignoring the cases when the batch explodes due to
> duplicates.)
>
> But what if there's a LIMIT 1 on top? Maybe it'd be enough to process
> just the first tuple, and the rest of the batch is wasted work? Plenty
> of (very expensive) OLAP have that, and many would likely benefit from
> batching, so just disabling batching if there's LIMIT seems way too
> heavy handed.

Yeah, LIMIT does complicate downstream batching decisions. If we
always use a full-size batch (say 64 tuples) for every operation, a
query with LIMIT 1 could end up doing a lot of unnecessary work
fetching and processing 63 tuples that never get used. Disabling
batching entirely for queries with LIMIT would indeed be overkill and
lose benefits for cases where the limit is not extremely selective.

> Perhaps it'd be good to gradually ramp up the batch size? Start with
> small batches, and then make them larger. The index prefetching does
> that too, indirectly - it reads the whole leaf page as a batch, but then
> gradually ramps up the prefetch distance (well, read_stream does that).
> Maybe the batching should have similar thing ...

An adaptive batch size that ramps up makes a lot of sense as a
solution. We could start with a very small batch (say 4 tuples) and if
we detect that the query needs more (e.g., the LIMIT wasn’t satisfied
yet or more output is still being consumed), then increase the batch
size for subsequent operations. This way, a query that stops early
doesn’t incur the full batching overhead, whereas a query that does
process lots of tuples will gradually get to a larger batch size to
gain efficiency. This is analogous to how the index prefetching ramps
up prefetch distance, as you mentioned.

Implementing that will require some careful thought. It could be done
either in the planner (choose initial batch sizes based on context
like LIMIT) or more dynamically in the executor (adjust on the fly). I
lean towards a runtime heuristic because it’s hard for the planner to
predict exactly how a LIMIT will play out, especially in complex
plans. In any case, I agree that a gradual ramp-up or other adaptive
approach would make batching more robust in the presence of query
execution variability. I will definitely consider adding such logic,
perhaps as an improvement once the basic framework is in.

> In fact, how shall the optimizer decide whether to use batching? It's
> one thing to decide whether a node can produce/consume batches, but
> another thing is "should it"? With a node that "builds" a batch, this
> decision would apply to even more plans, I guess.
>
> I don't have a great answer to this, it seems like an incredibly tricky
> costing issue. I'm a bit worried we might end up with something too
> coarse, like "jit=on" which we know is causing problems (admittedly,
> mostly due to a lot of the LLVM work being unpredictable/external). But
> having some "adaptive" heuristics (like the gradual ramp up) might make
> it less risky.

I agree that deciding when to use batching is tricky. So far, the
patch takes a fairly simplistic approach: if a node (particularly a
scan node) supports batching, it just does it, and other parts of the
plan will consume batches if they are capable. There isn’t yet a
nuanced cost-based decision in the planner for enabling batching. This
is indeed something we’ll have to refine. We don’t want to end up with
a blunt on/off GUC that could cause regressions in some cases.

One idea is to introduce costing for batching: for example, estimate
the per-tuple savings from batching vs the overhead of materialization
or batch setup. However, developing a reliable cost model for that
will take time and experimentation, especially with the possibility of
variable batch sizes or adaptive behavior. Not to mention, that will
be adding one more dimension to planner's costing model making the
planning more expensive and unpredictable.  In the near term, I’m fine
with relying on feedback and perhaps manual tuning (GUCs, etc.) to
decide on batching, but that’s perhaps not a long-term solution.

I share your inclination that adaptive heuristics might be the safer
path initially. Perhaps the executor can decide to batch or not batch
based on runtime conditions. The gradual ramp-up of batch size is one
such adaptive approach. We could also consider things like monitoring
how effective batching is (are we actually processing full batches or
frequently getting cut off?) and adjust behavior. These are somewhat
speculative ideas at the moment, but the bottom line is I’m aware we
need a smarter strategy than a simple switch. This will likely evolve
as we test the patch in more scenarios.

> FWIW the current batch size limit (64 tuples) seems rather low, but it's
> hard to say. It'd be good to be able to experiment with different
> values, so I suggest we make this a GUC and not a hard-coded constant.

Yeah, I was thinking the same while testing -- the optimal batch size
might vary by workload or hardware, and 64 was a somewhat arbitrary
starting point. I will make the batch size limit configurable
(probably as a GUC executor_batch_tuples, maybe only developer-focused
at first). That will let us and others experiment easily with
different batch sizes to see how it affects performance. It should
also help with your earlier point: for example, on a machine where 64
is too low or too high, we can adjust it without recompiling. So yes,
I'll add a GUC for the batch size in the next version of the patch.

> As for what to add to explain, I'd start by adding info about which
> nodes are "batched" (consuming/producing batches), and some info about
> the batch sizes. An average size, maybe a histogram if you want to be a
> bit fancy.

Adding more information to EXPLAIN is a good idea. In the current
patch, EXPLAIN does not show anything about batching, but it would be
very helpful for debugging and user transparency to indicate which
nodes are operating in batch mode.  I will update EXPLAIN to mark
nodes that produce or consume batches. Likely I’ll start with
something simple like an extra line or tag for a node, e.g., "Batch:
true (avg batch size 64)" or something along those lines. An average
batch size could be computed if we have instrumentation, which would
be useful to see if, say, the batch sizes ended up smaller due to
LIMIT or other factors. A full histogram might be more detail than
most users need, but I agree even just knowing average or maximum
batch size per node could be useful for performance analysis. I'll
implement at least the basics for now, and we can refine it (maybe add
more stats) if needed.

(I had added a flag in the EXPLAIN output at one point, but removed it
due to finding the regression output churn too noisy, though I
understand I'll have to bite the bullet at some point.)

> Now, numbers from some microbenchmarks:
>
> On 9/26/25 15:28, Amit Langote wrote:
> > To evaluate the overheads and benefits, I ran microbenchmarks with
> > single and multi-aggregate queries on a single table, with and without
> > WHERE clauses. Tables were fully VACUUMed so visibility maps are set
> > and IO costs are minimal. shared_buffers was large enough to fit the
> > whole table (up to 10M rows, ~43 on each page), and all pages were
> > prewarmed into cache before tests. Table schema/script is at [2].
> >
> > Observations from benchmarking (Detailed benchmark tables are at [3];
> > below is just a high-level summary of the main patterns):
> >
> > * Single aggregate, no WHERE (SELECT count(*) FROM bar_N, SELECT
> > sum(a) FROM bar_N): batching scan output alone improved latency by
> > ~10-20%. Adding batched transition evaluation pushed gains to ~30-40%,
> > especially once fmgr overhead was paid per batch instead of per row.
> >
> > * Single aggregate, with WHERE (WHERE a > 0 AND a < N): batching the
> > qual interpreter gave a big step up, with latencies dropping by
> > ~30-40% compared to batching=off.
> >
> > * Five aggregates, no WHERE: batching input from the child scan cut
> > ~15% off runtime. Adding batched transition evaluation increased
> > improvements to ~30%.
> >
> > * Five aggregates, with WHERE: modest gains from scan/input batching,
> > but per-batch transition evaluation and batched quals brought ~20-30%
> > improvement.
> >
> > * Across all cases, executor overheads became visible only after IO
> > was minimized. Once executor cost dominated, batching consistently
> > reduced CPU time, with the largest benefits coming from avoiding
> > per-row fmgr calls and evaluating quals across batches.
> >
> > I would appreciate if others could try these patches with their own
> > microbenchmarks or workloads and see if they can reproduce numbers
> > similar to mine. Feedback on both the general direction and the
> > details of the patches would be very helpful. In particular, patches
> > 0001-0003, which add the basic batch APIs and integrate them into
> > SeqScan, are intended to be the first candidates for review and
> > eventual commit. Comments on the later, more experimental patches
> > (aggregate input batching and expression evaluation (qual, aggregate
> > transition) batching) are also welcome.
> >
>
> I tried to replicate the results, but the numbers I see are not this
> good. In fact, I see a fair number of regressions (and some are not
> negligible).
>
> I'm attaching the scripts I used to build the tables / run the test. I
> used the same table structure, and tried to follow the same query
> pattern with 1 or 5 aggregates (I used "avg"), [0, 1, 5] where
> conditions (with 100% selectivity).
>
> I measured master vs. 0001-0003 vs. 0001-0007 (with batching on/off).
> And I did that on my (relatively) new ryzen machine, and old xeon. The
> behavior is quite different for the two machines, but none of them shows
> such improvements. I used clang 19.0, and --with-llvm.
>
> See the attached PDFs with a summary of the results, comparing the
> results for master and the two batching branches.
>
> The ryzen is much "smoother" - it shows almost no difference with
> batching "off" (as expected). The "scan" branch (with 0001-0003) shows
> an improvement of 5-10% - it's consistent, but much less than the 10-20%
> you report. For the "agg" branch the benefits are much larger, but
> there's also a significant regression for the largest table with 100M
> rows (which is ~18GB on disk).
>
> For xeon, the results are a bit more variable, but it affects runs both
> with batching "on" and "off". The machine is just more noisy. There
> seems to be a small benefit of "scan" batching (in most cases much less
> than the 10-20%). The "agg" is a clear win, with up to 30-40% speedup,
> and no regression similar to the ryzen.
>
> Perhaps I did something wrong. It does not surprise me this is somewhat
> CPU dependent. It's a bit sad the improvements are smaller for the newer
> CPU, though.

Thanks for sharing your benchmark results -- that’s very useful data.
I haven’t yet finished investigating why there's a regression relative
to master when executor_batching is turned off. I re-ran my benchmarks
to include comparisons with master and did observe some regressions in
a few cases too, but I didn't see anything obvious in profiles that
explained the slowdown. I initially assumed it might be noise, but now
I suspect it could be related to structural changes in the scan code
-- for example, I added a few new fields in the middle of
HeapScanDescData, and even though the batching logic is bypassed when
executor_batching is off, it’s possible that change alone affects
memory layout or cache behavior in a way that penalizes the unbatched
path. I haven’t confirmed that yet, but it’s on my list to look into
more closely.

Your observation that newer CPUs like the Ryzen may see smaller
improvements makes sense -- perhaps they handle the per-tuple overhead
more efficiently to begin with. Still, I’d prefer not to see
regressions at all, even in the unbatched case, so I’ll focus on
understanding and fixing that part before drawing conclusions from the
performance data.

Thanks again for the scripts -- those will help a lot in narrowing things down.

> I also tried running TPC-H. I don't have useful numbers yet, but I ran
> into a segfault - see the attached backtrace. It only happens with the
> batching, and only on Q22 for some reason. I initially thought it's a
> bug in clang, because I saw it with clang-22 built from git, and not
> with clang-14 or gcc. But since then I reproduced it with clang-19 (on
> debian 13). Still could be a clang bug, of course. I've seen ~20 of
> those segfaults so far, and the backtraces look exactly the same.

The v3 I posted fixes a tricky bug in the new EEOPs for batched-agg
evaluation that I suspect is also causing the crash you saw.

I'll try to post a v4 in a couple of weeks with some of the things I
mentioned above.

--
Thanks, Amit Langote



Re: Batching in executor

От
Tomas Vondra
Дата:
On 10/27/25 08:24, Amit Langote wrote:
> Hi Tomas,
> 
> On Mon, Sep 29, 2025 at 8:01 PM Tomas Vondra <tomas@vondra.me> wrote:
>>
>> Hi Amit,
>>
>> Thanks for the patch. I took a look over the weekend, and done a couple
>> experiments / benchmarks, so let me share some initial feedback (or
>> rather a bunch of questions I came up with).
> 
> Thank you for reviewing the patch and taking the time to run those
> experiments. I appreciate the detailed feedback and questions.  I also
> apologize for my late reply, I spent perhaps way too much time going
> over your index prefetching thread trying to understand the notion of
> batching that it uses and getting sidelined by other things while
> writing this reply.
> 

Cool! Now you can do a review of the index prefetch patch ;-)

>> I'll start with some general thoughts, before going into some nitpicky
>> comments about patches / code and perf results.
>>
>> I think the general goal of the patch - reducing the per-tuple overhead
>> and making the executor more efficient for OLAP workloads - is very
>> desirable. I believe the limitations of per-row executor are one of the
>> reasons why attempts to implement a columnar TAM mostly failed. The
>> compression is nice, but it's hard to be competitive without an executor
>> that leverages that too. So starting with an executor, in a way that
>> helps even heap, seems like a good plan. So +1 to this.
> 
> I'm happy to hear that you find the overall direction worthwhile.
> 
>> While looking at the patch, I couldn't help but think about the index
>> prefetching stuff that I work on. It also introduces the concept of a
>> "batch", for passing data between an index AM and the executor. It's
>> interesting how different the designs are in some respects. I'm not
>> saying one of those designs is wrong, it's more due different goals.
>>
>> For example, the index prefetching patch establishes a "shared" batch
>> struct, and the index AM is expected to fill it with data. After that,
>> the batch is managed entirely by indexam.c, with no AM calls. The only
>> AM-specific bit in the batch is "position", but that's used only when
>> advancing to the next page, etc.
>>
>> This patch does things differently. IIUC, each TAM may produce it's own
>> "batch", which is then wrapped in a generic one. For example, heap
>> produces HeapBatch, and it gets wrapped in TupleBatch. But I think this
>> is fine. In the prefetching we chose to move all this code (walking the
>> batch items) from the AMs into the layer above, and make it AM agnostic.
> 
> ...
> 
>> But for the batching, we want to retain the custom format as long as
>> possible. Presumably, the various advantages of the TAMs are tied to the
>> custom/columnar storage format. Memory efficiency thanks to compression,
>> execution on compressed data, etc. Keeping the custom format as long as
>> possible is the whole point of "late materialization" (and materializing
>> as late as possible is one of the important details in column stores).
> 
> Exactly -- keeping the TAM-specific batch format as long as possible
> is a key goal here. As you noted, the benefits of a custom storage
> format (compression, operating on compressed data, etc.) are best
> realized when we delay materialization until absolutely necessary.  I
> want to design this patch that each TAM can produce and use its own
> batch representation internally, only wrapping it when interfacing
> with the executor in a generic way.  I admit that's not entirely true
> with the patch as it stands as I write above below.
> 

Understood. Makes sense in general.

>> How far ahead have you though about these capabilities? I was wondering
>> about two things in particular. First, at which point do we have to
>> "materialize" the TupleBatch into some generic format (e.g. TupleSlots).
>> I get it that you want to enable passing batches between nodes, but
>> would those use the same "format" as the underlying scan node, or some
>> generic one? Second, will it be possible to execute expressions on the
>> custom batches (i.e. on "compressed data")? Or is it necessary to
>> "materialize" the batch into regular tuple slots? I realize those may
>> not be there "now" but maybe it'd be nice to plan for the future.
> 
> I have been thinking about those future capabilities. Currently, the
> patch keeps tuples in the TAM-specific batch format up until they need
> to be consumed by a node that doesn’t understand that format or has
> not been modified to invoke the TAM callbacks to decode it.  In the
> current patch, that means we materialize to regular TupleTableSlots at
> nodes that require it (for example, the scan node reading from TAM
> needing to evaluate quals, etc.). However, the intention is to allow
> batches to be passed through as many nodes as possible without
> materialization, ideally using the same format produced by the scan
> node all the way up until reaching a node that can only work with
> tuples in TupleTableSlots.
> 
> As for executing expressions directly on the custom batch data: that’s
> something I would like to enable in the future. Right now, expressions
> (quals, projections, etc.) are evaluated after materializing into
> normal tuples in TupleTableSlots stored in TupleBatch, because the
> expression evaluation code isn’t yet totally batch-aware or is very
> from doing things like operate on compressed data in its native form.
> Patches 0004-0008 do try to add batch-aware expression evaluation but
> that's just a prototype.  In the long term, the goal is to allow
> expression evaluation on batch data (for example, applying a WHERE
> clause or aggregate transition directly on a columnar batch without
> converting it to heap tuples first). This will require significant new
> infrastructure (perhaps specialized batch-aware expression operators
> and functions), so it's not in the current patch, but I agree it's
> important to plan for it. The current design doesn’t preclude it, it
> lays some groundwork by introducing the batch abstraction -- but fully
> supporting that will be future work.
> 
> That said, one area I’d like to mention while at it, especially to
> enable native execution on compressed or columnar batches, is giving
> the table AM more control over how expression evaluation is performed
> on its batch data. In the current patch, the AM can provide a
> materialize function via TupleBatchOps, but that always produces an
> array of TupleTableSlots stored in the TupleBatch, not an opaque
> representation that remains under AM control. Maybe that's not bad for
> a v1 patch.

I think materializing into a batch of TupleTableSlots (and then doing
the regular expression evaluation) seems perfectly fine for v1. It's the
simplest fallback possible, and we'll need it anyway if overriding the
expression evaluation will be optional (which I assume it will be?).

> When evaluating expressions over a batch, a BatchVector
> is built by looping over these slots and invoking the standard
> per-tuple getsomeattrs() to "deform" a tuple into needed columns.
> While that enables batch-style EEOPs for qual evaluation and aggregate
> transition (and is already a gain over per-row evaluation), it misses
> the opportunity to leverage any batch-specific optimizations the AM
> could offer, such as vectorized decoding or filtering over compressed
> data, and other AM optimizations for getting only the necessary
> columns out possibly in a vector format.
> 

I'm not sure about this BatchVector thing. I haven't looked into that
very much, I'd expect the construction to be more expensive than the
benefits (compared to just doing the materialize + regular evaluation),
but maybe I'm completely wrong. Or maybe we could keep the vector
representation for multiple operations? No idea.

But it seems like a great area for experimenting ...

> I’m considering extending TupleTableSlotOps with a batch-aware variant
> of getsomeattrs(), something like slot_getsomeattrs_batch(), so that
> AMs can populate column vectors (e.g., BatchVector) directly from
> their native format. That would allow bypassing slot materialization
> entirely and plug AM-provided decoding logic directly into the
> executor’s batch expression paths. This isn’t implemented yet, but I
> see it as a necessary step toward supporting fully native expression
> evaluation over compressed or columnar formats. I’m not yet sure if
> TupleTableSlotOps is the right place for such a hook, it might belong
> elsewhere in the abstraction, but exposing a batch-aware interface for
> this purpose seems like the right direction.
> 

No opinion. I don't see it as a necessary prerequisite for the other
parts of the patch series, but maybe the BatchVector really helps, and
then this would make perfect sense. I'm not sure there's a single
"correct" sequence in which to do these improvements, it's always a
matter of opinion.

>> It might be worth exploring some columnar formats, and see if this
>> design would be a good fit. Let's say we want to process data read from
>> a parquet file. Would we be able to leverage the format, or would we
>> need to "materialize" into slots too early? Or maybe it'd be good to
>> look at the VCI extension [1], discussed in a nearby thread. AFAICS
>> that's still based on an index AM, but there were suggestions to use TAM
>> instead (and maybe that'd be a better choice).
> 
> Yeah, looking at columnar TAMs or FDWs is on my list. I do think the
> design should be able to accommodate true columnar formats like
> Parquet. If we had a table AM (or FDW) that reads Parquet files into a
> columnar batch structure, the executor batching framework should
> ideally allow us to pass that batch along without immediately
> materializing to tuples.  As mentioned before, we might have to adjust
> or extend the TupleBatch abstraction to handle a wider variety of
> batch formats, but conceptually it fits -- the goal is to avoid
> forcing early materialization. I will definitely keep the Parquet
> use-case in mind and perhaps do some experiments with a columnar
> source to ensure we aren’t baking in any unnecessary materialization.
> Also, thanks for the reference to the VCI extension thread; I'll take
> a look at that.
> 

+1 I think having a TAM/FDW reading those established and common formats
is a good way to validate the overall design.

>> The other option would be to "create batches" during execution, say by
>> having a new node that accumulates tuples, builds a batch and sends it
>> to the node above. This would help both in cases when either the lower
>> node does not produce batches at all, or the batches are too small (due
>> to filtering, aggregation, ...). Or course, it'd only win if this
>> increases efficiency of the upper part of the plan enough to pay for
>> building the batches. That can be a hard decision.
> 
> Yes, introducing a dedicated executor node to accumulate and form
> batches on the fly is an interesting idea, I have thought about it and
> even mentioned it in passing in the pgconf.dev unconference. This
> could indeed cover scenarios where the data source (a node) doesn't
> produce batches (e.g., a non-batching node feeding into a
> batching-aware upper node) or where batches coming from below are too
> small to be efficient. The current patch set doesn’t implement such a
> node; I focused on enabling batching at the scan/TAM level first. The
> cost/benefit decision for a batch-aggregator node is tricky, as you
> said. We’d need a way to decide when the overhead of gathering tuples
> into a batch is outweighed by the benefits to the upper node. This
> likely ties into costing or adaptive execution decisions. It's
> something I’m open to exploring in a future iteration, perhaps once we
> have more feedback on how the existing batching performs in various
> scenarios. It might also require some planner or executor smarts
> (maybe the executor can decide to batch on the fly if it sees a
> pattern of use, or the planner could insert such nodes when
> beneficial).
> 

Yeah, those are good questions. I don't have a clear idea how should we
decide when to do this batching. Costing during planning is the
"traditional" option, with all the issues (e.g. it requires a reasonably
good cost model). Another option would be some sort of execution-time
heuristics - buts then which node would be responsible for building the
batches (if we didn't create them during planning)?

I agree it makes sense to focus on batching at the TAM/scan level for
now. That's a pretty big project already.

>> You also mentioned we could make batches larger by letting them span
>> multiple pages, etc. I'm not sure that's worth it - wouldn't that
>> substantially complicate the TAM code, which would need to pin+track
>> multiple buffers for each batch, etc.? Possible, but is it worth it?
>>
>> I'm not sure allowing multi-page batches would actually solve the issue.
>> It'd help with batches at the "scan level", but presumably the batch
>> size in the upper nodes matters just as much. Large scan batches may
>> help, but hard to predict.
>>
>> In the index prefetching patch we chose to keep batches 1:1 with leaf
>> pages, at least for now. Instead we allowed having multiple batches at
>> once. I'm not sure that'd be necessary for TAMs, though.
> 
> I tend to agree with you here. Allowing a single batch to span
> multiple pages would add quite a bit of complexity to the table AM
> implementations (managing multiple buffer pins per batch, tracking
> page boundaries, etc.), and it's unclear if the benefit would justify
> that complexity. For now, I'm inclined not to pursue multi-page
> batches at the scan level in this patch. We can keep the batch
> page-local (e.g., for heap, one batch corresponds to max one page, as
> it does now). If we need larger batch sizes overall, we might address
> that by other means -- for example, by the above-mentioned idea of a
> higher-level batching node or by simply producing multiple batches in
> quick succession.
> 

+1

> You’re right that even if we made scan batches larger, it doesn’t
> necessarily solve everything, since the effective batch size at
> higher-level nodes could still be constrained by other factors. So
> rather than complicating the low-level TAM code with multi-page
> batches, I'd prefer to first see if the current approach (with
> one-page batches) yields good benefits and then consider alternatives.
> We could also consider letting a scan node produce multiple batches
> before yielding to the upper node (similar to how the index
> prefetching patch can have multiple leaf page batches in flight) if
> needed, but as you note, it might not be necessary for TAMs yet. So at
> this stage, I'll keep it simple.
> 

+1

>> This also reminds me of LIMIT queries. The way I imagine a "batchified"
>> executor to work is that batches are essentially "units of work". For
>> example, a nested loop would grab a batch of tuples from the outer
>> relation, lookup inner tuples for the whole batch, and only then pass
>> the result batch. (I'm ignoring the cases when the batch explodes due to
>> duplicates.)
>>
>> But what if there's a LIMIT 1 on top? Maybe it'd be enough to process
>> just the first tuple, and the rest of the batch is wasted work? Plenty
>> of (very expensive) OLAP have that, and many would likely benefit from
>> batching, so just disabling batching if there's LIMIT seems way too
>> heavy handed.
> 
> Yeah, LIMIT does complicate downstream batching decisions. If we
> always use a full-size batch (say 64 tuples) for every operation, a
> query with LIMIT 1 could end up doing a lot of unnecessary work
> fetching and processing 63 tuples that never get used. Disabling
> batching entirely for queries with LIMIT would indeed be overkill and
> lose benefits for cases where the limit is not extremely selective.
> 
>> Perhaps it'd be good to gradually ramp up the batch size? Start with
>> small batches, and then make them larger. The index prefetching does
>> that too, indirectly - it reads the whole leaf page as a batch, but then
>> gradually ramps up the prefetch distance (well, read_stream does that).
>> Maybe the batching should have similar thing ...
> 
> An adaptive batch size that ramps up makes a lot of sense as a
> solution. We could start with a very small batch (say 4 tuples) and if
> we detect that the query needs more (e.g., the LIMIT wasn’t satisfied
> yet or more output is still being consumed), then increase the batch
> size for subsequent operations. This way, a query that stops early
> doesn’t incur the full batching overhead, whereas a query that does
> process lots of tuples will gradually get to a larger batch size to
> gain efficiency. This is analogous to how the index prefetching ramps
> up prefetch distance, as you mentioned.
> 
> Implementing that will require some careful thought. It could be done
> either in the planner (choose initial batch sizes based on context
> like LIMIT) or more dynamically in the executor (adjust on the fly). I
> lean towards a runtime heuristic because it’s hard for the planner to
> predict exactly how a LIMIT will play out, especially in complex
> plans. In any case, I agree that a gradual ramp-up or other adaptive
> approach would make batching more robust in the presence of query
> execution variability. I will definitely consider adding such logic,
> perhaps as an improvement once the basic framework is in.
> 

I agree a runtime heuristics is probably the right approach. After all,
a lot of the issues with LIMIT queries is due to the planner not knowing
the real data distribution, etc.

>> In fact, how shall the optimizer decide whether to use batching? It's
>> one thing to decide whether a node can produce/consume batches, but
>> another thing is "should it"? With a node that "builds" a batch, this
>> decision would apply to even more plans, I guess.
>>
>> I don't have a great answer to this, it seems like an incredibly tricky
>> costing issue. I'm a bit worried we might end up with something too
>> coarse, like "jit=on" which we know is causing problems (admittedly,
>> mostly due to a lot of the LLVM work being unpredictable/external). But
>> having some "adaptive" heuristics (like the gradual ramp up) might make
>> it less risky.
> 
> I agree that deciding when to use batching is tricky. So far, the
> patch takes a fairly simplistic approach: if a node (particularly a
> scan node) supports batching, it just does it, and other parts of the
> plan will consume batches if they are capable. There isn’t yet a
> nuanced cost-based decision in the planner for enabling batching. This
> is indeed something we’ll have to refine. We don’t want to end up with
> a blunt on/off GUC that could cause regressions in some cases.
> 
> One idea is to introduce costing for batching: for example, estimate
> the per-tuple savings from batching vs the overhead of materialization
> or batch setup. However, developing a reliable cost model for that
> will take time and experimentation, especially with the possibility of
> variable batch sizes or adaptive behavior. Not to mention, that will
> be adding one more dimension to planner's costing model making the
> planning more expensive and unpredictable.  In the near term, I’m fine
> with relying on feedback and perhaps manual tuning (GUCs, etc.) to
> decide on batching, but that’s perhaps not a long-term solution.
> 

Yeah, the cost model is going to be hard, because this depends on so
much low-level plan/hardware details. Like, the TAM may allow execution
on compressed data / leverage vectorization, .... But maybe the CPU does
not do that efficiently? There's so many unknown unknowns ...

Considering we still haven't fixed the JIT cost model, maybe it's better
to not rely on it too much for this batching patch? Also, all those
details contradict the idea that cost models are a simplified model of
the reality.

> I share your inclination that adaptive heuristics might be the safer
> path initially. Perhaps the executor can decide to batch or not batch
> based on runtime conditions. The gradual ramp-up of batch size is one
> such adaptive approach. We could also consider things like monitoring
> how effective batching is (are we actually processing full batches or
> frequently getting cut off?) and adjust behavior. These are somewhat
> speculative ideas at the moment, but the bottom line is I’m aware we
> need a smarter strategy than a simple switch. This will likely evolve
> as we test the patch in more scenarios.
> 

I think the big question is how much can the batching change the
relative cost of two plans (I mean, actual cost, not just estimates).

Imagine plans P1 and P2, where

   cost(P1) < cost(P2) = cost(P1) + delta

where "delta" is small (so P1 is faster, but not much).  If we
"batchify" the plans into P1' and P2', can this happen?

  cost(P1') >> cost(P2')

That is, can the "slower" plan P2 benefit much more from the batching,
making it significantly faster?

If this is unlikely, we could entirely ignore batching during planning,
and only do that as post-processing on the selected plan, or perhaps
even just during execution.

OTOH that's what JIT does, and we know it's not perfect - but that's
mostly because JIT has rather unpredictable costs when enabling. Maybe
batching doesn't have that.

>> FWIW the current batch size limit (64 tuples) seems rather low, but it's
>> hard to say. It'd be good to be able to experiment with different
>> values, so I suggest we make this a GUC and not a hard-coded constant.
> 
> Yeah, I was thinking the same while testing -- the optimal batch size
> might vary by workload or hardware, and 64 was a somewhat arbitrary
> starting point. I will make the batch size limit configurable
> (probably as a GUC executor_batch_tuples, maybe only developer-focused
> at first). That will let us and others experiment easily with
> different batch sizes to see how it affects performance. It should
> also help with your earlier point: for example, on a machine where 64
> is too low or too high, we can adjust it without recompiling. So yes,
> I'll add a GUC for the batch size in the next version of the patch.
> 

+1 to have developer-only GUC for testing. But the goal should be to not
expect users to tune this.

>> As for what to add to explain, I'd start by adding info about which
>> nodes are "batched" (consuming/producing batches), and some info about
>> the batch sizes. An average size, maybe a histogram if you want to be a
>> bit fancy.
> 
> Adding more information to EXPLAIN is a good idea. In the current
> patch, EXPLAIN does not show anything about batching, but it would be
> very helpful for debugging and user transparency to indicate which
> nodes are operating in batch mode.  I will update EXPLAIN to mark
> nodes that produce or consume batches. Likely I’ll start with
> something simple like an extra line or tag for a node, e.g., "Batch:
> true (avg batch size 64)" or something along those lines. An average
> batch size could be computed if we have instrumentation, which would
> be useful to see if, say, the batch sizes ended up smaller due to
> LIMIT or other factors. A full histogram might be more detail than
> most users need, but I agree even just knowing average or maximum
> batch size per node could be useful for performance analysis. I'll
> implement at least the basics for now, and we can refine it (maybe add
> more stats) if needed.

+1 to start with something simple

> 
> (I had added a flag in the EXPLAIN output at one point, but removed it
> due to finding the regression output churn too noisy, though I
> understand I'll have to bite the bullet at some point.)
> 

Why would there be regression churn, if the option is disabled by default?

>> Now, numbers from some microbenchmarks:
>>
>> ...
>>>> Perhaps I did something wrong. It does not surprise me this is somewhat
>> CPU dependent. It's a bit sad the improvements are smaller for the newer
>> CPU, though.
> 
> Thanks for sharing your benchmark results -- that’s very useful data.
> I haven’t yet finished investigating why there's a regression relative
> to master when executor_batching is turned off. I re-ran my benchmarks
> to include comparisons with master and did observe some regressions in
> a few cases too, but I didn't see anything obvious in profiles that
> explained the slowdown. I initially assumed it might be noise, but now
> I suspect it could be related to structural changes in the scan code
> -- for example, I added a few new fields in the middle of
> HeapScanDescData, and even though the batching logic is bypassed when
> executor_batching is off, it’s possible that change alone affects
> memory layout or cache behavior in a way that penalizes the unbatched
> path. I haven’t confirmed that yet, but it’s on my list to look into
> more closely.
> 
> Your observation that newer CPUs like the Ryzen may see smaller
> improvements makes sense -- perhaps they handle the per-tuple overhead
> more efficiently to begin with. Still, I’d prefer not to see
> regressions at all, even in the unbatched case, so I’ll focus on
> understanding and fixing that part before drawing conclusions from the
> performance data.
> 
> Thanks again for the scripts -- those will help a lot in narrowing things down.
> 

If needed, I can rerun the tests and collect additional information
(e.g. maybe perf-stat or perf-diff would be interesting).

>> I also tried running TPC-H. I don't have useful numbers yet, but I ran
>> into a segfault - see the attached backtrace. It only happens with the
>> batching, and only on Q22 for some reason. I initially thought it's a
>> bug in clang, because I saw it with clang-22 built from git, and not
>> with clang-14 or gcc. But since then I reproduced it with clang-19 (on
>> debian 13). Still could be a clang bug, of course. I've seen ~20 of
>> those segfaults so far, and the backtraces look exactly the same.
> 
> The v3 I posted fixes a tricky bug in the new EEOPs for batched-agg
> evaluation that I suspect is also causing the crash you saw.
> 
> I'll try to post a v4 in a couple of weeks with some of the things I
> mentioned above.
> 

Sounds good. Thank you.


regards

-- 
Tomas Vondra




Re: Batching in executor

От
Peter Geoghegan
Дата:
On Mon, Sep 29, 2025 at 7:01 AM Tomas Vondra <tomas@vondra.me> wrote:
> While looking at the patch, I couldn't help but think about the index
> prefetching stuff that I work on. It also introduces the concept of a
> "batch", for passing data between an index AM and the executor. It's
> interesting how different the designs are in some respects. I'm not
> saying one of those designs is wrong, it's more due different goals.

I've been working on a new prototype enhancement to the index
prefetching patch. The new spinoff patch has index scans batch up
calls to heap_hot_search_buffer for heap TIDs that the scan has yet to
return. This optimization is effective whenever an index scan returns
a contiguous group of TIDs that all point to the same heap page. We're
able to lock and unlock heap page buffers at the same point that
they're pinned and unpinned, which can dramatically decrease the
number of heap buffer locks acquired by index scans that return
contiguous TIDs (which is very common).

I find that speedups for pgbench SELECT variants with a predicate such
as "WHERE aid BETWEEN 1000 AND 1500" can have up to ~20% higher
throughput, at least in cases with low client counts (think 1 or 2
clients). These are cases where everything can fit in shared buffers,
so we're not getting any benefit from I/O prefetching (in spite of the
fact that this is built on top of the index prefetching patchset).

It makes sense to put this in scope for the index prefetching work
because that work will already give code outside of an index AM
visibility into which group of TIDs need to be read next. Right now
(on master) there is some trivial sense in which index AMs use their
own batches, but that's completely hidden from external callers.

> For example, the index prefetching patch establishes a "shared" batch
> struct, and the index AM is expected to fill it with data. After that,
> the batch is managed entirely by indexam.c, with no AM calls. The only
> AM-specific bit in the batch is "position", but that's used only when
> advancing to the next page, etc.

The major difficulty with my heap batching prototype is getting the
layering right (no surprises there). In some sense we're deliberately
sharing information across different what we currently think of as
different layers of abstraction, in order to be able to "schedule" the
work more intelligently. There's a number of competing considerations.

I have invented a new concept of heap batch, that is orthogonal to the
existing concept of index batches. Right now these are just an array
of HeapTuple structs that relate to exactly one group of group of
contiguous heap TIDs (i.e. if the index scan returns TIDs even a
little out of order, which is fairly common, we cannot currently
reorder the work in the current prototype patch).

Once a batch is prepared, calls to heapam_index_fetch_tuple just
return the next TID from the batch (until the next time we have to
return a TID pointing to some distinct heap block). In the case of
pgbench queries like the one I mentioned, we only need to call
LockBuffer/heap_hot_search_buffer once for every 61 heap tuples
returned (not once per heap tuple returned).

Importantly, the new interface added by my new prototype spinoff patch
is higher level than the existing
table_index_fetch_tuple/heapam_index_fetch_tuple interface. The
executor asks the table AM "give me the next heap TID in the current
scan direction", rather than asking "give me this heap TID". The
general idea is that the table AM has a direct understanding of
ordered index scans.

The advantage of this higher-level interface is that it gives the
table AM maximum freedom to reorder work. As I said already, we won't
do things like merge together logically noncontiguous accesses to the
same heap page into one physical access right now. But I think that
that should at least be enabled by this interface.

The downside of this approach is that table AM (not the executor
proper) is responsible for interfacing with the index AM layer. I
think that this can be generalized without very much code duplication
across table AMs. But it's hard.

> This patch does things differently. IIUC, each TAM may produce it's own
> "batch", which is then wrapped in a generic one. For example, heap
> produces HeapBatch, and it gets wrapped in TupleBatch. But I think this
> is fine. In the prefetching we chose to move all this code (walking the
> batch items) from the AMs into the layer above, and make it AM agnostic.

I think that the base index prefetching patch's current notion of
index-AM-wise batches can be kept quite separate from any table AM
batch concept that might be invented, either as part of what I'm
working on, or in Amit's patch.

It probably wouldn't be terribly difficult to get the new interface
I've described to return heap tuples in whatever batch format Amit
comes up with. That only has a benefit if it makes life easier for
expression evaluation in higher levels of the plan tree, but it might
just make sense to always do it that way. I doubt that adopting Amit's
batch format will make life much harder for the
heap_hot_search_buffer-batching mechanism (at least if it is generally
understood that its new index scan interface's builds batches in
Amit's format on a best-effort basis).

--
Peter Geoghegan



Re: Batching in executor

От
Amit Langote
Дата:
Hi Peter,

Thanks for chiming in here.

On Tue, Oct 28, 2025 at 2:37 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Sep 29, 2025 at 7:01 AM Tomas Vondra <tomas@vondra.me> wrote:
> > While looking at the patch, I couldn't help but think about the index
> > prefetching stuff that I work on. It also introduces the concept of a
> > "batch", for passing data between an index AM and the executor. It's
> > interesting how different the designs are in some respects. I'm not
> > saying one of those designs is wrong, it's more due different goals.
>
> I've been working on a new prototype enhancement to the index
> prefetching patch. The new spinoff patch has index scans batch up
> calls to heap_hot_search_buffer for heap TIDs that the scan has yet to
> return. This optimization is effective whenever an index scan returns
> a contiguous group of TIDs that all point to the same heap page. We're
> able to lock and unlock heap page buffers at the same point that
> they're pinned and unpinned, which can dramatically decrease the
> number of heap buffer locks acquired by index scans that return
> contiguous TIDs (which is very common).
>
> I find that speedups for pgbench SELECT variants with a predicate such
> as "WHERE aid BETWEEN 1000 AND 1500" can have up to ~20% higher
> throughput, at least in cases with low client counts (think 1 or 2
> clients). These are cases where everything can fit in shared buffers,
> so we're not getting any benefit from I/O prefetching (in spite of the
> fact that this is built on top of the index prefetching patchset).

I gathered from the index prefetching thread that it is mainly about
enabling I/O prefetching, so it's nice to see that kind of speedup
even for the in-memory case.

Is this spinoff patch separate from the one that adds amgetbatch() to
IndexAmRoutine which you posted on Oct 12? If so, where can I find it?

> It makes sense to put this in scope for the index prefetching work
> because that work will already give code outside of an index AM
> visibility into which group of TIDs need to be read next. Right now
> (on master) there is some trivial sense in which index AMs use their
> own batches, but that's completely hidden from external callers.

As you might know, heapam's TableAmRoutine.scan_* functions use a
"pagemode" in some cases, which fills a batch of tuples in
HeapScanData.rs_vistuples. However, that batch currently only stores
the tuples’ offset numbers. I started this work based on Andres’s
suggestion to propagate that batch up into the executor’s scan nodes.
The idea is to create a HeapTuple array sized according to the
executor’s batch size, and then populate it when the scan node calls
the new TableAmRoutine.scan_batch* variant. There might be some
overlap between our respective ideas.

> > For example, the index prefetching patch establishes a "shared" batch
> > struct, and the index AM is expected to fill it with data. After that,
> > the batch is managed entirely by indexam.c, with no AM calls. The only
> > AM-specific bit in the batch is "position", but that's used only when
> > advancing to the next page, etc.
>
> The major difficulty with my heap batching prototype is getting the
> layering right (no surprises there). In some sense we're deliberately
> sharing information across different what we currently think of as
> different layers of abstraction, in order to be able to "schedule" the
> work more intelligently. There's a number of competing considerations.
>
> I have invented a new concept of heap batch, that is orthogonal to the
> existing concept of index batches. Right now these are just an array
> of HeapTuple structs that relate to exactly one group of group of
> contiguous heap TIDs (i.e. if the index scan returns TIDs even a
> little out of order, which is fairly common, we cannot currently
> reorder the work in the current prototype patch).
>
> Once a batch is prepared, calls to heapam_index_fetch_tuple just
> return the next TID from the batch (until the next time we have to
> return a TID pointing to some distinct heap block). In the case of
> pgbench queries like the one I mentioned, we only need to call
> LockBuffer/heap_hot_search_buffer once for every 61 heap tuples
> returned (not once per heap tuple returned).
>
> Importantly, the new interface added by my new prototype spinoff patch
> is higher level than the existing
> table_index_fetch_tuple/heapam_index_fetch_tuple interface. The
> executor asks the table AM "give me the next heap TID in the current
> scan direction", rather than asking "give me this heap TID". The
> general idea is that the table AM has a direct understanding of
> ordered index scans.
>
> The advantage of this higher-level interface is that it gives the
> table AM maximum freedom to reorder work. As I said already, we won't
> do things like merge together logically noncontiguous accesses to the
> same heap page into one physical access right now. But I think that
> that should at least be enabled by this interface.

Interesting. It sounds like you aim to replace the fetch_tuple
interface with a more generic one, is that right?

> The downside of this approach is that table AM (not the executor
> proper) is responsible for interfacing with the index AM layer. I
> think that this can be generalized without very much code duplication
> across table AMs. But it's hard.

Seems so.

> > This patch does things differently. IIUC, each TAM may produce it's own
> > "batch", which is then wrapped in a generic one. For example, heap
> > produces HeapBatch, and it gets wrapped in TupleBatch. But I think this
> > is fine. In the prefetching we chose to move all this code (walking the
> > batch items) from the AMs into the layer above, and make it AM agnostic.
>
> I think that the base index prefetching patch's current notion of
> index-AM-wise batches can be kept quite separate from any table AM
> batch concept that might be invented, either as part of what I'm
> working on, or in Amit's patch.
>
> It probably wouldn't be terribly difficult to get the new interface
> I've described to return heap tuples in whatever batch format Amit
> comes up with. That only has a benefit if it makes life easier for
> expression evaluation in higher levels of the plan tree, but it might
> just make sense to always do it that way. I doubt that adopting Amit's
> batch format will make life much harder for the
> heap_hot_search_buffer-batching mechanism (at least if it is generally
> understood that its new index scan interface's builds batches in
> Amit's format on a best-effort basis).

In my implementation, the new TableAmRoutine.scan_getnextbatch()
returns a batch as an opaque table AM structure, which can then be
passed up to the upper levels of the plan. Patch 0001 in my series
adds the following to the TableAmRoutine API:

+    /* ------------------------------------------------------------------------
+     * Batched scan support
+     * ------------------------------------------------------------------------
+     */
+
+    void       *(*scan_begin_batch)(TableScanDesc sscan, int maxitems);
+    int         (*scan_getnextbatch)(TableScanDesc sscan, void *am_batch,
+                                     ScanDirection dir);
+    void        (*scan_end_batch)(TableScanDesc sscan, void *am_batch);

I haven't seen what your version looks like, but if it is compatible
with the above, I'd be happy to adopt a batch format that accommodates
multiple use cases.

--
Thanks, Amit Langote



Re: Batching in executor

От
Amit Langote
Дата:
On Tue, Oct 28, 2025 at 1:18 AM Tomas Vondra <tomas@vondra.me> wrote:
> On 10/27/25 08:24, Amit Langote wrote:
> > Thank you for reviewing the patch and taking the time to run those
> > experiments. I appreciate the detailed feedback and questions.  I also
> > apologize for my late reply, I spent perhaps way too much time going
> > over your index prefetching thread trying to understand the notion of
> > batching that it uses and getting sidelined by other things while
> > writing this reply.
>
> Cool! Now you can do a review of the index prefetch patch ;-)

Would love to and I'm adding that to my list.  :)

> >> How far ahead have you though about these capabilities? I was wondering
> >> about two things in particular. First, at which point do we have to
> >> "materialize" the TupleBatch into some generic format (e.g. TupleSlots).
> >> I get it that you want to enable passing batches between nodes, but
> >> would those use the same "format" as the underlying scan node, or some
> >> generic one? Second, will it be possible to execute expressions on the
> >> custom batches (i.e. on "compressed data")? Or is it necessary to
> >> "materialize" the batch into regular tuple slots? I realize those may
> >> not be there "now" but maybe it'd be nice to plan for the future.
> >
> > I have been thinking about those future capabilities. Currently, the
> > patch keeps tuples in the TAM-specific batch format up until they need
> > to be consumed by a node that doesn’t understand that format or has
> > not been modified to invoke the TAM callbacks to decode it.  In the
> > current patch, that means we materialize to regular TupleTableSlots at
> > nodes that require it (for example, the scan node reading from TAM
> > needing to evaluate quals, etc.). However, the intention is to allow
> > batches to be passed through as many nodes as possible without
> > materialization, ideally using the same format produced by the scan
> > node all the way up until reaching a node that can only work with
> > tuples in TupleTableSlots.
> >
> > As for executing expressions directly on the custom batch data: that’s
> > something I would like to enable in the future. Right now, expressions
> > (quals, projections, etc.) are evaluated after materializing into
> > normal tuples in TupleTableSlots stored in TupleBatch, because the
> > expression evaluation code isn’t yet totally batch-aware or is very
> > from doing things like operate on compressed data in its native form.
> > Patches 0004-0008 do try to add batch-aware expression evaluation but
> > that's just a prototype.  In the long term, the goal is to allow
> > expression evaluation on batch data (for example, applying a WHERE
> > clause or aggregate transition directly on a columnar batch without
> > converting it to heap tuples first). This will require significant new
> > infrastructure (perhaps specialized batch-aware expression operators
> > and functions), so it's not in the current patch, but I agree it's
> > important to plan for it. The current design doesn’t preclude it, it
> > lays some groundwork by introducing the batch abstraction -- but fully
> > supporting that will be future work.
> >
> > That said, one area I’d like to mention while at it, especially to
> > enable native execution on compressed or columnar batches, is giving
> > the table AM more control over how expression evaluation is performed
> > on its batch data. In the current patch, the AM can provide a
> > materialize function via TupleBatchOps, but that always produces an
> > array of TupleTableSlots stored in the TupleBatch, not an opaque
> > representation that remains under AM control. Maybe that's not bad for
> > a v1 patch.
>
> I think materializing into a batch of TupleTableSlots (and then doing
> the regular expression evaluation) seems perfectly fine for v1. It's the
> simplest fallback possible, and we'll need it anyway if overriding the
> expression evaluation will be optional (which I assume it will be?).

Yes.  The ability to materialize into TupleTableSlots won't be
optional for the table AM's BatchOps.  Converting to other formats
would be.

> > When evaluating expressions over a batch, a BatchVector
> > is built by looping over these slots and invoking the standard
> > per-tuple getsomeattrs() to "deform" a tuple into needed columns.
> > While that enables batch-style EEOPs for qual evaluation and aggregate
> > transition (and is already a gain over per-row evaluation), it misses
> > the opportunity to leverage any batch-specific optimizations the AM
> > could offer, such as vectorized decoding or filtering over compressed
> > data, and other AM optimizations for getting only the necessary
> > columns out possibly in a vector format.
> >
>
> I'm not sure about this BatchVector thing. I haven't looked into that
> very much, I'd expect the construction to be more expensive than the
> benefits (compared to just doing the materialize + regular evaluation),
> but maybe I'm completely wrong. Or maybe we could keep the vector
> representation for multiple operations? No idea.

Constructing the BatchVector does require looping over the batch and
deforming each tuple, typically via getsomeattrs(). So yes, there’s an
up-front cost similar to materialization. But the goal is to amortize
that by enabling expression evaluation to run in a tight loop over
column vectors, avoiding repeated jumps into slot/AM code for each
tuple and each column. That can reduce branching and improve locality.

In its current form, the BatchVector is ephemeral -- it's built just
before expression evaluation and discarded after. But your idea of
reusing the same vector across multiple operations is interesting.
That would let us spread out the construction cost even further and
might be necessary to justify the overhead fully in some cases. I’ll
keep that in mind.

> But it seems like a great area for experimenting ...

Yep.

> > I’m considering extending TupleTableSlotOps with a batch-aware variant
> > of getsomeattrs(), something like slot_getsomeattrs_batch(), so that
> > AMs can populate column vectors (e.g., BatchVector) directly from
> > their native format. That would allow bypassing slot materialization
> > entirely and plug AM-provided decoding logic directly into the
> > executor’s batch expression paths. This isn’t implemented yet, but I
> > see it as a necessary step toward supporting fully native expression
> > evaluation over compressed or columnar formats. I’m not yet sure if
> > TupleTableSlotOps is the right place for such a hook, it might belong
> > elsewhere in the abstraction, but exposing a batch-aware interface for
> > this purpose seems like the right direction.
> >
>
> No opinion. I don't see it as a necessary prerequisite for the other
> parts of the patch series, but maybe the BatchVector really helps, and
> then this would make perfect sense. I'm not sure there's a single
> "correct" sequence in which to do these improvements, it's always a
> matter of opinion.

Yes, I think we can come back to this later.

> >> The other option would be to "create batches" during execution, say by
> >> having a new node that accumulates tuples, builds a batch and sends it
> >> to the node above. This would help both in cases when either the lower
> >> node does not produce batches at all, or the batches are too small (due
> >> to filtering, aggregation, ...). Or course, it'd only win if this
> >> increases efficiency of the upper part of the plan enough to pay for
> >> building the batches. That can be a hard decision.
> >
> > Yes, introducing a dedicated executor node to accumulate and form
> > batches on the fly is an interesting idea, I have thought about it and
> > even mentioned it in passing in the pgconf.dev unconference. This
> > could indeed cover scenarios where the data source (a node) doesn't
> > produce batches (e.g., a non-batching node feeding into a
> > batching-aware upper node) or where batches coming from below are too
> > small to be efficient. The current patch set doesn’t implement such a
> > node; I focused on enabling batching at the scan/TAM level first. The
> > cost/benefit decision for a batch-aggregator node is tricky, as you
> > said. We’d need a way to decide when the overhead of gathering tuples
> > into a batch is outweighed by the benefits to the upper node. This
> > likely ties into costing or adaptive execution decisions. It's
> > something I’m open to exploring in a future iteration, perhaps once we
> > have more feedback on how the existing batching performs in various
> > scenarios. It might also require some planner or executor smarts
> > (maybe the executor can decide to batch on the fly if it sees a
> > pattern of use, or the planner could insert such nodes when
> > beneficial).
> >
>
> Yeah, those are good questions. I don't have a clear idea how should we
> decide when to do this batching. Costing during planning is the
> "traditional" option, with all the issues (e.g. it requires a reasonably
> good cost model). Another option would be some sort of execution-time
> heuristics - buts then which node would be responsible for building the
> batches (if we didn't create them during planning)?
>
> I agree it makes sense to focus on batching at the TAM/scan level for
> now. That's a pretty big project already.

Right -- batching at the TAM/scan level is already a sizable project,
especially given its interaction with prefetching work (maybe). I
think it's best to focus design effort there and on batched expression
evaluation first, and only revisit batch-creation nodes once that
groundwork is in place.

> >> In fact, how shall the optimizer decide whether to use batching? It's
> >> one thing to decide whether a node can produce/consume batches, but
> >> another thing is "should it"? With a node that "builds" a batch, this
> >> decision would apply to even more plans, I guess.
> >>
> >> I don't have a great answer to this, it seems like an incredibly tricky
> >> costing issue. I'm a bit worried we might end up with something too
> >> coarse, like "jit=on" which we know is causing problems (admittedly,
> >> mostly due to a lot of the LLVM work being unpredictable/external). But
> >> having some "adaptive" heuristics (like the gradual ramp up) might make
> >> it less risky.
> >
> > I agree that deciding when to use batching is tricky. So far, the
> > patch takes a fairly simplistic approach: if a node (particularly a
> > scan node) supports batching, it just does it, and other parts of the
> > plan will consume batches if they are capable. There isn’t yet a
> > nuanced cost-based decision in the planner for enabling batching. This
> > is indeed something we’ll have to refine. We don’t want to end up with
> > a blunt on/off GUC that could cause regressions in some cases.
> >
> > One idea is to introduce costing for batching: for example, estimate
> > the per-tuple savings from batching vs the overhead of materialization
> > or batch setup. However, developing a reliable cost model for that
> > will take time and experimentation, especially with the possibility of
> > variable batch sizes or adaptive behavior. Not to mention, that will
> > be adding one more dimension to planner's costing model making the
> > planning more expensive and unpredictable.  In the near term, I’m fine
> > with relying on feedback and perhaps manual tuning (GUCs, etc.) to
> > decide on batching, but that’s perhaps not a long-term solution.
> >
>
> Yeah, the cost model is going to be hard, because this depends on so
> much low-level plan/hardware details. Like, the TAM may allow execution
> on compressed data / leverage vectorization, .... But maybe the CPU does
> not do that efficiently? There's so many unknown unknowns ...
>
> Considering we still haven't fixed the JIT cost model, maybe it's better
> to not rely on it too much for this batching patch? Also, all those
> details contradict the idea that cost models are a simplified model of
> the reality.

Yeah, totally agreed -- the complexity and unpredictability here are
real, and your point about JIT costing is a good reminder not to
over-index on planner models for now.

> > I share your inclination that adaptive heuristics might be the safer
> > path initially. Perhaps the executor can decide to batch or not batch
> > based on runtime conditions. The gradual ramp-up of batch size is one
> > such adaptive approach. We could also consider things like monitoring
> > how effective batching is (are we actually processing full batches or
> > frequently getting cut off?) and adjust behavior. These are somewhat
> > speculative ideas at the moment, but the bottom line is I’m aware we
> > need a smarter strategy than a simple switch. This will likely evolve
> > as we test the patch in more scenarios.
> >
>
> I think the big question is how much can the batching change the
> relative cost of two plans (I mean, actual cost, not just estimates).
>
> Imagine plans P1 and P2, where
>
>    cost(P1) < cost(P2) = cost(P1) + delta
>
> where "delta" is small (so P1 is faster, but not much).  If we
> "batchify" the plans into P1' and P2', can this happen?
>
>   cost(P1') >> cost(P2')
>
> That is, can the "slower" plan P2 benefit much more from the batching,
> making it significantly faster?
>
> If this is unlikely, we could entirely ignore batching during planning,
> and only do that as post-processing on the selected plan, or perhaps
> even just during execution.
>
> OTOH that's what JIT does, and we know it's not perfect - but that's
> mostly because JIT has rather unpredictable costs when enabling. Maybe
> batching doesn't have that.

That’s an interesting scenario. I suspect batching (even with SIMD)
won’t usually flip plan orderings that dramatically -- i.e., turning
the clearly slower plan into the faster one -- though I could be
wrong. But I agree with the conclusion: this supports treating
batching as an executor concern, at least initially. Might be worth
seeing if there’s any relevant guidance in systems literature too.

> >> FWIW the current batch size limit (64 tuples) seems rather low, but it's
> >> hard to say. It'd be good to be able to experiment with different
> >> values, so I suggest we make this a GUC and not a hard-coded constant.
> >
> > Yeah, I was thinking the same while testing -- the optimal batch size
> > might vary by workload or hardware, and 64 was a somewhat arbitrary
> > starting point. I will make the batch size limit configurable
> > (probably as a GUC executor_batch_tuples, maybe only developer-focused
> > at first). That will let us and others experiment easily with
> > different batch sizes to see how it affects performance. It should
> > also help with your earlier point: for example, on a machine where 64
> > is too low or too high, we can adjust it without recompiling. So yes,
> > I'll add a GUC for the batch size in the next version of the patch.
> >
>
> +1 to have developer-only GUC for testing. But the goal should be to not
> expect users to tune this.

Yes.

> >> As for what to add to explain, I'd start by adding info about which
> >> nodes are "batched" (consuming/producing batches), and some info about
> >> the batch sizes. An average size, maybe a histogram if you want to be a
> >> bit fancy.
> >
> > Adding more information to EXPLAIN is a good idea. In the current
> > patch, EXPLAIN does not show anything about batching, but it would be
> > very helpful for debugging and user transparency to indicate which
> > nodes are operating in batch mode.  I will update EXPLAIN to mark
> > nodes that produce or consume batches. Likely I’ll start with
> > something simple like an extra line or tag for a node, e.g., "Batch:
> > true (avg batch size 64)" or something along those lines. An average
> > batch size could be computed if we have instrumentation, which would
> > be useful to see if, say, the batch sizes ended up smaller due to
> > LIMIT or other factors. A full histogram might be more detail than
> > most users need, but I agree even just knowing average or maximum
> > batch size per node could be useful for performance analysis. I'll
> > implement at least the basics for now, and we can refine it (maybe add
> > more stats) if needed.
>
> +1 to start with something simple
>
> >
> > (I had added a flag in the EXPLAIN output at one point, but removed it
> > due to finding the regression output churn too noisy, though I
> > understand I'll have to bite the bullet at some point.)
> >
>
> Why would there be regression churn, if the option is disabled by default?

executor_batching is on my default in my patch, so a seq scan will
always use batching provided the query features preventing it are not
present, which is true for a huge number of plans appearing in
regression suite output.

> >> Now, numbers from some microbenchmarks:
> >>
> >> ...
> >>>> Perhaps I did something wrong. It does not surprise me this is somewhat
> >> CPU dependent. It's a bit sad the improvements are smaller for the newer
> >> CPU, though.
> >
> > Thanks for sharing your benchmark results -- that’s very useful data.
> > I haven’t yet finished investigating why there's a regression relative
> > to master when executor_batching is turned off. I re-ran my benchmarks
> > to include comparisons with master and did observe some regressions in
> > a few cases too, but I didn't see anything obvious in profiles that
> > explained the slowdown. I initially assumed it might be noise, but now
> > I suspect it could be related to structural changes in the scan code
> > -- for example, I added a few new fields in the middle of
> > HeapScanDescData, and even though the batching logic is bypassed when
> > executor_batching is off, it’s possible that change alone affects
> > memory layout or cache behavior in a way that penalizes the unbatched
> > path. I haven’t confirmed that yet, but it’s on my list to look into
> > more closely.
> >
> > Your observation that newer CPUs like the Ryzen may see smaller
> > improvements makes sense -- perhaps they handle the per-tuple overhead
> > more efficiently to begin with. Still, I’d prefer not to see
> > regressions at all, even in the unbatched case, so I’ll focus on
> > understanding and fixing that part before drawing conclusions from the
> > performance data.
> >
> > Thanks again for the scripts -- those will help a lot in narrowing things down.
>
> If needed, I can rerun the tests and collect additional information
> (e.g. maybe perf-stat or perf-diff would be interesting).

That would be nice to see if you have the time, but maybe after I post
a new version.

--
Thanks, Amit Langote



Re: Batching in executor

От
Daniil Davydov
Дата:
Hi,

As far as I understand, this work partially overlaps with what we did in the
thread [1] (in short - we introduce support for batching within the ModifyTable
node). Am I correct?

It's worth saying that the patch in that thread is already quite old -
now I have
an improved implementation and tests for it (as well as performance
measurements). But the basic idea and design remained unchanged.

Maybe we can combine approaches? I haven't reviewed patches in this thread
yet, but I'll try to do it in the near future.

[1] https://www.postgresql.org/message-id/flat/CALj2ACVi9eTRYR%3Dgdca5wxtj3Kk_9q9qVccxsS1hngTGOCjPwQ%40mail.gmail.com

--
Best regards,
Daniil Davydov



Re: Batching in executor

От
Amit Langote
Дата:
Hi Daniil,

On Tue, Oct 28, 2025 at 11:32 PM Daniil Davydov <3danissimo@gmail.com> wrote:
>
> Hi,
>
> As far as I understand, this work partially overlaps with what we did in the
> thread [1] (in short - we introduce support for batching within the ModifyTable
> node). Am I correct?

There might be some relation, but not much overlap. The thread you
mention seems to focus on batching in the write path (for INSERT,
etc.), while this work targets batching in the read path via Table AM
scan callbacks. I think they can be developed independently, though
I'm happy to take a look.

--
Thanks, Amit Langote



Re: Batching in executor

От
Amit Langote
Дата:
On Tue, Oct 28, 2025 at 10:40 PM Amit Langote <amitlangote09@gmail.com> wrote:
> That would be nice to see if you have the time, but maybe after I post
> a new version.

I’ve created a CF entry marked WoA for this in the next CF under the
title “Batching in executor, part 1: add batch variant of table AM
scan API.” The idea is to track this piece separately so that later
parts can have their own entries and we don’t end up with a single
long-lived entry that never gets marked done. :-)

--
Thanks, Amit Langote



Re: Batching in executor

От
Daniil Davydov
Дата:
Hi,

On Wed, Oct 29, 2025 at 9:23 AM Amit Langote <amitlangote09@gmail.com> wrote:
>
> Hi Daniil,
>
> On Tue, Oct 28, 2025 at 11:32 PM Daniil Davydov <3danissimo@gmail.com> wrote:
> >
> > Hi,
> >
> > As far as I understand, this work partially overlaps with what we did in the
> > thread [1] (in short - we introduce support for batching within the ModifyTable
> > node). Am I correct?
>
> There might be some relation, but not much overlap. The thread you
> mention seems to focus on batching in the write path (for INSERT,
> etc.), while this work targets batching in the read path via Table AM
> scan callbacks. I think they can be developed independently, though
> I'm happy to take a look.

Oh, I got it. Thanks!

I looked at 0001-0003 patches and got some comments :
1)
I noticed that some Nodes may set SO_ALLOW_PAGEMODE flag to 'false'
during ExecReScan. heap_getnextslot works carefully with it - checks whether
pagemode is allowed at every call. If not - it just uses tuple-at-a-time mode.
At the same time, heap_getnextbatch always expects that pagemode is enabled.
I didn't find any code paths which can lead to an assertion [1] fail.
If such a code
path is unreachable under any circumstances, maybe we should add a comment
why?

2)
heapgettup_pagemode_batch : Do we really need to compute lineindex variable
in this way? :
***
            lineindex = scan->rs_cindex + dir;
            if (ScanDirectionIsForward(dir))
                linesleft = (lineindex <= (uint32) scan->rs_ntuples) ?
                    (scan->rs_ntuples - lineindex) : 0;
***

As far as I understand, this is enough :
***
        lineindex = scan->rs_cindex + dir;
        if (ScanDirectionIsForward(dir))
            linesleft = scan->rs_ntuples - lineindex;
***

3)
Is this code inside heapgettup_pagemode_batch necessary? :
***
ScanDirectionIsForward(dir) ? 0 : 0
***

4)
heapgettup_pagemode has this change :
HeapTuple    tuple = &(scan->rs_ctup) ---> HeapTuple tuple = &scan->rs_ctup
I guess it was changed accidentally.

5)
I apologize for the tediousness, but these braces are not in the
postgres style :
***
static const TupleBatchOps TupleBatchHeapOps = {
    .materialize_all = heap_materialize_batch_all
};
***

[1] heap_getnextbatch : Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE)

--
Best regards,
Daniil Davydov