Обсуждение: unnecessary executor overheads around seqscans

Поиск

Список

Период

Сортировка

unnecessary executor overheads around seqscans

От

Andres Freund

Дата:

23 января, 23:16:02

Hi,

In [1] I was looking at the profile of a seqscan with a where clause that
doesn't match any of the many rows.  I was a bit saddened by where we were
spending time.


- The fetching of variables, as well as the null check of scandesc, in
  SeqNext() is repeated in every loop iteration of ExecScanExtended, despite
  that obviously not being required after the first iteration

  We could perhaps address this by moving the check to the callers of
  ExecScanExtended() or by extending ExecScanExtended to have an explicit
  beginscan callback that it calls after.

  Or perhaps we could just make it so that the entire if (scandesc == NULL)
  branch isn't needed?


- The checkXidAlive checks that have been added to table_scan_getnextslot()
  show up noticeably and in every loop iteration, despite afaict never being reachable

  It's not obvious to me that this should
  a) be in table_scan_getnextslot(), rather than in beginscan - how could it
     change in the middle of a scan? That would require a wrapper around
     rd_tableam->scan_begin(), but that seems like it might be good anyway.
  b) not just be an assertion?


- The TupIsNull(slot) check in ExecScanExtended() is redundant with the return
  value of table_scan_getnextslot(), but the compiler doesn't grok that.

  We can use a pg_assume() in table_scan_getnextslot() to make the compiler
  understand.


- We repeatedly store the table oid in the slot, table_scan_getnextslot() and
  then again in ExecStoreBufferHeapTuple(). This shows up in the profile.

  I wish we had made the slot a property of the scan, that way the scan could
  assume the slot already has the oid set...


- heap_getnextslot() calls ExecStoreBufferHeapTuple() and then returns
  true. That prevents the sibiling call optimization.

  We should change ExecStoreBufferHeapTuple() to return true. Nobody uses the
  current return value. Alternatively we should consider just moving it to
  somewhere heapam.c/heapam_handler.c can see the implementations, they're the
  only ones that should use it anyway.


There's more, but these already provide a decent improvement.


Greetings,

Andres Freund

[1] https://postgr.es/m/rvlc7pb6zn4kydqovcqh72lf2qfcgs3qkj2seq7tcpvxyqwtqt%40nrvv6lpehwwa

Re: unnecessary executor overheads around seqscans

От

Amit Langote

Дата:

24 января, 09:21:22

Hi,

On Sat, Jan 24, 2026 at 5:16 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> In [1] I was looking at the profile of a seqscan with a where clause that
> doesn't match any of the many rows.  I was a bit saddened by where we were
> spending time.
>
>
> - The fetching of variables, as well as the null check of scandesc, in
>   SeqNext() is repeated in every loop iteration of ExecScanExtended, despite
>   that obviously not being required after the first iteration
>
>   We could perhaps address this by moving the check to the callers of
>   ExecScanExtended() or by extending ExecScanExtended to have an explicit
>   beginscan callback that it calls after.
>
>   Or perhaps we could just make it so that the entire if (scandesc == NULL)
>   branch isn't needed?

Kind of like ExecProcNodeFirst(), what if we replace the variant
selection in ExecInitSeqScan() with just:

    scanstate->ss.ps.ExecProcNode = ExecSeqScanFirst;

ExecSeqScanFirst() would:

- do the table_beginscan() call that's currently inside the if
(scandesc == NULL) block in SeqNext()

- select and install the appropriate ExecSeqScan variant based on
qual/projection/EPQ

- call that variant to fetch and return the first tuple.

Then we can just remove the if (scandesc == NULL) block from SeqNext() entirely.

> - The checkXidAlive checks that have been added to table_scan_getnextslot()
>   show up noticeably and in every loop iteration, despite afaict never being reachable
>
>   It's not obvious to me that this should
>   a) be in table_scan_getnextslot(), rather than in beginscan - how could it
>      change in the middle of a scan? That would require a wrapper around
>      rd_tableam->scan_begin(), but that seems like it might be good anyway.
>   b) not just be an assertion?

Haven't thought about this.

> - The TupIsNull(slot) check in ExecScanExtended() is redundant with the return
>   value of table_scan_getnextslot(), but the compiler doesn't grok that.
>
>   We can use a pg_assume() in table_scan_getnextslot() to make the compiler
>   understand.

Something like this?

    result = sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
    pg_assume(result == !TupIsNull(slot));
    return result;

I assume this relies on table_scan_getnextslot() being inlined into
ExecScanExtended()?

> - We repeatedly store the table oid in the slot, table_scan_getnextslot() and
>   then again in ExecStoreBufferHeapTuple(). This shows up in the profile.
>
>   I wish we had made the slot a property of the scan, that way the scan could
>   assume the slot already has the oid set...

I've noticed this when working on my batching patch. I set
tts_tableOid when creating the slots used in the batch themselves, so
the per-tuple assignment isn't needed.

> - heap_getnextslot() calls ExecStoreBufferHeapTuple() and then returns
>   true. That prevents the sibiling call optimization.
>
>   We should change ExecStoreBufferHeapTuple() to return true. Nobody uses the
>   current return value. Alternatively we should consider just moving it to
>   somewhere heapam.c/heapam_handler.c can see the implementations, they're the
>   only ones that should use it anyway.

Makes sense. Changing ExecStoreBufferHeapTuple() to return true seems
like the simpler option, unless I misunderstood.

--
Thanks, Amit Langote

Re: unnecessary executor overheads around seqscans

От

David Rowley

Дата:

24 января, 09:36:08

On Sat, 24 Jan 2026 at 19:21, Amit Langote <amitlangote09@gmail.com> wrote:
> On Sat, Jan 24, 2026 at 5:16 AM Andres Freund <andres@anarazel.de> wrote:
> >   Or perhaps we could just make it so that the entire if (scandesc == NULL)
> >   branch isn't needed?
>
> Kind of like ExecProcNodeFirst(), what if we replace the variant
> selection in ExecInitSeqScan() with just:

I imagined moving it to ExecInitSeqScan() and just avoid doing it when
we're doing EXPLAIN or we're doing a parallel scan. Something like the
attached, which is giving me a 4% speedup selecting from a million row
table with a single int column running a seqscan query with a WHERE
clause matching no rows.

> >   We should change ExecStoreBufferHeapTuple() to return true. Nobody uses the
> >   current return value. Alternatively we should consider just moving it to
> >   somewhere heapam.c/heapam_handler.c can see the implementations, they're the
> >   only ones that should use it anyway.
>
> Makes sense. Changing ExecStoreBufferHeapTuple() to return true seems
> like the simpler option, unless I misunderstood.

It's probably too late to change it now, but wouldn't it have been
better if scan_getnextslot had been coded to return the TupleTableSlot
rather than bool? That way you could get the sibling call in
ExecStoreBufferHeapTuple() and in SeqNext().

I also noticed my compiler does not inline SeqNext(). Adding a
pg_attribute_always_inline results in it getting inlined and gives a
small speedup.

David

Вложения

move_table_beginscan_to_initplan.patch

Re: unnecessary executor overheads around seqscans

От

Amit Kapila

Дата:

24 января, 12:53:44

On Sat, Jan 24, 2026 at 1:46 AM Andres Freund <andres@anarazel.de> wrote:
>
> - The checkXidAlive checks that have been added to table_scan_getnextslot()
>   show up noticeably and in every loop iteration, despite afaict never being reachable
>
>   It's not obvious to me that this should
>   a) be in table_scan_getnextslot(), rather than in beginscan - how could it
>      change in the middle of a scan? That would require a wrapper around
>      rd_tableam->scan_begin(), but that seems like it might be good anyway.
>   b) not just be an assertion?
>

IIRC, the main reason for having this precautionary check in the API
is to ensure that during logical decoding we never access the table AM
or
heap APIs directly when scanning catalog tables. This restriction
exists because we only check for concurrent aborts inside the
systable_* APIs.

In practice, the core code should never hit this precautionary check,
so it could likely be converted into an assert. The reason it may not
have been an assert originally is to give an ERROR if an output plugin
incorrectly invokes these APIs during decoding to access catalogs as
noted in docs [1] (Note that access to user catalog tables or regular
system catalog tables in the output plugins has to be done via the
systable_* scan APIs only. Access via the heap_* scan APIs will error
out.).

I'll think some more about it.

[1] - https://www.postgresql.org/docs/devel/logicaldecoding-output-plugin-writing.html

--
With Regards,
Amit Kapila.

Re: unnecessary executor overheads around seqscans

От

Andres Freund

Дата:

24 января, 18:31:01

Hi,

On 2026-01-24 15:23:44 +0530, Amit Kapila wrote:
> On Sat, Jan 24, 2026 at 1:46 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > - The checkXidAlive checks that have been added to table_scan_getnextslot()
> >   show up noticeably and in every loop iteration, despite afaict never being reachable
> >
> >   It's not obvious to me that this should
> >   a) be in table_scan_getnextslot(), rather than in beginscan - how could it
> >      change in the middle of a scan? That would require a wrapper around
> >      rd_tableam->scan_begin(), but that seems like it might be good anyway.
> >   b) not just be an assertion?
> >
> 
> IIRC, the main reason for having this precautionary check in the API
> is to ensure that during logical decoding we never access the table AM
> or
> heap APIs directly when scanning catalog tables. This restriction
> exists because we only check for concurrent aborts inside the
> systable_* APIs.

I know why the check exists - but why does it have to be in
table_scan_getnextslot(), which is executed very frequently, rather than
table_beginscan*(), which is executed much less frequently.

Greetings,

Andres Freund

Re: unnecessary executor overheads around seqscans

От

Andres Freund

Дата:

24 января, 18:36:19

Hi,

On 2026-01-24 19:36:08 +1300, David Rowley wrote:
> On Sat, 24 Jan 2026 at 19:21, Amit Langote <amitlangote09@gmail.com> wrote:
> > On Sat, Jan 24, 2026 at 5:16 AM Andres Freund <andres@anarazel.de> wrote:
> > >   Or perhaps we could just make it so that the entire if (scandesc == NULL)
> > >   branch isn't needed?
> >
> > Kind of like ExecProcNodeFirst(), what if we replace the variant
> > selection in ExecInitSeqScan() with just:
> 
> I imagined moving it to ExecInitSeqScan() and just avoid doing it when
> we're doing EXPLAIN or we're doing a parallel scan. Something like the
> attached, which is giving me a 4% speedup selecting from a million row
> table with a single int column running a seqscan query with a WHERE
> clause matching no rows.

Yes, i think that'd be better. Nice!


> > >   We should change ExecStoreBufferHeapTuple() to return true. Nobody uses the
> > >   current return value. Alternatively we should consider just moving it to
> > >   somewhere heapam.c/heapam_handler.c can see the implementations, they're the
> > >   only ones that should use it anyway.
> >
> > Makes sense. Changing ExecStoreBufferHeapTuple() to return true seems
> > like the simpler option, unless I misunderstood.
> 
> It's probably too late to change it now, but wouldn't it have been
> better if scan_getnextslot had been coded to return the TupleTableSlot
> rather than bool? That way you could get the sibling call in
> ExecStoreBufferHeapTuple() and in SeqNext().

TupIsNull() is actually more expensive than a bool return value, because it
needs a memory fetch to complete... Except of course if we end up doing both.

SeqNext() shouldn't need a sibling call because it ought to be inlined. But:

> I also noticed my compiler does not inline SeqNext(). Adding a
> pg_attribute_always_inline results in it getting inlined and gives a
> small speedup.

Oh,m that's not good. I think we really had assumed that it would with the 18
changes around this. It does here, but that's probably because I use -O3.


> diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
> index b8119face43..87420e60dc9 100644
> --- a/src/backend/executor/nodeSeqscan.c
> +++ b/src/backend/executor/nodeSeqscan.c
> @@ -63,17 +63,6 @@ SeqNext(SeqScanState *node)
>      direction = estate->es_direction;
>      slot = node->ss.ss_ScanTupleSlot;
>  
> -    if (scandesc == NULL)
> -    {
> -        /*
> -         * We reach here if the scan is not parallel, or if we're serially
> -         * executing a scan that was planned to be parallel.
> -         */
> -        scandesc = table_beginscan(node->ss.ss_currentRelation,
> -                                   estate->es_snapshot,
> -                                   0, NULL);
> -        node->ss.ss_currentScanDesc = scandesc;
> -    }

What about the "if we're serially executing a scan that was planned to be
parallel." part? I frankly don't really know what precisely was referencing...


Greetings,

Andres Freund

Re: unnecessary executor overheads around seqscans

От

Andres Freund

Дата:

24 января, 18:56:21

Hi,

On 2026-01-24 15:21:22 +0900, Amit Langote wrote:
> On Sat, Jan 24, 2026 at 5:16 AM Andres Freund <andres@anarazel.de> wrote:
> > - The TupIsNull(slot) check in ExecScanExtended() is redundant with the return
> >   value of table_scan_getnextslot(), but the compiler doesn't grok that.
> >
> >   We can use a pg_assume() in table_scan_getnextslot() to make the compiler
> >   understand.
> 
> Something like this?
> 
>     result = sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
>     pg_assume(result == !TupIsNull(slot));
>     return result;

Yep, that seems to clue at least gcc into understanding the situation. I only
tried two pg_assume(!found || !TupIsNull(slot)), but yours should probably
also work.


> I assume this relies on table_scan_getnextslot() being inlined into
> ExecScanExtended()?

Yes. But I think that that's a good bet, given that it's an inline function
and only used once in nodeSeqscan.c.


> > - heap_getnextslot() calls ExecStoreBufferHeapTuple() and then returns
> >   true. That prevents the sibiling call optimization.
> >
> >   We should change ExecStoreBufferHeapTuple() to return true. Nobody uses the
> >   current return value. Alternatively we should consider just moving it to
> >   somewhere heapam.c/heapam_handler.c can see the implementations, they're the
> >   only ones that should use it anyway.
> 
> Makes sense. Changing ExecStoreBufferHeapTuple() to return true seems
> like the simpler option, unless I misunderstood.

Yea. There's probably more to be gained by the other approach, but it's also
somewhat painful due to some functionality being private to execTuples.c right
now.

Greetings,

Andres Freund

Re: unnecessary executor overheads around seqscans

От

David Rowley

Дата:

25 января, 16:20:20

On Sun, 25 Jan 2026 at 04:36, Andres Freund <andres@anarazel.de> wrote:
> On 2026-01-24 19:36:08 +1300, David Rowley wrote:
> > diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
> > index b8119face43..87420e60dc9 100644
> > --- a/src/backend/executor/nodeSeqscan.c
> > +++ b/src/backend/executor/nodeSeqscan.c
> > @@ -63,17 +63,6 @@ SeqNext(SeqScanState *node)
> >       direction = estate->es_direction;
> >       slot = node->ss.ss_ScanTupleSlot;
> >
> > -     if (scandesc == NULL)
> > -     {
> > -             /*
> > -              * We reach here if the scan is not parallel, or if we're serially
> > -              * executing a scan that was planned to be parallel.
> > -              */
> > -             scandesc = table_beginscan(node->ss.ss_currentRelation,
> > -                                                                estate->es_snapshot,
> > -                                                                0, NULL);
> > -             node->ss.ss_currentScanDesc = scandesc;
> > -     }
>
> What about the "if we're serially executing a scan that was planned to be
> parallel." part? I frankly don't really know what precisely was referencing...

I think that comment is wrong and that maybe it worked that way at
some point when Parallel Seq Scan was in development, but it doesn't
seem to work that way today, and does not appear to have when Parallel
Seq Scan was committed. It's pretty easy to see that code isn't
triggered just by setting max_parallel_workers to 0 and running query
which triggers a Parallel Seq Scan. I think that's what "serially
executing a scan that was planned to be parallel" means, right? The
debug_parallel_query = on case uses a non-parallel Seq scan.

One user-visible side effect of initialising the TableScanDesc during
plan initialisation rather than on the first row is that the stats
will record the Seq Scan even if it's never executed in the plan.
Currently what we do there is somewhat inconsistent as with a Parallel
Seq Scan we'll count the seq scan stat if the Gather node is executed,
regardless of if the Seq Scan node gets executed: ExecGather ->
ExecInitParallelPlan -> ExecParallelInitializeDSM ->
ExecSeqScanInitializeDSM -> table_beginscan_parallel. But with a
non-parallel Seq Scan, the scan will be tracked after fetching the
first row.

Added Robert to see if he remembers about the comment or can shed some
light on it.

David

Re: unnecessary executor overheads around seqscans

От

David Rowley

Дата:

26 января, 04:32:50

On Sun, 25 Jan 2026 at 04:36, Andres Freund <andres@anarazel.de> wrote:
> On 2026-01-24 19:36:08 +1300, David Rowley wrote:
> > I also noticed my compiler does not inline SeqNext(). Adding a
> > pg_attribute_always_inline results in it getting inlined and gives a
> > small speedup.
>
> Oh,m that's not good. I think we really had assumed that it would with the 18
> changes around this. It does here, but that's probably because I use -O3.

To reduce the variables here, I've pushed a fix for that after a quick
test showed a 3.9% speedup on a 1 million row table with a single int4
column filtering out all rows. I noticed that clang also didn't inline
with -O2. It does now.

David

Re: unnecessary executor overheads around seqscans

От

David Rowley

Дата:

26 января, 06:03:19

On Sat, 24 Jan 2026 at 19:21, Amit Langote <amitlangote09@gmail.com> wrote:
> On Sat, Jan 24, 2026 at 5:16 AM Andres Freund <andres@anarazel.de> wrote:
> >   We can use a pg_assume() in table_scan_getnextslot() to make the compiler
> >   understand.
>
> Something like this?
>
>     result = sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
>     pg_assume(result == !TupIsNull(slot));
>     return result;
>
> I assume this relies on table_scan_getnextslot() being inlined into
> ExecScanExtended()?

I looked at the objdump of this, and it does seem to get rid of the extra check.

I did:

cd src/backend/executor
gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wimplicit-fallthrough=3
-Wcast-function-type -Wshadow=compatible-local -Wformat-security
-Wmissing-variable-declarations -fno-strict-aliasing -fwrapv
-fexcess-precision=standard -Wno-format-truncation
-Wno-stringop-truncation -O2 -I../../../src/include -D_GNU_SOURCE -g
-c nodeSeqscan.c
objdump -d -M intel -S nodeSeqscan.o > nodeSeqscan_pg_assume.s

Looking at ExecSeqScanWithQual, I see:

master:
return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 22c: 48 8b 07              mov    rax,QWORD PTR [rdi]
 22f: 4c 89 f2              mov    rdx,r14
 232: 44 89 fe              mov    esi,r15d
 235: 48 8b 80 40 01 00 00 mov    rax,QWORD PTR [rax+0x140]
 23c: ff 50 28              call   QWORD PTR [rax+0x28]
if (table_scan_getnextslot(scandesc, direction, slot))
 23f: 84 c0                test   al,al <-- *** this test and the
subsequent jump equal are removed
 241: 74 6d                je     2b0 <ExecSeqScanWithQual+0x100>
if (TupIsNull(slot))
 243: 41 f6 46 04 02        test   BYTE PTR [r14+0x4],0x2
 248: 75 69                jne    2b3 <ExecSeqScanWithQual+0x103>
 24a: 48 8b 45 28          mov    rax,QWORD PTR [rbp+0x28]

And with your pg_assume added:
result = sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 22c: 48 8b 07              mov    rax,QWORD PTR [rdi]
 22f: 4c 89 f2              mov    rdx,r14
 232: 44 89 fe              mov    esi,r15d
 235: 48 8b 80 40 01 00 00 mov    rax,QWORD PTR [rax+0x140]
 23c: ff 50 28              call   QWORD PTR [rax+0x28]
pg_assume(result == !TupIsNull(slot));
 23f: 41 f6 46 04 02        test   BYTE PTR [r14+0x4],0x2
 244: 75 62                jne    2a8 <ExecSeqScanWithQual+0xf8>
 246: 48 8b 45 28          mov    rax,QWORD PTR [rbp+0x28]

I didn't test the performance.

David

Re: unnecessary executor overheads around seqscans

От

Amit Langote

Дата:

26 января, 10:47:31

On Mon, Jan 26, 2026 at 12:03 PM David Rowley <dgrowleyml@gmail.com> wrote:
> On Sat, 24 Jan 2026 at 19:21, Amit Langote <amitlangote09@gmail.com> wrote:
> > On Sat, Jan 24, 2026 at 5:16 AM Andres Freund <andres@anarazel.de> wrote:
> > >   We can use a pg_assume() in table_scan_getnextslot() to make the compiler
> > >   understand.
> >
> > Something like this?
> >
> >     result = sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
> >     pg_assume(result == !TupIsNull(slot));
> >     return result;
> >
> > I assume this relies on table_scan_getnextslot() being inlined into
> > ExecScanExtended()?
>
> I looked at the objdump of this, and it does seem to get rid of the extra check.
>
> I did:
>
> cd src/backend/executor
> gcc -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Werror=vla -Wendif-labels
> -Wmissing-format-attribute -Wimplicit-fallthrough=3
> -Wcast-function-type -Wshadow=compatible-local -Wformat-security
> -Wmissing-variable-declarations -fno-strict-aliasing -fwrapv
> -fexcess-precision=standard -Wno-format-truncation
> -Wno-stringop-truncation -O2 -I../../../src/include -D_GNU_SOURCE -g
> -c nodeSeqscan.c
> objdump -d -M intel -S nodeSeqscan.o > nodeSeqscan_pg_assume.s
>
> Looking at ExecSeqScanWithQual, I see:
>
> master:
> return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
>  22c: 48 8b 07              mov    rax,QWORD PTR [rdi]
>  22f: 4c 89 f2              mov    rdx,r14
>  232: 44 89 fe              mov    esi,r15d
>  235: 48 8b 80 40 01 00 00 mov    rax,QWORD PTR [rax+0x140]
>  23c: ff 50 28              call   QWORD PTR [rax+0x28]
> if (table_scan_getnextslot(scandesc, direction, slot))
>  23f: 84 c0                test   al,al <-- *** this test and the
> subsequent jump equal are removed
>  241: 74 6d                je     2b0 <ExecSeqScanWithQual+0x100>
> if (TupIsNull(slot))
>  243: 41 f6 46 04 02        test   BYTE PTR [r14+0x4],0x2
>  248: 75 69                jne    2b3 <ExecSeqScanWithQual+0x103>
>  24a: 48 8b 45 28          mov    rax,QWORD PTR [rbp+0x28]
>
> And with your pg_assume added:
> result = sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
>  22c: 48 8b 07              mov    rax,QWORD PTR [rdi]
>  22f: 4c 89 f2              mov    rdx,r14
>  232: 44 89 fe              mov    esi,r15d
>  235: 48 8b 80 40 01 00 00 mov    rax,QWORD PTR [rax+0x140]
>  23c: ff 50 28              call   QWORD PTR [rax+0x28]
> pg_assume(result == !TupIsNull(slot));
>  23f: 41 f6 46 04 02        test   BYTE PTR [r14+0x4],0x2
>  244: 75 62                jne    2a8 <ExecSeqScanWithQual+0xf8>
>  246: 48 8b 45 28          mov    rax,QWORD PTR [rbp+0x28]
>
> I didn't test the performance.

Thanks for sharing that.

I tried my patch over your committed SeqNext inlining patch and ran
the following benchmark but didn't notice in material difference:

CREATE TABLE t (a int);
INSERT INTO t SELECT generate_series(1, 1000000);
ANALYZE t;
SET max_parallel_workers_per_gather = 0;
SELECT * FROM t WHERE a = -1;

Perhaps not too surprising given it's just eliminating a couple of
instructions per row that the branch predictor probably handles well
anyway? Still seems worth having for code hygiene if nothing else.

Same result (no diff in perf) when I apply it over your patch to move
the scandesc == NULL check.

--
Thanks, Amit Langote

Re: unnecessary executor overheads around seqscans

От

Amit Kapila

Дата:

26 января, 14:57:36

On Sat, Jan 24, 2026 at 9:01 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2026-01-24 15:23:44 +0530, Amit Kapila wrote:
> > On Sat, Jan 24, 2026 at 1:46 AM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > - The checkXidAlive checks that have been added to table_scan_getnextslot()
> > >   show up noticeably and in every loop iteration, despite afaict never being reachable
> > >
> > >   It's not obvious to me that this should
> > >   a) be in table_scan_getnextslot(), rather than in beginscan - how could it
> > >      change in the middle of a scan? That would require a wrapper around
> > >      rd_tableam->scan_begin(), but that seems like it might be good anyway.
> > >   b) not just be an assertion?
> > >
> >
> > IIRC, the main reason for having this precautionary check in the API
> > is to ensure that during logical decoding we never access the table AM
> > or
> > heap APIs directly when scanning catalog tables. This restriction
> > exists because we only check for concurrent aborts inside the
> > systable_* APIs.
>
> I know why the check exists - but why does it have to be in
> table_scan_getnextslot(), which is executed very frequently, rather than
> table_beginscan*(), which is executed much less frequently.
>

I thought about this point and couldn't think of any reason why this
check can't be in table_beginscan*(). I think your idea of having a
wrapper around scan_begin() to handle this check is a good one.

--
With Regards,
Amit Kapila.

Re: unnecessary executor overheads around seqscans

От

Heikki Linnakangas

Дата:

26 января, 18:23:16

On 23/01/2026 22:16, Andres Freund wrote:
> Hi,
> 
> In [1] I was looking at the profile of a seqscan with a where clause that
> doesn't match any of the many rows.  I was a bit saddened by where we were
> spending time.
> 
> 
> - The fetching of variables, as well as the null check of scandesc, in
>    SeqNext() is repeated in every loop iteration of ExecScanExtended, despite
>    that obviously not being required after the first iteration
> 
>    We could perhaps address this by moving the check to the callers of
>    ExecScanExtended() or by extending ExecScanExtended to have an explicit
>    beginscan callback that it calls after.

For context, we're talking about this in SeqNext:

>     /*
>      * get information from the estate and scan state
>      */
>     scandesc = node->ss.ss_currentScanDesc;
>     estate = node->ss.ps.state;
>     direction = estate->es_direction;
>     slot = node->ss.ss_ScanTupleSlot;

Hmm. I guess the compiler doesn't know that the variables don't change 
between calls, so it has to fetch them on every iteration. Passing them 
through a 'const' pointer might give it clue, but I'm not sure how to 
shoehorn that here.

Perhaps we should turn the ExecScanExtended() function inside out. 
Instead of passing SeqNext as a callback to ExecScanExtended(), we would 
have a function like this (for illustration purposes only, doesn't compile):

> /* ----------------------------------------------------------------
>  *        ExecSeqNextInline
>  *
>  *        This is a workhorse for ExecSeqScan. It's inlined into
>  *        specialized implementations for cases where epqstate, qual,
>  *        projInfo are NULL or not.
>  * ----------------------------------------------------------------
>  */
> static pg_attribute_always_inline TupleTableSlot *
> ExecSeqScanInline(SeqScanState *node,
>                   EPQState *epqstate,
>                   ExprState *qual,
>                   ProjectionInfo *projInfo)
> {
>     /*
>      * get information from the estate and scan state
>      */
>     scandesc = node->ss.ss_currentScanDesc;
>     estate = node->ss.ps.state;
>     direction = estate->es_direction;
>     slot = node->ss.ss_ScanTupleSlot;
> 
>     if (scandesc == NULL)
>     {
>         /*
>          * We reach here if the scan is not parallel, or if we're serially
>          * executing a scan that was planned to be parallel.
>          */
>         scandesc = table_beginscan(node->ss.ss_currentRelation,
>                                    estate->es_snapshot,
>                                    0, NULL);
>         node->ss.ss_currentScanDesc = scandesc;
>     }
> 
>     for (;;)
>     {
>         /* do ExecScanFetch(), except for the call to SeqScanNext */
>         slot = ExecScanFetchEPQ(...);
> 
>         /*
>          * get the row next tuple from the table
>          */
>         if (slot == NULL)
>             slot = table_scan_getnextslot(scandesc, direction, slot);
> 
>         if (slot == NULL)
>             break;
> 
>         /*
>          * Does it pass the quals? (This does the parts of ExecScanExtended()
>          * after the ExecScanFetch() call)
>          */
>         slot = ExecScanProjectAndFilter(&node->ss, slot, qual, projInfo);
>         if (slot)
>             return slot;
>     }
>     return NULL;
> }

- Heikki

Re: unnecessary executor overheads around seqscans

От

Andres Freund

Дата:

26 января, 20:16:08

On 2026-01-26 14:32:50 +1300, David Rowley wrote:
> On Sun, 25 Jan 2026 at 04:36, Andres Freund <andres@anarazel.de> wrote:
> > On 2026-01-24 19:36:08 +1300, David Rowley wrote:
> > > I also noticed my compiler does not inline SeqNext(). Adding a
> > > pg_attribute_always_inline results in it getting inlined and gives a
> > > small speedup.
> >
> > Oh,m that's not good. I think we really had assumed that it would with the 18
> > changes around this. It does here, but that's probably because I use -O3.
> 
> To reduce the variables here, I've pushed a fix for that after a quick
> test showed a 3.9% speedup on a 1 million row table with a single int4
> column filtering out all rows. I noticed that clang also didn't inline
> with -O2. It does now.

Thanks!

Re: unnecessary executor overheads around seqscans

От

Andres Freund

Дата:

26 января, 21:08:45

Hi,

On 2026-01-26 16:47:31 +0900, Amit Langote wrote:
> I tried my patch over your committed SeqNext inlining patch and ran
> the following benchmark but didn't notice in material difference:
> 
> CREATE TABLE t (a int);
> INSERT INTO t SELECT generate_series(1, 1000000);
> ANALYZE t;

Because the table isn't frozen, visibility checks will probably add enough
per-row overhead to make any per-row micro-optimization harder to see.  On my
somewhat older workstation freezing is a 17% improvement.

> SET max_parallel_workers_per_gather = 0;
> SELECT * FROM t WHERE a = -1;
> 
> Perhaps not too surprising given it's just eliminating a couple of
> instructions per row that the branch predictor probably handles well
> anyway? Still seems worth having for code hygiene if nothing else.
> 
> Same result (no diff in perf) when I apply it over your patch to move
> the scandesc == NULL check.

FWIW, on my cascade lake workstation it's a, surprisingly large, 3.5%, after
freezing. Without freezing there maybe still is a difference, but it's very
close to the noise floor.

Greetings,

Andres Freund

Re: unnecessary executor overheads around seqscans

От

Andres Freund

Дата:

27 января, 02:25:44

Hi,

On 2026-01-26 17:23:16 +0200, Heikki Linnakangas wrote:
> On 23/01/2026 22:16, Andres Freund wrote:
> > Hi,
> >
> > In [1] I was looking at the profile of a seqscan with a where clause that
> > doesn't match any of the many rows.  I was a bit saddened by where we were
> > spending time.
> >
> >
> > - The fetching of variables, as well as the null check of scandesc, in
> >    SeqNext() is repeated in every loop iteration of ExecScanExtended, despite
> >    that obviously not being required after the first iteration
> >
> >    We could perhaps address this by moving the check to the callers of
> >    ExecScanExtended() or by extending ExecScanExtended to have an explicit
> >    beginscan callback that it calls after.
>
> For context, we're talking about this in SeqNext:
>
> >     /*
> >      * get information from the estate and scan state
> >      */
> >     scandesc = node->ss.ss_currentScanDesc;
> >     estate = node->ss.ps.state;
> >     direction = estate->es_direction;
> >     slot = node->ss.ss_ScanTupleSlot;
>
> Hmm. I guess the compiler doesn't know that the variables don't change
> between calls, so it has to fetch them on every iteration. Passing them
> through a 'const' pointer might give it clue, but I'm not sure how to
> shoehorn that here.

My understanding is that compilers very rarely utilize const on pointers for
optimization. The problem is that just because the current pointer is const
doesn't mean there aren't other *non-const* pointers. And since there are a
lot of external function calls involved here, there's no way the compiler
could provide that that isn't the case :(.

> Perhaps we should turn the ExecScanExtended() function inside out. Instead
> of passing SeqNext as a callback to ExecScanExtended(), we would have a
> function like this (for illustration purposes only, doesn't compile):

That would be one approach, would require structural changes in a fair number
of places though :/.

A slightly simpler approach could be for ExecScanExtended to pass in these
parameters as arguments to the callbacks. For things like estate, direction
and scanslot, that makes plenty sense. It's a bit more problematic for the
scan descriptor, due to the "lazy start" we have in a few places.

I very briefly prototyped that (relying on the fact that all callers cast to
the callback type and that passing unused arguments just works even if the
function definition doesn't expect them), and that seems to do the trick.

What shows up more visibly afterwards is that we set ExecScanExtended() sets
econtext->ecxt_scantuple in every iteration, despite that not changing in
almost all cases (I think for FDWs it could, fdwhandler.sgml just says that
the scanslot "should" be used).

Greetings,

Andres Freund

Re: unnecessary executor overheads around seqscans

От

Amit Langote

Дата:

27 января, 05:39:00

On Tue, Jan 27, 2026 at 3:08 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2026-01-26 16:47:31 +0900, Amit Langote wrote:
> > I tried my patch over your committed SeqNext inlining patch and ran
> > the following benchmark but didn't notice in material difference:
> >
> > CREATE TABLE t (a int);
> > INSERT INTO t SELECT generate_series(1, 1000000);
> > ANALYZE t;
>
> Because the table isn't frozen, visibility checks will probably add enough
> per-row overhead to make any per-row micro-optimization harder to see.  On my
> somewhat older workstation freezing is a 17% improvement.
>
> > SET max_parallel_workers_per_gather = 0;
> > SELECT * FROM t WHERE a = -1;
> >
> > Perhaps not too surprising given it's just eliminating a couple of
> > instructions per row that the branch predictor probably handles well
> > anyway? Still seems worth having for code hygiene if nothing else.
> >
> > Same result (no diff in perf) when I apply it over your patch to move
> > the scandesc == NULL check.
>
> FWIW, on my cascade lake workstation it's a, surprisingly large, 3.5%, after
> freezing. Without freezing there maybe still is a difference, but it's very
> close to the noise floor.

I did freeze but still don't see a measurable difference.  Though, I
tested on a VM, so the noise floor is probably higher than on your
bare metal workstation.  I'll try later on bare metal.

--
Thanks, Amit Langote

Re: unnecessary executor overheads around seqscans

От

Heikki Linnakangas

Дата:

27 января, 13:16:15

On 27/01/2026 01:25, Andres Freund wrote:
> On 2026-01-26 17:23:16 +0200, Heikki Linnakangas wrote:
>> Perhaps we should turn the ExecScanExtended() function inside out. Instead
>> of passing SeqNext as a callback to ExecScanExtended(), we would have a
>> function like this (for illustration purposes only, doesn't compile):
> 
> That would be one approach, would require structural changes in a fair number
> of places though :/.

ExecScanExtended is only used in execScan.c and nodeSeqScan.c, so not 
that many places. Even replacing ExecScan() completely would be isolated 
to src/backend/executor.

> A slightly simpler approach could be for ExecScanExtended to pass in these
> parameters as arguments to the callbacks. For things like estate, direction
> and scanslot, that makes plenty sense. It's a bit more problematic for the
> scan descriptor, due to the "lazy start" we have in a few places.

Yep, it's less flexible. We have to know beforehand which variables the 
function might want to avoid re-fetching, and add them all as parameters.

> I very briefly prototyped that (relying on the fact that all callers cast to
> the callback type and that passing unused arguments just works even if the
> function definition doesn't expect them), and that seems to do the trick.

This can be a very hot codepath so if we can shave a few % off, I'm 
willing to hold my nose. But if we can do it more elegantly, even better...

> What shows up more visibly afterwards is that we set ExecScanExtended() sets
> econtext->ecxt_scantuple in every iteration, despite that not changing in
> almost all cases (I think for FDWs it could, fdwhandler.sgml just says that
> the scanslot "should" be used).

Huh, that's just a single store instruction. This really is a hot path. 
Or is it just that the first instruction after the function call causes 
some kind of a stall or data dependency, i.e. if that was removed, it'd 
just move to the next instruction instead?

I wonder about the access to (econtext)->ecxt_per_tuple_memory. That 
never changes, but we're re-fetching that too on every iteration.

InstrCountFiltered1() also fetches node->instrument, with a NULL check, 
on every iteration. Maybe keep a counter in a local variable and only 
call InstrCountFiltered1() when exiting the loop.

I guess these don't show up in a profiler or you would've latched on 
them already..

- Heikki

Re: unnecessary executor overheads around seqscans

От

Dilip Kumar

Дата:

27 января, 14:10:55

On Mon, Jan 26, 2026 at 5:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Jan 24, 2026 at 9:01 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2026-01-24 15:23:44 +0530, Amit Kapila wrote:
> > > On Sat, Jan 24, 2026 at 1:46 AM Andres Freund <andres@anarazel.de> wrote:
> > > >
> > > > - The checkXidAlive checks that have been added to table_scan_getnextslot()
> > > >   show up noticeably and in every loop iteration, despite afaict never being reachable
> > > >
> > > >   It's not obvious to me that this should
> > > >   a) be in table_scan_getnextslot(), rather than in beginscan - how could it
> > > >      change in the middle of a scan? That would require a wrapper around
> > > >      rd_tableam->scan_begin(), but that seems like it might be good anyway.
> > > >   b) not just be an assertion?
> > > >
> > >
> > > IIRC, the main reason for having this precautionary check in the API
> > > is to ensure that during logical decoding we never access the table AM
> > > or
> > > heap APIs directly when scanning catalog tables. This restriction
> > > exists because we only check for concurrent aborts inside the
> > > systable_* APIs.
> >
> > I know why the check exists - but why does it have to be in
> > table_scan_getnextslot(), which is executed very frequently, rather than
> > table_beginscan*(), which is executed much less frequently.
> >
>
> I thought about this point and couldn't think of any reason why this
> check can't be in table_beginscan*(). I think your idea of having a
> wrapper around scan_begin() to handle this check is a good one.

Here is the patch. I've used table_scan_begin_wrapper() to wrap the
scan_begin() callback for now. If you have a better naming preference
to avoid the 'wrapper' suffix, please let me know.

--
Regards,
Dilip Kumar
Google

Вложения

v1-0001-Refactor-Move-CheckXidAlive-check-to-table_begins.patch

Re: unnecessary executor overheads around seqscans

От

Andres Freund

Дата:

27 января, 20:02:59

Hi,

On 2026-01-27 12:16:15 +0200, Heikki Linnakangas wrote:
> On 27/01/2026 01:25, Andres Freund wrote:
> > On 2026-01-26 17:23:16 +0200, Heikki Linnakangas wrote:
> > > Perhaps we should turn the ExecScanExtended() function inside out. Instead
> > > of passing SeqNext as a callback to ExecScanExtended(), we would have a
> > > function like this (for illustration purposes only, doesn't compile):
> >
> > That would be one approach, would require structural changes in a fair number
> > of places though :/.
>
> ExecScanExtended is only used in execScan.c and nodeSeqScan.c, so not that
> many places. Even replacing ExecScan() completely would be isolated to
> src/backend/executor.

I was assuming we'd not want a totally different structure for nodeSeqScan.c
than for the rest. I think we really ought to use ExecScanExtended
everywhere. Probably not with multiple callsites (for no qual, no projection
etc), but to get rid of the indirect function calls to "accessMtd".

> > What shows up more visibly afterwards is that we set ExecScanExtended() sets
> > econtext->ecxt_scantuple in every iteration, despite that not changing in
> > almost all cases (I think for FDWs it could, fdwhandler.sgml just says that
> > the scanslot "should" be used).
>
> Huh, that's just a single store instruction. This really is a hot path. Or
> is it just that the first instruction after the function call causes some
> kind of a stall or data dependency, i.e. if that was removed, it'd just move
> to the next instruction instead?

I was a bit surprised too. It might be a case of some store forwarding issue,
because very shortly afterwards we end up reading ecxt_scantuple (at the start
of expression evaluation). Or we're just hitting limits on the number of
in-flight stores.

> I wonder about the access to (econtext)->ecxt_per_tuple_memory. That never
> changes, but we're re-fetching that too on every iteration.

The load itself didn't show up crazily, but there's something related that
does show up quite prominently: ResetExprContext().  I briefly hacked upan
inline version of MemoryContextReset(), and that does help noticeably.

Another context related thing that does show up is all the switcheroo around
CurrentMemoryContext.

> InstrCountFiltered1() also fetches node->instrument, with a NULL check, on
> every iteration. Maybe keep a counter in a local variable and only call
> InstrCountFiltered1() when exiting the loop.

> I guess these don't show up in a profiler or you would've latched on them
> already..

It does show up, but only really after addressing the other things... There
are plenty more things that show up, fwiw:

- Expression evaluation checks for NULL function args, even though we could
  know that they should not be NULL (this is complicated by the fact that
  sometimes we build "illegal" tuples, before constraint checking). That shows
  up surprisingly high.

  I think if we added information about whether NOT NULL is reliable in
  tupledescs it could solve this.

  While it'd allow optimizing some cases, it'd still be limited due strict
  function calls not being guaranteed to return NOT NULL.

- heap_getnextslot()'s calls to heapgettup_pagemode() should be inlined

- we probably should eventually inline ExecStoreBufferHeapTuple(),
  that way we would only need to handle the "buffer has changed" paths when
  heapgettup_pagemode() switches pages

- Expression evaluation has a too high startup overhead. This is partially due
  to it preparing to access all types of slots, which most of the time aren't
  needed.

  This would be a lot easier if we didn't need to go through the indirection
  via ExprContext to find the right scan/outer/... slot during every
  evaluation. But unfortunately there are some callsites that do change the
  slot between executions.  Perhaps we could make it a per-expression
  operation to change the slot of an expression, which would update state
  inside the ExprState?

- The access to columns in slots is too expensive during expression
  evaluation. We go from the slot, via an indirect memory access, to the
  values array, then have to fetch the right element there. Then do the same
  for nulls.

  I think to really do better we'd need to do two things:

  - Use one NullableDatum array instead of two separate arrays for values and
    isnull.

  - Store the array in place in the slot, as a flexible array member). That
    makes storing of per-slot-type data harder though.  We probably would need
    to embed the "common" part of the slot at the end of the custom part of
    the slot, which is more complicated...

  That's a lot of work, I tried it before... There's a fair bit of fallout due
  to all the places that do stuff like forming tuples based on
  tts_values/tts_isnull.

  I do wonder if it could already help to just store the offset to the values
  and isnull arrays inside the slot, instead of having to read pointers and
  then do address calculation based on those.

- The function call interface is way too expensive for simple operators

  We need a simplified function call interface that doesn't shuffle everything
  through pointers...

Greetings,

Andres Freund

Re: unnecessary executor overheads around seqscans

От

Andres Freund

Дата:

27 января, 22:24:30

Hi,

On 2026-01-27 16:40:55 +0530, Dilip Kumar wrote:
> > > I know why the check exists - but why does it have to be in
> > > table_scan_getnextslot(), which is executed very frequently, rather than
> > > table_beginscan*(), which is executed much less frequently.
> > >
> >
> > I thought about this point and couldn't think of any reason why this
> > check can't be in table_beginscan*(). I think your idea of having a
> > wrapper around scan_begin() to handle this check is a good one.
> 
> Here is the patch. I've used table_scan_begin_wrapper() to wrap the
> scan_begin() callback for now. If you have a better naming preference
> to avoid the 'wrapper' suffix, please let me know.

I'd probably just go for _internal(), _impl() or such.



> diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
> index 87491796523..15a92c052d3 100644
> --- a/src/backend/access/table/tableam.c
> +++ b/src/backend/access/table/tableam.c
> [...]
> +/*
> + * table_scan_begin_wrapper
> + *
> + * A wrapper around the Table Access Method scan_begin callback. This handles
> + * centralized error checking—specifically ensuring we aren't performing
> + * table scan while CheckXidAlive is valid.  This state is reserved for
> + * specific logical decoding operations where direct relation scanning is
> + * prohibited.
> + */
> +TableScanDesc
> +table_scan_begin_wrapper(Relation rel, Snapshot snapshot, int nkeys,
> +                         ScanKeyData *key, ParallelTableScanDesc pscan,
> +                         uint32 flags)
> +{
> +    /*
> +     * We don't expect direct calls to this function with valid CheckXidAlive
> +     * for catalog or regular tables.  See detailed comments in xact.c where
> +     * these variables are declared.
> +     */
> +    if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
> +        elog(ERROR, "unexpected table_scan_begin_wrapper call during logical decoding");
> +
> +    return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, pscan, flags);
> +}

Given that some of the callers are in inline functions in tableam.h, I would
implement the wrapper there. I doubt it'll make a meaningful difference, but
it doesn't seem worth "risking" that.

Greetings,

Andres Freund

Re: unnecessary executor overheads around seqscans

От

Dilip Kumar

Дата:

28 января, 07:23:48

On Wed, Jan 28, 2026 at 12:54 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2026-01-27 16:40:55 +0530, Dilip Kumar wrote:
> > > > I know why the check exists - but why does it have to be in
> > > > table_scan_getnextslot(), which is executed very frequently, rather than
> > > > table_beginscan*(), which is executed much less frequently.
> > > >
> > >
> > > I thought about this point and couldn't think of any reason why this
> > > check can't be in table_beginscan*(). I think your idea of having a
> > > wrapper around scan_begin() to handle this check is a good one.
> >
> > Here is the patch. I've used table_scan_begin_wrapper() to wrap the
> > scan_begin() callback for now. If you have a better naming preference
> > to avoid the 'wrapper' suffix, please let me know.
>
> I'd probably just go for _internal(), _impl() or such.

Thanks

>
> > diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
> > index 87491796523..15a92c052d3 100644
> > --- a/src/backend/access/table/tableam.c
> > +++ b/src/backend/access/table/tableam.c
> > [...]
> > +/*
> > + * table_scan_begin_wrapper
> > + *
> > + * A wrapper around the Table Access Method scan_begin callback. This handles
> > + * centralized error checking—specifically ensuring we aren't performing
> > + * table scan while CheckXidAlive is valid.  This state is reserved for
> > + * specific logical decoding operations where direct relation scanning is
> > + * prohibited.
> > + */
> > +TableScanDesc
> > +table_scan_begin_wrapper(Relation rel, Snapshot snapshot, int nkeys,
> > +                                              ScanKeyData *key, ParallelTableScanDesc pscan,
> > +                                              uint32 flags)
> > +{
> > +     /*
> > +      * We don't expect direct calls to this function with valid CheckXidAlive
> > +      * for catalog or regular tables.  See detailed comments in xact.c where
> > +      * these variables are declared.
> > +      */
> > +     if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
> > +             elog(ERROR, "unexpected table_scan_begin_wrapper call during logical decoding");
> > +
> > +     return rel->rd_tableam->scan_begin(rel, snapshot, nkeys, key, pscan, flags);
> > +}
>
> Given that some of the callers are in inline functions in tableam.h, I would
> implement the wrapper there. I doubt it'll make a meaningful difference, but
> it doesn't seem worth "risking" that.

Yeah, it makes sense, changed.

--
Regards,
Dilip Kumar
Google

Вложения

v2-0001-Refactor-Move-CheckXidAlive-check-to-table_begins.patch

Re: unnecessary executor overheads around seqscans

От

Andres Freund

Дата:

28 января, 20:52:22

Hi,

Thanks for the updated patch! I found one issue below.  Unless somebody sees a
reason not to, I'm planning to apply this after that is fixed.


On 2026-01-28 09:53:48 +0530, Dilip Kumar wrote:
> From 5347d920fe590ad3b250624d9cb50cc685ccd6d9 Mon Sep 17 00:00:00 2001
> From: Dilip Kumar <dilipkumarb@google.com>
> Date: Tue, 27 Jan 2026 16:20:59 +0530
> Subject: [PATCH v2] Refactor: Move CheckXidAlive check to table_beginscan for
>  better performance
> 
> Previously, the CheckXidAlive validation was performed within each table_scan_*
> function. This caused the check to be executed repeatedly for every tuple
> fetched, creating unnecessary overhead.
> 
> Move the check to table_beginscan* so it is performed once per scan rather than
> once per row.
> 
> Author: Dilip Kumar
> Reported-by: Andres Freund
> Suggested-by: Andres Freund, Amit Kapila
> ---

Very minor suggestion for the future, to make it easier for committers: Our
project style these days is to include email addresses, as used by the people
on the thread, in these tags, and to only include one person per tag,
instead repeating the tag to represent multiple people.


> @@ -1219,14 +1235,6 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
>                          TupleTableSlot *slot,
>                          bool *call_again, bool *all_dead)
>  {
> -    /*
> -     * We don't expect direct calls to table_index_fetch_tuple with valid
> -     * CheckXidAlive for catalog or regular tables.  See detailed comments in
> -     * xact.c where these variables are declared.
> -     */
> -    if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
> -        elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
> -
>      return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
>                                                      slot, call_again,
>                                                      all_dead);
> @@ -1265,14 +1273,6 @@ table_tuple_fetch_row_version(Relation rel,
>                                Snapshot snapshot,
>                                TupleTableSlot *slot)
>  {
> -    /*
> -     * We don't expect direct calls to table_tuple_fetch_row_version with
> -     * valid CheckXidAlive for catalog or regular tables.  See detailed
> -     * comments in xact.c where these variables are declared.
> -     */
> -    if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
> -        elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
> -
>      return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
>  }
>

These two cases aren't covered by the CheckXidAlive check in
table_scan_begin_impl(), as they don't use TableScanDesc.

table_tuple_fetch_row_version() doesn't use a scan, so we probably can't get
rid of the check - I think that's ok, the callers seem unlikely to be
bottlenecked by the test.

For table_index_fetch_tuple(), the check should be moved to
table_index_fetch_begin(). That's worth doing, as table_index_fetch_tuple()
can be performance critical (e.g. in an ordered index scan).

Greetings,

Andres Freund

Re: unnecessary executor overheads around seqscans

От

Dilip Kumar

Дата:

29 января, 12:24:58

On Wed, Jan 28, 2026 at 11:22 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> Thanks for the updated patch! I found one issue below.  Unless somebody sees a
> reason not to, I'm planning to apply this after that is fixed.
>
>
> On 2026-01-28 09:53:48 +0530, Dilip Kumar wrote:
> > From 5347d920fe590ad3b250624d9cb50cc685ccd6d9 Mon Sep 17 00:00:00 2001
> > From: Dilip Kumar <dilipkumarb@google.com>
> > Date: Tue, 27 Jan 2026 16:20:59 +0530
> > Subject: [PATCH v2] Refactor: Move CheckXidAlive check to table_beginscan for
> >  better performance
> >
> > Previously, the CheckXidAlive validation was performed within each table_scan_*
> > function. This caused the check to be executed repeatedly for every tuple
> > fetched, creating unnecessary overhead.
> >
> > Move the check to table_beginscan* so it is performed once per scan rather than
> > once per row.
> >
> > Author: Dilip Kumar
> > Reported-by: Andres Freund
> > Suggested-by: Andres Freund, Amit Kapila
> > ---
>
> Very minor suggestion for the future, to make it easier for committers: Our
> project style these days is to include email addresses, as used by the people
> on the thread, in these tags, and to only include one person per tag,
> instead repeating the tag to represent multiple people.

Okay, noted, I have changed it.

>
> > @@ -1219,14 +1235,6 @@ table_index_fetch_tuple(struct IndexFetchTableData *scan,
> >                                               TupleTableSlot *slot,
> >                                               bool *call_again, bool *all_dead)
> >  {
> > -     /*
> > -      * We don't expect direct calls to table_index_fetch_tuple with valid
> > -      * CheckXidAlive for catalog or regular tables.  See detailed comments in
> > -      * xact.c where these variables are declared.
> > -      */
> > -     if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
> > -             elog(ERROR, "unexpected table_index_fetch_tuple call during logical decoding");
> > -
> >       return scan->rel->rd_tableam->index_fetch_tuple(scan, tid, snapshot,
> >                                                                                                       slot,
call_again,
> >                                                                                                       all_dead);
> > @@ -1265,14 +1273,6 @@ table_tuple_fetch_row_version(Relation rel,
> >                                                         Snapshot snapshot,
> >                                                         TupleTableSlot *slot)
> >  {
> > -     /*
> > -      * We don't expect direct calls to table_tuple_fetch_row_version with
> > -      * valid CheckXidAlive for catalog or regular tables.  See detailed
> > -      * comments in xact.c where these variables are declared.
> > -      */
> > -     if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan))
> > -             elog(ERROR, "unexpected table_tuple_fetch_row_version call during logical decoding");
> > -
> >       return rel->rd_tableam->tuple_fetch_row_version(rel, tid, snapshot, slot);
> >  }
> >
>
> These two cases aren't covered by the CheckXidAlive check in
> table_scan_begin_impl(), as they don't use TableScanDesc.
>
> table_tuple_fetch_row_version() doesn't use a scan, so we probably can't get
> rid of the check - I think that's ok, the callers seem unlikely to be
> bottlenecked by the test.

Yeah, I missed this, thanks for pointing this out, fixed.

> For table_index_fetch_tuple(), the check should be moved to
> table_index_fetch_begin(). That's worth doing, as table_index_fetch_tuple()
> can be performance critical (e.g. in an ordered index scan).

Yeah we can do that, fixed.


--
Regards,
Dilip Kumar
Google

Вложения

v3-0001-Refactor-Move-CheckXidAlive-check-to-table_begins.patch

Re: unnecessary executor overheads around seqscans

От

Andres Freund

Дата:

30 января, 02:51:26

Hi,

I've edited the patch a bit more and pushed it:

- I ended up not liking _impl() that much, and changed it to common. That
  seems more in line what it actually does.

- All the beginscan functions are called beginscan, so I renamed
  table_scan_begin_impl to match

- We normally don't use function names in error message strings. I've
  rephrased the error

- Some comment rephrasing

Greetings,

Andres

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: unnecessary executor overheads around seqscans

Вложения

Вложения

Вложения

Вложения