Обсуждение: Qual push down to table AM

Поиск
Список
Период
Сортировка

Qual push down to table AM

От
Julien Tachoires
Дата:
Hi,

Please find attached a patch set proposal intended to implement WHERE
clauses (qual) push down to the underlying table AM during
table/sequential scan execution.

The primary goal of this project is to convert quals to ScanKeys and
pass them to the table AMs. Table AMs are then allowed to apply early
tuple filtering during table (sequential) scans. Applying filtering at
the table storage level is something necessary for non row-oriented
table storage like columnar storage. Index organized table is another
table storage that would need quals push down.

AFAIK, CustomScan is the one and only way to go for having table scan
using quals pushed down, but each table AM must implement its own
mechanism. IMHO, having this feature available in core would help the
development of new table AMs. About Heap, some performance testing
(detailed at the end of this message) shows between 45% and 60%
improvement in seq scan execution time when only one tuple is returned
from the table.

Only a few expressions are supported: OpExpr (<key> <operator> <value>),
ScalarArrayOpExpr (<key> <operator> ANY|ALL(ARRAY[...]), and NullTest.
Row comparison is not yet supported as this part is still not clear to
me. On the right part of the expression, we support: constant, variable,
function call, and subquery (InitPlan only).

In terms of security, we check if the function related to the operator
is not user defined: only functions from the catalog are supported. We
also check that the function is "leakproof".

Pushing down quals does not guaranty to the executor that the tuples
returned during table scan satisfy a qual, as we don't know if the table
AM (potentially implemented via an extension) has applied tuple
filtering. In order to ensure to produce the right response to the where
clause, pushed down quals are executed twice per tuple returned: once by
the table AM, and once by the executor. This produces a performance
regression (15-17%) where almost the entire table is returned (see perf.
test results at the end of the message). This could be optimized by
flagging the tuples filtered by the table AM, this way we could avoid
the re-execution of the pushed down quals.


Details about the patch files

v1-0001-Pass-the-number-of-ScanKeys-to-scan_rescan.patch: This patch
adds the number of ScanKeys passed via scan_rescan() as a new argument.
The number of ScanKeys was only passed to the table AM via begin_scan(),
but not in scan_rescan().

v1-0002-Simple-quals-push-down-to-table-AMs.patch: Core of the feature,
this patch adds qual push down support for OpExpr expressions.

v1-0003-Add-the-table-reloption-quals_push_down.patch: Adds a new
reloption: quals_push_down used to enable/disable qual push down for a
table. Disabled by default.

v1-0004-Add-tests-for-quals-push-down-to-table-AM.patch: Regression
tests.

v1-0005-Push-down-IN-NOT-IN-array-quals-to-table-AMs.patch:
ScalarArrayOpExpr support.

v1-0006-Push-down-IS-IS-NOT-NULL-quals-to-table-AMs.patch: NullTest
support.


Performance testing

Head:
CREATE TABLE t (i INTEGER);

Patch:
CREATE TABLE t (i INTEGER) WITH (quals_push_down = on);

n=1M:
INSERT INTO t SELECT generate_series(1, 1000000);
VACUUM t;

n=10M:
TRUNCATE t;
INSERT INTO t SELECT generate_series(1, 10000000);
VACUUM t;

n=100M:
TRUNCATE t;
INSERT INTO t SELECT generate_series(1, 100000000);
VACUUM t;


Case #1: SELECT COUNT(*) FROM t WHERE i = 50000;

        |       n=1M      |        n=10M      |         n=100M
        +--------+--------+---------+---------+----------+---------
        |  Head  |  Patch |  Head   |  Patch  |  Head    |  Patch
--------+--------+--------+---------+---------+----------+---------
Test #1    | 38.903 | 21.308 | 365.707 | 155.429 | 3939.937 | 1564.182
Test #2    | 39.239 | 21.271 | 364.206 | 153.127 | 3872.370 | 1527.988
Test #3    | 39.015 | 21.958 | 365.434 | 154.498 | 3812.382 | 1525.535
--------+--------+--------+---------+---------+----------+---------

--------+--------+--------+---------+---------+----------+---------
Average    | 39.052 | 21.512 | 365.116 | 154.351 | 3874.896 | 1539.235
Std dev |  0.171 |  0.386 |   0.800 |   1.158 |   63.815 |   21.640
--------+--------+--------+---------+---------+----------+---------
Gain    |          44.91% |            57.73% |             60.28%


Case #2: SELECT COUNT(*) FROM t WHERE i >= 2;

        |       n=1M      |        n=10M      |         n=100M
        +--------+--------+---------+---------+----------+---------
        |  Head  |  Patch |  Head   |  Patch  |  Head    |  Patch
--------+--------+--------+---------+---------+----------+---------
Test #1 | 68.422 | 81.233 | 674.397 | 778.427 | 6845.165 | 8071.627
Test #2 | 69.237 | 80.868 | 682.976 | 774.417 | 6533.091 | 7668.477
Test #3 | 69.579 | 80.418 | 676.072 | 791.465 | 6917.964 | 7916.182
--------+--------+--------+---------+---------+----------+---------

--------+--------+--------+---------+---------+----------+---------
Average | 69.079 | 80.840 | 677.815 | 781.436 | 6765.407 | 7885.429
Std dev |  0.594 |  0.408 |   4.547 |   8.914 |  204.457 |  203.327
--------+--------+--------+---------+---------+----------+---------
Gain    |         -17.02% |           -15.29% |            -16.56%


Thoughts?

Best regards,

-- 
Julien Tachoires

Вложения

Re: Qual push down to table AM

От
Andres Freund
Дата:
On 2025-08-27 22:27:37 +0200, Julien Tachoires wrote:
> Please find attached a patch set proposal intended to implement WHERE
> clauses (qual) push down to the underlying table AM during
> table/sequential scan execution.
> 
> The primary goal of this project is to convert quals to ScanKeys and
> pass them to the table AMs. Table AMs are then allowed to apply early
> tuple filtering during table (sequential) scans. Applying filtering at
> the table storage level is something necessary for non row-oriented
> table storage like columnar storage. Index organized table is another
> table storage that would need quals push down.
> 
> AFAIK, CustomScan is the one and only way to go for having table scan
> using quals pushed down, but each table AM must implement its own
> mechanism. IMHO, having this feature available in core would help the
> development of new table AMs. About Heap, some performance testing
> (detailed at the end of this message) shows between 45% and 60%
> improvement in seq scan execution time when only one tuple is returned
> from the table.

One problem with doing that in the case of heapam is that you're evaluating
scan keys with the buffer lock held - with basically arbitrary expressions
being evaluated. That's an easy path to undetected deadlocks.  You'd have to
redesign the relevant mechanism to filter outside of the lock...

Greetings,

Andres Freund



Re: Qual push down to table AM

От
Kirill Reshke
Дата:
On Thu, 28 Aug 2025 at 01:27, Julien Tachoires <julien@tachoires.me> wrote:
>
> Hi,
>
> Please find attached a patch set proposal intended to implement WHERE
> clauses (qual) push down to the underlying table AM during
> table/sequential scan execution.
>
> The primary goal of this project is to convert quals to ScanKeys and
> pass them to the table AMs. Table AMs are then allowed to apply early
> tuple filtering during table (sequential) scans. Applying filtering at
> the table storage level is something necessary for non row-oriented
> table storage like columnar storage. Index organized table is another
> table storage that would need quals push down.
>
> AFAIK, CustomScan is the one and only way to go for having table scan
> using quals pushed down, but each table AM must implement its own
> mechanism. IMHO, having this feature available in core would help the
> development of new table AMs. About Heap, some performance testing
> (detailed at the end of this message) shows between 45% and 60%
> improvement in seq scan execution time when only one tuple is returned
> from the table.
>
> Only a few expressions are supported: OpExpr (<key> <operator> <value>),
> ScalarArrayOpExpr (<key> <operator> ANY|ALL(ARRAY[...]), and NullTest.
> Row comparison is not yet supported as this part is still not clear to
> me. On the right part of the expression, we support: constant, variable,
> function call, and subquery (InitPlan only).
>
> In terms of security, we check if the function related to the operator
> is not user defined: only functions from the catalog are supported. We
> also check that the function is "leakproof".
>
> Pushing down quals does not guaranty to the executor that the tuples
> returned during table scan satisfy a qual, as we don't know if the table
> AM (potentially implemented via an extension) has applied tuple
> filtering. In order to ensure to produce the right response to the where
> clause, pushed down quals are executed twice per tuple returned: once by
> the table AM, and once by the executor. This produces a performance
> regression (15-17%) where almost the entire table is returned (see perf.
> test results at the end of the message). This could be optimized by
> flagging the tuples filtered by the table AM, this way we could avoid
> the re-execution of the pushed down quals.
>
>
> Details about the patch files
>
> v1-0001-Pass-the-number-of-ScanKeys-to-scan_rescan.patch: This patch
> adds the number of ScanKeys passed via scan_rescan() as a new argument.
> The number of ScanKeys was only passed to the table AM via begin_scan(),
> but not in scan_rescan().
>
> v1-0002-Simple-quals-push-down-to-table-AMs.patch: Core of the feature,
> this patch adds qual push down support for OpExpr expressions.
>
> v1-0003-Add-the-table-reloption-quals_push_down.patch: Adds a new
> reloption: quals_push_down used to enable/disable qual push down for a
> table. Disabled by default.
>
> v1-0004-Add-tests-for-quals-push-down-to-table-AM.patch: Regression
> tests.
>
> v1-0005-Push-down-IN-NOT-IN-array-quals-to-table-AMs.patch:
> ScalarArrayOpExpr support.
>
> v1-0006-Push-down-IS-IS-NOT-NULL-quals-to-table-AMs.patch: NullTest
> support.
>
>
> Performance testing
>
> Head:
> CREATE TABLE t (i INTEGER);
>
> Patch:
> CREATE TABLE t (i INTEGER) WITH (quals_push_down = on);
>
> n=1M:
> INSERT INTO t SELECT generate_series(1, 1000000);
> VACUUM t;
>
> n=10M:
> TRUNCATE t;
> INSERT INTO t SELECT generate_series(1, 10000000);
> VACUUM t;
>
> n=100M:
> TRUNCATE t;
> INSERT INTO t SELECT generate_series(1, 100000000);
> VACUUM t;
>
>
> Case #1: SELECT COUNT(*) FROM t WHERE i = 50000;
>
>         |       n=1M      |        n=10M      |         n=100M
>         +--------+--------+---------+---------+----------+---------
>         |  Head  |  Patch |  Head   |  Patch  |  Head    |  Patch
> --------+--------+--------+---------+---------+----------+---------
> Test #1 | 38.903 | 21.308 | 365.707 | 155.429 | 3939.937 | 1564.182
> Test #2 | 39.239 | 21.271 | 364.206 | 153.127 | 3872.370 | 1527.988
> Test #3 | 39.015 | 21.958 | 365.434 | 154.498 | 3812.382 | 1525.535
> --------+--------+--------+---------+---------+----------+---------
>
> --------+--------+--------+---------+---------+----------+---------
> Average | 39.052 | 21.512 | 365.116 | 154.351 | 3874.896 | 1539.235
> Std dev |  0.171 |  0.386 |   0.800 |   1.158 |   63.815 |   21.640
> --------+--------+--------+---------+---------+----------+---------
> Gain    |          44.91% |            57.73% |             60.28%
>
>
> Case #2: SELECT COUNT(*) FROM t WHERE i >= 2;
>
>         |       n=1M      |        n=10M      |         n=100M
>         +--------+--------+---------+---------+----------+---------
>         |  Head  |  Patch |  Head   |  Patch  |  Head    |  Patch
> --------+--------+--------+---------+---------+----------+---------
> Test #1 | 68.422 | 81.233 | 674.397 | 778.427 | 6845.165 | 8071.627
> Test #2 | 69.237 | 80.868 | 682.976 | 774.417 | 6533.091 | 7668.477
> Test #3 | 69.579 | 80.418 | 676.072 | 791.465 | 6917.964 | 7916.182
> --------+--------+--------+---------+---------+----------+---------
>
> --------+--------+--------+---------+---------+----------+---------
> Average | 69.079 | 80.840 | 677.815 | 781.436 | 6765.407 | 7885.429
> Std dev |  0.594 |  0.408 |   4.547 |   8.914 |  204.457 |  203.327
> --------+--------+--------+---------+---------+----------+---------
> Gain    |         -17.02% |           -15.29% |            -16.56%
>
>
> Thoughts?
>
> Best regards,
>
> --
> Julien Tachoires

Hi!
I was also always wondering if something like quals pushing can be
implemented in Postgres. It is indeed very beneficial for Column-based
processing in MPP databases, Greenplum and Cloudberry to name a few. I
did my own micro-research a while ago (while working on some
Cloudberry features), so here are my thoughts on the subject.

What this patchset is doing, is passing ScanKeys directly to tableam
somewhat blindly. In speedups processing execution-phase. While I do
not have strong objections against this approach, I suspect this
breaks some layers of abstractions and *silent* (or maybe documented)
agreements of what are responsibilities of TableAM functions. So,
passing ScanKeys directly to TAM is used on HEAD for catalog-access
only. Correct me if I'm wrong. For all other types of relation each
query is planned, which includes

(1) building data access patch thought various data access methods (indexes)
(2) Decide for each Qual which indexes can be used to satisfy this qual
(3) Using Cost Model for filtering best options

All of this can not be done with your approach?

Cost model can give hints to the optimizer that this TAM will process
some qual much faster than any by-index access. Smart cost
model/optimizer can realise that selecting only few of all attributes
from column-orietired relation + filter when using SIMD etc can be
really cheap.

So maybe the good shape of this patch would be something that could
choose between seqscan and indexscan in planner time?

-- 
Best regards,
Kirill Reshke



Re: Qual push down to table AM

От
Julien Tachoires
Дата:
Hi,

On Wed, Aug 27, 2025 at 05:50:01PM -0400, Andres Freund wrote:
> On 2025-08-27 22:27:37 +0200, Julien Tachoires wrote:
> > Please find attached a patch set proposal intended to implement WHERE
> > clauses (qual) push down to the underlying table AM during
> > table/sequential scan execution.
> > 
> > The primary goal of this project is to convert quals to ScanKeys and
> > pass them to the table AMs. Table AMs are then allowed to apply early
> > tuple filtering during table (sequential) scans. Applying filtering at
> > the table storage level is something necessary for non row-oriented
> > table storage like columnar storage. Index organized table is another
> > table storage that would need quals push down.
> > 
> > AFAIK, CustomScan is the one and only way to go for having table scan
> > using quals pushed down, but each table AM must implement its own
> > mechanism. IMHO, having this feature available in core would help the
> > development of new table AMs. About Heap, some performance testing
> > (detailed at the end of this message) shows between 45% and 60%
> > improvement in seq scan execution time when only one tuple is returned
> > from the table.
> 
> One problem with doing that in the case of heapam is that you're evaluating
> scan keys with the buffer lock held - with basically arbitrary expressions
> being evaluated. That's an easy path to undetected deadlocks.  You'd have to
> redesign the relevant mechanism to filter outside of the lock...

Thank you for this quick feedback.

One potential approach to solve this in heapgettup() would be:
1. hold the buffer lock
2. get the tuple from the buffer
3. if the tuple is not visible, move to the next tuple, back to 2. 
4. release the buffer lock
5. if the tuple does not satisfy the scan keys, take the buffer lock,
   move to the next tuple, back to 2.
6. return the tuple

Do you see something fundamentally wrong here?

In practice, I might be wrong, but I think this problem affects
heapgettup() only, heapgettup_pagemode() does not hold the buffer lock
when HeapKeyTest() is called. To reach this problematic part we need two
conditions: non-NULL ScanKey array and not to be in the page-at-a-time
mode (pagemode). The only table_beginscan_something() function able to
meet these conditions before calling the scan_begin() API is
table_beginscan_sampling(): the caller can choose to use pagemode. The
only place where table_beginscan_sampling() is called is in
tablesample_init(), in this case the ScanKey array is NULL, so we cannot
reach the problematic part of heapgettup(). There is only one other case
where we disable the pagemode in heapam: when the snapshot is non-MVCC.

Do you know any other code path/scenario I missed that can lead to reach
this problematic part?

Best regards,

-- 
Julien Tachoires



Re: Qual push down to table AM

От
Julien Tachoires
Дата:
Hi,

On Thu, Aug 28, 2025 at 02:57:02AM +0500, Kirill Reshke wrote:
> Hi!
> I was also always wondering if something like quals pushing can be
> implemented in Postgres. It is indeed very beneficial for Column-based
> processing in MPP databases, Greenplum and Cloudberry to name a few. I
> did my own micro-research a while ago (while working on some
> Cloudberry features), so here are my thoughts on the subject.
> 
> What this patchset is doing, is passing ScanKeys directly to tableam
> somewhat blindly. In speedups processing execution-phase. While I do
> not have strong objections against this approach, I suspect this
> breaks some layers of abstractions and *silent* (or maybe documented)
> agreements of what are responsibilities of TableAM functions. So,
> passing ScanKeys directly to TAM is used on HEAD for catalog-access
> only. Correct me if I'm wrong. For all other types of relation each
> query is planned, which includes
> 
> (1) building data access patch thought various data access methods (indexes)
> (2) Decide for each Qual which indexes can be used to satisfy this qual
> (3) Using Cost Model for filtering best options
> 
> All of this can not be done with your approach?
> 
> Cost model can give hints to the optimizer that this TAM will process
> some qual much faster than any by-index access. Smart cost
> model/optimizer can realise that selecting only few of all attributes
> from column-orietired relation + filter when using SIMD etc can be
> really cheap.
> 
> So maybe the good shape of this patch would be something that could
> choose between seqscan and indexscan in planner time?

Thank you for your quick feed back.

Exact, this patch does not add/reduce any cost when some quals are
planned to be pushed down. I agree with you that it would be nice
(necessary?) to have this. I think the table AM API should provide, via
new APIs, cost estimation in case of table scan and considering the cost
of evaluating the quals, if any.

Best regards,

-- 
Julien Tachoires



Re: Qual push down to table AM

От
Mark Dilger
Дата:
Thanks for the patchset, Julien.

v1-0001:
  All current callers of scan_rescan() pass NULL for the `key` parameter, making it unclear to
  Table AM authors if this is supposed to be a pointer to a single key, or an array.  If an array,
  how is it terminated?  None of this is addressed in the current code comments, and given
  that nobody uses this field, the intent is undiscoverable.  By adding `int nkeys` to the parameter
  list, your patch makes the intention clearer.  Perhaps you could also update the documentation
  for these functions?

v1-0002
  Changing the EXPLAIN output to include where the exclusion happened is quite nice!

  Out of curiosity, why did you wait until this patch to add the `int nkeys` parameter to
  initscan()?  It seems more on-topic for v1-0001.  Likewise, renaming `key` as `keys` helps,
  but could have been done in v1-0001.

As for Andres' concern upstream, I am including a Work-In-Progress patch (WIP) to check
that no light-weight locks are held during qual evaluation.  I don't intend this for commit so much
as for discussion.  It seems to me fairly clear that your patch does not evaluate the quals while
holding such locks, but I might have misunderstood the concern.



On Wed, Aug 27, 2025 at 1:28 PM Julien Tachoires <julien@tachoires.me> wrote:
Hi,

Please find attached a patch set proposal intended to implement WHERE
clauses (qual) push down to the underlying table AM during
table/sequential scan execution.

The primary goal of this project is to convert quals to ScanKeys and
pass them to the table AMs. Table AMs are then allowed to apply early
tuple filtering during table (sequential) scans. Applying filtering at
the table storage level is something necessary for non row-oriented
table storage like columnar storage. Index organized table is another
table storage that would need quals push down.

AFAIK, CustomScan is the one and only way to go for having table scan
using quals pushed down, but each table AM must implement its own
mechanism. IMHO, having this feature available in core would help the
development of new table AMs. About Heap, some performance testing
(detailed at the end of this message) shows between 45% and 60%
improvement in seq scan execution time when only one tuple is returned
from the table.

Only a few expressions are supported: OpExpr (<key> <operator> <value>),
ScalarArrayOpExpr (<key> <operator> ANY|ALL(ARRAY[...]), and NullTest.
Row comparison is not yet supported as this part is still not clear to
me. On the right part of the expression, we support: constant, variable,
function call, and subquery (InitPlan only).

In terms of security, we check if the function related to the operator
is not user defined: only functions from the catalog are supported. We
also check that the function is "leakproof".

Pushing down quals does not guaranty to the executor that the tuples
returned during table scan satisfy a qual, as we don't know if the table
AM (potentially implemented via an extension) has applied tuple
filtering. In order to ensure to produce the right response to the where
clause, pushed down quals are executed twice per tuple returned: once by
the table AM, and once by the executor. This produces a performance
regression (15-17%) where almost the entire table is returned (see perf.
test results at the end of the message). This could be optimized by
flagging the tuples filtered by the table AM, this way we could avoid
the re-execution of the pushed down quals.


Details about the patch files

v1-0001-Pass-the-number-of-ScanKeys-to-scan_rescan.patch: This patch
adds the number of ScanKeys passed via scan_rescan() as a new argument.
The number of ScanKeys was only passed to the table AM via begin_scan(),
but not in scan_rescan().

v1-0002-Simple-quals-push-down-to-table-AMs.patch: Core of the feature,
this patch adds qual push down support for OpExpr expressions.

v1-0003-Add-the-table-reloption-quals_push_down.patch: Adds a new
reloption: quals_push_down used to enable/disable qual push down for a
table. Disabled by default.

v1-0004-Add-tests-for-quals-push-down-to-table-AM.patch: Regression
tests.

v1-0005-Push-down-IN-NOT-IN-array-quals-to-table-AMs.patch:
ScalarArrayOpExpr support.

v1-0006-Push-down-IS-IS-NOT-NULL-quals-to-table-AMs.patch: NullTest
support.


Performance testing

Head:
CREATE TABLE t (i INTEGER);

Patch:
CREATE TABLE t (i INTEGER) WITH (quals_push_down = on);

n=1M:
INSERT INTO t SELECT generate_series(1, 1000000);
VACUUM t;

n=10M:
TRUNCATE t;
INSERT INTO t SELECT generate_series(1, 10000000);
VACUUM t;

n=100M:
TRUNCATE t;
INSERT INTO t SELECT generate_series(1, 100000000);
VACUUM t;


Case #1: SELECT COUNT(*) FROM t WHERE i = 50000;

        |       n=1M      |        n=10M      |         n=100M
        +--------+--------+---------+---------+----------+---------
        |  Head  |  Patch |  Head   |  Patch  |  Head    |  Patch
--------+--------+--------+---------+---------+----------+---------
Test #1 | 38.903 | 21.308 | 365.707 | 155.429 | 3939.937 | 1564.182
Test #2 | 39.239 | 21.271 | 364.206 | 153.127 | 3872.370 | 1527.988
Test #3 | 39.015 | 21.958 | 365.434 | 154.498 | 3812.382 | 1525.535
--------+--------+--------+---------+---------+----------+---------

--------+--------+--------+---------+---------+----------+---------
Average | 39.052 | 21.512 | 365.116 | 154.351 | 3874.896 | 1539.235
Std dev |  0.171 |  0.386 |   0.800 |   1.158 |   63.815 |   21.640
--------+--------+--------+---------+---------+----------+---------
Gain    |          44.91% |            57.73% |             60.28%


Case #2: SELECT COUNT(*) FROM t WHERE i >= 2;

        |       n=1M      |        n=10M      |         n=100M
        +--------+--------+---------+---------+----------+---------
        |  Head  |  Patch |  Head   |  Patch  |  Head    |  Patch
--------+--------+--------+---------+---------+----------+---------
Test #1 | 68.422 | 81.233 | 674.397 | 778.427 | 6845.165 | 8071.627
Test #2 | 69.237 | 80.868 | 682.976 | 774.417 | 6533.091 | 7668.477
Test #3 | 69.579 | 80.418 | 676.072 | 791.465 | 6917.964 | 7916.182
--------+--------+--------+---------+---------+----------+---------

--------+--------+--------+---------+---------+----------+---------
Average | 69.079 | 80.840 | 677.815 | 781.436 | 6765.407 | 7885.429
Std dev |  0.594 |  0.408 |   4.547 |   8.914 |  204.457 |  203.327
--------+--------+--------+---------+---------+----------+---------
Gain    |         -17.02% |           -15.29% |            -16.56%


Thoughts?

Best regards,

--
Julien Tachoires


--

Mark Dilger
Вложения

Re: Qual push down to table AM

От
Julien Tachoires
Дата:
Hi Mark,

On Tue, Oct 07, 2025 at 04:55:53PM -0700, Mark Dilger wrote:
> v1-0001:
>   All current callers of scan_rescan() pass NULL for the `key` parameter,
> making it unclear to
>   Table AM authors if this is supposed to be a pointer to a single key, or
> an array.  If an array,
>   how is it terminated?  None of this is addressed in the current code
> comments, and given
>   that nobody uses this field, the intent is undiscoverable.  By adding
> `int nkeys` to the parameter
>   list, your patch makes the intention clearer.  Perhaps you could also
> update the documentation
>   for these functions?
> 
> v1-0002
>   Changing the EXPLAIN output to include where the exclusion happened is
> quite nice!
> 
>   Out of curiosity, why did you wait until this patch to add the `int
> nkeys` parameter to
>   initscan()?  It seems more on-topic for v1-0001.  Likewise, renaming
> `key` as `keys` helps,
>   but could have been done in v1-0001.

Thanks for your review. Please find a new version attached that addresses
these points.

> As for Andres' concern upstream, I am including a Work-In-Progress patch
> (WIP) to check
> that no light-weight locks are held during qual evaluation.  I don't intend
> this for commit so much
> as for discussion.  It seems to me fairly clear that your patch does not
> evaluate the quals while
> holding such locks, but I might have misunderstood the concern.

Thanks, it helps to confirm that due to the absence of ScanKey,
HeapKeyTest() is not called from heapgettup(), at least when running the
regression tests.

In order to guarantee to not hold the Buffer lock while evaluating a
ScanKey, v4-0001-Release-buffer-lock-before-scan-key-evaluation releases
the lock before calling HeapKeyTest().

Regards,

-- 
Julien Tachoires

Вложения

Re: Qual push down to table AM

От
Robert Haas
Дата:
On Fri, Aug 29, 2025 at 4:38 AM Julien Tachoires <julien@tachoires.me> wrote:
> Thank you for this quick feedback.
>
> One potential approach to solve this in heapgettup() would be:
> 1. hold the buffer lock
> 2. get the tuple from the buffer
> 3. if the tuple is not visible, move to the next tuple, back to 2.
> 4. release the buffer lock
> 5. if the tuple does not satisfy the scan keys, take the buffer lock,
>    move to the next tuple, back to 2.
> 6. return the tuple
>
> Do you see something fundamentally wrong here?

I spent a bit of time this afternoon looking at v4-0001. I noticed a
few spelling mistakes (abritrary x2, statisfied x1). As far as the
basic approach is concerned, I don't see how there can be a safety
problem here. If it's safe to release the buffer lock when we find a
tuple that matches the quals, for the purposes of returning that tuple
to the caller, then it seems like it must also be safe to release it
to evaluate a proposed qual.

Potentially, there could be a performance problem. Imagine that we
have some code right now that uses this code path and it's safe
because the qual that we're evaluating is something super-simple like
the integer less-than operator, so calling it under the buffer lock
doesn't create a stability hazard. Well, with the patch, we'd
potentially take and release the buffer lock a lot more times than we
do right now. Imagine that there are lots of tuples on each page but
only 1 or very few of them satisfy the qual: then we lock and unlock
the buffer a whole bunch of times instead of just once.

However, I don't think this really happens in practice. I believe it's
possible to take this code path if you set ignore_system_indexes=on,
because that turns index scans --- which, not surprisingly, have
scankeys --- into sequential scans which then end up also having
scankeys. Many of those scans use catalog snapshots so there's no
issue, but a little bit of debugging code seems to show that
systable_beginscan() can also be called with snapshot->snapshot_type
set to SNAPSHOT_ANY or SNAPSHOT_DIRTY. For example, see
GetNewOidWithIndex(). However, even if ignore_system_indexes=on gets a
little slower as a result of this or some other patch, I don't think
we really care, and without that setting, this code doesn't seem to
get exercised at all.

So, somewhat to my surprise, I think that v4-0001 might be basically
fine. I wonder if anyone else sees a problem that I'm missing?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Qual push down to table AM

От
Andres Freund
Дата:
Hi,

On 2025-12-09 16:40:17 -0500, Robert Haas wrote:
> On Fri, Aug 29, 2025 at 4:38 AM Julien Tachoires <julien@tachoires.me> wrote:
> Potentially, there could be a performance problem

I think the big performance hazard with this is repeated deforming. The
scankey infrastructure deforms attributes one-by-one *and* it does not
"persist" the work of deforming for later accesses.  So if you e.g. have
something like

  SELECT sum(col_29) FROM tbl WHERE col_30 = common_value;
or
  SELECT * FROM tbl WHERE col_30 = common_value;


we'll now deform col_30 in isolation for the ScanKey evaluation and then we'll
deform columns 1-29 in the slot (because we always deform all the leading
columns), during projection.

But even leaving the slot issue aside, I'd bet that you'll see overhead due to
*not* deforming multiple columns at once. If you have a ScanKey version of
something like
  WHERE column_20 = common_val AND column_21 = some_val AND column_22 = another_val;

and there's a NULL or varlena value in one of the leading columns, we'll redo
a fair bit of work during the fastgetattr() for column_22.



I don't really see this being viable without first tackling two nontrivial
projects:

1) Make slot deforming for expressions & projections selective, i.e. don't
   deform all the leading columns, but only ones that will eventually be
   needed
2) Perform ScanKey evaluation in slot form, to be able to cache the deforming
   and to make deforming of multiple columns sufficiently efficient.


> So, somewhat to my surprise, I think that v4-0001 might be basically
> fine. I wonder if anyone else sees a problem that I'm missing?

I doubt this would be safe as-is: ISTM that if you release the page lock
between tuples, things like the number of items on the page can change. But we
store stuff like that in registers / on the stack, which could change while
the lock is not held.

We could refetch the number items on the page for every loop iteration, but
that'd probably not be free. OTOH, it's probably nothing compared to the cost
of relocking the page...

Greetings,

Andres Freund



Re: Qual push down to table AM

От
Robert Haas
Дата:
On Tue, Dec 9, 2025 at 6:08 PM Andres Freund <andres@anarazel.de> wrote:
> On 2025-12-09 16:40:17 -0500, Robert Haas wrote:
> > On Fri, Aug 29, 2025 at 4:38 AM Julien Tachoires <julien@tachoires.me> wrote:
> > Potentially, there could be a performance problem
>
> I think the big performance hazard with this is repeated deforming. The
> scankey infrastructure deforms attributes one-by-one *and* it does not
> "persist" the work of deforming for later accesses.  So if you e.g. have
> something like
>
>   SELECT sum(col_29) FROM tbl WHERE col_30 = common_value;
> or
>   SELECT * FROM tbl WHERE col_30 = common_value;
>
> we'll now deform col_30 in isolation for the ScanKey evaluation and then we'll
> deform columns 1-29 in the slot (because we always deform all the leading
> columns), during projection.

Hmm, this is a good point, and I agree that it's a huge challenge for
this patch set. Repeated tuple deforming is REALLY expensive, which is
why we've spent so much energy trying to use slots in an as many
places as possible. I find it easy to believe that HeapKeyTest's loop
over heap_getattr() is going to prohibitively painful and that this
code will need to somehow also be slot-ified for this to be a viable
project.

> I don't really see this being viable without first tackling two nontrivial
> projects:
>
> 1) Make slot deforming for expressions & projections selective, i.e. don't
>    deform all the leading columns, but only ones that will eventually be
>    needed
> 2) Perform ScanKey evaluation in slot form, to be able to cache the deforming
>    and to make deforming of multiple columns sufficiently efficient.

IOW, I agree that we probably need to do #2. I am not entirely sure
about #1. I'm a little afraid that trying to skip over columns without
deforming them will add a bunch of code complexity that doesn't really
pay off. You have to do the bookkeeping to know what to skip, and then
how much are you really gaining by skipping it? If you can skip over a
bunch of fixed-width columns, that's cool, but it's probably fairly
normal to have lots of varlena columns, and then I don't really see
that you're gaining much here. You still have to iterate through the
tuple, and not storing the pointer to the start of each column as you
find it doesn't seem like it will save much. What's your reasoning
behind thinking that #1 will be necessary?

> > So, somewhat to my surprise, I think that v4-0001 might be basically
> > fine. I wonder if anyone else sees a problem that I'm missing?
>
> I doubt this would be safe as-is: ISTM that if you release the page lock
> between tuples, things like the number of items on the page can change. But we
> store stuff like that in registers / on the stack, which could change while
> the lock is not held.
>
> We could refetch the number items on the page for every loop iteration, but
> that'd probably not be free. OTOH, it's probably nothing compared to the cost
> of relocking the page...

We still hold a pin, though, which I think means very little can
change. More items can be added to the page, so we might want to
refresh the number of items on the page at least when we think we're
done, but I believe that any sort of more invasive page rearrangement
would be precluded by the pin.

I kind of wonder if it would be good to make a change along the lines
of v4-0001 even if this patch set doesn't move forward overall, or
will need a lot of slot-ification to do so. It seems weird to me that
we're OK with calling out to arbitrary code with a buffer lock held,
and even weirder that whether or not we do that depends on whether
SO_ALLOW_PAGEMODE was set. I don't think a difference of this kind
between pagemode behavior and non-pagemode behavior would survive
review if someone proposed it today; the fact that it works the way it
does is probably an artifact of this mechanism having been added
twenty years ago when the project was in a very different place.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Qual push down to table AM

От
Amit Langote
Дата:
On Thu, Dec 11, 2025 at 12:41 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Dec 9, 2025 at 6:08 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2025-12-09 16:40:17 -0500, Robert Haas wrote:
> > > On Fri, Aug 29, 2025 at 4:38 AM Julien Tachoires <julien@tachoires.me> wrote:
> > > Potentially, there could be a performance problem
> >
> > I think the big performance hazard with this is repeated deforming. The
> > scankey infrastructure deforms attributes one-by-one *and* it does not
> > "persist" the work of deforming for later accesses.  So if you e.g. have
> > something like
> >
> >   SELECT sum(col_29) FROM tbl WHERE col_30 = common_value;
> > or
> >   SELECT * FROM tbl WHERE col_30 = common_value;
> >
> > we'll now deform col_30 in isolation for the ScanKey evaluation and then we'll
> > deform columns 1-29 in the slot (because we always deform all the leading
> > columns), during projection.
>
> Hmm, this is a good point, and I agree that it's a huge challenge for
> this patch set. Repeated tuple deforming is REALLY expensive, which is
> why we've spent so much energy trying to use slots in an as many
> places as possible. I find it easy to believe that HeapKeyTest's loop
> over heap_getattr() is going to prohibitively painful and that this
> code will need to somehow also be slot-ified for this to be a viable
> project.
>
> > I don't really see this being viable without first tackling two nontrivial
> > projects:
> >
> > 1) Make slot deforming for expressions & projections selective, i.e. don't
> >    deform all the leading columns, but only ones that will eventually be
> >    needed
> > 2) Perform ScanKey evaluation in slot form, to be able to cache the deforming
> >    and to make deforming of multiple columns sufficiently efficient.
>
> IOW, I agree that we probably need to do #2. I am not entirely sure
> about #1.

I'm also curious to understand why Andres sees #1 as a prerequisite
for qual pushdown.

> I'm a little afraid that trying to skip over columns without
> deforming them will add a bunch of code complexity that doesn't really
> pay off.

I think it might be worthwhile. I have a PoC [1] I worked on (at
Andres's suggestion) that showed ~2x improvement on simple aggregation
queries over wide tables (all pages buffered) by tracking the minimum
needed attribute and using cached offsets stored in TupleDesc to skip
fixed-not-null prefixes. I'm thinking of reviving it with proper
tracking of which attributes are needed and deformed (bitmapset or
flag array in TupleTableSlot).

> > > So, somewhat to my surprise, I think that v4-0001 might be basically
> > > fine. I wonder if anyone else sees a problem that I'm missing?
> >
> > I doubt this would be safe as-is: ISTM that if you release the page lock
> > between tuples, things like the number of items on the page can change. But we
> > store stuff like that in registers / on the stack, which could change while
> > the lock is not held.
> >
> > We could refetch the number items on the page for every loop iteration, but
> > that'd probably not be free. OTOH, it's probably nothing compared to the cost
> > of relocking the page...
>
> We still hold a pin, though, which I think means very little can
> change. More items can be added to the page, so we might want to
> refresh the number of items on the page at least when we think we're
> done, but I believe that any sort of more invasive page rearrangement
> would be precluded by the pin.
>
> I kind of wonder if it would be good to make a change along the lines
> of v4-0001 even if this patch set doesn't move forward overall, or
> will need a lot of slot-ification to do so. It seems weird to me that
> we're OK with calling out to arbitrary code with a buffer lock held,
> and even weirder that whether or not we do that depends on whether
> SO_ALLOW_PAGEMODE was set. I don't think a difference of this kind
> between pagemode behavior and non-pagemode behavior would survive
> review if someone proposed it today; the fact that it works the way it
> does is probably an artifact of this mechanism having been added
> twenty years ago when the project was in a very different place.

One maybe crazy thought: what about only enabling qual pushdown when
pagemode is used, since it already processes all tuples on a page in
one locked phase? That raises the question of whether there's a class
of quals simple enough (built-in ops?) that evaluating them alongside
visibility checking would be acceptable, with lock held that is -- but
it would avoid the lock churn and racy loop termination issues with
v4-0001.

--
Thanks, Amit Langote

[1] https://www.postgresql.org/message-id/CA%2BHiwqHXDY6TxegR2Cr_4sRa_LY1QJnoL8XRmOqdfrx21pZ6cw%40mail.gmail.com



Re: Qual push down to table AM

От
Andres Freund
Дата:
Hi,

On 2025-12-15 21:56:12 +0900, Amit Langote wrote:
> On Thu, Dec 11, 2025 at 12:41 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > On Tue, Dec 9, 2025 at 6:08 PM Andres Freund <andres@anarazel.de> wrote:
> > > On 2025-12-09 16:40:17 -0500, Robert Haas wrote:
> > > > On Fri, Aug 29, 2025 at 4:38 AM Julien Tachoires <julien@tachoires.me> wrote:
> > > > Potentially, there could be a performance problem
> > >
> > > I think the big performance hazard with this is repeated deforming. The
> > > scankey infrastructure deforms attributes one-by-one *and* it does not
> > > "persist" the work of deforming for later accesses.  So if you e.g. have
> > > something like
> > >
> > >   SELECT sum(col_29) FROM tbl WHERE col_30 = common_value;
> > > or
> > >   SELECT * FROM tbl WHERE col_30 = common_value;
> > >
> > > we'll now deform col_30 in isolation for the ScanKey evaluation and then we'll
> > > deform columns 1-29 in the slot (because we always deform all the leading
> > > columns), during projection.
> >
> > Hmm, this is a good point, and I agree that it's a huge challenge for
> > this patch set. Repeated tuple deforming is REALLY expensive, which is
> > why we've spent so much energy trying to use slots in an as many
> > places as possible. I find it easy to believe that HeapKeyTest's loop
> > over heap_getattr() is going to prohibitively painful and that this
> > code will need to somehow also be slot-ified for this to be a viable
> > project.
> >
> > > I don't really see this being viable without first tackling two nontrivial
> > > projects:
> > >
> > > 1) Make slot deforming for expressions & projections selective, i.e. don't
> > >    deform all the leading columns, but only ones that will eventually be
> > >    needed
> > > 2) Perform ScanKey evaluation in slot form, to be able to cache the deforming
> > >    and to make deforming of multiple columns sufficiently efficient.
> >
> > IOW, I agree that we probably need to do #2. I am not entirely sure
> > about #1.
> 
> I'm also curious to understand why Andres sees #1 as a prerequisite
> for qual pushdown.

I suspect you'll not see a whole lot of gain without it. When I experimented
with it, a good portion (but not all!) of the gain seemed to be from just
deforming the immediately required columns - but also a lot of the visible
regressions were from that.


> > We still hold a pin, though, which I think means very little can
> > change. More items can be added to the page, so we might want to
> > refresh the number of items on the page at least when we think we're
> > done, but I believe that any sort of more invasive page rearrangement
> > would be precluded by the pin.
> >
> > I kind of wonder if it would be good to make a change along the lines
> > of v4-0001 even if this patch set doesn't move forward overall, or
> > will need a lot of slot-ification to do so. It seems weird to me that
> > we're OK with calling out to arbitrary code with a buffer lock held,
> > and even weirder that whether or not we do that depends on whether
> > SO_ALLOW_PAGEMODE was set. I don't think a difference of this kind
> > between pagemode behavior and non-pagemode behavior would survive
> > review if someone proposed it today; the fact that it works the way it
> > does is probably an artifact of this mechanism having been added
> > twenty years ago when the project was in a very different place.
> 
> One maybe crazy thought: what about only enabling qual pushdown when
> pagemode is used, since it already processes all tuples on a page in
> one locked phase? That raises the question of whether there's a class
> of quals simple enough (built-in ops?) that evaluating them alongside
> visibility checking would be acceptable, with lock held that is -- but
> it would avoid the lock churn and racy loop termination issues with
> v4-0001.

I think that's the wrong direction to go. We shouldn't do more under the lock,
we should be less. You certainly couldn't just use builtin ops, as some of
them *do* other catalog lookups, which would lead to deadlock potential.

I think it's also just unnecessary to try to do anything under a lock here,
it's not hard to first do all the visibility checks while locked and then,
after unlocking, filter the tuples based on quals.

Greetings,

Andres Freund



Re: Qual push down to table AM

От
Andres Freund
Дата:
Hi,

On 2025-12-10 10:41:19 -0500, Robert Haas wrote:
> On Tue, Dec 9, 2025 at 6:08 PM Andres Freund <andres@anarazel.de> wrote:
> > I don't really see this being viable without first tackling two nontrivial
> > projects:
> >
> > 1) Make slot deforming for expressions & projections selective, i.e. don't
> >    deform all the leading columns, but only ones that will eventually be
> >    needed
> > 2) Perform ScanKey evaluation in slot form, to be able to cache the deforming
> >    and to make deforming of multiple columns sufficiently efficient.
> 
> IOW, I agree that we probably need to do #2. I am not entirely sure
> about #1. I'm a little afraid that trying to skip over columns without
> deforming them will add a bunch of code complexity that doesn't really
> pay off. You have to do the bookkeeping to know what to skip, and then
> how much are you really gaining by skipping it? If you can skip over a
> bunch of fixed-width columns, that's cool, but it's probably fairly
> normal to have lots of varlena columns, and then I don't really see
> that you're gaining much here. You still have to iterate through the
> tuple, and not storing the pointer to the start of each column as you
> find it doesn't seem like it will save much.

FWIW, in experiments I observed that not storing all the columns that never
are used saves pretty decent amount of cycles, just due to not having to store
the never-accessed datums in the slot (and in case of non-varlena columns, not
having to fetch the relevant data). It's probably true that the gain for
varlenas is smaller, due to the cost of determining the length, but the
difference is that if the cost of storing the columns is relevant, fields
after a NULL or a varlena still can beenfit from the optimization.


> What's your reasoning behind thinking that #1 will be necessary?

Tried to answer that downthread, in response to Amit.


> > > So, somewhat to my surprise, I think that v4-0001 might be basically
> > > fine. I wonder if anyone else sees a problem that I'm missing?
> >
> > I doubt this would be safe as-is: ISTM that if you release the page lock
> > between tuples, things like the number of items on the page can change. But we
> > store stuff like that in registers / on the stack, which could change while
> > the lock is not held.
> >
> > We could refetch the number items on the page for every loop iteration, but
> > that'd probably not be free. OTOH, it's probably nothing compared to the cost
> > of relocking the page...
> 
> We still hold a pin, though, which I think means very little can
> change. More items can be added to the page, so we might want to
> refresh the number of items on the page at least when we think we're
> done, but I believe that any sort of more invasive page rearrangement
> would be precluded by the pin.
> 
> I kind of wonder if it would be good to make a change along the lines
> of v4-0001 even if this patch set doesn't move forward overall, or
> will need a lot of slot-ification to do so. It seems weird to me that
> we're OK with calling out to arbitrary code with a buffer lock held,
> and even weirder that whether or not we do that depends on whether
> SO_ALLOW_PAGEMODE was set.

I don't think it's stated clearly anywhere - the only reason this is remotely
within a stone's throw of ok is that the only code using ScanKeys for table
scans is catcache, which in turn means that the set of effectively allowed
operators is tiny (oideq, int2eq, int4eq, int8eq, nameeq, chareq, booleq and
perhaps 2-3 more).

And for those we only support ScanKey mode because it's required as a fallback
for index based catcache searches.

I'm not against fixing qual-eval-under-buffer-lock, it shouldn't ever be used
in particularly performance sensitive cases.

Greetings,

Andres Freund



Re: Qual push down to table AM

От
Maxime Schoemans
Дата:
Hi,

> On 10 Dec 2025, at 00:08, Andres Freund <andres@anarazel.de> wrote:
> I don't really see this being viable without first tackling two nontrivial
> projects:
>
> 2) Perform ScanKey evaluation in slot form, to be able to cache the deforming
>   and to make deforming of multiple columns sufficiently efficient.

Am I right in understanding that you think that the repeated calls to
heap_getattr in HeapKeyTest is not ideal if we have NULL or varlena
columns? I have written a small patch (see attached) that stores the heap
tuple in a TupleTableSlot first and then calls slot_getattr instead, which
should benefit from caching. Is that the type of solution you were thinking of?

It is definitely not a complete patch (needs comments and a description),
and it is not merged into the patch set of Julien yet, but I just wanted to
check that this was what you were proposing and that I was not
misunderstanding something.

> 1) Make slot deforming for expressions & projections selective, i.e. don't
>   deform all the leading columns, but only ones that will eventually be
>   needed


Concerning 1), I’m also not certain I understand why this is a prerequisite for
the pushdown work. It could certainly be beneficial, but it seems to be
complementary. In any case, I’d be interested to look at your POC patch
on the subject, Amit.

Best,

Maxime Schoemans

Вложения

Re: Qual push down to table AM

От
Andres Freund
Дата:
Hi,

On 2025-12-18 20:40:31 +0100, Maxime Schoemans wrote:
> > On 10 Dec 2025, at 00:08, Andres Freund <andres@anarazel.de> wrote:
> > I don't really see this being viable without first tackling two nontrivial
> > projects:
> > 
> > 2) Perform ScanKey evaluation in slot form, to be able to cache the deforming
> >   and to make deforming of multiple columns sufficiently efficient.
> 
> Am I right in understanding that you think that the repeated calls to
> heap_getattr in HeapKeyTest is not ideal if we have NULL or varlena
> columns?

That's part of it, but not all of it: The other aspect is that if you do a
bunch of heap_getattr()s inside HeapKeyTest() and then project that column in
nodeSeqscan.c, you'll do the work to deform twice.


> > 1) Make slot deforming for expressions & projections selective, i.e. don't
> >   deform all the leading columns, but only ones that will eventually be
> >   needed
> 
> 
> Concerning 1), I’m also not certain I understand why this is a prerequisite
> for the pushdown work. It could certainly be beneficial, but it seems to be
> complementary.

As hinted at in [1] I suspect that you're just not going to see big enough
wins without the above optimization. A decent portion of the win from using
HeapKeyTest is to only selectively deform.

Greetings,

Andres Freund

[1] https://postgr.es/m/CAAh00EQUwG5khqJO7nSV0nsqsG1OP%3DkA6ACfxV3rnNSVd4b6TQ%40mail.gmail.com