Обсуждение: Qual push down to table AM
Hi,
Please find attached a patch set proposal intended to implement WHERE
clauses (qual) push down to the underlying table AM during
table/sequential scan execution.
The primary goal of this project is to convert quals to ScanKeys and
pass them to the table AMs. Table AMs are then allowed to apply early
tuple filtering during table (sequential) scans. Applying filtering at
the table storage level is something necessary for non row-oriented
table storage like columnar storage. Index organized table is another
table storage that would need quals push down.
AFAIK, CustomScan is the one and only way to go for having table scan
using quals pushed down, but each table AM must implement its own
mechanism. IMHO, having this feature available in core would help the
development of new table AMs. About Heap, some performance testing
(detailed at the end of this message) shows between 45% and 60%
improvement in seq scan execution time when only one tuple is returned
from the table.
Only a few expressions are supported: OpExpr (<key> <operator> <value>),
ScalarArrayOpExpr (<key> <operator> ANY|ALL(ARRAY[...]), and NullTest.
Row comparison is not yet supported as this part is still not clear to
me. On the right part of the expression, we support: constant, variable,
function call, and subquery (InitPlan only).
In terms of security, we check if the function related to the operator
is not user defined: only functions from the catalog are supported. We
also check that the function is "leakproof".
Pushing down quals does not guaranty to the executor that the tuples
returned during table scan satisfy a qual, as we don't know if the table
AM (potentially implemented via an extension) has applied tuple
filtering. In order to ensure to produce the right response to the where
clause, pushed down quals are executed twice per tuple returned: once by
the table AM, and once by the executor. This produces a performance
regression (15-17%) where almost the entire table is returned (see perf.
test results at the end of the message). This could be optimized by
flagging the tuples filtered by the table AM, this way we could avoid
the re-execution of the pushed down quals.
Details about the patch files
v1-0001-Pass-the-number-of-ScanKeys-to-scan_rescan.patch: This patch
adds the number of ScanKeys passed via scan_rescan() as a new argument.
The number of ScanKeys was only passed to the table AM via begin_scan(),
but not in scan_rescan().
v1-0002-Simple-quals-push-down-to-table-AMs.patch: Core of the feature,
this patch adds qual push down support for OpExpr expressions.
v1-0003-Add-the-table-reloption-quals_push_down.patch: Adds a new
reloption: quals_push_down used to enable/disable qual push down for a
table. Disabled by default.
v1-0004-Add-tests-for-quals-push-down-to-table-AM.patch: Regression
tests.
v1-0005-Push-down-IN-NOT-IN-array-quals-to-table-AMs.patch:
ScalarArrayOpExpr support.
v1-0006-Push-down-IS-IS-NOT-NULL-quals-to-table-AMs.patch: NullTest
support.
Performance testing
Head:
CREATE TABLE t (i INTEGER);
Patch:
CREATE TABLE t (i INTEGER) WITH (quals_push_down = on);
n=1M:
INSERT INTO t SELECT generate_series(1, 1000000);
VACUUM t;
n=10M:
TRUNCATE t;
INSERT INTO t SELECT generate_series(1, 10000000);
VACUUM t;
n=100M:
TRUNCATE t;
INSERT INTO t SELECT generate_series(1, 100000000);
VACUUM t;
Case #1: SELECT COUNT(*) FROM t WHERE i = 50000;
| n=1M | n=10M | n=100M
+--------+--------+---------+---------+----------+---------
| Head | Patch | Head | Patch | Head | Patch
--------+--------+--------+---------+---------+----------+---------
Test #1 | 38.903 | 21.308 | 365.707 | 155.429 | 3939.937 | 1564.182
Test #2 | 39.239 | 21.271 | 364.206 | 153.127 | 3872.370 | 1527.988
Test #3 | 39.015 | 21.958 | 365.434 | 154.498 | 3812.382 | 1525.535
--------+--------+--------+---------+---------+----------+---------
--------+--------+--------+---------+---------+----------+---------
Average | 39.052 | 21.512 | 365.116 | 154.351 | 3874.896 | 1539.235
Std dev | 0.171 | 0.386 | 0.800 | 1.158 | 63.815 | 21.640
--------+--------+--------+---------+---------+----------+---------
Gain | 44.91% | 57.73% | 60.28%
Case #2: SELECT COUNT(*) FROM t WHERE i >= 2;
| n=1M | n=10M | n=100M
+--------+--------+---------+---------+----------+---------
| Head | Patch | Head | Patch | Head | Patch
--------+--------+--------+---------+---------+----------+---------
Test #1 | 68.422 | 81.233 | 674.397 | 778.427 | 6845.165 | 8071.627
Test #2 | 69.237 | 80.868 | 682.976 | 774.417 | 6533.091 | 7668.477
Test #3 | 69.579 | 80.418 | 676.072 | 791.465 | 6917.964 | 7916.182
--------+--------+--------+---------+---------+----------+---------
--------+--------+--------+---------+---------+----------+---------
Average | 69.079 | 80.840 | 677.815 | 781.436 | 6765.407 | 7885.429
Std dev | 0.594 | 0.408 | 4.547 | 8.914 | 204.457 | 203.327
--------+--------+--------+---------+---------+----------+---------
Gain | -17.02% | -15.29% | -16.56%
Thoughts?
Best regards,
--
Julien Tachoires
Вложения
- v1-0001-Pass-the-number-of-ScanKeys-to-scan_rescan.patch
- v1-0002-Simple-quals-push-down-to-table-AMs.patch
- v1-0003-Add-the-table-reloption-quals_push_down.patch
- v1-0004-Add-tests-for-quals-push-down-to-table-AM.patch
- v1-0005-Push-down-IN-NOT-IN-array-quals-to-table-AMs.patch
- v1-0006-Push-down-IS-IS-NOT-NULL-quals-to-table-AMs.patch
On 2025-08-27 22:27:37 +0200, Julien Tachoires wrote: > Please find attached a patch set proposal intended to implement WHERE > clauses (qual) push down to the underlying table AM during > table/sequential scan execution. > > The primary goal of this project is to convert quals to ScanKeys and > pass them to the table AMs. Table AMs are then allowed to apply early > tuple filtering during table (sequential) scans. Applying filtering at > the table storage level is something necessary for non row-oriented > table storage like columnar storage. Index organized table is another > table storage that would need quals push down. > > AFAIK, CustomScan is the one and only way to go for having table scan > using quals pushed down, but each table AM must implement its own > mechanism. IMHO, having this feature available in core would help the > development of new table AMs. About Heap, some performance testing > (detailed at the end of this message) shows between 45% and 60% > improvement in seq scan execution time when only one tuple is returned > from the table. One problem with doing that in the case of heapam is that you're evaluating scan keys with the buffer lock held - with basically arbitrary expressions being evaluated. That's an easy path to undetected deadlocks. You'd have to redesign the relevant mechanism to filter outside of the lock... Greetings, Andres Freund
On Thu, 28 Aug 2025 at 01:27, Julien Tachoires <julien@tachoires.me> wrote: > > Hi, > > Please find attached a patch set proposal intended to implement WHERE > clauses (qual) push down to the underlying table AM during > table/sequential scan execution. > > The primary goal of this project is to convert quals to ScanKeys and > pass them to the table AMs. Table AMs are then allowed to apply early > tuple filtering during table (sequential) scans. Applying filtering at > the table storage level is something necessary for non row-oriented > table storage like columnar storage. Index organized table is another > table storage that would need quals push down. > > AFAIK, CustomScan is the one and only way to go for having table scan > using quals pushed down, but each table AM must implement its own > mechanism. IMHO, having this feature available in core would help the > development of new table AMs. About Heap, some performance testing > (detailed at the end of this message) shows between 45% and 60% > improvement in seq scan execution time when only one tuple is returned > from the table. > > Only a few expressions are supported: OpExpr (<key> <operator> <value>), > ScalarArrayOpExpr (<key> <operator> ANY|ALL(ARRAY[...]), and NullTest. > Row comparison is not yet supported as this part is still not clear to > me. On the right part of the expression, we support: constant, variable, > function call, and subquery (InitPlan only). > > In terms of security, we check if the function related to the operator > is not user defined: only functions from the catalog are supported. We > also check that the function is "leakproof". > > Pushing down quals does not guaranty to the executor that the tuples > returned during table scan satisfy a qual, as we don't know if the table > AM (potentially implemented via an extension) has applied tuple > filtering. In order to ensure to produce the right response to the where > clause, pushed down quals are executed twice per tuple returned: once by > the table AM, and once by the executor. This produces a performance > regression (15-17%) where almost the entire table is returned (see perf. > test results at the end of the message). This could be optimized by > flagging the tuples filtered by the table AM, this way we could avoid > the re-execution of the pushed down quals. > > > Details about the patch files > > v1-0001-Pass-the-number-of-ScanKeys-to-scan_rescan.patch: This patch > adds the number of ScanKeys passed via scan_rescan() as a new argument. > The number of ScanKeys was only passed to the table AM via begin_scan(), > but not in scan_rescan(). > > v1-0002-Simple-quals-push-down-to-table-AMs.patch: Core of the feature, > this patch adds qual push down support for OpExpr expressions. > > v1-0003-Add-the-table-reloption-quals_push_down.patch: Adds a new > reloption: quals_push_down used to enable/disable qual push down for a > table. Disabled by default. > > v1-0004-Add-tests-for-quals-push-down-to-table-AM.patch: Regression > tests. > > v1-0005-Push-down-IN-NOT-IN-array-quals-to-table-AMs.patch: > ScalarArrayOpExpr support. > > v1-0006-Push-down-IS-IS-NOT-NULL-quals-to-table-AMs.patch: NullTest > support. > > > Performance testing > > Head: > CREATE TABLE t (i INTEGER); > > Patch: > CREATE TABLE t (i INTEGER) WITH (quals_push_down = on); > > n=1M: > INSERT INTO t SELECT generate_series(1, 1000000); > VACUUM t; > > n=10M: > TRUNCATE t; > INSERT INTO t SELECT generate_series(1, 10000000); > VACUUM t; > > n=100M: > TRUNCATE t; > INSERT INTO t SELECT generate_series(1, 100000000); > VACUUM t; > > > Case #1: SELECT COUNT(*) FROM t WHERE i = 50000; > > | n=1M | n=10M | n=100M > +--------+--------+---------+---------+----------+--------- > | Head | Patch | Head | Patch | Head | Patch > --------+--------+--------+---------+---------+----------+--------- > Test #1 | 38.903 | 21.308 | 365.707 | 155.429 | 3939.937 | 1564.182 > Test #2 | 39.239 | 21.271 | 364.206 | 153.127 | 3872.370 | 1527.988 > Test #3 | 39.015 | 21.958 | 365.434 | 154.498 | 3812.382 | 1525.535 > --------+--------+--------+---------+---------+----------+--------- > > --------+--------+--------+---------+---------+----------+--------- > Average | 39.052 | 21.512 | 365.116 | 154.351 | 3874.896 | 1539.235 > Std dev | 0.171 | 0.386 | 0.800 | 1.158 | 63.815 | 21.640 > --------+--------+--------+---------+---------+----------+--------- > Gain | 44.91% | 57.73% | 60.28% > > > Case #2: SELECT COUNT(*) FROM t WHERE i >= 2; > > | n=1M | n=10M | n=100M > +--------+--------+---------+---------+----------+--------- > | Head | Patch | Head | Patch | Head | Patch > --------+--------+--------+---------+---------+----------+--------- > Test #1 | 68.422 | 81.233 | 674.397 | 778.427 | 6845.165 | 8071.627 > Test #2 | 69.237 | 80.868 | 682.976 | 774.417 | 6533.091 | 7668.477 > Test #3 | 69.579 | 80.418 | 676.072 | 791.465 | 6917.964 | 7916.182 > --------+--------+--------+---------+---------+----------+--------- > > --------+--------+--------+---------+---------+----------+--------- > Average | 69.079 | 80.840 | 677.815 | 781.436 | 6765.407 | 7885.429 > Std dev | 0.594 | 0.408 | 4.547 | 8.914 | 204.457 | 203.327 > --------+--------+--------+---------+---------+----------+--------- > Gain | -17.02% | -15.29% | -16.56% > > > Thoughts? > > Best regards, > > -- > Julien Tachoires Hi! I was also always wondering if something like quals pushing can be implemented in Postgres. It is indeed very beneficial for Column-based processing in MPP databases, Greenplum and Cloudberry to name a few. I did my own micro-research a while ago (while working on some Cloudberry features), so here are my thoughts on the subject. What this patchset is doing, is passing ScanKeys directly to tableam somewhat blindly. In speedups processing execution-phase. While I do not have strong objections against this approach, I suspect this breaks some layers of abstractions and *silent* (or maybe documented) agreements of what are responsibilities of TableAM functions. So, passing ScanKeys directly to TAM is used on HEAD for catalog-access only. Correct me if I'm wrong. For all other types of relation each query is planned, which includes (1) building data access patch thought various data access methods (indexes) (2) Decide for each Qual which indexes can be used to satisfy this qual (3) Using Cost Model for filtering best options All of this can not be done with your approach? Cost model can give hints to the optimizer that this TAM will process some qual much faster than any by-index access. Smart cost model/optimizer can realise that selecting only few of all attributes from column-orietired relation + filter when using SIMD etc can be really cheap. So maybe the good shape of this patch would be something that could choose between seqscan and indexscan in planner time? -- Best regards, Kirill Reshke
Hi, On Wed, Aug 27, 2025 at 05:50:01PM -0400, Andres Freund wrote: > On 2025-08-27 22:27:37 +0200, Julien Tachoires wrote: > > Please find attached a patch set proposal intended to implement WHERE > > clauses (qual) push down to the underlying table AM during > > table/sequential scan execution. > > > > The primary goal of this project is to convert quals to ScanKeys and > > pass them to the table AMs. Table AMs are then allowed to apply early > > tuple filtering during table (sequential) scans. Applying filtering at > > the table storage level is something necessary for non row-oriented > > table storage like columnar storage. Index organized table is another > > table storage that would need quals push down. > > > > AFAIK, CustomScan is the one and only way to go for having table scan > > using quals pushed down, but each table AM must implement its own > > mechanism. IMHO, having this feature available in core would help the > > development of new table AMs. About Heap, some performance testing > > (detailed at the end of this message) shows between 45% and 60% > > improvement in seq scan execution time when only one tuple is returned > > from the table. > > One problem with doing that in the case of heapam is that you're evaluating > scan keys with the buffer lock held - with basically arbitrary expressions > being evaluated. That's an easy path to undetected deadlocks. You'd have to > redesign the relevant mechanism to filter outside of the lock... Thank you for this quick feedback. One potential approach to solve this in heapgettup() would be: 1. hold the buffer lock 2. get the tuple from the buffer 3. if the tuple is not visible, move to the next tuple, back to 2. 4. release the buffer lock 5. if the tuple does not satisfy the scan keys, take the buffer lock, move to the next tuple, back to 2. 6. return the tuple Do you see something fundamentally wrong here? In practice, I might be wrong, but I think this problem affects heapgettup() only, heapgettup_pagemode() does not hold the buffer lock when HeapKeyTest() is called. To reach this problematic part we need two conditions: non-NULL ScanKey array and not to be in the page-at-a-time mode (pagemode). The only table_beginscan_something() function able to meet these conditions before calling the scan_begin() API is table_beginscan_sampling(): the caller can choose to use pagemode. The only place where table_beginscan_sampling() is called is in tablesample_init(), in this case the ScanKey array is NULL, so we cannot reach the problematic part of heapgettup(). There is only one other case where we disable the pagemode in heapam: when the snapshot is non-MVCC. Do you know any other code path/scenario I missed that can lead to reach this problematic part? Best regards, -- Julien Tachoires
Hi, On Thu, Aug 28, 2025 at 02:57:02AM +0500, Kirill Reshke wrote: > Hi! > I was also always wondering if something like quals pushing can be > implemented in Postgres. It is indeed very beneficial for Column-based > processing in MPP databases, Greenplum and Cloudberry to name a few. I > did my own micro-research a while ago (while working on some > Cloudberry features), so here are my thoughts on the subject. > > What this patchset is doing, is passing ScanKeys directly to tableam > somewhat blindly. In speedups processing execution-phase. While I do > not have strong objections against this approach, I suspect this > breaks some layers of abstractions and *silent* (or maybe documented) > agreements of what are responsibilities of TableAM functions. So, > passing ScanKeys directly to TAM is used on HEAD for catalog-access > only. Correct me if I'm wrong. For all other types of relation each > query is planned, which includes > > (1) building data access patch thought various data access methods (indexes) > (2) Decide for each Qual which indexes can be used to satisfy this qual > (3) Using Cost Model for filtering best options > > All of this can not be done with your approach? > > Cost model can give hints to the optimizer that this TAM will process > some qual much faster than any by-index access. Smart cost > model/optimizer can realise that selecting only few of all attributes > from column-orietired relation + filter when using SIMD etc can be > really cheap. > > So maybe the good shape of this patch would be something that could > choose between seqscan and indexscan in planner time? Thank you for your quick feed back. Exact, this patch does not add/reduce any cost when some quals are planned to be pushed down. I agree with you that it would be nice (necessary?) to have this. I think the table AM API should provide, via new APIs, cost estimation in case of table scan and considering the cost of evaluating the quals, if any. Best regards, -- Julien Tachoires
Thanks for the patchset, Julien.
v1-0001:
All current callers of scan_rescan() pass NULL for the `key` parameter, making it unclear to
Table AM authors if this is supposed to be a pointer to a single key, or an array. If an array,
how is it terminated? None of this is addressed in the current code comments, and given
that nobody uses this field, the intent is undiscoverable. By adding `int nkeys` to the parameter
list, your patch makes the intention clearer. Perhaps you could also update the documentation
for these functions?
v1-0002
Changing the EXPLAIN output to include where the exclusion happened is quite nice!
Out of curiosity, why did you wait until this patch to add the `int nkeys` parameter to
initscan()? It seems more on-topic for v1-0001. Likewise, renaming `key` as `keys` helps,
but could have been done in v1-0001.
As for Andres' concern upstream, I am including a Work-In-Progress patch (WIP) to check
that no light-weight locks are held during qual evaluation. I don't intend this for commit so much
as for discussion. It seems to me fairly clear that your patch does not evaluate the quals while
holding such locks, but I might have misunderstood the concern.
On Wed, Aug 27, 2025 at 1:28 PM Julien Tachoires <julien@tachoires.me> wrote:
Hi,
Please find attached a patch set proposal intended to implement WHERE
clauses (qual) push down to the underlying table AM during
table/sequential scan execution.
The primary goal of this project is to convert quals to ScanKeys and
pass them to the table AMs. Table AMs are then allowed to apply early
tuple filtering during table (sequential) scans. Applying filtering at
the table storage level is something necessary for non row-oriented
table storage like columnar storage. Index organized table is another
table storage that would need quals push down.
AFAIK, CustomScan is the one and only way to go for having table scan
using quals pushed down, but each table AM must implement its own
mechanism. IMHO, having this feature available in core would help the
development of new table AMs. About Heap, some performance testing
(detailed at the end of this message) shows between 45% and 60%
improvement in seq scan execution time when only one tuple is returned
from the table.
Only a few expressions are supported: OpExpr (<key> <operator> <value>),
ScalarArrayOpExpr (<key> <operator> ANY|ALL(ARRAY[...]), and NullTest.
Row comparison is not yet supported as this part is still not clear to
me. On the right part of the expression, we support: constant, variable,
function call, and subquery (InitPlan only).
In terms of security, we check if the function related to the operator
is not user defined: only functions from the catalog are supported. We
also check that the function is "leakproof".
Pushing down quals does not guaranty to the executor that the tuples
returned during table scan satisfy a qual, as we don't know if the table
AM (potentially implemented via an extension) has applied tuple
filtering. In order to ensure to produce the right response to the where
clause, pushed down quals are executed twice per tuple returned: once by
the table AM, and once by the executor. This produces a performance
regression (15-17%) where almost the entire table is returned (see perf.
test results at the end of the message). This could be optimized by
flagging the tuples filtered by the table AM, this way we could avoid
the re-execution of the pushed down quals.
Details about the patch files
v1-0001-Pass-the-number-of-ScanKeys-to-scan_rescan.patch: This patch
adds the number of ScanKeys passed via scan_rescan() as a new argument.
The number of ScanKeys was only passed to the table AM via begin_scan(),
but not in scan_rescan().
v1-0002-Simple-quals-push-down-to-table-AMs.patch: Core of the feature,
this patch adds qual push down support for OpExpr expressions.
v1-0003-Add-the-table-reloption-quals_push_down.patch: Adds a new
reloption: quals_push_down used to enable/disable qual push down for a
table. Disabled by default.
v1-0004-Add-tests-for-quals-push-down-to-table-AM.patch: Regression
tests.
v1-0005-Push-down-IN-NOT-IN-array-quals-to-table-AMs.patch:
ScalarArrayOpExpr support.
v1-0006-Push-down-IS-IS-NOT-NULL-quals-to-table-AMs.patch: NullTest
support.
Performance testing
Head:
CREATE TABLE t (i INTEGER);
Patch:
CREATE TABLE t (i INTEGER) WITH (quals_push_down = on);
n=1M:
INSERT INTO t SELECT generate_series(1, 1000000);
VACUUM t;
n=10M:
TRUNCATE t;
INSERT INTO t SELECT generate_series(1, 10000000);
VACUUM t;
n=100M:
TRUNCATE t;
INSERT INTO t SELECT generate_series(1, 100000000);
VACUUM t;
Case #1: SELECT COUNT(*) FROM t WHERE i = 50000;
| n=1M | n=10M | n=100M
+--------+--------+---------+---------+----------+---------
| Head | Patch | Head | Patch | Head | Patch
--------+--------+--------+---------+---------+----------+---------
Test #1 | 38.903 | 21.308 | 365.707 | 155.429 | 3939.937 | 1564.182
Test #2 | 39.239 | 21.271 | 364.206 | 153.127 | 3872.370 | 1527.988
Test #3 | 39.015 | 21.958 | 365.434 | 154.498 | 3812.382 | 1525.535
--------+--------+--------+---------+---------+----------+---------
--------+--------+--------+---------+---------+----------+---------
Average | 39.052 | 21.512 | 365.116 | 154.351 | 3874.896 | 1539.235
Std dev | 0.171 | 0.386 | 0.800 | 1.158 | 63.815 | 21.640
--------+--------+--------+---------+---------+----------+---------
Gain | 44.91% | 57.73% | 60.28%
Case #2: SELECT COUNT(*) FROM t WHERE i >= 2;
| n=1M | n=10M | n=100M
+--------+--------+---------+---------+----------+---------
| Head | Patch | Head | Patch | Head | Patch
--------+--------+--------+---------+---------+----------+---------
Test #1 | 68.422 | 81.233 | 674.397 | 778.427 | 6845.165 | 8071.627
Test #2 | 69.237 | 80.868 | 682.976 | 774.417 | 6533.091 | 7668.477
Test #3 | 69.579 | 80.418 | 676.072 | 791.465 | 6917.964 | 7916.182
--------+--------+--------+---------+---------+----------+---------
--------+--------+--------+---------+---------+----------+---------
Average | 69.079 | 80.840 | 677.815 | 781.436 | 6765.407 | 7885.429
Std dev | 0.594 | 0.408 | 4.547 | 8.914 | 204.457 | 203.327
--------+--------+--------+---------+---------+----------+---------
Gain | -17.02% | -15.29% | -16.56%
Thoughts?
Best regards,
--
Julien Tachoires
Mark Dilger
Вложения
Hi Mark, On Tue, Oct 07, 2025 at 04:55:53PM -0700, Mark Dilger wrote: > v1-0001: > All current callers of scan_rescan() pass NULL for the `key` parameter, > making it unclear to > Table AM authors if this is supposed to be a pointer to a single key, or > an array. If an array, > how is it terminated? None of this is addressed in the current code > comments, and given > that nobody uses this field, the intent is undiscoverable. By adding > `int nkeys` to the parameter > list, your patch makes the intention clearer. Perhaps you could also > update the documentation > for these functions? > > v1-0002 > Changing the EXPLAIN output to include where the exclusion happened is > quite nice! > > Out of curiosity, why did you wait until this patch to add the `int > nkeys` parameter to > initscan()? It seems more on-topic for v1-0001. Likewise, renaming > `key` as `keys` helps, > but could have been done in v1-0001. Thanks for your review. Please find a new version attached that addresses these points. > As for Andres' concern upstream, I am including a Work-In-Progress patch > (WIP) to check > that no light-weight locks are held during qual evaluation. I don't intend > this for commit so much > as for discussion. It seems to me fairly clear that your patch does not > evaluate the quals while > holding such locks, but I might have misunderstood the concern. Thanks, it helps to confirm that due to the absence of ScanKey, HeapKeyTest() is not called from heapgettup(), at least when running the regression tests. In order to guarantee to not hold the Buffer lock while evaluating a ScanKey, v4-0001-Release-buffer-lock-before-scan-key-evaluation releases the lock before calling HeapKeyTest(). Regards, -- Julien Tachoires
Вложения
- v4-0001-Release-buffer-lock-before-scan-key-evaluation.patch
- v4-0002-Pass-the-number-of-ScanKeys-to-scan_rescan.patch
- v4-0003-Simple-quals-push-down-to-table-AMs.patch
- v4-0004-Add-the-table-reloption-quals_push_down.patch
- v4-0005-Add-tests-for-quals-push-down-to-table-AM.patch
- v4-0006-Push-down-IN-NOT-IN-array-quals-to-table-AMs.patch
- v4-0007-Push-down-IS-IS-NOT-NULL-quals-to-table-AMs.patch
On Fri, Aug 29, 2025 at 4:38 AM Julien Tachoires <julien@tachoires.me> wrote: > Thank you for this quick feedback. > > One potential approach to solve this in heapgettup() would be: > 1. hold the buffer lock > 2. get the tuple from the buffer > 3. if the tuple is not visible, move to the next tuple, back to 2. > 4. release the buffer lock > 5. if the tuple does not satisfy the scan keys, take the buffer lock, > move to the next tuple, back to 2. > 6. return the tuple > > Do you see something fundamentally wrong here? I spent a bit of time this afternoon looking at v4-0001. I noticed a few spelling mistakes (abritrary x2, statisfied x1). As far as the basic approach is concerned, I don't see how there can be a safety problem here. If it's safe to release the buffer lock when we find a tuple that matches the quals, for the purposes of returning that tuple to the caller, then it seems like it must also be safe to release it to evaluate a proposed qual. Potentially, there could be a performance problem. Imagine that we have some code right now that uses this code path and it's safe because the qual that we're evaluating is something super-simple like the integer less-than operator, so calling it under the buffer lock doesn't create a stability hazard. Well, with the patch, we'd potentially take and release the buffer lock a lot more times than we do right now. Imagine that there are lots of tuples on each page but only 1 or very few of them satisfy the qual: then we lock and unlock the buffer a whole bunch of times instead of just once. However, I don't think this really happens in practice. I believe it's possible to take this code path if you set ignore_system_indexes=on, because that turns index scans --- which, not surprisingly, have scankeys --- into sequential scans which then end up also having scankeys. Many of those scans use catalog snapshots so there's no issue, but a little bit of debugging code seems to show that systable_beginscan() can also be called with snapshot->snapshot_type set to SNAPSHOT_ANY or SNAPSHOT_DIRTY. For example, see GetNewOidWithIndex(). However, even if ignore_system_indexes=on gets a little slower as a result of this or some other patch, I don't think we really care, and without that setting, this code doesn't seem to get exercised at all. So, somewhat to my surprise, I think that v4-0001 might be basically fine. I wonder if anyone else sees a problem that I'm missing? -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2025-12-09 16:40:17 -0500, Robert Haas wrote: > On Fri, Aug 29, 2025 at 4:38 AM Julien Tachoires <julien@tachoires.me> wrote: > Potentially, there could be a performance problem I think the big performance hazard with this is repeated deforming. The scankey infrastructure deforms attributes one-by-one *and* it does not "persist" the work of deforming for later accesses. So if you e.g. have something like SELECT sum(col_29) FROM tbl WHERE col_30 = common_value; or SELECT * FROM tbl WHERE col_30 = common_value; we'll now deform col_30 in isolation for the ScanKey evaluation and then we'll deform columns 1-29 in the slot (because we always deform all the leading columns), during projection. But even leaving the slot issue aside, I'd bet that you'll see overhead due to *not* deforming multiple columns at once. If you have a ScanKey version of something like WHERE column_20 = common_val AND column_21 = some_val AND column_22 = another_val; and there's a NULL or varlena value in one of the leading columns, we'll redo a fair bit of work during the fastgetattr() for column_22. I don't really see this being viable without first tackling two nontrivial projects: 1) Make slot deforming for expressions & projections selective, i.e. don't deform all the leading columns, but only ones that will eventually be needed 2) Perform ScanKey evaluation in slot form, to be able to cache the deforming and to make deforming of multiple columns sufficiently efficient. > So, somewhat to my surprise, I think that v4-0001 might be basically > fine. I wonder if anyone else sees a problem that I'm missing? I doubt this would be safe as-is: ISTM that if you release the page lock between tuples, things like the number of items on the page can change. But we store stuff like that in registers / on the stack, which could change while the lock is not held. We could refetch the number items on the page for every loop iteration, but that'd probably not be free. OTOH, it's probably nothing compared to the cost of relocking the page... Greetings, Andres Freund
On Tue, Dec 9, 2025 at 6:08 PM Andres Freund <andres@anarazel.de> wrote: > On 2025-12-09 16:40:17 -0500, Robert Haas wrote: > > On Fri, Aug 29, 2025 at 4:38 AM Julien Tachoires <julien@tachoires.me> wrote: > > Potentially, there could be a performance problem > > I think the big performance hazard with this is repeated deforming. The > scankey infrastructure deforms attributes one-by-one *and* it does not > "persist" the work of deforming for later accesses. So if you e.g. have > something like > > SELECT sum(col_29) FROM tbl WHERE col_30 = common_value; > or > SELECT * FROM tbl WHERE col_30 = common_value; > > we'll now deform col_30 in isolation for the ScanKey evaluation and then we'll > deform columns 1-29 in the slot (because we always deform all the leading > columns), during projection. Hmm, this is a good point, and I agree that it's a huge challenge for this patch set. Repeated tuple deforming is REALLY expensive, which is why we've spent so much energy trying to use slots in an as many places as possible. I find it easy to believe that HeapKeyTest's loop over heap_getattr() is going to prohibitively painful and that this code will need to somehow also be slot-ified for this to be a viable project. > I don't really see this being viable without first tackling two nontrivial > projects: > > 1) Make slot deforming for expressions & projections selective, i.e. don't > deform all the leading columns, but only ones that will eventually be > needed > 2) Perform ScanKey evaluation in slot form, to be able to cache the deforming > and to make deforming of multiple columns sufficiently efficient. IOW, I agree that we probably need to do #2. I am not entirely sure about #1. I'm a little afraid that trying to skip over columns without deforming them will add a bunch of code complexity that doesn't really pay off. You have to do the bookkeeping to know what to skip, and then how much are you really gaining by skipping it? If you can skip over a bunch of fixed-width columns, that's cool, but it's probably fairly normal to have lots of varlena columns, and then I don't really see that you're gaining much here. You still have to iterate through the tuple, and not storing the pointer to the start of each column as you find it doesn't seem like it will save much. What's your reasoning behind thinking that #1 will be necessary? > > So, somewhat to my surprise, I think that v4-0001 might be basically > > fine. I wonder if anyone else sees a problem that I'm missing? > > I doubt this would be safe as-is: ISTM that if you release the page lock > between tuples, things like the number of items on the page can change. But we > store stuff like that in registers / on the stack, which could change while > the lock is not held. > > We could refetch the number items on the page for every loop iteration, but > that'd probably not be free. OTOH, it's probably nothing compared to the cost > of relocking the page... We still hold a pin, though, which I think means very little can change. More items can be added to the page, so we might want to refresh the number of items on the page at least when we think we're done, but I believe that any sort of more invasive page rearrangement would be precluded by the pin. I kind of wonder if it would be good to make a change along the lines of v4-0001 even if this patch set doesn't move forward overall, or will need a lot of slot-ification to do so. It seems weird to me that we're OK with calling out to arbitrary code with a buffer lock held, and even weirder that whether or not we do that depends on whether SO_ALLOW_PAGEMODE was set. I don't think a difference of this kind between pagemode behavior and non-pagemode behavior would survive review if someone proposed it today; the fact that it works the way it does is probably an artifact of this mechanism having been added twenty years ago when the project was in a very different place. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Dec 11, 2025 at 12:41 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Dec 9, 2025 at 6:08 PM Andres Freund <andres@anarazel.de> wrote: > > On 2025-12-09 16:40:17 -0500, Robert Haas wrote: > > > On Fri, Aug 29, 2025 at 4:38 AM Julien Tachoires <julien@tachoires.me> wrote: > > > Potentially, there could be a performance problem > > > > I think the big performance hazard with this is repeated deforming. The > > scankey infrastructure deforms attributes one-by-one *and* it does not > > "persist" the work of deforming for later accesses. So if you e.g. have > > something like > > > > SELECT sum(col_29) FROM tbl WHERE col_30 = common_value; > > or > > SELECT * FROM tbl WHERE col_30 = common_value; > > > > we'll now deform col_30 in isolation for the ScanKey evaluation and then we'll > > deform columns 1-29 in the slot (because we always deform all the leading > > columns), during projection. > > Hmm, this is a good point, and I agree that it's a huge challenge for > this patch set. Repeated tuple deforming is REALLY expensive, which is > why we've spent so much energy trying to use slots in an as many > places as possible. I find it easy to believe that HeapKeyTest's loop > over heap_getattr() is going to prohibitively painful and that this > code will need to somehow also be slot-ified for this to be a viable > project. > > > I don't really see this being viable without first tackling two nontrivial > > projects: > > > > 1) Make slot deforming for expressions & projections selective, i.e. don't > > deform all the leading columns, but only ones that will eventually be > > needed > > 2) Perform ScanKey evaluation in slot form, to be able to cache the deforming > > and to make deforming of multiple columns sufficiently efficient. > > IOW, I agree that we probably need to do #2. I am not entirely sure > about #1. I'm also curious to understand why Andres sees #1 as a prerequisite for qual pushdown. > I'm a little afraid that trying to skip over columns without > deforming them will add a bunch of code complexity that doesn't really > pay off. I think it might be worthwhile. I have a PoC [1] I worked on (at Andres's suggestion) that showed ~2x improvement on simple aggregation queries over wide tables (all pages buffered) by tracking the minimum needed attribute and using cached offsets stored in TupleDesc to skip fixed-not-null prefixes. I'm thinking of reviving it with proper tracking of which attributes are needed and deformed (bitmapset or flag array in TupleTableSlot). > > > So, somewhat to my surprise, I think that v4-0001 might be basically > > > fine. I wonder if anyone else sees a problem that I'm missing? > > > > I doubt this would be safe as-is: ISTM that if you release the page lock > > between tuples, things like the number of items on the page can change. But we > > store stuff like that in registers / on the stack, which could change while > > the lock is not held. > > > > We could refetch the number items on the page for every loop iteration, but > > that'd probably not be free. OTOH, it's probably nothing compared to the cost > > of relocking the page... > > We still hold a pin, though, which I think means very little can > change. More items can be added to the page, so we might want to > refresh the number of items on the page at least when we think we're > done, but I believe that any sort of more invasive page rearrangement > would be precluded by the pin. > > I kind of wonder if it would be good to make a change along the lines > of v4-0001 even if this patch set doesn't move forward overall, or > will need a lot of slot-ification to do so. It seems weird to me that > we're OK with calling out to arbitrary code with a buffer lock held, > and even weirder that whether or not we do that depends on whether > SO_ALLOW_PAGEMODE was set. I don't think a difference of this kind > between pagemode behavior and non-pagemode behavior would survive > review if someone proposed it today; the fact that it works the way it > does is probably an artifact of this mechanism having been added > twenty years ago when the project was in a very different place. One maybe crazy thought: what about only enabling qual pushdown when pagemode is used, since it already processes all tuples on a page in one locked phase? That raises the question of whether there's a class of quals simple enough (built-in ops?) that evaluating them alongside visibility checking would be acceptable, with lock held that is -- but it would avoid the lock churn and racy loop termination issues with v4-0001. -- Thanks, Amit Langote [1] https://www.postgresql.org/message-id/CA%2BHiwqHXDY6TxegR2Cr_4sRa_LY1QJnoL8XRmOqdfrx21pZ6cw%40mail.gmail.com
Hi, On 2025-12-15 21:56:12 +0900, Amit Langote wrote: > On Thu, Dec 11, 2025 at 12:41 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Dec 9, 2025 at 6:08 PM Andres Freund <andres@anarazel.de> wrote: > > > On 2025-12-09 16:40:17 -0500, Robert Haas wrote: > > > > On Fri, Aug 29, 2025 at 4:38 AM Julien Tachoires <julien@tachoires.me> wrote: > > > > Potentially, there could be a performance problem > > > > > > I think the big performance hazard with this is repeated deforming. The > > > scankey infrastructure deforms attributes one-by-one *and* it does not > > > "persist" the work of deforming for later accesses. So if you e.g. have > > > something like > > > > > > SELECT sum(col_29) FROM tbl WHERE col_30 = common_value; > > > or > > > SELECT * FROM tbl WHERE col_30 = common_value; > > > > > > we'll now deform col_30 in isolation for the ScanKey evaluation and then we'll > > > deform columns 1-29 in the slot (because we always deform all the leading > > > columns), during projection. > > > > Hmm, this is a good point, and I agree that it's a huge challenge for > > this patch set. Repeated tuple deforming is REALLY expensive, which is > > why we've spent so much energy trying to use slots in an as many > > places as possible. I find it easy to believe that HeapKeyTest's loop > > over heap_getattr() is going to prohibitively painful and that this > > code will need to somehow also be slot-ified for this to be a viable > > project. > > > > > I don't really see this being viable without first tackling two nontrivial > > > projects: > > > > > > 1) Make slot deforming for expressions & projections selective, i.e. don't > > > deform all the leading columns, but only ones that will eventually be > > > needed > > > 2) Perform ScanKey evaluation in slot form, to be able to cache the deforming > > > and to make deforming of multiple columns sufficiently efficient. > > > > IOW, I agree that we probably need to do #2. I am not entirely sure > > about #1. > > I'm also curious to understand why Andres sees #1 as a prerequisite > for qual pushdown. I suspect you'll not see a whole lot of gain without it. When I experimented with it, a good portion (but not all!) of the gain seemed to be from just deforming the immediately required columns - but also a lot of the visible regressions were from that. > > We still hold a pin, though, which I think means very little can > > change. More items can be added to the page, so we might want to > > refresh the number of items on the page at least when we think we're > > done, but I believe that any sort of more invasive page rearrangement > > would be precluded by the pin. > > > > I kind of wonder if it would be good to make a change along the lines > > of v4-0001 even if this patch set doesn't move forward overall, or > > will need a lot of slot-ification to do so. It seems weird to me that > > we're OK with calling out to arbitrary code with a buffer lock held, > > and even weirder that whether or not we do that depends on whether > > SO_ALLOW_PAGEMODE was set. I don't think a difference of this kind > > between pagemode behavior and non-pagemode behavior would survive > > review if someone proposed it today; the fact that it works the way it > > does is probably an artifact of this mechanism having been added > > twenty years ago when the project was in a very different place. > > One maybe crazy thought: what about only enabling qual pushdown when > pagemode is used, since it already processes all tuples on a page in > one locked phase? That raises the question of whether there's a class > of quals simple enough (built-in ops?) that evaluating them alongside > visibility checking would be acceptable, with lock held that is -- but > it would avoid the lock churn and racy loop termination issues with > v4-0001. I think that's the wrong direction to go. We shouldn't do more under the lock, we should be less. You certainly couldn't just use builtin ops, as some of them *do* other catalog lookups, which would lead to deadlock potential. I think it's also just unnecessary to try to do anything under a lock here, it's not hard to first do all the visibility checks while locked and then, after unlocking, filter the tuples based on quals. Greetings, Andres Freund
Hi, On 2025-12-10 10:41:19 -0500, Robert Haas wrote: > On Tue, Dec 9, 2025 at 6:08 PM Andres Freund <andres@anarazel.de> wrote: > > I don't really see this being viable without first tackling two nontrivial > > projects: > > > > 1) Make slot deforming for expressions & projections selective, i.e. don't > > deform all the leading columns, but only ones that will eventually be > > needed > > 2) Perform ScanKey evaluation in slot form, to be able to cache the deforming > > and to make deforming of multiple columns sufficiently efficient. > > IOW, I agree that we probably need to do #2. I am not entirely sure > about #1. I'm a little afraid that trying to skip over columns without > deforming them will add a bunch of code complexity that doesn't really > pay off. You have to do the bookkeeping to know what to skip, and then > how much are you really gaining by skipping it? If you can skip over a > bunch of fixed-width columns, that's cool, but it's probably fairly > normal to have lots of varlena columns, and then I don't really see > that you're gaining much here. You still have to iterate through the > tuple, and not storing the pointer to the start of each column as you > find it doesn't seem like it will save much. FWIW, in experiments I observed that not storing all the columns that never are used saves pretty decent amount of cycles, just due to not having to store the never-accessed datums in the slot (and in case of non-varlena columns, not having to fetch the relevant data). It's probably true that the gain for varlenas is smaller, due to the cost of determining the length, but the difference is that if the cost of storing the columns is relevant, fields after a NULL or a varlena still can beenfit from the optimization. > What's your reasoning behind thinking that #1 will be necessary? Tried to answer that downthread, in response to Amit. > > > So, somewhat to my surprise, I think that v4-0001 might be basically > > > fine. I wonder if anyone else sees a problem that I'm missing? > > > > I doubt this would be safe as-is: ISTM that if you release the page lock > > between tuples, things like the number of items on the page can change. But we > > store stuff like that in registers / on the stack, which could change while > > the lock is not held. > > > > We could refetch the number items on the page for every loop iteration, but > > that'd probably not be free. OTOH, it's probably nothing compared to the cost > > of relocking the page... > > We still hold a pin, though, which I think means very little can > change. More items can be added to the page, so we might want to > refresh the number of items on the page at least when we think we're > done, but I believe that any sort of more invasive page rearrangement > would be precluded by the pin. > > I kind of wonder if it would be good to make a change along the lines > of v4-0001 even if this patch set doesn't move forward overall, or > will need a lot of slot-ification to do so. It seems weird to me that > we're OK with calling out to arbitrary code with a buffer lock held, > and even weirder that whether or not we do that depends on whether > SO_ALLOW_PAGEMODE was set. I don't think it's stated clearly anywhere - the only reason this is remotely within a stone's throw of ok is that the only code using ScanKeys for table scans is catcache, which in turn means that the set of effectively allowed operators is tiny (oideq, int2eq, int4eq, int8eq, nameeq, chareq, booleq and perhaps 2-3 more). And for those we only support ScanKey mode because it's required as a fallback for index based catcache searches. I'm not against fixing qual-eval-under-buffer-lock, it shouldn't ever be used in particularly performance sensitive cases. Greetings, Andres Freund
Hi, > On 10 Dec 2025, at 00:08, Andres Freund <andres@anarazel.de> wrote: > I don't really see this being viable without first tackling two nontrivial > projects: > > 2) Perform ScanKey evaluation in slot form, to be able to cache the deforming > and to make deforming of multiple columns sufficiently efficient. Am I right in understanding that you think that the repeated calls to heap_getattr in HeapKeyTest is not ideal if we have NULL or varlena columns? I have written a small patch (see attached) that stores the heap tuple in a TupleTableSlot first and then calls slot_getattr instead, which should benefit from caching. Is that the type of solution you were thinking of? It is definitely not a complete patch (needs comments and a description), and it is not merged into the patch set of Julien yet, but I just wanted to check that this was what you were proposing and that I was not misunderstanding something. > 1) Make slot deforming for expressions & projections selective, i.e. don't > deform all the leading columns, but only ones that will eventually be > needed Concerning 1), I’m also not certain I understand why this is a prerequisite for the pushdown work. It could certainly be beneficial, but it seems to be complementary. In any case, I’d be interested to look at your POC patch on the subject, Amit. Best, Maxime Schoemans
Вложения
Hi, On 2025-12-18 20:40:31 +0100, Maxime Schoemans wrote: > > On 10 Dec 2025, at 00:08, Andres Freund <andres@anarazel.de> wrote: > > I don't really see this being viable without first tackling two nontrivial > > projects: > > > > 2) Perform ScanKey evaluation in slot form, to be able to cache the deforming > > and to make deforming of multiple columns sufficiently efficient. > > Am I right in understanding that you think that the repeated calls to > heap_getattr in HeapKeyTest is not ideal if we have NULL or varlena > columns? That's part of it, but not all of it: The other aspect is that if you do a bunch of heap_getattr()s inside HeapKeyTest() and then project that column in nodeSeqscan.c, you'll do the work to deform twice. > > 1) Make slot deforming for expressions & projections selective, i.e. don't > > deform all the leading columns, but only ones that will eventually be > > needed > > > Concerning 1), I’m also not certain I understand why this is a prerequisite > for the pushdown work. It could certainly be beneficial, but it seems to be > complementary. As hinted at in [1] I suspect that you're just not going to see big enough wins without the above optimization. A decent portion of the win from using HeapKeyTest is to only selectively deform. Greetings, Andres Freund [1] https://postgr.es/m/CAAh00EQUwG5khqJO7nSV0nsqsG1OP%3DkA6ACfxV3rnNSVd4b6TQ%40mail.gmail.com