Обсуждение: Convert NOT IN sublinks to anti-joins when safe

Поиск

Список

Период

Сортировка

Convert NOT IN sublinks to anti-joins when safe

От

Richard Guo

Дата:

03 февраля, 10:12:48

This topic has been discussed several times in the past.  Due to the
semantic mismatch regarding NULL handling, NOT IN is not ordinarily
equivalent to an anti-join.  However, if we can prove that neither the
outer expressions nor the subquery outputs can yield NULL values, it
should be safe to convert NOT IN to an anti-join.

I believe we are now in a much better position to attempt this again.
The planner has accumulated significant infrastructure that makes this
proof straightforward and reliable.  Specifically, we can now leverage
the outer-join-aware-Var infrastructure to tell whether a Var comes
from the nullable side of an outer join, and the not-null-attnums hash
table to efficiently check whether a Var is defined NOT NULL.  We also
have the expr_is_nonnullable() function that is smart enough to deduce
non-nullability for expressions more complex than simple Vars/Consts.

Attached is a draft patch for this attempt (part of the code is
adapted from an old patch [1] by David and Tom).  This patch aims for
a conservative implementation: the goal is not to handle every
theoretical case, but to handle canonical query patterns with minimal
code complexity.

The patch primarily targets patterns like:

  SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM banned_users);

... and

  SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM
banned_users WHERE user_id IS NOT NULL);

This is a very typical syntax for exclusion.  In well-modeled
databases, join keys like id and user_id are very likely to be defined
as NOT NULL.

It seems to me that the ROI here is quite positive: the added code
complexity is very low (thanks to the existing infrastructure), while
the benefit is that users writing this typical pattern will finally
get efficient anti-join plans without needing manual rewrites.

(For the outer expressions, we could potentially also use outer query
quals to prove non-nullability.  This patch does not attempt to do so.
Implementing this would require passing state down during the
pull_up_sublinks recursion; and given that find_nonnullable_vars can
fail to prove non-nullability in many cases due to the lack of
const-simplification at this stage, I'm not sure whether it is worth
the code complexity.  Besides, I haven't fully convinced myself that
doing this does not introduce correctness issues.)

Any thoughts?

[1] https://postgr.es/m/13766.1405037879@sss.pgh.pa.us

- Richard

Вложения

v1-0001-Convert-NOT-IN-sublinks-to-anti-joins-when-safe.patch

Re: Convert NOT IN sublinks to anti-joins when safe

От

wenhui qiu

Дата:

03 февраля, 12:41:59

Hi Richard

> I believe we are now in a much better position to attempt this again.
> The planner has accumulated significant infrastructure that makes this
> proof straightforward and reliable. Specifically, we can now leverage
> the outer-join-aware-Var infrastructure to tell whether a Var comes
> from the nullable side of an outer join, and the not-null-attnums hash

> table to efficiently check whether a Var is defined NOT NULL. We also
> have the expr_is_nonnullable() function that is smart enough to deduce
> non-nullability for expressions more complex than simple Vars/Consts.

Thank you for working on this.Indeed, the benefits are substantial and highly necessary, as Oracle, SQL Server, and MySQL have all implemented varying degrees of support.I shall test this path in my spare time.

Thanks ，

On Tue, Feb 3, 2026 at 3:13 PM Richard Guo <guofenglinux@gmail.com> wrote:

This topic has been discussed several times in the past. Due to the
semantic mismatch regarding NULL handling, NOT IN is not ordinarily
equivalent to an anti-join. However, if we can prove that neither the
outer expressions nor the subquery outputs can yield NULL values, it
should be safe to convert NOT IN to an anti-join.

I believe we are now in a much better position to attempt this again.
The planner has accumulated significant infrastructure that makes this
proof straightforward and reliable. Specifically, we can now leverage
the outer-join-aware-Var infrastructure to tell whether a Var comes
from the nullable side of an outer join, and the not-null-attnums hash
table to efficiently check whether a Var is defined NOT NULL. We also
have the expr_is_nonnullable() function that is smart enough to deduce
non-nullability for expressions more complex than simple Vars/Consts.

Attached is a draft patch for this attempt (part of the code is
adapted from an old patch [1] by David and Tom). This patch aims for
a conservative implementation: the goal is not to handle every
theoretical case, but to handle canonical query patterns with minimal
code complexity.

The patch primarily targets patterns like:

SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM banned_users);

... and

SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM
banned_users WHERE user_id IS NOT NULL);

This is a very typical syntax for exclusion. In well-modeled
databases, join keys like id and user_id are very likely to be defined
as NOT NULL.

It seems to me that the ROI here is quite positive: the added code
complexity is very low (thanks to the existing infrastructure), while
the benefit is that users writing this typical pattern will finally
get efficient anti-join plans without needing manual rewrites.

(For the outer expressions, we could potentially also use outer query
quals to prove non-nullability. This patch does not attempt to do so.
Implementing this would require passing state down during the
pull_up_sublinks recursion; and given that find_nonnullable_vars can
fail to prove non-nullability in many cases due to the lack of
const-simplification at this stage, I'm not sure whether it is worth
the code complexity. Besides, I haven't fully convinced myself that
doing this does not introduce correctness issues.)

Any thoughts?

[1] https://postgr.es/m/13766.1405037879@sss.pgh.pa.us

- Richard

Re: Convert NOT IN sublinks to anti-joins when safe

От

wenhui qiu

Дата:

04 февраля, 06:53:08

Hi Richard

That's just brilliant! Here’s my testing process, which also tackles a commonly criticized aspect of PostgreSQL, effectively addressing its shortcomings.Moreover, the path test scenarios are thorough, encompassing cases that involve "NOT IN".Once again, thank you for your hard work.

###########
create table join1 (id integer,name varchar(300),k1 integer);
create table join2 (id integer,name varchar(300),score integer);
insert into join1 values ( generate_series(1,20000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(1,20000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(1,20000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(50201,50300),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAASSSSSAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(50201,50300),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAASSSSSAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(150201,1350300),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAASSSSSAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join2 values ( generate_series(1,40000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',1);
insert into join2 values ( generate_series(1,40000),'aaaaaaaaaaaaaaaaAAAAAAABBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',2);
insert into join2 values ( generate_series(20001,22000),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
insert into join2 values ( generate_series(150201,950300),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
create index idx_j1 on join1(id);
create index idx_j2 on join2(id);
VACUUM ANALYZE JOIN1;
VACUUM ANALYZE JOIN2;

postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 WHERE T2.ID>10000);
QUERY PLAN
---------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=20060.65..58729.40 rows=630150 width=8)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..17904.72 rows=862371 width=4)
Index Cond: (id > 10000)
(5 rows)

postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 WHERE T2.ID>10000 and t2.id is not null);
QUERY PLAN
---------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=22216.57..60885.32 rows=630150 width=8)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..20060.65 rows=862371 width=4)
Index Cond: ((id > 10000) AND (id IS NOT NULL))
(5 rows)
postgres=# alter table join1 alter id set not null;
ALTER TABLE
postgres=# alter table join2 alter id set not null;
ALTER TABLE
postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 WHERE T2.ID>10000 );
QUERY PLAN
-------------------------------------------------------------------------------------------------
Hash Anti Join (cost=28684.35..72424.36 rows=393902 width=8)
Hash Cond: (t1.id = t2.id)
-> Seq Scan on join1 t1 (cost=0.00..35518.00 rows=1260300 width=8)
-> Hash (cost=17904.72..17904.72 rows=862371 width=4)
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..17904.72 rows=862371 width=4)
Index Cond: (id > 10000)
(6 rows)

postgres=#
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where t2.id is not null);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=27134.57..70874.58 rows=393902 width=8) (actual time=239.777..602.433 rows=400200.00 loops=1)
Hash Cond: (t1.id = t2.id)
Buffers: shared hit=25304
-> Seq Scan on join1 t1 (cost=0.00..35518.00 rows=1260300 width=8) (actual time=0.007..85.255 rows=1260300.00 loops=1)
Buffers: shared hit=22915
-> Hash (cost=16108.32..16108.32 rows=882100 width=4) (actual time=220.949..220.951 rows=882100.00 loops=1)
Buckets: 1048576 Batches: 1 Memory Usage: 39204kB
Buffers: shared hit=2389
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..16108.32 rows=882100 width=4) (actual time=0.008..103.306 rows=882100.00 loops=1)
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=2389
Planning:
Buffers: shared hit=12
Planning Time: 0.193 ms
Execution Time: 617.668 ms
(16 rows)

postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
QUERY PLAN
-------------------------------------------------------------------------------------------------
Hash Anti Join (cost=27134.57..70874.58 rows=393902 width=8)
Hash Cond: (t1.id = t2.id)
-> Seq Scan on join1 t1 (cost=0.00..35518.00 rows=1260300 width=8)
-> Hash (cost=16108.32..16108.32 rows=882100 width=4)
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..16108.32 rows=882100 width=4)
(5 rows)

postgres=# alter table join2 alter id drop not null ;
ALTER TABLE
postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
QUERY PLAN
---------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=18313.57..56982.32 rows=630150 width=8)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..16108.32 rows=882100 width=4)
(4 rows)

postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where t2.id is not null);
QUERY PLAN
-------------------------------------------------------------------------------------------------
Hash Anti Join (cost=29339.83..73079.83 rows=393902 width=8)
Hash Cond: (t1.id = t2.id)
-> Seq Scan on join1 t1 (cost=0.00..35518.00 rows=1260300 width=8)
-> Hash (cost=18313.58..18313.58 rows=882100 width=4)
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..18313.58 rows=882100 width=4)
Index Cond: (id IS NOT NULL)
(6 rows)

postgres=# insert into join2 values ( generate_series(950300,1950300),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
INSERT 0 1000001
postgres=# analyze join2;
ANALYZE
postgres=#
postgres=# insert into join2 values ( generate_series(1950300,3950300),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
INSERT 0 2000001
postgres=# analyze join2;
ANALYZE
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where t2.id is not null);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.86..73003.87 rows=1 width=8) (actual time=348.632..371.158 rows=200.00 loops=1)
Workers Planned: 3
Workers Launched: 3
Buffers: shared hit=99166 read=3398
-> Merge Anti Join (cost=0.86..72003.77 rows=1 width=8) (actual time=177.574..334.714 rows=50.00 loops=4)
Merge Cond: (t1.id = t2.id)
Buffers: shared hit=99166 read=3398
-> Parallel Index Scan using idx_j1 on join1 t1 (cost=0.43..37380.06 rows=406548 width=8) (actual time=0.090..81.338 rows=315075.00 loops=4)
Index Searches: 1
Buffers: shared hit=85241 read=3398
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.43..80682.41 rows=3882102 width=4) (actual time=0.049..145.187 rows=1281766.50 loops=4)
Index Cond: (id IS NOT NULL)
Heap Fetches: 0
Index Searches: 4
Buffers: shared hit=13925
Planning:
Buffers: shared hit=12
Planning Time: 0.191 ms
Execution Time: 371.191 ms

postgres=# set work_mem ='128MB';
SET
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=80682.42..119351.17 rows=630150 width=8) (actual time=1477.040..1835.205 rows=200.00 loops=1)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
Rows Removed by Filter: 1260100
Buffers: shared hit=33502
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.43..70977.16 rows=3882102 width=4) (actual time=0.028..421.350 rows=3882102.00 loops=1)
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=10587
Planning Time: 0.083 ms
Execution Time: 1845.041 ms
(11 rows)

postgres=# set work_mem ='8MB';
SET
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=80682.42..119351.17 rows=630150 width=8) (actual time=1468.254..1825.373 rows=200.00 loops=1)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
Rows Removed by Filter: 1260100
Buffers: shared hit=33502
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.43..70977.16 rows=3882102 width=4) (actual time=0.025..425.249 rows=3882102.00 loops=1)
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=10587
Planning Time: 0.085 ms
Execution Time: 1835.148 ms
(11 rows)

postgres=# set work_mem ='128MB';
SET
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=80682.42..119351.17 rows=630150 width=8) (actual time=1477.040..1835.205 rows=200.00 loops=1)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
Rows Removed by Filter: 1260100
Buffers: shared hit=33502
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.43..70977.16 rows=3882102 width=4) (actual time=0.028..421.350 rows=3882102.00 loops=1)
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=10587
Planning Time: 0.083 ms
Execution Time: 1845.041 ms
(11 rows)

postgres=# set work_mem ='8MB';
SET
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=80682.42..119351.17 rows=630150 width=8) (actual time=1468.254..1825.373 rows=200.00 loops=1)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
Rows Removed by Filter: 1260100
Buffers: shared hit=33502
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.43..70977.16 rows=3882102 width=4) (actual time=0.025..425.249 rows=3882102.00 loops=1)
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=10587
Planning Time: 0.085 ms
Execution Time: 1835.148 ms
(11 rows)

postgres=# set work_mem ='2MB';
SET
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
^CCancel request sent
ERROR: canceling statement due to user request
postgres=# ^C
postgres=#

postgres=# set work_mem ='2MB';
SET
postgres=# select now();
now
-------------------------------
2026-02-04 11:31:42.957009+08
(1 row)

postgres=# select now();
now
-------------------------------
2026-02-04 11:32:10.811249+08
(1 row)

postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
^CCancel request sent
ERROR: canceling statement due to user request
postgres=# select now();
now
-------------------------------
2026-02-04 11:42:29.698854+08
(1 row)

postgres=#

Thanks

On Tue, Feb 3, 2026 at 5:41 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:

Hi Richard

> I believe we are now in a much better position to attempt this again.
> The planner has accumulated significant infrastructure that makes this
> proof straightforward and reliable. Specifically, we can now leverage
> the outer-join-aware-Var infrastructure to tell whether a Var comes
> from the nullable side of an outer join, and the not-null-attnums hash
> table to efficiently check whether a Var is defined NOT NULL. We also
> have the expr_is_nonnullable() function that is smart enough to deduce
> non-nullability for expressions more complex than simple Vars/Consts.
Thank you for working on this.Indeed, the benefits are substantial and highly necessary, as Oracle, SQL Server, and MySQL have all implemented varying degrees of support.I shall test this path in my spare time.

Thanks ，

On Tue, Feb 3, 2026 at 3:13 PM Richard Guo <guofenglinux@gmail.com> wrote:
This topic has been discussed several times in the past. Due to the
semantic mismatch regarding NULL handling, NOT IN is not ordinarily
equivalent to an anti-join. However, if we can prove that neither the
outer expressions nor the subquery outputs can yield NULL values, it
should be safe to convert NOT IN to an anti-join.

I believe we are now in a much better position to attempt this again.
The planner has accumulated significant infrastructure that makes this
proof straightforward and reliable. Specifically, we can now leverage
the outer-join-aware-Var infrastructure to tell whether a Var comes
from the nullable side of an outer join, and the not-null-attnums hash
table to efficiently check whether a Var is defined NOT NULL. We also
have the expr_is_nonnullable() function that is smart enough to deduce
non-nullability for expressions more complex than simple Vars/Consts.

Attached is a draft patch for this attempt (part of the code is
adapted from an old patch [1] by David and Tom). This patch aims for
a conservative implementation: the goal is not to handle every
theoretical case, but to handle canonical query patterns with minimal
code complexity.

The patch primarily targets patterns like:

SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM banned_users);

... and

SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM
banned_users WHERE user_id IS NOT NULL);

This is a very typical syntax for exclusion. In well-modeled
databases, join keys like id and user_id are very likely to be defined
as NOT NULL.

It seems to me that the ROI here is quite positive: the added code
complexity is very low (thanks to the existing infrastructure), while
the benefit is that users writing this typical pattern will finally
get efficient anti-join plans without needing manual rewrites.

(For the outer expressions, we could potentially also use outer query
quals to prove non-nullability. This patch does not attempt to do so.
Implementing this would require passing state down during the
pull_up_sublinks recursion; and given that find_nonnullable_vars can
fail to prove non-nullability in many cases due to the lack of
const-simplification at this stage, I'm not sure whether it is worth
the code complexity. Besides, I haven't fully convinced myself that
doing this does not introduce correctness issues.)

Any thoughts?

[1] https://postgr.es/m/13766.1405037879@sss.pgh.pa.us

- Richard

Re: Convert NOT IN sublinks to anti-joins when safe

От

Zhang Mingli

Дата:

04 февраля, 07:49:43

Hi,

On Feb 3, 2026 at 15:13 +0800, Richard Guo <guofenglinux@gmail.com>, wrote:

This topic has been discussed several times in the past. Due to the
semantic mismatch regarding NULL handling, NOT IN is not ordinarily
equivalent to an anti-join. However, if we can prove that neither the
outer expressions nor the subquery outputs can yield NULL values, it
should be safe to convert NOT IN to an anti-join.

Thanks for working on this!
I've reviewed the patch and it looks good overall.

I noticed several minor issues in the test case comments:

1. The comment doesn't match the SQL:

+-- ANTI JOIN: outer side is defined NOT NULL, inner side is forced nonnullable
+-- by qual clause
+SELECT * FROM not_null_tab
+WHERE id NOT IN (
+   SELECT t1.id
+   FROM null_tab t1
+   INNER JOIN null_tab t2 ON t1.id = t2.id
+   LEFT JOIN null_tab t3 ON TRUE
+);

The comment says "forced nonnullable by qual clause", but there's no explicit IS NOT NULL qual here.

2. Another test case that could use a more precise comment:

+-- ANTI JOIN: outer side is defined NOT NULL and is not nulled by outer join,
+-- inner side is defined NOT NULL
+SELECT * FROM not_null_tab t1
+LEFT JOIN not_null_tab t2
+ON t2.id NOT IN (SELECT id FROM not_null_tab);

Correct me if I’m wrong.
This is a subtle case - the key point is that the ON clause is evaluated on actual t2 rows *before* LEFT JOIN's null-padding.
The current comment is technically correct but might be clearer as:

-- ANTI JOIN: outer side(t2) is defined NOT NULL.
— ON clause is evaluated on actual t2 rows before LEFT JOIN's
-- null-padding, so t2.id is NOT NULL; inner side is also defined NOT NULL

3.Also, one suggestion for additional test coverage - the case where the subquery output comes from the nullable side of an outer join but is forced non-nullable by qual:

-- ANTI JOIN: inner side comes from nullable side of outer join
-- but is forced non-nullable by WHERE clause
EXPLAIN (COSTS OFF)
SELECT * FROM not_null_tab
WHERE id NOT IN (
   SELECT t2.id
   FROM not_null_tab t1
   LEFT JOIN not_null_tab t2 ON t1.id = t2.id
   WHERE t2.id IS NOT NULL
);

The existing tests cover t1.id (non-nullable side) with IS NOT NULL, but not t2.id (nullable side).

If I read the code correctly, this should work via find_subquery_safe_quals + find_nonnullable_vars, but explicit coverage would be good.

And I test it:
QUERY PLAN
----------------------------------------------------
Hash Anti Join
Hash Cond: (not_null_tab.id = t2.id)
->  Seq Scan on not_null_tab
->  Hash
->  Merge Join
Merge Cond: (t2.id = t1.id)
->  Sort
Sort Key: t2.id
->  Seq Scan on null_tab t2
Filter: (id IS NOT NULL)
->  Sort
Sort Key: t1.id
->  Seq Scan on null_tab t1
(13 rows)

--
Zhang Mingli
HashData

Re: Convert NOT IN sublinks to anti-joins when safe

От

Tender Wang

Дата:

04 февраля, 07:54:58

Zhang Mingli <zmlpostgres@gmail.com> 于2026年2月4日周三 12:50写道：
>
> Hi,
>
> On Feb 3, 2026 at 15:13 +0800, Richard Guo <guofenglinux@gmail.com>, wrote:
>
>
> This topic has been discussed several times in the past. Due to the
> semantic mismatch regarding NULL handling, NOT IN is not ordinarily
> equivalent to an anti-join. However, if we can prove that neither the
> outer expressions nor the subquery outputs can yield NULL values, it
> should be safe to convert NOT IN to an anti-join.
>
>
>
> Thanks for working on this!
> I've reviewed the patch and it looks good overall.
>
> I noticed several minor issues in the test case comments:
>
> 1. The comment doesn't match the SQL:
>
> +-- ANTI JOIN: outer side is defined NOT NULL, inner side is forced nonnullable
> +-- by qual clause
> +SELECT * FROM not_null_tab
> +WHERE id NOT IN (
> +    SELECT t1.id
> +    FROM null_tab t1
> +    INNER JOIN null_tab t2 ON t1.id = t2.id
> +    LEFT JOIN null_tab t3 ON TRUE
> +);
>
> The comment says "forced nonnullable by qual clause", but there's no explicit IS NOT NULL qual here.

I guess that it means "t1.id = t2.id". This join clause makes t1.id
forced non-nullable.





--
Thanks,
Tender Wang

Re: Convert NOT IN sublinks to anti-joins when safe

От

Richard Guo

Дата:

04 февраля, 12:47:37

On Tue, Feb 3, 2026 at 4:12 PM Richard Guo <guofenglinux@gmail.com> wrote:
> This topic has been discussed several times in the past.  Due to the
> semantic mismatch regarding NULL handling, NOT IN is not ordinarily
> equivalent to an anti-join.  However, if we can prove that neither the
> outer expressions nor the subquery outputs can yield NULL values, it
> should be safe to convert NOT IN to an anti-join.

I've noticed a loose end in the v1 patch.

The semantic gap between NOT IN and anti-join actually exists whenever
the operator returns NULL.  For NOT IN, if (A op B) returns NULL, then
NOT (NULL) evaluates to NULL (effectively false), and the row is
discarded.  In contrast, for an anti-join, if (A op B) returns NULL,
it implies no match was found, and the anti-join logic dictates that
the row should be kept.

To guarantee that (A op B) never returns NULL, the current patch
verifies that both A and B are non-nullable.  However, this is not
sufficient.  The "op" might be an operator that returns NULL on
non-null inputs.

On the other hand, if "op" does not return NULL on NULL inputs, like
IS DISTINCT FROM, we technically would not even need to require that A
and B are non-nullable.

Is there a convenient way to verify that an operator never returns
NULL on non-null inputs?  Would it be sufficient to insist that the
operator belongs to btree opclass (assuming that the strict ordering
requirements of btree imply this safety)?

And, is it worth checking if an operator never returns NULL even on
NULL inputs?  If we can identify such operators, we should be able to
remove the requirement that both sides of NOT IN must be non-nullable.
Is there a convenient way to check for such operators?

- Richard

Re: Convert NOT IN sublinks to anti-joins when safe

От

David Geier

Дата:

04 февраля, 17:59:22

Hi!

On 04.02.2026 10:47, Richard Guo wrote:
> On Tue, Feb 3, 2026 at 4:12 PM Richard Guo <guofenglinux@gmail.com> wrote:
>> This topic has been discussed several times in the past.  Due to the
>> semantic mismatch regarding NULL handling, NOT IN is not ordinarily
>> equivalent to an anti-join.  However, if we can prove that neither the
>> outer expressions nor the subquery outputs can yield NULL values, it
>> should be safe to convert NOT IN to an anti-join.

I used to play around with query rewrites of the same form, turning NOT
IN into NOT EXISTS. As you rightfully pointed out, the straight forward
rewrite only works if neither the outer expression nor the sub-query
cannot yield NULLs, limiting the optimization considerably. Both cases
can be fixed by amending the basic form of the rewrite.

Basic form:

SELECT t1.c1 FROM t1 WHERE t1.c1 NOT IN (SELECT t2.c1 FROM t2)

=>

SELECT t1.c1 FROM t1 WHERE
  NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1)

If the sub-select can yield NULLs, the rewrite can be fixed by adding an
OR t2.c1 IS NULL clause, such as:

SELECT t1.c1 FROM t1 WHERE
  NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1 OR t2.c1 IS NULL)

which is equivalent to the following SQL which avoids the OR:

SELECT t1.c1 FROM t1 WHERE
  NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1) AND
  NOT EXISTS (SELECT 1 FROM t2 WHERE t2.c1 IS NULL)

If the outer expression can yield NULLs, the rewrite can be fixed by
adding a t1.c1 IS NOT NULL clause, such as:

SELECT t1.c1 FROM T1 WHERE
  t1.c1 IS NOT NULL AND
  NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1)

Both fixes can of course be combined yielding:

SELECT t1.c1 FROM T1 WHERE
  t1.c1 IS NOT NULL AND
  NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1) AND
  NOT EXISTS (SELECT 1 FROM t2 WHERE t2.c1 IS NULL)

What's our today's take on doing more involved transformations inside
the planner to support such cases? It would greatly open up the scope of
the optimization.

> I've noticed a loose end in the v1 patch.
> 
> The semantic gap between NOT IN and anti-join actually exists whenever
> the operator returns NULL.  For NOT IN, if (A op B) returns NULL, then
> NOT (NULL) evaluates to NULL (effectively false), and the row is
> discarded.  In contrast, for an anti-join, if (A op B) returns NULL,
> it implies no match was found, and the anti-join logic dictates that
> the row should be kept.
> 
> To guarantee that (A op B) never returns NULL, the current patch
> verifies that both A and B are non-nullable.  However, this is not
> sufficient.  The "op" might be an operator that returns NULL on
> non-null inputs.
> 
> On the other hand, if "op" does not return NULL on NULL inputs, like
> IS DISTINCT FROM, we technically would not even need to require that A
> and B are non-nullable.
> 
> Is there a convenient way to verify that an operator never returns
> NULL on non-null inputs?  Would it be sufficient to insist that the
> operator belongs to btree opclass (assuming that the strict ordering
> requirements of btree imply this safety)?

There's lots of code that e.g. calls FunctionCall2[Coll]() on operators,
where that function raises an ERROR in case the returned value is NULL.
Hence, I would say it's safe to make that assumption.

Otherwise, you could restrict the check to only builtin operators. For
those we know they have the expected semantics.

--
David Geier

Re: Convert NOT IN sublinks to anti-joins when safe

От

Richard Guo

Дата:

05 февраля, 09:09:17

On Wed, Feb 4, 2026 at 11:59 PM David Geier <geidav.pg@gmail.com> wrote:
> If the sub-select can yield NULLs, the rewrite can be fixed by adding an
> OR t2.c1 IS NULL clause, such as:
>
> SELECT t1.c1 FROM t1 WHERE
>   NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1 OR t2.c1 IS NULL)

I'm not sure if this rewrite results in a better plan.  The OR clause
would force a nested loop join, which could be much slower than a
hashed-subplan plan.

> If the outer expression can yield NULLs, the rewrite can be fixed by
> adding a t1.c1 IS NOT NULL clause, such as:
>
> SELECT t1.c1 FROM T1 WHERE
>   t1.c1 IS NOT NULL AND
>   NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1)

This rewrite doesn't seem correct to me.  If t2 is empty, you would
incorrectly lose the NULL rows from t1 in the final result.

> What's our today's take on doing more involved transformations inside
> the planner to support such cases? It would greatly open up the scope of
> the optimization.

As mentioned in my initial email, the goal of this patch is not to
handle every possible case, but rather only to handle the basic form
where both sides of NOT IN are provably non-nullable.  This keeps the
code complexity to a minimum, and I believe this would cover the most
common use cases in real world.

- Richard

Re: Convert NOT IN sublinks to anti-joins when safe

От

Richard Guo

Дата:

05 февраля, 09:51:20

On Wed, Feb 4, 2026 at 6:47 PM Richard Guo <guofenglinux@gmail.com> wrote:
> Is there a convenient way to verify that an operator never returns
> NULL on non-null inputs?  Would it be sufficient to insist that the
> operator belongs to btree opclass (assuming that the strict ordering
> requirements of btree imply this safety)?

I think we can insist that the operator be a member of a btree or hash
opfamily.  Btree operators must adhere to strict total order, and hash
operators must adhere to strict equality; if they return NULL for
non-null inputs, the indexes themselves would be corrupt.

I'm less confident about other access methods like gist or gin.  Their
semantics can be more flexible, and using such operators in a NOT IN
clause is quite rare.

Attached is the updated patch, which adds the check requiring the
operator to be a member of a btree or hash opfamily.

> And, is it worth checking if an operator never returns NULL even on
> NULL inputs?  If we can identify such operators, we should be able to
> remove the requirement that both sides of NOT IN must be non-nullable.
> Is there a convenient way to check for such operators?

I don't know how to check for such operators, so I didn't do it in the
patch.

- Richard

Вложения

v2-0001-Convert-NOT-IN-sublinks-to-anti-joins-when-safe.patch

Re: Convert NOT IN sublinks to anti-joins when safe

От

wenhui qiu

Дата:

05 февраля, 10:29:54

HI Richard

> As mentioned in my initial email, the goal of this patch is not to
> handle every possible case, but rather only to handle the basic form
> where both sides of NOT IN are provably non-nullable. This keeps the
> code complexity to a minimum, and I believe this would cover the most
> common use cases in real world.

Agree +1 ,The current path already covers common scenarios and is no less comprehensive than other databases.I'm already quite pleased that it can be merged.

Having tested a certain widely used open-source database, I found it unable to process the following query: `SELECT * FROM join1 WHERE id NOT IN (SELECT id FROM join2 WHERE id IS NOT NULL);` Note that join2 allows null values for id.

Thanks

On Thu, Feb 5, 2026 at 2:09 PM Richard Guo <guofenglinux@gmail.com> wrote:

On Wed, Feb 4, 2026 at 11:59 PM David Geier <geidav.pg@gmail.com> wrote:
> If the sub-select can yield NULLs, the rewrite can be fixed by adding an
> OR t2.c1 IS NULL clause, such as:
>
> SELECT t1.c1 FROM t1 WHERE
> NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1 OR t2.c1 IS NULL)

I'm not sure if this rewrite results in a better plan. The OR clause
would force a nested loop join, which could be much slower than a
hashed-subplan plan.

> If the outer expression can yield NULLs, the rewrite can be fixed by
> adding a t1.c1 IS NOT NULL clause, such as:
>
> SELECT t1.c1 FROM T1 WHERE
> t1.c1 IS NOT NULL AND
> NOT EXISTS (SELECT 1 FROM t2 WHERE t1.c1 = t2.c1)

This rewrite doesn't seem correct to me. If t2 is empty, you would
incorrectly lose the NULL rows from t1 in the final result.

> What's our today's take on doing more involved transformations inside
> the planner to support such cases? It would greatly open up the scope of
> the optimization.

As mentioned in my initial email, the goal of this patch is not to
handle every possible case, but rather only to handle the basic form
where both sides of NOT IN are provably non-nullable. This keeps the
code complexity to a minimum, and I believe this would cover the most
common use cases in real world.

- Richard

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Convert NOT IN sublinks to anti-joins when safe

Вложения

Вложения