Re: Convert NOT IN sublinks to anti-joins when safe
| От | wenhui qiu |
|---|---|
| Тема | Re: Convert NOT IN sublinks to anti-joins when safe |
| Дата | |
| Msg-id | CAGjGUAJh0bBaETV7Zin9J5pY++9ZEgXnPy--M66LUj3-nd63AQ@mail.gmail.com обсуждение исходный текст |
| Ответ на | Re: Convert NOT IN sublinks to anti-joins when safe (wenhui qiu <qiuwenhuifx@gmail.com>) |
| Список | pgsql-hackers |
Hi Richard
That's just brilliant! Here’s my testing process, which also tackles a commonly criticized aspect of PostgreSQL, effectively addressing its shortcomings.Moreover, the path test scenarios are thorough, encompassing cases that involve "NOT IN".Once again, thank you for your hard work.
###########
create table join1 (id integer,name varchar(300),k1 integer);
create table join2 (id integer,name varchar(300),score integer);
insert into join1 values ( generate_series(1,20000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(1,20000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(1,20000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(50201,50300),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAASSSSSAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(50201,50300),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAASSSSSAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(150201,1350300),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAASSSSSAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join2 values ( generate_series(1,40000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',1);
insert into join2 values ( generate_series(1,40000),'aaaaaaaaaaaaaaaaAAAAAAABBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',2);
insert into join2 values ( generate_series(20001,22000),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
insert into join2 values ( generate_series(150201,950300),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
create index idx_j1 on join1(id);
create index idx_j2 on join2(id);
VACUUM ANALYZE JOIN1;
VACUUM ANALYZE JOIN2;
postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 WHERE T2.ID>10000);
QUERY PLAN
---------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=20060.65..58729.40 rows=630150 width=8)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..17904.72 rows=862371 width=4)
Index Cond: (id > 10000)
(5 rows)
postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 WHERE T2.ID>10000 and t2.id is not null);
QUERY PLAN
---------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=22216.57..60885.32 rows=630150 width=8)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..20060.65 rows=862371 width=4)
Index Cond: ((id > 10000) AND (id IS NOT NULL))
(5 rows)
postgres=# alter table join1 alter id set not null;
ALTER TABLE
postgres=# alter table join2 alter id set not null;
ALTER TABLE
postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 WHERE T2.ID>10000 );
QUERY PLAN
-------------------------------------------------------------------------------------------------
Hash Anti Join (cost=28684.35..72424.36 rows=393902 width=8)
Hash Cond: (t1.id = t2.id)
-> Seq Scan on join1 t1 (cost=0.00..35518.00 rows=1260300 width=8)
-> Hash (cost=17904.72..17904.72 rows=862371 width=4)
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..17904.72 rows=862371 width=4)
Index Cond: (id > 10000)
(6 rows)
postgres=#
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where t2.id is not null);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=27134.57..70874.58 rows=393902 width=8) (actual time=239.777..602.433 rows=400200.00 loops=1)
Hash Cond: (t1.id = t2.id)
Buffers: shared hit=25304
-> Seq Scan on join1 t1 (cost=0.00..35518.00 rows=1260300 width=8) (actual time=0.007..85.255 rows=1260300.00 loops=1)
Buffers: shared hit=22915
-> Hash (cost=16108.32..16108.32 rows=882100 width=4) (actual time=220.949..220.951 rows=882100.00 loops=1)
Buckets: 1048576 Batches: 1 Memory Usage: 39204kB
Buffers: shared hit=2389
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..16108.32 rows=882100 width=4) (actual time=0.008..103.306 rows=882100.00 loops=1)
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=2389
Planning:
Buffers: shared hit=12
Planning Time: 0.193 ms
Execution Time: 617.668 ms
(16 rows)
postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
QUERY PLAN
-------------------------------------------------------------------------------------------------
Hash Anti Join (cost=27134.57..70874.58 rows=393902 width=8)
Hash Cond: (t1.id = t2.id)
-> Seq Scan on join1 t1 (cost=0.00..35518.00 rows=1260300 width=8)
-> Hash (cost=16108.32..16108.32 rows=882100 width=4)
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..16108.32 rows=882100 width=4)
(5 rows)
postgres=# alter table join2 alter id drop not null ;
ALTER TABLE
postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
QUERY PLAN
---------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=18313.57..56982.32 rows=630150 width=8)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..16108.32 rows=882100 width=4)
(4 rows)
postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where t2.id is not null);
QUERY PLAN
-------------------------------------------------------------------------------------------------
Hash Anti Join (cost=29339.83..73079.83 rows=393902 width=8)
Hash Cond: (t1.id = t2.id)
-> Seq Scan on join1 t1 (cost=0.00..35518.00 rows=1260300 width=8)
-> Hash (cost=18313.58..18313.58 rows=882100 width=4)
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..18313.58 rows=882100 width=4)
Index Cond: (id IS NOT NULL)
(6 rows)
postgres=# insert into join2 values ( generate_series(950300,1950300),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
INSERT 0 1000001
postgres=# analyze join2;
ANALYZE
postgres=#
postgres=# insert into join2 values ( generate_series(1950300,3950300),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
INSERT 0 2000001
postgres=# analyze join2;
ANALYZE
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where t2.id is not null);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.86..73003.87 rows=1 width=8) (actual time=348.632..371.158 rows=200.00 loops=1)
Workers Planned: 3
Workers Launched: 3
Buffers: shared hit=99166 read=3398
-> Merge Anti Join (cost=0.86..72003.77 rows=1 width=8) (actual time=177.574..334.714 rows=50.00 loops=4)
Merge Cond: (t1.id = t2.id)
Buffers: shared hit=99166 read=3398
-> Parallel Index Scan using idx_j1 on join1 t1 (cost=0.43..37380.06 rows=406548 width=8) (actual time=0.090..81.338 rows=315075.00 loops=4)
Index Searches: 1
Buffers: shared hit=85241 read=3398
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.43..80682.41 rows=3882102 width=4) (actual time=0.049..145.187 rows=1281766.50 loops=4)
Index Cond: (id IS NOT NULL)
Heap Fetches: 0
Index Searches: 4
Buffers: shared hit=13925
Planning:
Buffers: shared hit=12
Planning Time: 0.191 ms
Execution Time: 371.191 ms
postgres=# set work_mem ='128MB';
SET
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=80682.42..119351.17 rows=630150 width=8) (actual time=1477.040..1835.205 rows=200.00 loops=1)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
Rows Removed by Filter: 1260100
Buffers: shared hit=33502
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.43..70977.16 rows=3882102 width=4) (actual time=0.028..421.350 rows=3882102.00 loops=1)
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=10587
Planning Time: 0.083 ms
Execution Time: 1845.041 ms
(11 rows)
postgres=# set work_mem ='8MB';
SET
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=80682.42..119351.17 rows=630150 width=8) (actual time=1468.254..1825.373 rows=200.00 loops=1)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
Rows Removed by Filter: 1260100
Buffers: shared hit=33502
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.43..70977.16 rows=3882102 width=4) (actual time=0.025..425.249 rows=3882102.00 loops=1)
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=10587
Planning Time: 0.085 ms
Execution Time: 1835.148 ms
(11 rows)
postgres=# set work_mem ='128MB';
SET
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=80682.42..119351.17 rows=630150 width=8) (actual time=1477.040..1835.205 rows=200.00 loops=1)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
Rows Removed by Filter: 1260100
Buffers: shared hit=33502
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.43..70977.16 rows=3882102 width=4) (actual time=0.028..421.350 rows=3882102.00 loops=1)
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=10587
Planning Time: 0.083 ms
Execution Time: 1845.041 ms
(11 rows)
postgres=# set work_mem ='8MB';
SET
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=80682.42..119351.17 rows=630150 width=8) (actual time=1468.254..1825.373 rows=200.00 loops=1)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
Rows Removed by Filter: 1260100
Buffers: shared hit=33502
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.43..70977.16 rows=3882102 width=4) (actual time=0.025..425.249 rows=3882102.00 loops=1)
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=10587
Planning Time: 0.085 ms
Execution Time: 1835.148 ms
(11 rows)
postgres=# set work_mem ='2MB';
SET
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
^CCancel request sent
ERROR: canceling statement due to user request
postgres=# ^C
postgres=#
postgres=# set work_mem ='2MB';
SET
postgres=# select now();
now
-------------------------------
2026-02-04 11:31:42.957009+08
(1 row)
postgres=# select now();
now
-------------------------------
2026-02-04 11:32:10.811249+08
(1 row)
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
^CCancel request sent
ERROR: canceling statement due to user request
postgres=# select now();
now
-------------------------------
2026-02-04 11:42:29.698854+08
(1 row)
postgres=#
create table join1 (id integer,name varchar(300),k1 integer);
create table join2 (id integer,name varchar(300),score integer);
insert into join1 values ( generate_series(1,20000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(1,20000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(1,20000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(50201,50300),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAASSSSSAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(50201,50300),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAASSSSSAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join1 values ( generate_series(150201,1350300),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAASSSSSAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',10);
insert into join2 values ( generate_series(1,40000),'aaaaaaaaaaaaaaaaAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',1);
insert into join2 values ( generate_series(1,40000),'aaaaaaaaaaaaaaaaAAAAAAABBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',2);
insert into join2 values ( generate_series(20001,22000),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
insert into join2 values ( generate_series(150201,950300),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
create index idx_j1 on join1(id);
create index idx_j2 on join2(id);
VACUUM ANALYZE JOIN1;
VACUUM ANALYZE JOIN2;
postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 WHERE T2.ID>10000);
QUERY PLAN
---------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=20060.65..58729.40 rows=630150 width=8)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..17904.72 rows=862371 width=4)
Index Cond: (id > 10000)
(5 rows)
postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 WHERE T2.ID>10000 and t2.id is not null);
QUERY PLAN
---------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=22216.57..60885.32 rows=630150 width=8)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..20060.65 rows=862371 width=4)
Index Cond: ((id > 10000) AND (id IS NOT NULL))
(5 rows)
postgres=# alter table join1 alter id set not null;
ALTER TABLE
postgres=# alter table join2 alter id set not null;
ALTER TABLE
postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 WHERE T2.ID>10000 );
QUERY PLAN
-------------------------------------------------------------------------------------------------
Hash Anti Join (cost=28684.35..72424.36 rows=393902 width=8)
Hash Cond: (t1.id = t2.id)
-> Seq Scan on join1 t1 (cost=0.00..35518.00 rows=1260300 width=8)
-> Hash (cost=17904.72..17904.72 rows=862371 width=4)
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..17904.72 rows=862371 width=4)
Index Cond: (id > 10000)
(6 rows)
postgres=#
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where t2.id is not null);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=27134.57..70874.58 rows=393902 width=8) (actual time=239.777..602.433 rows=400200.00 loops=1)
Hash Cond: (t1.id = t2.id)
Buffers: shared hit=25304
-> Seq Scan on join1 t1 (cost=0.00..35518.00 rows=1260300 width=8) (actual time=0.007..85.255 rows=1260300.00 loops=1)
Buffers: shared hit=22915
-> Hash (cost=16108.32..16108.32 rows=882100 width=4) (actual time=220.949..220.951 rows=882100.00 loops=1)
Buckets: 1048576 Batches: 1 Memory Usage: 39204kB
Buffers: shared hit=2389
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..16108.32 rows=882100 width=4) (actual time=0.008..103.306 rows=882100.00 loops=1)
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=2389
Planning:
Buffers: shared hit=12
Planning Time: 0.193 ms
Execution Time: 617.668 ms
(16 rows)
postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
QUERY PLAN
-------------------------------------------------------------------------------------------------
Hash Anti Join (cost=27134.57..70874.58 rows=393902 width=8)
Hash Cond: (t1.id = t2.id)
-> Seq Scan on join1 t1 (cost=0.00..35518.00 rows=1260300 width=8)
-> Hash (cost=16108.32..16108.32 rows=882100 width=4)
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..16108.32 rows=882100 width=4)
(5 rows)
postgres=# alter table join2 alter id drop not null ;
ALTER TABLE
postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
QUERY PLAN
---------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=18313.57..56982.32 rows=630150 width=8)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..16108.32 rows=882100 width=4)
(4 rows)
postgres=# explain SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where t2.id is not null);
QUERY PLAN
-------------------------------------------------------------------------------------------------
Hash Anti Join (cost=29339.83..73079.83 rows=393902 width=8)
Hash Cond: (t1.id = t2.id)
-> Seq Scan on join1 t1 (cost=0.00..35518.00 rows=1260300 width=8)
-> Hash (cost=18313.58..18313.58 rows=882100 width=4)
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.42..18313.58 rows=882100 width=4)
Index Cond: (id IS NOT NULL)
(6 rows)
postgres=# insert into join2 values ( generate_series(950300,1950300),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
INSERT 0 1000001
postgres=# analyze join2;
ANALYZE
postgres=#
postgres=# insert into join2 values ( generate_series(1950300,3950300),'aaaaaaaaaaaaaaaaAACCCCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAaaAAAAAAAAAAAAAAAAAAAAaaaaaaaAAAAAAAAAAAAAAAAAAAAA',3);
INSERT 0 2000001
postgres=# analyze join2;
ANALYZE
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 where t2.id is not null);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.86..73003.87 rows=1 width=8) (actual time=348.632..371.158 rows=200.00 loops=1)
Workers Planned: 3
Workers Launched: 3
Buffers: shared hit=99166 read=3398
-> Merge Anti Join (cost=0.86..72003.77 rows=1 width=8) (actual time=177.574..334.714 rows=50.00 loops=4)
Merge Cond: (t1.id = t2.id)
Buffers: shared hit=99166 read=3398
-> Parallel Index Scan using idx_j1 on join1 t1 (cost=0.43..37380.06 rows=406548 width=8) (actual time=0.090..81.338 rows=315075.00 loops=4)
Index Searches: 1
Buffers: shared hit=85241 read=3398
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.43..80682.41 rows=3882102 width=4) (actual time=0.049..145.187 rows=1281766.50 loops=4)
Index Cond: (id IS NOT NULL)
Heap Fetches: 0
Index Searches: 4
Buffers: shared hit=13925
Planning:
Buffers: shared hit=12
Planning Time: 0.191 ms
Execution Time: 371.191 ms
postgres=# set work_mem ='128MB';
SET
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=80682.42..119351.17 rows=630150 width=8) (actual time=1477.040..1835.205 rows=200.00 loops=1)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
Rows Removed by Filter: 1260100
Buffers: shared hit=33502
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.43..70977.16 rows=3882102 width=4) (actual time=0.028..421.350 rows=3882102.00 loops=1)
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=10587
Planning Time: 0.083 ms
Execution Time: 1845.041 ms
(11 rows)
postgres=# set work_mem ='8MB';
SET
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=80682.42..119351.17 rows=630150 width=8) (actual time=1468.254..1825.373 rows=200.00 loops=1)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
Rows Removed by Filter: 1260100
Buffers: shared hit=33502
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.43..70977.16 rows=3882102 width=4) (actual time=0.025..425.249 rows=3882102.00 loops=1)
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=10587
Planning Time: 0.085 ms
Execution Time: 1835.148 ms
(11 rows)
postgres=# set work_mem ='128MB';
SET
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=80682.42..119351.17 rows=630150 width=8) (actual time=1477.040..1835.205 rows=200.00 loops=1)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
Rows Removed by Filter: 1260100
Buffers: shared hit=33502
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.43..70977.16 rows=3882102 width=4) (actual time=0.028..421.350 rows=3882102.00 loops=1)
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=10587
Planning Time: 0.083 ms
Execution Time: 1845.041 ms
(11 rows)
postgres=# set work_mem ='8MB';
SET
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Seq Scan on join1 t1 (cost=80682.42..119351.17 rows=630150 width=8) (actual time=1468.254..1825.373 rows=200.00 loops=1)
Filter: (NOT (ANY (id = (hashed SubPlan any_1).col1)))
Rows Removed by Filter: 1260100
Buffers: shared hit=33502
SubPlan any_1
-> Index Only Scan using idx_j2 on join2 t2 (cost=0.43..70977.16 rows=3882102 width=4) (actual time=0.025..425.249 rows=3882102.00 loops=1)
Heap Fetches: 0
Index Searches: 1
Buffers: shared hit=10587
Planning Time: 0.085 ms
Execution Time: 1835.148 ms
(11 rows)
postgres=# set work_mem ='2MB';
SET
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
^CCancel request sent
ERROR: canceling statement due to user request
postgres=# ^C
postgres=#
postgres=# set work_mem ='2MB';
SET
postgres=# select now();
now
-------------------------------
2026-02-04 11:31:42.957009+08
(1 row)
postgres=# select now();
now
-------------------------------
2026-02-04 11:32:10.811249+08
(1 row)
postgres=# explain analyze SELECT T1.id,T1.K1 FROM join1 t1 WHERE T1.id NOT IN (SELECT T2.id FROM join2 t2 );
^CCancel request sent
ERROR: canceling statement due to user request
postgres=# select now();
now
-------------------------------
2026-02-04 11:42:29.698854+08
(1 row)
postgres=#
Thanks
On Tue, Feb 3, 2026 at 5:41 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote:
Hi Richard> I believe we are now in a much better position to attempt this again.
> The planner has accumulated significant infrastructure that makes this
> proof straightforward and reliable. Specifically, we can now leverage
> the outer-join-aware-Var infrastructure to tell whether a Var comes
> from the nullable side of an outer join, and the not-null-attnums hash> table to efficiently check whether a Var is defined NOT NULL. We also
> have the expr_is_nonnullable() function that is smart enough to deduce
> non-nullability for expressions more complex than simple Vars/Consts.Thank you for working on this.Indeed, the benefits are substantial and highly necessary, as Oracle, SQL Server, and MySQL have all implemented varying degrees of support.I shall test this path in my spare time.Thanks ,On Tue, Feb 3, 2026 at 3:13 PM Richard Guo <guofenglinux@gmail.com> wrote:This topic has been discussed several times in the past. Due to the
semantic mismatch regarding NULL handling, NOT IN is not ordinarily
equivalent to an anti-join. However, if we can prove that neither the
outer expressions nor the subquery outputs can yield NULL values, it
should be safe to convert NOT IN to an anti-join.
I believe we are now in a much better position to attempt this again.
The planner has accumulated significant infrastructure that makes this
proof straightforward and reliable. Specifically, we can now leverage
the outer-join-aware-Var infrastructure to tell whether a Var comes
from the nullable side of an outer join, and the not-null-attnums hash
table to efficiently check whether a Var is defined NOT NULL. We also
have the expr_is_nonnullable() function that is smart enough to deduce
non-nullability for expressions more complex than simple Vars/Consts.
Attached is a draft patch for this attempt (part of the code is
adapted from an old patch [1] by David and Tom). This patch aims for
a conservative implementation: the goal is not to handle every
theoretical case, but to handle canonical query patterns with minimal
code complexity.
The patch primarily targets patterns like:
SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM banned_users);
... and
SELECT * FROM users WHERE id NOT IN (SELECT user_id FROM
banned_users WHERE user_id IS NOT NULL);
This is a very typical syntax for exclusion. In well-modeled
databases, join keys like id and user_id are very likely to be defined
as NOT NULL.
It seems to me that the ROI here is quite positive: the added code
complexity is very low (thanks to the existing infrastructure), while
the benefit is that users writing this typical pattern will finally
get efficient anti-join plans without needing manual rewrites.
(For the outer expressions, we could potentially also use outer query
quals to prove non-nullability. This patch does not attempt to do so.
Implementing this would require passing state down during the
pull_up_sublinks recursion; and given that find_nonnullable_vars can
fail to prove non-nullability in many cases due to the lack of
const-simplification at this stage, I'm not sure whether it is worth
the code complexity. Besides, I haven't fully convinced myself that
doing this does not introduce correctness issues.)
Any thoughts?
[1] https://postgr.es/m/13766.1405037879@sss.pgh.pa.us
- Richard
В списке pgsql-hackers по дате отправления: