Обсуждение: BUG #7556: "select not in sub query" plan very poor vs "not exists"
The following bug has been logged on the website: Bug reference: 7556 Logged by: lt Email address: l1t@tom.com PostgreSQL version: 9.2.0 Operating system: windows xp Description: = test=3D# create table sli_test (id int primary key,info varchar(20)); test=3D# insert into sli_test select generate_series(1,1000000),'digoal'||generate_series(1,1000000); test=3D# analyze verbose sli_test; test=3D# create table sli_test2 (id int not null,info varchar(20)); test=3D# insert into sli_test2 select generate_series(1,1000000),'dbase'||generate_series(1,1000000); test=3D# analyze verbose sli_test2; test=3D# explain select max(a.info)from sli_test a where a.id not in(select b.id from sli_test2 b where b.id<50000); QUERY PLAN ---------------------------------------------------------------------------= ------------ Aggregate (cost=3D9241443774.00..9241443774.01 rows=3D1 width=3D12)
<l1t@tom.com> wrote: > test=# explain select max(a.info)from sli_test a where a.id not > in(select b.id from sli_test2 b where b.id<50000); > Aggregate (cost=9241443774.00..9241443774.01 rows=1 width=12) A slower plan for NOT IN than NOT EXISTS is a fact, but definitely not a bug. According to the standard, the semantics of NOT IN are different from NOT EXISTS when there is a possibility of NULLs on either side. Because of those different semantics, NOT IN cannot use certain optimizations which are available for NOT EXISTS. Technically, if a NOT IN case could be analyzed to the point where it is clear that both sides of the predicate are definitely free of NULLs, it could be run the same as NOT EXISTS, but that would add complexity and run-time expense to the optimizer. Some cases are simple, but some are not at all practical. We have chosen instead to recommend that people use NOT EXISTS unless they really want the rather astonishing behavior of NOT IN. -Kevin
On 09/19/2012 01:48 PM, l1t@tom.com wrote: > The following bug has been logged on the website: > > Bug reference: 7556 > Logged by: lt > Email address: l1t@tom.com > PostgreSQL version: 9.2.0 > Operating system: windows xp > Description: > create table sli_test (id int primary key,info varchar(20)); insert into sli_test select generate_series(1,1000000),'digoal'||generate_series(1,1000000); analyze verbose sli_test; create table sli_test2 (id int not null,info varchar(20)); insert into sli_test2 select generate_series(1,1000000),'dbase'||generate_series(1,1000000); analyze verbose sli_test2; explain select max(a.info)from sli_test a where a.id not in(select b.id from sli_test2 b where b.id<50000); > QUERY PLAN > --------------------------------------------------------------------------------------- > Aggregate (cost=9241443774.00..9241443774.01 rows=1 width=12) Here's what I get on 9.1: regress=# explain select max(a.info)from sli_test a where a.id not in(select regress(# b.id from sli_test2 b where b.id<50000); QUERY PLAN --------------------------------------------------------------------------------- Aggregate (cost=38050.82..38050.83 rows=1 width=12) -> Seq Scan on sli_test a (cost=18026.82..36800.82 rows=500000 width=12) Filter: (NOT (hashed SubPlan 1)) SubPlan 1 -> Seq Scan on sli_test2 b (cost=0.00..17906.00 rows=48329 width=4) Filter: (id < 50000) (6 rows) It runs in about 500ms here. You don't appear to have posted the full query plan, so it's hard to compare. In general, `NOT IN` is a poor formulation for a query; you're better off with a JOIN or with `NOT EXISTS`. See eg http://stackoverflow.com/questions/12444142/postgresql-how-to-figure-out-missing-numbers-in-a-column-using-generate-series/12444165#12444165 -- Craig Ringer