NOT IN vs. NOT EXISTS performance

Поиск

Список

Период

Сортировка

От	Lincoln Swaine-Moore
Тема	NOT IN vs. NOT EXISTS performance
Дата	8 ноября 2018 г. 22:35:32
Msg-id	CABcidkLPtsYLjW0hiVdRweYriMfZbzOpVh8mr3J9j8nh4=0EaQ@mail.gmail.com обсуждение исходный текст
Ответы	Re: NOT IN vs. NOT EXISTS performance
Список	pgsql-performance

Дерево обсуждения

Hi all,

I've figured out how to solve the performance issues I've been encountering with a particular query, but I'm interested in better understanding the intuition behind why the better query is so much more performant.

The query in question involves a NOT IN filter from a CTE:

WITH temp as (
SELECT temp_id
FROM other_table
WHERE ...
)
SELECT COUNT(*)

FROM main_table m

WHERE m.main_id NOT IN (SELECT temp_id from temp);

The query plan for (the non-masked version of) this query includes:

Filter: (NOT (SubPlan 1))

   Rows Removed by Filter: 1

   SubPlan 1

     ->  CTE Scan on temp  (cost=0.00..1950.60 rows=97530 width=418) (actual time=0.018..4.170 rows=15697 loops=100731)

My understanding is that PostgreSQL is not able to make this into a hashed Subplan because the expected number of rows * width of rows is too large for the work_mem setting, and so instead it has to do many repeated linear passes to find whether the various values of main_id are in the list of temp_ids.

The resolution to this problem was discovered via https://explainextended.com/2009/09/16/not-in-vs-not-exists-vs-left-join-is-null-postgresql/, and involves rewriting as so:

WITH temp as (
  SELECT temp_id
  FROM other_table
  WHERE ...
)
SELECT COUNT(*)FROM main_table m 
WHERE NOT EXISTS (SELECT temp_id from temp where temp_id = m.main_id);

The query plan for this query (which I believe is equivalent to what it would be if I instead explicitly LEFT JOINed main_table to temp and used a WHERE to filter out NULL values of temp_id) does not involve a high number of loops like above:

->  Merge Anti Join  (cost=147305.45..149523.68 rows=192712 width=4) (actual time=5050.773..5622.266 rows=158850 loops=1)
   Merge Cond: (m.main_id = temp.temp_id)
   ->  Sort  (cost=115086.98..115637.10 rows=220050 width=19) (actual time=1290.829..1655.066 rows=199226 loops=1)
         Sort Key: m.main_id
         Sort Method: external merge  Disk: 5632kB
   ->  Materialize  (cost=32218.47..32764.37 rows=109180 width=418) (actual time=3759.936..3787.724 rows=38268 loops=1)
         ->  Sort  (cost=32218.47..32491.42 rows=109180 width=418) (actual time=3759.933..3771.117 rows=38268 loops=1)
               Sort Key: temp.temp_id
               Sort Method: quicksort  Memory: 3160kB
               ->  CTE Scan on temp  (cost=0.00..2183.60 rows=109180 width=418) (actual time=2316.745..3735.486 rows=38268 loops=1)

Instead it sorts using disk, and uses a MERGE ANTI JOIN, which I understand to eliminate the need for multiple passes through the table temp.

My primary question is: why is this approach only possible (for data too large for memory) when using NOT EXISTS, and not when using NOT IN? 

I understand that there is a slight difference in the meaning of the two expressions, in that NOT IN will produce NULL if there are any NULL values in the right hand side (in this case there are none, and the queries should return the same COUNT). But if anything, I would expect that to improve performance of the NOT IN operation, since a single pass through that data should reveal if there are any NULL values, at which point that information could be used to short-circuit. So I am a bit baffled.

Thanks very much for your help!

Best,
Lincoln

--
Lincoln Swaine-Moore

В списке pgsql-performance по дате отправления:

Предыдущее

От: Richard Lee
Дата: 06 ноября 2018 г., 14:40:50
Сообщение: Re: Fwd: Query with high planning time compared to execution time

Следующее

От: David Rowley
Дата: 09 ноября 2018 г., 00:11:40
Сообщение: Re: NOT IN vs. NOT EXISTS performance

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

NOT IN vs. NOT EXISTS performance

Предыдущее

Следующее