optimizing queries using IN and EXISTS

От

Nick Hofstede

Тема

Дата

18 июля 2012 г. в 13:10:41

Msg-id

BC885F9E3DB48248A4C9FC7F2C57215C2DAAFF71@Hoefnix.dc.intranet

обсуждение

Список

pgsql-performance

Дерево обсуждения

optimizing queries using IN and EXISTS Nick Hofstede <Nick.Hofstede@inventivegroup.com> 18 июля 2012 г. в 13:10:41

Re: optimizing queries using IN and EXISTS Tom Lane <tgl@sss.pgh.pa.us> 18 июля 2012 г. в 19:36:33

Re: optimizing queries using IN and EXISTS Nick Hofstede <Nick.Hofstede@inventivegroup.com> 19 июля 2012 г. в 08:53:53

Re: optimizing queries using IN and EXISTS Peter Geoghegan <peter@2ndquadrant.com> 18 июля 2012 г. в 15:40:47

Re: optimizing queries using IN and EXISTS Nick Hofstede <Nick.Hofstede@inventivegroup.com> 18 июля 2012 г. в 17:29:03

Hi,

I'm surprised at the difference in speed/execution plan between two logically equivalent queries, one using IN, the other using EXISTS. (At least I think they are logically equivalent)

I've created a small setup that illustrates what I mean.

Consider the following tables:

CREATE TABLE foo
(
  id integer NOT NULL,
  CONSTRAINT foo_pkey PRIMARY KEY (id )
)

CREATE TABLE bar
(
  foo_ref integer,
  value character varying,
  id integer NOT NULL,
  CONSTRAINT bar_pkey PRIMARY KEY (id ),
  CONSTRAINT bar_foo_ref_fkey FOREIGN KEY (foo_ref)
      REFERENCES foo (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION
)

The following two queries have very different query plans:

SELECT *
FROM   foo
WHERE  'text6' IN (SELECT value
                   FROM   bar
                          JOIN foo AS foo2
                            ON bar.foo_ref = foo2.id
                   WHERE  foo2.id = foo.id)

and

SELECT *
FROM   foo
WHERE  EXISTS(SELECT 0
              FROM   bar
                     JOIN foo AS foo2
                       ON bar.foo_ref = foo2.id
              WHERE  foo2.id = foo.id
                     AND bar.value = 'text6')

Whereas the second one uses the indexes to look up the matching bar rows, the first one performs a full table scan on bar.
Given that both queries are logically equivalent, I'm wondering why this optimization isn't made. Is there information missing for the optimizer to make the better decision? Are these queries not equivalent perhaps?

An EXPLAIN ANALYZE on the two queries on filled and analyzed tables highlighting the difference:

EXPLAIN ANALYZE SELECT * FROM foo WHERE 'text6' IN (SELECT value FROM bar JOIN foo AS foo2 ON bar.foo_ref = foo2.id WHERE foo2.id = foo.id)

"Seq Scan on foo  (cost=0.00..3316934.60 rows=5000 width=4) (actual time=6.416..10803.056 rows=1 loops=1)"
"  Filter: (SubPlan 1)"
"  SubPlan 1"
"    ->  Nested Loop  (cost=0.00..663.29 rows=1 width=8) (actual time=0.667..1.079 rows=1 loops=10000)"
"          ->  Seq Scan on bar  (cost=0.00..655.00 rows=1 width=12) (actual time=0.660..1.072 rows=1 loops=10000)"
"                Filter: (foo_ref = foo.id)"
"          ->  Index Scan using foo_pkey on foo foo2  (cost=0.00..8.28 rows=1 width=4) (actual time=0.002..0.003 rows=1 loops=10000)"
"                Index Cond: (id = foo.id)"
"Total runtime: 10803.088 ms"


EXPLAIN ANALYZE SELECT * FROM foo WHERE EXISTS(SELECT 0 FROM bar JOIN foo AS foo2 ON bar.foo_ref = foo2.id WHERE foo2.id = foo.id AND bar.value = 'text6')

"Nested Loop  (cost=16.58..24.88 rows=1 width=4) (actual time=0.032..0.032 rows=1 loops=1)"
"  ->  HashAggregate  (cost=16.58..16.59 rows=1 width=8) (actual time=0.029..0.029 rows=1 loops=1)"
"        ->  Nested Loop  (cost=0.00..16.58 rows=1 width=8) (actual time=0.025..0.025 rows=1 loops=1)"
"              ->  Index Scan using bar_value_idx on bar  (cost=0.00..8.29 rows=1 width=4) (actual time=0.019..0.020 rows=1 loops=1)"
"                    Index Cond: ((value)::text = 'text6'::text)"
"              ->  Index Scan using foo_pkey on foo foo2  (cost=0.00..8.28 rows=1 width=4) (actual time=0.002..0.003 rows=1 loops=1)"
"                    Index Cond: (id = bar.foo_ref)"
"  ->  Index Scan using foo_pkey on foo  (cost=0.00..8.28 rows=1 width=4) (actual time=0.001..0.002 rows=1 loops=1)"
"        Index Cond: (id = bar.foo_ref)"
"Total runtime: 0.064 ms"

Hoping someone sheds some light on this and restores my confidence in the optimizer,

Nick Hofstede

PS: I know the EXIST can also be rewritten to a JOIN

SELECT foo.id
FROM   foo
       JOIN bar
         ON bar.foo_ref = foo.id
       JOIN foo AS foo2
         ON bar.foo_ref = foo2.id
WHERE  foo2.id = foo.id
       AND bar.value = 'text6'

and ultimately to (thanks to foo2.id = foo.id)

SELECT foo.id
FROM   foo
       JOIN bar
         ON bar.foo_ref = foo.id
WHERE  bar.value = 'text6'

.. all of wich have an execution plan and performance similar to the EXISTS query.
What I'm concerned about is that the first step from IN to EXISTS isn't made (which also precludes all following optimization steps)


________________________________

Inventive Designers' Email Disclaimer:
http://www.inventivedesigners.com/email-disclaimer

В списке pgsql-performance по дате отправления

Предыдущее

От: Merlin Moncure

Дата: 18 июля 2012 г. в 11:34:01

Сообщение: Re: PostgreSQL db, 30 tables with number of rows < 100 (not huge) - the fastest way to clean each non-empty table and reset unique identifier column of empty ones.

Следующее

От: Peter Geoghegan

Дата: 18 июля 2012 г. в 15:40:47

Сообщение: Re: optimizing queries using IN and EXISTS