Re: SQL WHERE: many sql or large IN()
| От | Listmail | 
|---|---|
| Тема | Re: SQL WHERE: many sql or large IN() | 
| Дата | |
| Msg-id | op.tqdj53ckzcizji@apollo13 обсуждение исходный текст | 
| Ответ на | Re: SQL WHERE: many sql or large IN() (Markus Schiltknecht <markus@bluegap.ch>) | 
| Ответы | Re: SQL WHERE: many sql or large IN() | 
| Список | pgsql-general | 
On Fri, 06 Apr 2007 18:45:15 +0200, Markus Schiltknecht
<markus@bluegap.ch> wrote:
> Hi,
>
> tom wrote:
>> Initially it seems that the WHERE IN (...) approach takes a turn for
>> the worse when the list gets very large.
>
    Since I use this a lot on webpages, I thought maybe a little benchmark is
in order ?
CREATE TABLE test (id SERIAL PRIMARY KEY, value TEXT NOT NULL );
INSERT INTO test (value) SELECT * FROM generate_series( 1, 1000000 );
CREATE INDEX test_value ON test( value );
ANALYZE test;
    My script runs EXPLAIN ANALYZE 10 times and keeps the fastest timing,
then displays it.
    Then it executes the query enough times so that 1 second is elapsed (also
fetching the results), then prints the real runtime.
    Well, the table is clustered (since it was created from generate_series)
so of course, bitmap index scan rules.
    Get the first 300 rows of the table using column 'value' which is an
indexed TEXT column.
    (using 'id' which is an integer is a bit faster)
SELECT * FROM test WHERE value IN <300 values>
Bitmap Heap Scan on test  (cost=1225.83..2317.08 rows=300 width=13)
(actual time=3.736..3.807 rows=300 loops=1)
   Recheck Cond: (value = ANY ('{<300 values>}'::text[]))
   ->  Bitmap Index Scan on test_value  (cost=0.00..1225.75 rows=300
width=0) (actual time=3.717..3.717 rows=300 loops=1)
         Index Cond: (value = ANY ('{<300 values>}'::text[]))
Explain Analyze runtime: 1 x 3.896 ms = 3.896 ms
Real runtime: 1 x 6.027 ms = 6.027 ms (timed on 257 iterations)
----------------------------------------
SELECT t.* FROM test t, ( VALUES <300 values> ) AS v WHERE
t.value=v.column1
Nested Loop  (cost=0.00..2447.27 rows=300 width=13) (actual
time=0.035..4.724 rows=300 loops=1)
   ->  Values Scan on "*VALUES*"  (cost=0.00..3.75 rows=300 width=32)
(actual time=0.002..0.219 rows=300 loops=1)
   ->  Index Scan using test_value on test t  (cost=0.00..8.13 rows=1
width=13) (actual time=0.013..0.014 rows=1 loops=300)
         Index Cond: (t.value = "*VALUES*".column1)
Explain Analyze runtime: 1 x 4.814 ms = 4.814 ms
Real runtime: 1 x 6.786 ms = 6.786 ms (timed on 208 iterations)
----------------------------------------
SELECT * FROM test WHERE value='1'
Index Scan using test_value on test  (cost=0.00..8.40 rows=1 width=13)
(actual time=0.014..0.015 rows=1 loops=1)
   Index Cond: (value = '1'::text)
Explain Analyze runtime: 300 x 0.032 ms = 9.600 ms
Real runtime: 300 x 0.149 ms = 44.843 ms (timed on 31251 iterations)
    Now if we ask for 300 random rows out of the million in the table, which
is a lot more likely situation...
SELECT * FROM test WHERE value IN <300 values>
Bitmap Heap Scan on test  (cost=1225.83..2317.08 rows=300 width=13)
(actual time=4.516..4.945 rows=300 loops=1)
   Recheck Cond: (value = ANY ('{<300 values>}'::text[]))
   ->  Bitmap Index Scan on test_value  (cost=0.00..1225.75 rows=300
width=0) (actual time=4.451..4.451 rows=300 loops=1)
         Index Cond: (value = ANY ('{<300 values>}'::text[]))
Explain Analyze runtime: 1 x 5.034 ms = 5.034 ms
Real runtime: 1 x 7.278 ms = 7.278 ms (timed on 199 iterations)
----------------------------------------
SELECT t.* FROM test t, ( VALUES <300 values> ) AS v WHERE
t.value=v.column1
Nested Loop  (cost=0.00..2447.27 rows=300 width=13) (actual
time=0.046..5.503 rows=300 loops=1)
   ->  Values Scan on "*VALUES*"  (cost=0.00..3.75 rows=300 width=32)
(actual time=0.001..0.200 rows=300 loops=1)
   ->  Index Scan using test_value on test t  (cost=0.00..8.13 rows=1
width=13) (actual time=0.016..0.016 rows=1 loops=300)
         Index Cond: (t.value = "*VALUES*".column1)
Explain Analyze runtime: 1 x 5.625 ms = 5.625 ms
Real runtime: 1 x 7.572 ms = 7.572 ms (timed on 178 iterations)
    Doing a query per row is a lot slower than it appears in EXPLAIN ANALYZE
when you add all the overhead. In fact it sucks. That was ecpected of
course. I could tell you about when I got this php/mysql website where the
previous developer's idea of a join was a select in a PHP for() loop
because "joins are slow". It was hell.
    I tried to query 1, 2, 10, 1000, 10k and 100k random rows (thats abuse)
and :
    - one query per row is (obviously) the slowest unless you select only one
row
    - IN reverts to a single index scan if it contains one value, else it
always uses a bitmap scan
    - VALUES join uses nested loop until about 100k rows where it switches to
hash join
    I also tried on a real table with many columns. In that case, there are
less tuples per page, and bitmap scans are more efficient than nested
loops, so IN wins.
    So, IN() does not turn to crap anymore like it used to !
    However, if you are asking for this question because you don't want to
use temporary tables, it seems that temp tables have been upgraded so they
are now usable even in a webpage (well, not ALL webpages, but that big
nasty slow webpage, you know which one I'm talking about).
		
	В списке pgsql-general по дате отправления: