Re: slow IN() clause for many cases

От:

Tom Lane <tgl@sss.pgh.pa.us>

Дата:

29 ноября 2005 г. в 18:21:34

Re: slower merge join on sorted data chosen over

От:

Simon Riggs <simon@2ndquadrant.com>

Дата:

11 октября 2005 г. в 06:59:22

Re: slow IN() clause for many cases

От:

Simon Riggs <simon@2ndquadrant.com>

Дата:

17 октября 2005 г. в 09:35:04

Re: slower merge join on sorted data chosen over

От:

Simon Riggs <simon@2ndquadrant.com>

Дата:

17 октября 2005 г. в 17:31:02

Re: slow IN() clause for many cases

От:

Simon Riggs <simon@2ndquadrant.com>

Дата:

29 ноября 2005 г. в 18:07:10

Re: slow IN() clause for many cases

От:

Simon Riggs <simon@2ndquadrant.com>

Дата:

29 ноября 2005 г. в 18:53:43

Re: slow IN() clause for many cases

От:

Simon Riggs <simon@2ndquadrant.com>

Дата:

2 декабря 2005 г. в 04:18:51

Re: slow IN() clause for many cases

От:

Tom Lane <tgl@sss.pgh.pa.us>

Дата:

17 октября 2005 г. в 01:38:07

Re: slow IN() clause for many cases

От:

Tom Lane <tgl@sss.pgh.pa.us>

Дата:

12 октября 2005 г. в 11:51:52

Re: slow IN() clause for many cases

От:

Tom Lane <tgl@sss.pgh.pa.us>

Дата:

30 ноября 2005 г. в 03:05:44

Re: slow IN() clause for many cases

От:

Josh Berkus <josh@agliodbs.com>

Дата:

11 октября 2005 г. в 15:58:16

Re: slow IN() clause for many cases

От:

David Fetter <david@fetter.org>

Дата:

14 октября 2005 г. в 21:21:06

Re: slow IN() clause for many cases

От:

mark@mark.mielke.cc

Дата:

15 октября 2005 г. в 09:15:26

Re: slow IN() clause for many cases

От:

Martijn van Oosterhout <kleptog@svana.org>

Дата:

16 октября 2005 г. в 13:34:11

Re: slow IN() clause for many cases

От:

Martijn van Oosterhout <kleptog@svana.org>

Дата:

16 октября 2005 г. в 18:41:12

Re: slower merge join on sorted data chosen over

От:

"Jim C. Nasby" <jnasby@pervasive.com>

Дата:

17 октября 2005 г. в 16:55:22

Re: slower merge join on sorted data chosen over

От:

"Jim C. Nasby" <jnasby@pervasive.com>

Дата:

26 октября 2005 г. в 18:06:39

Re: slow IN() clause for many cases

От:

Martijn van Oosterhout <kleptog@svana.org>

Дата:

30 ноября 2005 г. в 02:19:08

Re: slow IN() clause for many cases

От:

Martijn van Oosterhout <kleptog@svana.org>

Дата:

2 декабря 2005 г. в 05:48:45

Re: slow IN() clause for many cases

От:

Tom Lane <tgl@sss.pgh.pa.us>

Дата:

12 октября 2005 г. в 02:23:01

Re: slow IN() clause for many cases

От:

Tom Lane <tgl@sss.pgh.pa.us>

Дата:

15 октября 2005 г. в 12:31:42

Re: slow IN() clause for many cases

От:

Tom Lane <tgl@sss.pgh.pa.us>

Дата:

15 октября 2005 г. в 19:45:30

Re: slow IN() clause for many cases

От:

Tom Lane <tgl@sss.pgh.pa.us>

Дата:

15 октября 2005 г. в 18:42:35

Re: slow IN() clause for many cases

От:

"Jonah H. Harris" <jonah.harris@gmail.com>

Дата:

11 октября 2005 г. в 12:03:16

Please post an explain analyze on your query with a 20-30 item IN clause so that we can see what plan is being generated.

On 10/11/05, Ilia Kantor <ilia@obnovlenie.ru> wrote:

When in clause becomes large enough (>20-30 cases),
It is much better to use "join" way of processing..

I mean,
"SELECT * FROM table WHERE field IN (1,2...30)" will be slower than
"SELECT * FROM table JOIN (SRF returning 1...30) USING(field)"

I'm not quite sure, where the difference starts, but sometimes I need to
make selects with 30 or more items by primary key and I get significant
speed up by this transform:

CREATE OR REPLACE FUNCTION array2table(arr int[]) RETURNS SETOF int

select * from persons join (select array2table as id from
array2table(array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,2
3,24,25,26,27,28,29,30])) a using(id);

I'm sure that backend could do that in a much faster and elegant fasion.
Bitmap-or is nice, but for many IN arguments it is still much slower than
join.

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

Re: slow IN() clause for many cases

От:

"Jonah H. Harris" <jonah.harris@gmail.com>

Дата:

12 октября 2005 г. в 03:26:24

true dat :)

On 10/12/05, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Ilia Kantor" <ilia@obnovlenie.ru> writes:
> Bitmap Heap Scan on objects_hier  (cost=60.29..179.57 rows=80 width=600)
> (actual time=0.835..1.115 rows=138 loops=1)

vs

>  Merge Join  (cost=62.33..576.80 rows=1117 width=600) (actual
> time=0.542..2.898 rows=138 loops=1)

Hmm, sure looks from here like the bitmap plan is faster.

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

Re: slow IN() clause for many cases

От:

"Jonah H. Harris" <jonah.harris@gmail.com>

Дата:

12 октября 2005 г. в 16:37:11

IMHO, we should first look at whether the O(N^2) behavior is needed.

On 10/12/05, Ilia Kantor <ilia@obnovlenie.ru > wrote:

1) merge join is faster for me then hash join (6 vs 3-4ms):

explain analyze select * from objects_hier where id in (select
(array[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,
27,28,29,30])[i] from generate_series(1,30) s(i));

Hash Join  (cost= 17.50..162.93 rows=223 width=600) (actual
time=1.148..32.654 rows=138 loops=1)
   Hash Cond: ("outer".id =
('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,2
8,29,30}'::integer[])["inner".i])
   ->  Seq Scan on objects_hier  (cost=0.00..134.80 rows=1680 width=600)
(actual time=0.235..12.271 rows=1680 loops=1)
   ->  Hash  (cost=17.00..17.00 rows=200 width=4) (actual time=0.893..0.893
rows=30 loops=1)
         ->  HashAggregate  (cost=15.00..17.00 rows=200 width=4) (actual
time=0.161..0.830 rows=30 loops=1)
               ->  Function Scan on generate_series s  (cost=0.00..12.50
rows=1000 width=4) (actual time=0.037..0.097 rows=30 loops=1)

2) Is it possible to make the planner plan that better, e.g use Bitmap-OR
for <20numbers, use some other method for >20 numbers (actual number depends
on costs) ?

Maybe that should be added to TODO ?

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto: pgsql-hackers-owner@postgresql.org] On Behalf Of Andrew - Supernews
Sent: Wednesday, October 12, 2005 1:41 PM
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] slow IN() clause for many cases

On 2005-10-11, "Ilia Kantor" <ilia@obnovlenie.ru> wrote:
> When in clause becomes large enough (>20-30 cases),
> It is much better to use "join" way of processing..

or even a different way of writing the IN clause.

This one is one I've used after considerable research:

select * from table
where field in (select (some_array_of_N_items)[i]
                   from generate_series(1,N) as s(i));

This generally plans as a nestloop, with a HashAggregate of the function
scan (of generate_series) on the outer path, and index lookups on the inner
path.

It's worth noting that EXPLAIN ANALYZE doesn't tell the whole story when
comparing queries of this kind. The IN (1,2,...30) form is much slower to
plan, and usually can't usefully be used in prepared form (you'd need to
prepare it separately for every different number of items); in contrast,
the array version can be prepared once and reused.

As the number of items in the IN clause increases, the planning time grows
rather radically. For example with 1000 items (here stashed in a psql
convenience variable for brevity), using 8.1beta2:

test=# prepare tstplan1 as select * from test where id in (:v);
Time: 4881.702 ms

compare:

test=# prepare tstplan2 as select * from test where id in
  (select (ARRAY[:v])[i] from generate_series(1,1000) s(i));
Time: 10.889 ms

(on my machine the break-even point for these two is less than 20 items,
or even less if the array is passed in as a literal or a single parameter
rather than constructed with ARRAY[].)

The actual execution time of these two is very close, with the second
being about 10% slower on my system (31ms vs 34ms, based on \timing values
from psql and averaged over several goes). However, the timings returned
from EXPLAIN ANALYZE are much more skewed: 42ms vs 66ms as reported in the
"total runtime" line. So not only is the planning time different, but also
the instrumentation overhead of EXPLAIN ANALYZE is wildly different between
the two forms.

What this means is that unless you're going to prepare in advance every
possible number of parameters to IN that your app is ever going to use,
the only way to get useful performance for IN queries with more than a
handful of literal values is to use an array method, in spite of the fact
that the bitmap-OR execution plan is actually at least as fast.

--
Andrew, Supernews
http://www.supernews.com - individual and corporate NNTP services

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend