New hashed IN code ignores distinctiveness of subquery

Поиск
Список
Период
Сортировка
От Bradley Baetz
Тема New hashed IN code ignores distinctiveness of subquery
Дата
Msg-id 20030126122612.GA3820@mango.home
обсуждение исходный текст
Ответы Re: New hashed IN code ignores distinctiveness of subquery  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-bugs
I've been trying out the new hased subselect code from CVS. It appears
that the planner isn't taking the distinctiveness of the values from the
subselect into account:

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);

QUERY PLAN


---------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3485675.30..3485675.30 rows=1 width=8) (actual
time=1430065.54..1430065.55 rows=1 loops=1)
   ->  Merge Join  (cost=0.00..3485661.38 rows=5570 width=8) (actual
time=0.15..1429696.69 rows=50000 loops=1)
         Merge Cond: ("outer".product_id = "inner".product_id)
         ->  Index Scan using bugs_product_id_idx on bugs
(cost=0.00..2313.33 rows=50000 width=4) (actual time=0.12..358.43
rows=50000 loops=1)
         ->  Index Scan using bugs_product_id_idx on bugs
(cost=0.00..2313.33 rows=50000 width=4) (actual time=0.01..1152455.44
rows=277884160 loops=1)
 Total runtime: 1430102.08 msec
(6 rows)

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT distinct product_id FROM bugs);

QUERY PLAN


-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3033.30..3033.30 rows=1 width=8) (actual
time=505.17..505.17 rows=1 loops=1)
   ->  Hash Join  (cost=1959.54..3032.67 rows=251 width=8) (actual
time=282.19..456.66 rows=50000 loops=1)
         Hash Cond: ("outer".product_id = "inner".product_id)
         ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000 width=4)
(actual time=0.01..68.94 rows=50000 loops=1)
         ->  Hash  (cost=1959.52..1959.52 rows=9 width=4) (actual
time=282.14..282.14 rows=0 loops=1)
               ->  Subquery Scan "IN_subquery"  (cost=0.00..1959.52
rows=9 width=4) (actual time=0.13..282.08 rows=9 loops=1)
                     ->  Unique  (cost=0.00..1959.52 rows=9 width=4)
(actual time=0.13..282.03 rows=9 loops=1)
                           ->  Index Scan using bugs_product_id_idx on
bugs  (cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..245.46
rows=50000 loops=1)
 Total runtime: 505.30 msec
(9 rows)

bbaetz=# set enable_mergejoin=false;
SET

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);

                                                       QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4281600.45..4281600.45 rows=1 width=8) (actual
time=486.80..486.80 rows=1 loops=1)
   ->  Hash Join  (cost=1091.00..4281586.50 rows=5580 width=8) (actual
time=146.58..425.92 rows=50000 loops=1)
         Hash Cond: ("outer".product_id = "inner".product_id)
         ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000 width=4)
(actual time=0.04..75.73 rows=50000 loops=1)
         ->  Hash  (cost=795.00..795.00 rows=50000 width=4) (actual
time=146.34..146.34 rows=0 loops=1)
               ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000
width=4) (actual time=0.06..64.98 rows=50000 loops=1)
 Total runtime: 486.91 msec
(7 rows)

bugs is a table with 50000 rows, and products has 10 rows. (Only 9 of
the products are actually used in bugs.product_id, due to an off-by-one
error in the script I used to generate the table)

I still haven't tuned the various optimiser settings, which may explain
part of the enable_mergejoin=false result, although the DISTINCT
probably takes some time too (Side note - is it possible to notice that
DISTINCT on a column with a unique index doesn't need a Unique pass?).
However, 23 minutes vs 0.5 seconds isn't due to that. This is a fairly
useless and silly query though - I was just playing arround.

The tables have been analyzed, and there are separate unique indexes on
products.id, bugs.bug_id and bugs.product_id.

FWIW:

bbaetz=# select n_distinct, correlation FROM pg_stats WHERE
tablename='products' AND attname='product_id';
 n_distinct | correlation
------------+-------------
          9 |   0.0919474
(1 row)

Thanks,

Bradley

В списке pgsql-bugs по дате отправления:

Предыдущее
От: pgsql-bugs@postgresql.org
Дата:
Сообщение: Bug #887: pg_restore blocks
Следующее
От: Tom Lane
Дата:
Сообщение: Re: New hashed IN code ignores distinctiveness of subquery