New hashed IN code ignores distinctiveness of subquery

Поиск

Список

Период

Сортировка

От	Bradley Baetz
Тема	New hashed IN code ignores distinctiveness of subquery
Дата	26 января 2003 г. 10:26:16
Msg-id	20030126122612.GA3820@mango.home обсуждение исходный текст
Ответы	Re: New hashed IN code ignores distinctiveness of subquery
Список	pgsql-bugs

Дерево обсуждения

I've been trying out the new hased subselect code from CVS. It appears
that the planner isn't taking the distinctiveness of the values from the
subselect into account:

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);

QUERY PLAN


---------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3485675.30..3485675.30 rows=1 width=8) (actual
time=1430065.54..1430065.55 rows=1 loops=1)
   ->  Merge Join  (cost=0.00..3485661.38 rows=5570 width=8) (actual
time=0.15..1429696.69 rows=50000 loops=1)
         Merge Cond: ("outer".product_id = "inner".product_id)
         ->  Index Scan using bugs_product_id_idx on bugs
(cost=0.00..2313.33 rows=50000 width=4) (actual time=0.12..358.43
rows=50000 loops=1)
         ->  Index Scan using bugs_product_id_idx on bugs
(cost=0.00..2313.33 rows=50000 width=4) (actual time=0.01..1152455.44
rows=277884160 loops=1)
 Total runtime: 1430102.08 msec
(6 rows)

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT distinct product_id FROM bugs);

QUERY PLAN


-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3033.30..3033.30 rows=1 width=8) (actual
time=505.17..505.17 rows=1 loops=1)
   ->  Hash Join  (cost=1959.54..3032.67 rows=251 width=8) (actual
time=282.19..456.66 rows=50000 loops=1)
         Hash Cond: ("outer".product_id = "inner".product_id)
         ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000 width=4)
(actual time=0.01..68.94 rows=50000 loops=1)
         ->  Hash  (cost=1959.52..1959.52 rows=9 width=4) (actual
time=282.14..282.14 rows=0 loops=1)
               ->  Subquery Scan "IN_subquery"  (cost=0.00..1959.52
rows=9 width=4) (actual time=0.13..282.08 rows=9 loops=1)
                     ->  Unique  (cost=0.00..1959.52 rows=9 width=4)
(actual time=0.13..282.03 rows=9 loops=1)
                           ->  Index Scan using bugs_product_id_idx on
bugs  (cost=0.00..1834.52 rows=50000 width=4) (actual time=0.13..245.46
rows=50000 loops=1)
 Total runtime: 505.30 msec
(9 rows)

bbaetz=# set enable_mergejoin=false;
SET

bbaetz=# explain analyze select count(*) FROM bugs where product_id IN
(SELECT product_id FROM bugs);

                                                       QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4281600.45..4281600.45 rows=1 width=8) (actual
time=486.80..486.80 rows=1 loops=1)
   ->  Hash Join  (cost=1091.00..4281586.50 rows=5580 width=8) (actual
time=146.58..425.92 rows=50000 loops=1)
         Hash Cond: ("outer".product_id = "inner".product_id)
         ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000 width=4)
(actual time=0.04..75.73 rows=50000 loops=1)
         ->  Hash  (cost=795.00..795.00 rows=50000 width=4) (actual
time=146.34..146.34 rows=0 loops=1)
               ->  Seq Scan on bugs  (cost=0.00..795.00 rows=50000
width=4) (actual time=0.06..64.98 rows=50000 loops=1)
 Total runtime: 486.91 msec
(7 rows)

bugs is a table with 50000 rows, and products has 10 rows. (Only 9 of
the products are actually used in bugs.product_id, due to an off-by-one
error in the script I used to generate the table)

I still haven't tuned the various optimiser settings, which may explain
part of the enable_mergejoin=false result, although the DISTINCT
probably takes some time too (Side note - is it possible to notice that
DISTINCT on a column with a unique index doesn't need a Unique pass?).
However, 23 minutes vs 0.5 seconds isn't due to that. This is a fairly
useless and silly query though - I was just playing arround.

The tables have been analyzed, and there are separate unique indexes on
products.id, bugs.bug_id and bugs.product_id.

FWIW:

bbaetz=# select n_distinct, correlation FROM pg_stats WHERE
tablename='products' AND attname='product_id';
 n_distinct | correlation
------------+-------------
          9 |   0.0919474
(1 row)

Thanks,

Bradley

В списке pgsql-bugs по дате отправления:

Предыдущее

От: pgsql-bugs@postgresql.org
Дата: 26 января 2003 г., 08:04:12
Сообщение: Bug #887: pg_restore blocks

Следующее

От: Tom Lane
Дата: 26 января 2003 г., 17:10:22
Сообщение: Re: New hashed IN code ignores distinctiveness of subquery

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

New hashed IN code ignores distinctiveness of subquery

Предыдущее

Следующее