PATCH: use foreign keys to improve join estimates v1

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема PATCH: use foreign keys to improve join estimates v1
Дата
Msg-id 552335D9.3090707@2ndquadrant.com
обсуждение исходный текст
Ответы Re: PATCH: use foreign keys to improve join estimates v1  (David Rowley <dgrowleyml@gmail.com>)
Список pgsql-hackers
Hi,

attached is a first version of a patch that aims to improve cardinality
estimates of joins by matching foreign keys between the tables (which
was ignored by the planner until now).

This significantly improves estimates when joining two tables using
multi-column conditions, matching a foreign key between the tables.

Consider for example this simple case

CREATE TABLE dim (a int, b int, primary key (a,b));
CREATE TABLE fact (a int, b int, foreign key (a,b) references dim(a,b));

INSERT INTO dim SELECT i,i FROM generate_series(1,1000000) s(i);
INSERT INTO fact SELECT i,i FROM generate_series(1,1000000) s(i);

ANALYZE dim;
ANALYZE fact;

EXPLAIN SELECT * FROM fact f JOIN dim d USING (a,b);

                            QUERY PLAN
---------------------------------------------------------------------------
Hash Join  (cost=29425.00..51350.01 rows=1 width=16)
   Hash Cond: ((f.a = d.a) AND (f.b = d.b))
   ->  Seq Scan on fact f (cost=0.00..14425.00 rows=1000000 width=8)
   ->  Hash  (cost=14425.00..14425.00 rows=1000000 width=8)
       ->  Seq Scan on dim d (cost=0.00..14425.00 rows=1000000 width=8)
(5 rows)

which is of course completely off, because the query produces 1M rows.

In practice, underestimates like this often cause much more serious
issues in the subsequent steps - for example the next join would
probably be executed as nested loop, which makes sense with a single row
but is an awful choice with 1M rows.

With the patch, the planner realizes there is a matching foreign key,
and tweaks the selectivities in calc_joinrel_size_estimate().

                              QUERY PLAN
-------------------------------------------------------------------------
  Hash Join  (cost=29426.25..250323877.62 rows=1000050 width=8)
    Hash Cond: ((fact.a = dim.a) AND (fact.b = dim.b))
    ->  Seq Scan on fact  (cost=0.00..14425.50 rows=1000050 width=8)
    ->  Hash  (cost=14425.50..14425.50 rows=1000050 width=8)
          ->  Seq Scan on dim  (cost=0.00..14425.50 rows=1000050 width=8)
(5 rows)


There are a few unsolved questions/issues:

(1) The current patch only does the trick when the FK matches the
     conditions perfectly - when there are no missing columns (present
     in the FK, not covered by a condition).

     I believe this might be relaxed in both directions. When the
     conditions don't cover all the FK columns, we know there's at least
     one matching row (and we can estimate the number of matches). In
     the other direction, we can estimate just the 'extra' conditions.

(2) Adding further conditions may further break the estimates, for
     example after adding "WHERE d.a = d.b" this happens

                                QUERY PLAN
------------------------------------------------------------------------
  Hash Join  (cost=16987.50..33931.50 rows=25 width=8)
    Hash Cond: (f.a = d.a)
    ->  Seq Scan on fact f  (cost=0.00..16925.00 rows=5000 width=8)
          Filter: (a = b)
    ->  Hash  (cost=16925.00..16925.00 rows=5000 width=8)
          ->  Seq Scan on dim d  (cost=0.00..16925.00 rows=5000 width=8)
                Filter: (a = b)
(7 rows)

     One issue is that "a=b" condition is poorly estimated due to
     correlation (which might be improved by multi-variate stats). It
     however removes one of the conditions from the join restrict list,
     so it only contains "f.a = d.a" and thus only covers one of the FK
     columns, and thus the patch fails to detect the FK :-(

     Not sure how to fix this - one way might be performing the check
     sooner, before the second join clause is removed (not sure where
     that happens). Another option is reconstructing clauses somehow.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Simon Riggs
Дата:
Сообщение: Re: Auditing extension for PostgreSQL (Take 2)
Следующее
От: Sawada Masahiko
Дата:
Сообщение: Re: Freeze avoidance of very large table.