PATCH: use foreign keys to improve join estimates v1
От | Tomas Vondra |
---|---|
Тема | PATCH: use foreign keys to improve join estimates v1 |
Дата | |
Msg-id | 552335D9.3090707@2ndquadrant.com обсуждение исходный текст |
Ответы |
Re: PATCH: use foreign keys to improve join estimates v1
(David Rowley <dgrowleyml@gmail.com>)
|
Список | pgsql-hackers |
Hi, attached is a first version of a patch that aims to improve cardinality estimates of joins by matching foreign keys between the tables (which was ignored by the planner until now). This significantly improves estimates when joining two tables using multi-column conditions, matching a foreign key between the tables. Consider for example this simple case CREATE TABLE dim (a int, b int, primary key (a,b)); CREATE TABLE fact (a int, b int, foreign key (a,b) references dim(a,b)); INSERT INTO dim SELECT i,i FROM generate_series(1,1000000) s(i); INSERT INTO fact SELECT i,i FROM generate_series(1,1000000) s(i); ANALYZE dim; ANALYZE fact; EXPLAIN SELECT * FROM fact f JOIN dim d USING (a,b); QUERY PLAN --------------------------------------------------------------------------- Hash Join (cost=29425.00..51350.01 rows=1 width=16) Hash Cond: ((f.a = d.a) AND (f.b = d.b)) -> Seq Scan on fact f (cost=0.00..14425.00 rows=1000000 width=8) -> Hash (cost=14425.00..14425.00 rows=1000000 width=8) -> Seq Scan on dim d (cost=0.00..14425.00 rows=1000000 width=8) (5 rows) which is of course completely off, because the query produces 1M rows. In practice, underestimates like this often cause much more serious issues in the subsequent steps - for example the next join would probably be executed as nested loop, which makes sense with a single row but is an awful choice with 1M rows. With the patch, the planner realizes there is a matching foreign key, and tweaks the selectivities in calc_joinrel_size_estimate(). QUERY PLAN ------------------------------------------------------------------------- Hash Join (cost=29426.25..250323877.62 rows=1000050 width=8) Hash Cond: ((fact.a = dim.a) AND (fact.b = dim.b)) -> Seq Scan on fact (cost=0.00..14425.50 rows=1000050 width=8) -> Hash (cost=14425.50..14425.50 rows=1000050 width=8) -> Seq Scan on dim (cost=0.00..14425.50 rows=1000050 width=8) (5 rows) There are a few unsolved questions/issues: (1) The current patch only does the trick when the FK matches the conditions perfectly - when there are no missing columns (present in the FK, not covered by a condition). I believe this might be relaxed in both directions. When the conditions don't cover all the FK columns, we know there's at least one matching row (and we can estimate the number of matches). In the other direction, we can estimate just the 'extra' conditions. (2) Adding further conditions may further break the estimates, for example after adding "WHERE d.a = d.b" this happens QUERY PLAN ------------------------------------------------------------------------ Hash Join (cost=16987.50..33931.50 rows=25 width=8) Hash Cond: (f.a = d.a) -> Seq Scan on fact f (cost=0.00..16925.00 rows=5000 width=8) Filter: (a = b) -> Hash (cost=16925.00..16925.00 rows=5000 width=8) -> Seq Scan on dim d (cost=0.00..16925.00 rows=5000 width=8) Filter: (a = b) (7 rows) One issue is that "a=b" condition is poorly estimated due to correlation (which might be improved by multi-variate stats). It however removes one of the conditions from the join restrict list, so it only contains "f.a = d.a" and thus only covers one of the FK columns, and thus the patch fails to detect the FK :-( Not sure how to fix this - one way might be performing the check sooner, before the second join clause is removed (not sure where that happens). Another option is reconstructing clauses somehow. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
В списке pgsql-hackers по дате отправления: