Improvement of var_eq_non_const()
От | Teodor Sigaev |
---|---|
Тема | Improvement of var_eq_non_const() |
Дата | |
Msg-id | 9789f79b-34f0-49ee-9852-783392a3615c@sigaev.ru обсуждение исходный текст |
Ответы |
Re: Improvement of var_eq_non_const()
|
Список | pgsql-hackers |
Hi! Comment in var_eq_non_const() says: /* * Search is for a value that we do not know a priori, but we will * assume it is not NULL. Estimate the selectivity as non-null * fraction divided by number of distinct values, so that we get a * result averaged over all possible values whether common or * uncommon. (Essentially, we are assuming that the not-yet-known * comparison value is equally likely to be any of the possible * values, regardless of their frequency in the table. Is that a good * idea?) */ Seems, it is workable, but not so good idea, because it could be a reason of significant underestimation in case of non-uniform data distribution. Often, this leads to a choice of nested loop instead of other join algorithms and, with underestimation, planner chooses a wrong plan. In attachment there is a dump and query from real application, but clipped and obfuscated, sorry. explain analyze SELECT COUNT(*) FROM t1 LEFT JOIN t2 ON t2.b = t1.a AND t2.c = 0 //line of interest ; With commented 'line of interest' everything works fine without any problems, but with that line we will get another plan, much worst. Interesting, that t2.c column contains only one value (zero). Planner chooses nested loop instead of hash join and makes a wrong estimation: Index Only Scan using i1_t2 on t2 (.. rows=6 ..) (.. rows=3555 loops=2000) Index Cond: ((c = 0) AND (b = t1.a)) It supposed six rows but executor got 3555. That caused because var_eq_non_const() supposes uniform distribution of t2.b column which is wrong. Actually, this example works fine anyway, but real query contains a lot other joins connected to t2 and this planner's mistake becomes a huge performance issue, up to three orders of magnitude. I'd like to suggest to improve var_eq_non_const() by using knowledge of MCV and estimate the selectivity as quadratic mean of non-null fraction divided by number of distinct values (as it was before) and set of MCV selectivities. Use quadratic mean because it includes the squared deviation (error) as well and here it would be nice to compute upper limit of estimation to prevent wrong choose of nested loop, for example. If nested loop is forced and patch is applied it will produce much better estimation: Index Only Scan using i1_t2 on t2 (.. rows=3380 ..) (.. rows=3555 loops=2000) -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Вложения
В списке pgsql-hackers по дате отправления: