Improvement of var_eq_non_const()

Поиск

Список

Период

Сортировка

От	Teodor Sigaev
Тема	Improvement of var_eq_non_const()
Дата	20 февраля 20:05:48
Msg-id	9789f79b-34f0-49ee-9852-783392a3615c@sigaev.ru обсуждение исходный текст
Ответы	Re: Improvement of var_eq_non_const()
Список	pgsql-hackers

Дерево обсуждения

Hi!

Comment in var_eq_non_const() says:

/*
  * Search is for a value that we do not know a priori, but we will
  * assume it is not NULL.  Estimate the selectivity as non-null
  * fraction divided by number of distinct values, so that we get a
  * result averaged over all possible values whether common or
  * uncommon.  (Essentially, we are assuming that the not-yet-known
  * comparison value is equally likely to be any of the possible
  * values, regardless of their frequency in the table.  Is that a good
  * idea?)
  */

Seems, it is workable, but not so good idea, because it could be a reason of 
significant underestimation in case of non-uniform data distribution. Often, 
this leads to a choice of nested loop instead of other join algorithms and, with 
underestimation, planner chooses a wrong plan. In attachment there is a dump and 
query from real application, but clipped and obfuscated, sorry.

explain analyze
     SELECT
        COUNT(*)
     FROM
        t1
     LEFT JOIN
         t2
             ON t2.b = t1.a
             AND t2.c = 0  //line of interest
             ;

With commented 'line of interest' everything works fine without any problems, 
but with that line we will get another plan, much worst. Interesting, that t2.c 
column contains only one value (zero). Planner chooses nested loop instead of 
hash join and makes a wrong estimation:
   Index Only Scan using i1_t2 on t2  (.. rows=6 ..) (.. rows=3555 loops=2000)
     Index Cond: ((c = 0) AND (b = t1.a))

It supposed six rows but executor got 3555. That caused because 
var_eq_non_const() supposes uniform distribution of t2.b column which is wrong. 
Actually, this example works fine anyway, but real query contains a lot other 
joins connected to t2 and this planner's mistake becomes a huge performance 
issue, up to three orders of magnitude.

I'd like to suggest to improve var_eq_non_const() by using knowledge of MCV and 
estimate the selectivity as quadratic mean of non-null fraction divided by 
number of distinct values (as it was before) and set of MCV selectivities. Use 
quadratic mean because it includes the squared deviation (error) as well and 
here it would be nice to compute upper limit of estimation to prevent wrong 
choose of nested loop, for example.

If nested loop is forced and patch is applied  it will produce much better 
estimation:
Index Only Scan using i1_t2 on t2  (.. rows=3380 ..) (.. rows=3555 loops=2000)

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Вложения

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Improvement of var_eq_non_const()

Вложения