Re: proposal : cross-column stats

Поиск
Список
Период
Сортировка
От tv@fuzzy.cz
Тема Re: proposal : cross-column stats
Дата
Msg-id 0f997798b71d041e31262a49e361db88.squirrel@sq.gransy.com
обсуждение исходный текст
Ответ на Re: proposal : cross-column stats  (Florian Pflug <fgp@phlo.org>)
Ответы Re: proposal : cross-column stats  (Florian Pflug <fgp@phlo.org>)
Список pgsql-hackers
> On Dec17, 2010, at 23:12 , Tomas Vondra wrote:
>> Well, not really - I haven't done any experiments with it. For two
>> columns selectivity equation is
>>
>>      (dist(A) * sel(A) + dist(B) * sel(B)) / (2 * dist(A,B))
>>
>> where A and B are columns, dist(X) is number of distinct values in
>> column X and sel(X) is selectivity of column X.
>
> Huh? This is the selectivity estimate for "A = x AND B = y"? Surely,
> if A and B are independent, the formula must reduce to sel(A) * sel(B),
> and I cannot see how that'd work with the formula above.

Yes, it's a selectivity estimate for P(A=a and B=b). It's based on
conditional probability, as
  P(A=a and B=b) = P(A=a|B=b)*P(B=b) = P(B=b|A=a)*P(A=a)

and "uniform correlation" assumption so that it's possible to replace the
conditional probabilities with constants. And those constants are then
estimated as dist(A)/dist(A,B) or dist(B)/dist(A,B).

So it does not reduce to sel(A)*sel(B) exactly, as the dist(A)/dist(A,B)
is just an estimate of P(B|A). The paper states that this works best for
highly correlated data, while for low correlated data it (at least)
matches the usual estimates.

I don't say it's perfect, but it seems to produce reasonable estimates.

Tomas



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alex Hunsaker
Дата:
Сообщение: Re: plperlu problem with utf8
Следующее
От: Itagaki Takahiro
Дата:
Сообщение: Re: Extensions, patch v19 (encoding brainfart fix) (was: Extensions, patch v18 (merge against master, bitrot-only-fixes))