Re: proposal : cross-column stats

Поиск
Список
Период
Сортировка
От Florian Pflug
Тема Re: proposal : cross-column stats
Дата
Msg-id 7DE5DDDD-1E3B-447A-AD43-D63BD0420FCE@phlo.org
обсуждение исходный текст
Ответ на Re: proposal : cross-column stats  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Ответы Re: proposal : cross-column stats  (Tomas Vondra <tv@fuzzy.cz>)
Re: proposal : cross-column stats  (Tomas Vondra <tv@fuzzy.cz>)
Список pgsql-hackers
On Dec12, 2010, at 15:43 , Heikki Linnakangas wrote:
> The way I think of that problem is that once you know the postcode, knowing the city name doesn't add any
information.The postcode implies the city name. So the selectivity for "postcode = ? AND city = ?" should be the
selectivityof "postcode = ?" alone. The measurement we need is "implicativeness": How strongly does column A imply a
certainvalue for column B. Perhaps that could be measured by counting the number of distinct values of column B for
eachvalue of column A, or something like that. I don't know what the statisticians call that property, or if there's
someexisting theory on how to measure that from a sample. 

The statistical term for this is "conditional probability", written P(A|B), meaning the probability of A under the
assumptionor knowledge of B. The basic tool for working with conditional probabilities is bayes' theorem which states
that

P(A|B) = P(A and B) / P(B).

Currently, we assume that P(A|B) = P(A), meaning the probability (or selectivity as we call it) of an event (like a=3)
doesnot change under additional assumptions like b=4. Bayes' theorem thus becomes 

P(A) = P(A and B) / P(B)    <=>
P(A and B) = P(A)*P(B)

which is how we currently compute the selectivity of a clause such as "WHERE a=3 AND b=4".

I believe that measuring this by counting the number of distinct values of column B for each A is basically the right
idea.Maybe we could count the number of distinct values of "b" for every one of the most common values of "a", and
comparethat to the overall number of distinct values of "b"... 

A (very) quick search on scholar.google.com for "estimate conditional probability" didn't turn up anything useful, but
it'shard to believe that there isn't at least some literature on the subject. 

best regards,
Florian Pflug

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Dunstan
Дата:
Сообщение: Re: function attributes
Следующее
От: Andrew Dunstan
Дата:
Сообщение: Re: function attributes