Re: proposal : cross-column stats

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: proposal : cross-column stats
Дата
Msg-id 4D04DF74.1010706@enterprisedb.com
обсуждение исходный текст
Ответ на Re: proposal : cross-column stats  (Martijn van Oosterhout <kleptog@svana.org>)
Ответы Re: proposal : cross-column stats  (Tomas Vondra <tv@fuzzy.cz>)
Re: proposal : cross-column stats  (Florian Pflug <fgp@phlo.org>)
Re: proposal : cross-column stats  (Robert Haas <robertmhaas@gmail.com>)
Re: proposal : cross-column stats  (Tomas Vondra <tv@fuzzy.cz>)
Список pgsql-hackers
On 12.12.2010 15:17, Martijn van Oosterhout wrote:
> On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
> Very cool that you're working on this.

+1

>> Lets talk about one special case - I'll explain how the proposed
>> solution works, and then I'll explain how to make it more general, what
>> improvements are possible, what issues are there. Anyway this is by no
>> means a perfect or complete solution - it's just a starting point.
>
> It looks like you handled most of the issues. Just a few points:
>
> - This is obviously applicable to more than just integers, probably
>    anything with a b-tree operator class. What you've coded seems rely
>    on calculations on the values. Have you thought about how it could
>    work for, for example, strings?
>
> The classic failure case has always been: postcodes and city names.
> Strongly correlated, but in a way that the computer can't easily see.

Yeah, and that's actually analogous to the example I used in my 
presentation.

The way I think of that problem is that once you know the postcode, 
knowing the city name doesn't add any information. The postcode implies 
the city name. So the selectivity for "postcode = ? AND city = ?" should 
be the selectivity of "postcode = ?" alone. The measurement we need is 
"implicativeness": How strongly does column A imply a certain value for 
column B. Perhaps that could be measured by counting the number of 
distinct values of column B for each value of column A, or something 
like that. I don't know what the statisticians call that property, or if 
there's some existing theory on how to measure that from a sample.

That's assuming the combination has any matches. It's possible that the 
user chooses a postcode and city combination that doesn't exist, but 
that's no different from a user doing "city = 'fsdfsdfsd'" on a single 
column, returning no matches. We should assume that the combination 
makes sense.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Oleg Bartunov
Дата:
Сообщение: Re: Extensions, patch v16
Следующее
От: Tom Lane
Дата:
Сообщение: Re: function attributes