Re: proposal : cross-column stats

Поиск

Список

Период

Сортировка

От	Heikki Linnakangas
Тема	Re: proposal : cross-column stats
Дата	12 декабря 2010 г. 10:43:14
Msg-id	4D04DF74.1010706@enterprisedb.com обсуждение исходный текст
Ответ на	Re: proposal : cross-column stats (Martijn van Oosterhout <kleptog@svana.org>)
Ответы	Re: proposal : cross-column stats Re: proposal : cross-column stats Re: proposal : cross-column stats Re: proposal : cross-column stats
Список	pgsql-hackers

Дерево обсуждения

On 12.12.2010 15:17, Martijn van Oosterhout wrote:
> On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote:
> Very cool that you're working on this.

+1

>> Lets talk about one special case - I'll explain how the proposed
>> solution works, and then I'll explain how to make it more general, what
>> improvements are possible, what issues are there. Anyway this is by no
>> means a perfect or complete solution - it's just a starting point.
>
> It looks like you handled most of the issues. Just a few points:
>
> - This is obviously applicable to more than just integers, probably
>    anything with a b-tree operator class. What you've coded seems rely
>    on calculations on the values. Have you thought about how it could
>    work for, for example, strings?
>
> The classic failure case has always been: postcodes and city names.
> Strongly correlated, but in a way that the computer can't easily see.

Yeah, and that's actually analogous to the example I used in my 
presentation.

The way I think of that problem is that once you know the postcode, 
knowing the city name doesn't add any information. The postcode implies 
the city name. So the selectivity for "postcode = ? AND city = ?" should 
be the selectivity of "postcode = ?" alone. The measurement we need is 
"implicativeness": How strongly does column A imply a certain value for 
column B. Perhaps that could be measured by counting the number of 
distinct values of column B for each value of column A, or something 
like that. I don't know what the statisticians call that property, or if 
there's some existing theory on how to measure that from a sample.

That's assuming the combination has any matches. It's possible that the 
user chooses a postcode and city combination that doesn't exist, but 
that's no different from a user doing "city = 'fsdfsdfsd'" on a single 
column, returning no matches. We should assume that the combination 
makes sense.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: proposal : cross-column stats