Re: Cross-column statistics revisited

Поиск

Список

Период

Сортировка

От	Gregory Stark
Тема	Re: Cross-column statistics revisited
Дата	17 октября 2008 г. 05:01:11
Msg-id	8763nrhcr9.fsf@oxford.xeocode.com обсуждение исходный текст
Ответ на	Re: Cross-column statistics revisited (Martijn van Oosterhout <kleptog@svana.org>)
Ответы	Re: Cross-column statistics revisited
Список	pgsql-hackers

Дерево обсуждения

Martijn van Oosterhout <kleptog@svana.org> writes:

> On Fri, Oct 17, 2008 at 12:20:58AM +0200, Greg Stark wrote:
>> Correlation is the wrong tool. In fact zip codes and city have nearly  
>> zero correlation.  Zip codes near 00000 are no more likely to be in  
>> cities starting with A than Z.
>
> I think we need to define our terms better. In terms of linear
> correlation you are correct. However, you can define invertable mappings
> from zip codes and cities onto the integers which will then have an
> almost perfect correlation.
>
> According to a paper I found this is related to the "principle of
> maximum entropy". The fact that you can't determine such functions
> easily in practice doesn't change the fact that zip codes and city
> names are highly correlated.

They're certainly very much not independent variables. There are lots of ways
of measuring how much dependence there is between them. I don't know enough
about the math to know if your maps are equivalent to any of them.

In any case as I described it's not enough information to know that the two
data sets are heavily dependent. You need to know for which pairs (or ntuples)
that dependency results in a higher density and for which it results in lower
density and how much higher or lower. That seems like a lot of information to
encode (and a lot to find in the sample).

Perhaps just knowing whether that there's a dependence between two data sets
might be somewhat useful if the planner kept a confidence value for all its
estimates. It would know to have a lower confidence value for estimates coming
from highly dependent clauses? It wouldn't be very easy for the planner to
distinguish "safe" plans for low confidence estimates and "risky" plans which
might blow up if the estimates are wrong though. And of course that's a lot
less interesting than just getting better estimates :)

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Cross-column statistics revisited