Re: Cross-field statistics

Поиск
Список
Период
Сортировка
От Decibel!
Тема Re: Cross-field statistics
Дата
Msg-id F03A6C3F-A325-4BB7-AEA0-784DFE2F5F54@decibel.org
обсуждение исходный текст
Ответ на Re: Cross-field statistics  (Gregory Stark <stark@enterprisedb.com>)
Список pgsql-hackers
On Apr 17, 2008, at 12:22 PM, Gregory Stark wrote:
> "Decibel!" <decibel@decibel.org> writes:
>
>> For each field that isn't already in a set of field groupings
>>  * Sort sample rows on that field
>>  * Calculate correlation for all other fields
>>  * If there are other fields that have a correlation to this sort   
>> order over
>> some threshold, save them along with the field we  originally  
>> sorted on as a
>> new 'field grouping'
>>  * Else, there are no other fields that group with this field;  
>> it's  a "loner"
>
> I think this is going somewhere. But "correlation" isn't quite  
> right. It has
> the same problem our use of correlation for clusteredness has.  
> Consider the
> case of Zip code and City. They're nearly very non-independent  
> variables but
> there's basically no correlation.

If we have a limited number of values in one of the fields, it would  
be possible to build a histogram of values for other fields based on  
the values in the first field. But I can't see how we could possibly  
represent something like city, zip in a compact form. You would have  
to keep a range of zips that cover a city.

Hmm... but we only care about cities that have a substantial number  
of zip codes. This is what the equivalent of the most-common-values  
list would be for cross-platform stats: for the most_common_vals in  
column a, you store a range or histogram of the corresponding values  
in b, assuming that there is a good correspondence.

>> For each field grouping, at a minimum we'd need to store a  
>> histogram  for that
>> grouping.
>
> This is a problem. What does a histogram on a grouping mean? It's  
> not clear
> how to come up with a histogram which can help answer questions like
>   A between ? and ? and B between ? and ?
>
> You can do a histogram on <a,b> or <b,a> but neither are going to be
> especially useful. Heikki and I came up with a weird hybrid thing  
> which might
> be useful for avoiding overestimating selectivity like
>   WHERE city='BOS' AND areacode = '617'
>
> But it didn't help at all with the converse, ie:
>  WHERE city='BOS' AND areacode = '212'
>
> It's hard to see how we could possibly catch cases like that though.

If the two fields share the same correlation, then the histogram is  
just what we use right now. We could actually do this today, but only  
for fields with a high physical correlation. What I was describing  
allowed extending this to fields that have a high correlation to each  
other, even if they didn't have a high physical correlation. I know  
that this doesn't help us for things like city/area code or city/zip,  
but other than my idea above I'm rather at a loss on how to represent  
that in a compact fashion.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Aidan Van Dyk
Дата:
Сообщение: Re: Lessons from commit fest
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Lessons from commit fest