Обсуждение: Cross-field statistics

Поиск
Список
Период
Сортировка

Cross-field statistics

От
Decibel!
Дата:
I just had an idea about how to create cross-field statistics, which  
could greatly improve the quality of estimates involving multiple  
conditions on one table. This is rather arm-wavy, but I wanted to at  
least get the idea out...

If we built a table via

CREATE TABLE moo AS SELECT i, i*2 AS j FROM generate_series(1,9999) i;

Then it would be nice if the planner produced the same estimate for  
all of these:

SELECT * FROM moo WHERE i>8888 AND j>8888*2;
SELECT * FROM moo WHERE i>8888 OR j>8888*2;
SELECT * FROM moo WHERE i>8888;
SELECT * FROM moo WHERE j>8888*2;

It only actually gets the last 2 correct (1117 rows, close enough to  
the actual 1111 rows). On my laptop, it guesses 125 for the AND case  
and 2109 for the OR case. The problem is that it doesn't know how  
closely i and j are related. But in this (contrived) example, it  
actually *could* make an inference between these two columns, because  
each field has a correlation of 1. That means that you can actually  
compute how much those two conditions will overlap by comparing how  
much they overlap in the histogram that's stored in pg_stats. As a  
first pass, it might be worth having the planner actually take this  
simple case into account.

For all the other fields, what if ANALYZE constructed artificial  
correlation orderings? We don't actually care about how well these  
artificial correlations correspond to physical table ordering, we  
only care about how many fields line up with a particular artificial  
ordering. What I'm proposing is that once we have our set of sample  
records in ANALYZE:

For each field that isn't already in a set of field groupings * Sort sample rows on that field * Calculate correlation
forall other fields * If there are other fields that have a correlation to this sort  
 
order over some threshold, save them along with the field we  
originally sorted on as a new 'field grouping' * Else, there are no other fields that group with this field; it's  
a "loner"

For each field grouping, at a minimum we'd need to store a histogram  
for that grouping. It might be worth looking at how things change  
when we sort on different fields in the grouping... the lower the  
correlation threshold used to identify groupings, the more  
variability there will be. I think we'd also want to consider how  
well each field in the grouping correlated to that grouping. It might  
also be worth iteratively dropping the correlation threshold and  
searching again for groupings. At some point we lose the ability to  
draw meaningful conclusion from the information, but I'd expect  
there's some way we can calculate epsilon for different groupings and  
take that into account with query plans.

The important thing is that this scheme adds less than O(n) (n being  
the number of fields), and not O(n^2), both in terms of ANALYZE (ok,  
maybe not entirely true since presumably we don't do any sorting  
there right now) and in terms of storing statistics. I'm not sure  
what it would do to the planner; the entire key there would be  
identifying field groupings that covered sets of fields in the WHERE  
clause.
--
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828



Re: Cross-field statistics

От
Gregory Stark
Дата:
"Decibel!" <decibel@decibel.org> writes:

> For each field that isn't already in a set of field groupings
>  * Sort sample rows on that field
>  * Calculate correlation for all other fields
>  * If there are other fields that have a correlation to this sort  order over
> some threshold, save them along with the field we  originally sorted on as a
> new 'field grouping'
>  * Else, there are no other fields that group with this field; it's  a "loner"

I think this is going somewhere. But "correlation" isn't quite right. It has
the same problem our use of correlation for clusteredness has. Consider the
case of Zip code and City. They're nearly very non-independent variables but
there's basically no correlation.

If we found the right metric for clusteredness we could probably use it here
though too though.

> For each field grouping, at a minimum we'd need to store a histogram  for that
> grouping. 

This is a problem. What does a histogram on a grouping mean? It's not clear
how to come up with a histogram which can help answer questions like  A between ? and ? and B between ? and ?

You can do a histogram on <a,b> or <b,a> but neither are going to be
especially useful. Heikki and I came up with a weird hybrid thing which might
be useful for avoiding overestimating selectivity like  WHERE city='BOS' AND areacode = '617'

But it didn't help at all with the converse, ie:WHERE city='BOS' AND areacode = '212'

It's hard to see how we could possibly catch cases like that though.

> The important thing is that this scheme adds less than O(n) (n being  the
> number of fields), and not O(n^2), both in terms of ANALYZE

It looks like a good method for *finding* column sets which will be
interesting to keep more stats on. That's definitely one of the challenges.

I'm still not sure what stats to actually gather on the resulting column sets.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!


Re: Cross-field statistics

От
Decibel!
Дата:
On Apr 17, 2008, at 12:22 PM, Gregory Stark wrote:
> "Decibel!" <decibel@decibel.org> writes:
>
>> For each field that isn't already in a set of field groupings
>>  * Sort sample rows on that field
>>  * Calculate correlation for all other fields
>>  * If there are other fields that have a correlation to this sort   
>> order over
>> some threshold, save them along with the field we  originally  
>> sorted on as a
>> new 'field grouping'
>>  * Else, there are no other fields that group with this field;  
>> it's  a "loner"
>
> I think this is going somewhere. But "correlation" isn't quite  
> right. It has
> the same problem our use of correlation for clusteredness has.  
> Consider the
> case of Zip code and City. They're nearly very non-independent  
> variables but
> there's basically no correlation.

If we have a limited number of values in one of the fields, it would  
be possible to build a histogram of values for other fields based on  
the values in the first field. But I can't see how we could possibly  
represent something like city, zip in a compact form. You would have  
to keep a range of zips that cover a city.

Hmm... but we only care about cities that have a substantial number  
of zip codes. This is what the equivalent of the most-common-values  
list would be for cross-platform stats: for the most_common_vals in  
column a, you store a range or histogram of the corresponding values  
in b, assuming that there is a good correspondence.

>> For each field grouping, at a minimum we'd need to store a  
>> histogram  for that
>> grouping.
>
> This is a problem. What does a histogram on a grouping mean? It's  
> not clear
> how to come up with a histogram which can help answer questions like
>   A between ? and ? and B between ? and ?
>
> You can do a histogram on <a,b> or <b,a> but neither are going to be
> especially useful. Heikki and I came up with a weird hybrid thing  
> which might
> be useful for avoiding overestimating selectivity like
>   WHERE city='BOS' AND areacode = '617'
>
> But it didn't help at all with the converse, ie:
>  WHERE city='BOS' AND areacode = '212'
>
> It's hard to see how we could possibly catch cases like that though.

If the two fields share the same correlation, then the histogram is  
just what we use right now. We could actually do this today, but only  
for fields with a high physical correlation. What I was describing  
allowed extending this to fields that have a high correlation to each  
other, even if they didn't have a high physical correlation. I know  
that this doesn't help us for things like city/area code or city/zip,  
but other than my idea above I'm rather at a loss on how to represent  
that in a compact fashion.
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828