Re: Cross-column statistics revisited

Поиск
Список
Период
Сортировка
От Greg Stark
Тема Re: Cross-column statistics revisited
Дата
Msg-id FD242D4C-AAB8-4F62-8D61-3DB904A348F1@enterprisedb.com
обсуждение исходный текст
Ответ на Re: Cross-column statistics revisited  (Josh Berkus <josh@agliodbs.com>)
Ответы Re: Cross-column statistics revisited  (Martijn van Oosterhout <kleptog@svana.org>)
Список pgsql-hackers
Correlation is the wrong tool. In fact zip codes and city have nearly  
zero correlation.  Zip codes near 00000 are no more likely to be in  
cities starting with A than Z.

Even if you use an appropriate tool I'm not clear what to do with the  
information. Consider the case of WHERE city='boston' and zip='02139'  
and another query with WHERE city='boston' and zip='90210'. One will  
produce many more records than the separate histograms would predict  
and the other would produce zero. How do you determine which category  
a given pair of constants falls into?

Separately you mention cross-table stats - but that' a whole other  
kettle of worms. I'm not sure which is easier but let's do one at a  
time?


greg

On 17 Oct 2008, at 12:12 AM, Josh Berkus <josh@agliodbs.com> wrote:

>
>> Yes, or to phrase that another way: What kinds of queries are being
>> poorly optimized now and why?
>
> Well, we have two different correlation problems.  One is the  
> problem of
> dependant correlation, such as the 1.0 correlation of ZIP and CITY  
> fields
> as a common problem.  This could in fact be fixed, I believe, via a  
> linear
> math calculation based on the sampled level of correlation, assuming  
> we
> have enough samples.  And it's really only an issue if the  
> correlation is
>> 0.5.
>
> The second type of correlation issue we have is correlating values  
> in a
> parent table with *rows* in child table (i.e. FK joins).  Currently,  
> the
> planner assumes that all rows in the child table are evenly  
> distributed
> against keys in the parent table.  But many real-world databases  
> have this
> kind of problem:
>
> A    B
> 1    10000 rows
> 2    10000 rows
> 3    1000 rows
> 4 .. 1000    0 to 1 rows
>
> For queries which cover values between 4..1000 on A, the misestimate  
> won't
> be much of a real execution problem.  But for values 1,2,3, the  
> query will
> bomb.
>
>> The other half of this is that bad selectivity estimates only matter
>> if they're bad enough to change the plan, and I'm not sure whether
>> cases like this are actually a problem in practice.
>
> My experience is that any estimate which is more than 5x wrong (i.e.  
> < .2
> or > 5.0) usually causes problems, and 3x sometimes causes problems.
>
> -- 
> --Josh
>
> Josh Berkus
> PostgreSQL
> San Francisco
>
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Josh Berkus
Дата:
Сообщение: Re: Cross-column statistics revisited
Следующее
От: Ron Mayer
Дата:
Сообщение: Re: Cross-column statistics revisited