Re: Cross-column statistics revisited

Поиск

Список

Период

Сортировка

От	Greg Stark
Тема	Re: Cross-column statistics revisited
Дата	16 октября 2008 г. 19:21:09
Msg-id	FD242D4C-AAB8-4F62-8D61-3DB904A348F1@enterprisedb.com обсуждение исходный текст
Ответ на	Re: Cross-column statistics revisited (Josh Berkus <josh@agliodbs.com>)
Ответы	Re: Cross-column statistics revisited
Список	pgsql-hackers

Дерево обсуждения

Correlation is the wrong tool. In fact zip codes and city have nearly  
zero correlation.  Zip codes near 00000 are no more likely to be in  
cities starting with A than Z.

Even if you use an appropriate tool I'm not clear what to do with the  
information. Consider the case of WHERE city='boston' and zip='02139'  
and another query with WHERE city='boston' and zip='90210'. One will  
produce many more records than the separate histograms would predict  
and the other would produce zero. How do you determine which category  
a given pair of constants falls into?

Separately you mention cross-table stats - but that' a whole other  
kettle of worms. I'm not sure which is easier but let's do one at a  
time?

greg

On 17 Oct 2008, at 12:12 AM, Josh Berkus <josh@agliodbs.com> wrote:

>
>> Yes, or to phrase that another way: What kinds of queries are being
>> poorly optimized now and why?
>
> Well, we have two different correlation problems.  One is the  
> problem of
> dependant correlation, such as the 1.0 correlation of ZIP and CITY  
> fields
> as a common problem.  This could in fact be fixed, I believe, via a  
> linear
> math calculation based on the sampled level of correlation, assuming  
> we
> have enough samples.  And it's really only an issue if the  
> correlation is
>> 0.5.
>
> The second type of correlation issue we have is correlating values  
> in a
> parent table with *rows* in child table (i.e. FK joins).  Currently,  
> the
> planner assumes that all rows in the child table are evenly  
> distributed
> against keys in the parent table.  But many real-world databases  
> have this
> kind of problem:
>
> A    B
> 1    10000 rows
> 2    10000 rows
> 3    1000 rows
> 4 .. 1000    0 to 1 rows
>
> For queries which cover values between 4..1000 on A, the misestimate  
> won't
> be much of a real execution problem.  But for values 1,2,3, the  
> query will
> bomb.
>
>> The other half of this is that bad selectivity estimates only matter
>> if they're bad enough to change the plan, and I'm not sure whether
>> cases like this are actually a problem in practice.
>
> My experience is that any estimate which is more than 5x wrong (i.e.  
> < .2
> or > 5.0) usually causes problems, and 3x sometimes causes problems.
>
> -- 
> --Josh
>
> Josh Berkus
> PostgreSQL
> San Francisco
>
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Cross-column statistics revisited