Re: Cross-column statistics revisited
От | Greg Stark |
---|---|
Тема | Re: Cross-column statistics revisited |
Дата | |
Msg-id | FD242D4C-AAB8-4F62-8D61-3DB904A348F1@enterprisedb.com обсуждение исходный текст |
Ответ на | Re: Cross-column statistics revisited (Josh Berkus <josh@agliodbs.com>) |
Ответы |
Re: Cross-column statistics revisited
(Martijn van Oosterhout <kleptog@svana.org>)
|
Список | pgsql-hackers |
Correlation is the wrong tool. In fact zip codes and city have nearly zero correlation. Zip codes near 00000 are no more likely to be in cities starting with A than Z. Even if you use an appropriate tool I'm not clear what to do with the information. Consider the case of WHERE city='boston' and zip='02139' and another query with WHERE city='boston' and zip='90210'. One will produce many more records than the separate histograms would predict and the other would produce zero. How do you determine which category a given pair of constants falls into? Separately you mention cross-table stats - but that' a whole other kettle of worms. I'm not sure which is easier but let's do one at a time? greg On 17 Oct 2008, at 12:12 AM, Josh Berkus <josh@agliodbs.com> wrote: > >> Yes, or to phrase that another way: What kinds of queries are being >> poorly optimized now and why? > > Well, we have two different correlation problems. One is the > problem of > dependant correlation, such as the 1.0 correlation of ZIP and CITY > fields > as a common problem. This could in fact be fixed, I believe, via a > linear > math calculation based on the sampled level of correlation, assuming > we > have enough samples. And it's really only an issue if the > correlation is >> 0.5. > > The second type of correlation issue we have is correlating values > in a > parent table with *rows* in child table (i.e. FK joins). Currently, > the > planner assumes that all rows in the child table are evenly > distributed > against keys in the parent table. But many real-world databases > have this > kind of problem: > > A B > 1 10000 rows > 2 10000 rows > 3 1000 rows > 4 .. 1000 0 to 1 rows > > For queries which cover values between 4..1000 on A, the misestimate > won't > be much of a real execution problem. But for values 1,2,3, the > query will > bomb. > >> The other half of this is that bad selectivity estimates only matter >> if they're bad enough to change the plan, and I'm not sure whether >> cases like this are actually a problem in practice. > > My experience is that any estimate which is more than 5x wrong (i.e. > < .2 > or > 5.0) usually causes problems, and 3x sometimes causes problems. > > -- > --Josh > > Josh Berkus > PostgreSQL > San Francisco > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
В списке pgsql-hackers по дате отправления: