Re: proposal : cross-column stats
От | Heikki Linnakangas |
---|---|
Тема | Re: proposal : cross-column stats |
Дата | |
Msg-id | 4D04DF74.1010706@enterprisedb.com обсуждение исходный текст |
Ответ на | Re: proposal : cross-column stats (Martijn van Oosterhout <kleptog@svana.org>) |
Ответы |
Re: proposal : cross-column stats
Re: proposal : cross-column stats Re: proposal : cross-column stats Re: proposal : cross-column stats |
Список | pgsql-hackers |
On 12.12.2010 15:17, Martijn van Oosterhout wrote: > On Sun, Dec 12, 2010 at 03:58:49AM +0100, Tomas Vondra wrote: > Very cool that you're working on this. +1 >> Lets talk about one special case - I'll explain how the proposed >> solution works, and then I'll explain how to make it more general, what >> improvements are possible, what issues are there. Anyway this is by no >> means a perfect or complete solution - it's just a starting point. > > It looks like you handled most of the issues. Just a few points: > > - This is obviously applicable to more than just integers, probably > anything with a b-tree operator class. What you've coded seems rely > on calculations on the values. Have you thought about how it could > work for, for example, strings? > > The classic failure case has always been: postcodes and city names. > Strongly correlated, but in a way that the computer can't easily see. Yeah, and that's actually analogous to the example I used in my presentation. The way I think of that problem is that once you know the postcode, knowing the city name doesn't add any information. The postcode implies the city name. So the selectivity for "postcode = ? AND city = ?" should be the selectivity of "postcode = ?" alone. The measurement we need is "implicativeness": How strongly does column A imply a certain value for column B. Perhaps that could be measured by counting the number of distinct values of column B for each value of column A, or something like that. I don't know what the statisticians call that property, or if there's some existing theory on how to measure that from a sample. That's assuming the combination has any matches. It's possible that the user chooses a postcode and city combination that doesn't exist, but that's no different from a user doing "city = 'fsdfsdfsd'" on a single column, returning no matches. We should assume that the combination makes sense. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
В списке pgsql-hackers по дате отправления: