Re: Statistics and selectivity estimation for ranges
От | Heikki Linnakangas |
---|---|
Тема | Re: Statistics and selectivity estimation for ranges |
Дата | |
Msg-id | 502B5A76.7000205@enterprisedb.com обсуждение исходный текст |
Ответ на | Re: Statistics and selectivity estimation for ranges (Alexander Korotkov <aekorotkov@gmail.com>) |
Ответы |
Re: Statistics and selectivity estimation for ranges
|
Список | pgsql-hackers |
On 15.08.2012 10:38, Alexander Korotkov wrote: > On Tue, Aug 14, 2012 at 7:46 PM, Heikki Linnakangas< > heikki.linnakangas@enterprisedb.com> wrote: > >> It would be quite easy to provide reasonable estimates for those >> operators, if we had a separate histogram of upper bounds. I also note that >> the estimation of overlap selectivity could be implemented using separate >> histograms of lower bounds and upper bounds, without requiring a histogram >> of range lengths, because a&& b == NOT (a<< b OR a>> b). I'm not sure if >> the estimates we'd get that way would be better or worse than your current >> method, but I think it would be easier to understand. >> >> I don't think the&< and&> operators could be implemented in terms of a >> lower and upper bound histogram, though, so you'd still need the current >> length histogram method for that. > > Oh, actually I didn't touch those operators. Selectivity estimation > functions for them were already in the catalog, they didn't work previously > just because no statistics. Yeah, without the histogram, the scalar selectivity estimator sort-of works, in that it returns the estimate just based on the most common values and a constant. > Histogram of upper bounds would be both more > accurate and natural for some operators. However, it requires collecting > additional statistics while AFAICS it doesn't liberate us from having > histogram of range lengths. Hmm, if we collected a histogram of lower bounds and a histogram of upper bounds, that would be roughly the same amount of data as for the "standard" histogram with both bounds in the same histogram. >> The code in that traverses the lower bound and length histograms in >> lockstep looks quite scary. Any ideas on how to simplify that? My first >> thought is that there should be helper function that gets a range length as >> argument, and returns the fraction of tuples with length>= argument. It >> would do the lookup in the length histogram to find the right histogram >> bin, and do the linear interpolation within the bin. You're assuming that >> length is independent of lower/upper bound, so you shouldn't need any other >> parameters than range length for that estimation. >> >> You could then loop through only the lower bounds, and call the helper >> function for each bin to get the fraction of ranges long enough in that >> bin, instead dealing with both histograms in the same loop. I think a >> helper function like that might simplify those scary loops significantly, >> but I wasn't sure if there's some more intelligence in the way you combine >> values from the length and lower bound histograms that you couldn't do with >> such a helper function. > > Yes, I also thought about something like this. But, in order to save > current estimate accuracy, it should be more complicated in following > reasons: > 1) In last version, I don't estimate just fraction of tuples with length>= > argument, but area under length histogram between two length bounds > (length_hist_summ). > 2) In histogram ends up before reaching given length bound we also need to > return place where it happened. Now it is performed by hist_frac *= (length > - prev_dist) / (dist - prev_dist). > I'm going to try some simplification with taking care about both mentioned > aspects. Thanks. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Alexander KorotkovДата:
Сообщение: Re: Statistics and selectivity estimation for ranges
Следующее
От: Alexander KorotkovДата:
Сообщение: Re: Statistics and selectivity estimation for ranges