Re: Statistics and selectivity estimation for ranges

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: Statistics and selectivity estimation for ranges
Дата
Msg-id CAPpHfdsKfL_ZAxp4H-6Z3MUy2engekFB9DE7_Yrob9X8rXzxKA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Statistics and selectivity estimation for ranges  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Ответы Re: Statistics and selectivity estimation for ranges
Список pgsql-hackers
On Tue, Aug 14, 2012 at 7:46 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote:
On 14.08.2012 09:45, Alexander Korotkov wrote:
After fixing few more bugs, I've a version with much more reasonable
accuracy.

Great! One little thing just occurred to me:

You're relying on the regular scalar selectivity estimators for the <<, >>, &< and &> operators. That seems bogus, in particular for << and &<, because ineq_histogram_selectivity then performs a binary search of the histogram using those operators. << and &< compare the *upper* bound of the value in table against the lower bound of constant, but the histogram is constructed using regular < operator, which sorts the entries by lower bound. I think the estimates you now get for those operators are quite bogus if there is a non-trivial amount of overlap between ranges. For example:

postgres=# create table range_test as
select int4range(-a, a) as r from generate_series(1,1000000) a; analyze range_test;
SELECT 1000000
ANALYZE
postgres=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r <<
int4range(200000, 200001);
                                                    QUERY PLAN

--------------------------------------------------------------------------------
-----------------------------------
 Seq Scan on range_test  (cost=0.00..17906.00 rows=100 width=14) (actual time=0.
060..1340.147 rows=200000 loops=1)
   Filter: (r << '[200000,200001)'::int4range)
   Rows Removed by Filter: 800000
 Total runtime: 1371.865 ms
(4 rows)

It would be quite easy to provide reasonable estimates for those operators, if we had a separate histogram of upper bounds. I also note that the estimation of overlap selectivity could be implemented using separate histograms of lower bounds and upper bounds, without requiring a histogram of range lengths, because a && b == NOT (a << b OR a >> b). I'm not sure if the estimates we'd get that way would be better or worse than your current method, but I think it would be easier to understand.

I don't think the &< and &> operators could be implemented in terms of a lower and upper bound histogram, though, so you'd still need the current length histogram method for that.

Oh, actually I didn't touch those operators. Selectivity estimation functions for them were already in the catalog, they didn't work previously just because no statistics. Histogram of upper bounds would be both more accurate and natural for some operators. However, it requires collecting additional statistics while AFAICS it doesn't liberate us from having histogram of range lengths.
 
The code in that traverses the lower bound and length histograms in lockstep looks quite scary. Any ideas on how to simplify that? My first thought is that there should be helper function that gets a range length as argument, and returns the fraction of tuples with length >= argument. It would do the lookup in the length histogram to find the right histogram bin, and do the linear interpolation within the bin. You're assuming that length is independent of lower/upper bound, so you shouldn't need any other parameters than range length for that estimation.

You could then loop through only the lower bounds, and call the helper function for each bin to get the fraction of ranges long enough in that bin, instead dealing with both histograms in the same loop. I think a helper function like that might simplify those scary loops significantly, but I wasn't sure if there's some more intelligence in the way you combine values from the length and lower bound histograms that you couldn't do with such a helper function.

Yes, I also thought about something like this. But, in order to save current estimate accuracy, it should be more complicated in following reasons:
1) In last version, I don't estimate just fraction of tuples with length >= argument, but area under length histogram between two length bounds (length_hist_summ).
2) In histogram ends up before reaching given length bound we also need to return place where it happened. Now it is performed by hist_frac *= (length - prev_dist) / (dist - prev_dist).
I'm going to try some simplification with taking care about both mentioned aspects.

------
With best regards,
Alexander Korotkov.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavel Stehule
Дата:
Сообщение: Re: betatesting: ERROR: failed to build any 2-way joins on 9.2
Следующее
От: Heikki Linnakangas
Дата:
Сообщение: Re: Statistics and selectivity estimation for ranges