Re: Statistics and selectivity estimation for ranges

Поиск

Список

Период

Сортировка

От	Heikki Linnakangas
Тема	Re: Statistics and selectivity estimation for ranges
Дата	15 августа 2012 г. 08:14:59
Msg-id	502B5A76.7000205@enterprisedb.com обсуждение исходный текст
Ответ на	Re: Statistics and selectivity estimation for ranges (Alexander Korotkov <aekorotkov@gmail.com>)
Ответы	Re: Statistics and selectivity estimation for ranges
Список	pgsql-hackers

Дерево обсуждения

On 15.08.2012 10:38, Alexander Korotkov wrote:
> On Tue, Aug 14, 2012 at 7:46 PM, Heikki Linnakangas<
> heikki.linnakangas@enterprisedb.com>  wrote:
>
>> It would be quite easy to provide reasonable estimates for those
>> operators, if we had a separate histogram of upper bounds. I also note that
>> the estimation of overlap selectivity could be implemented using separate
>> histograms of lower bounds and upper bounds, without requiring a histogram
>> of range lengths, because a&&  b == NOT (a<<  b OR a>>  b). I'm not sure if
>> the estimates we'd get that way would be better or worse than your current
>> method, but I think it would be easier to understand.
>>
>> I don't think the&<  and&>  operators could be implemented in terms of a
>> lower and upper bound histogram, though, so you'd still need the current
>> length histogram method for that.
>
> Oh, actually I didn't touch those operators. Selectivity estimation
> functions for them were already in the catalog, they didn't work previously
> just because no statistics.

Yeah, without the histogram, the scalar selectivity estimator sort-of 
works, in that it returns the estimate just based on the most common 
values and a constant.

> Histogram of upper bounds would be both more
> accurate and natural for some operators. However, it requires collecting
> additional statistics while AFAICS it doesn't liberate us from having
> histogram of range lengths.

Hmm, if we collected a histogram of lower bounds and a histogram of 
upper bounds, that would be roughly the same amount of data as for the 
"standard" histogram with both bounds in the same histogram.

>> The code in that traverses the lower bound and length histograms in
>> lockstep looks quite scary. Any ideas on how to simplify that? My first
>> thought is that there should be helper function that gets a range length as
>> argument, and returns the fraction of tuples with length>= argument. It
>> would do the lookup in the length histogram to find the right histogram
>> bin, and do the linear interpolation within the bin. You're assuming that
>> length is independent of lower/upper bound, so you shouldn't need any other
>> parameters than range length for that estimation.
>>
>> You could then loop through only the lower bounds, and call the helper
>> function for each bin to get the fraction of ranges long enough in that
>> bin, instead dealing with both histograms in the same loop. I think a
>> helper function like that might simplify those scary loops significantly,
>> but I wasn't sure if there's some more intelligence in the way you combine
>> values from the length and lower bound histograms that you couldn't do with
>> such a helper function.
>
> Yes, I also thought about something like this. But, in order to save
> current estimate accuracy, it should be more complicated in following
> reasons:
> 1) In last version, I don't estimate just fraction of tuples with length>=
> argument, but area under length histogram between two length bounds
> (length_hist_summ).
> 2) In histogram ends up before reaching given length bound we also need to
> return place where it happened. Now it is performed by hist_frac *= (length
> - prev_dist) / (dist - prev_dist).
> I'm going to try some simplification with taking care about both mentioned
> aspects.

Thanks.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Statistics and selectivity estimation for ranges