Re: Collect frequency statistics for arrays

Поиск

Список

Период

Сортировка

От	Alexander Korotkov
Тема	Re: Collect frequency statistics for arrays
Дата	12 марта 2012 г. 08:07:19
Msg-id	CAPpHfdsoBcKokndg76RTKwo-Az8eoxdzVyLSKK5oGdhuZWopkQ@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Collect frequency statistics for arrays (Tom Lane <tgl@sss.pgh.pa.us>)
Список	pgsql-hackers

Дерево обсуждения

On Thu, Mar 8, 2012 at 4:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alexander Korotkov <aekorotkov@gmail.com> writes:
> True. If (max count - min count + 1) is small, enumerating of frequencies
> is both more compact and more precise representation. Simultaneously,
> if (max count - min count + 1) is large, we can run out of
> statistics_target with such representation. We can use same representation
> of count distribution as for scalar column value: MCV and HISTOGRAM, but it
> would require additional statkind and statistics slot. Probably, you've
> better ideas?

I wasn't thinking of introducing two different representations,
but just trimming the histogram length when it's larger than necessary.

On reflection my idea above is wrong; for example assume that we have a
column with 900 arrays of length 1 and 100 arrays of length 2. Going by
what I said, we'd reduce the histogram to {1,2}, which might accurately
capture the set of lengths present but fails to show that 1 is much more
common than 2. However, a histogram {1,1,1,1,1,1,1,1,1,2} (ten entries)
would capture the situation perfectly in one-tenth the space that the
current logic does.

More generally, by limiting the histogram to statistics_target entries,
we are already accepting errors of up to 1/(2*statistics_target) in the
accuracy of the bin-boundary representation. What the above example
shows is that sometimes we could meet the same accuracy requirement with
fewer entries. I'm not sure how this could be mechanized but it seems
worth thinking about.

I can propose following representation of histogram.

If (max_count - min_count + 1) <= statistics_target then

1) store max_count and min_count in stavalues

2) store frequencies from min_count ot max_count in numvalues

If (max_count - min_count + 1) > statistics_target then

store histogram in current manner. I think in this case it's unlikely to be many repeating values.

I can propose patch which change histogram representation to this.

Comments?

------
With best regards,
Alexander Korotkov.

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Etsuro Fujita
Дата: 12 марта 2012 г., 08:05:25
Сообщение: Corrected: Re: pgsql_fdw, FDW for PostgreSQL server

Следующее

От: Simon Riggs
Дата: 12 марта 2012 г., 09:02:39
Сообщение: Re: elegant and effective way for running jobs inside a database

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Collect frequency statistics for arrays

Предыдущее

Следующее