Re: Improving N-Distinct estimation by ANALYZE

Поиск

Список

Период

Сортировка

От	Simon Riggs
Тема	Re: Improving N-Distinct estimation by ANALYZE
Дата	4 января 2006 г. 20:57:01
Msg-id	1136422621.21025.244.camel@localhost.localdomain обсуждение исходный текст
Ответ на	Re: Improving N-Distinct estimation by ANALYZE (Greg Stark <gsstark@mit.edu>)
Список	pgsql-hackers

Дерево обсуждения

On Wed, 2006-01-04 at 19:22 -0500, Greg Stark wrote:
> I think you're right that a reasonable sample size for this kind of
> estimate
> is going to be proportional to the table size, not the constant sized
> sample
> that regular statistics need. 

Agreed [I said exactly that in April]; the counter argument at that time
was that proportional samples on large tables lead to external sorts in
many cases which would lead to unacceptably long run times - since we
need to sort the values for each attribute in turn.

I've proposed limiting ourselves to maintenance_work_mem (I credit Josh
with that idea from April). If you can allocate 1 GB of memory to an
ANALYZE then this would be a very large proportion of a medium sized
table (or partition).

Considering how few rows we sample at the moment, increasing the actual
sample size by many 000s would have a very beneficial effect.

> On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote:
> > This one looks *really* good. 
> > 
> >  http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf
> > 
> > It does require a single full table scan 

The Distinct Sampling approach you mention would scan the whole table
and also use large in-memory hash tables. So it uses more I/O, the same
memory and probably less CPU - no sorting required. The technique is the
same implementation as a HashAgg, just one that loses rows in a
predictable manner when it spills out of memory. It doesn't identify
columns that scale with N, nor does it calculate correlation.

Thats the same as re-writing Count(Distinct) to use hashing, which is a
TODO item. So perhaps you could plan the code to do the Distinct
Sampling approach at the same time. Hmmm. I'll think about that.

Best Regards, Simon Riggs

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Improving N-Distinct estimation by ANALYZE