Re: Improving N-Distinct estimation by ANALYZE

Поиск
Список
Период
Сортировка
От Greg Stark
Тема Re: Improving N-Distinct estimation by ANALYZE
Дата
Msg-id 87wthf4j82.fsf@stark.xeocode.com
обсуждение исходный текст
Ответ на Re: Improving N-Distinct estimation by ANALYZE  (Josh Berkus <josh@agliodbs.com>)
Ответы Re: Improving N-Distinct estimation by ANALYZE  (Simon Riggs <simon@2ndquadrant.com>)
Список pgsql-hackers
Josh Berkus <josh@agliodbs.com> writes:

> Tom,
> 
> > In general, estimating n-distinct from a sample is just plain a hard
> > problem, and it's probably foolish to suppose we'll ever be able to
> > do it robustly.  What we need is to minimize the impact when we get
> > it wrong.  
> 
> Well, I think it's pretty well proven that to be accurate at all you need 
> to be able to sample at least 5%, even if some users choose to sample 
> less.   Also I don't think anyone on this list disputes that the current 
> algorithm is very inaccurate for large tables.  Or do they?

I think it's worse than that. It's proven that to get any kind of accuracy in
this kind of estimate you basically have to look at the entire table.

Someone posted a URL to a paper that had a very clever way of estimating
distinct values. It required scanning the entire table but only kept a sample
of the values found using a method that guaranteed the sample was
representative not of the entire table but of the distinct values.

I think you're right that a reasonable sample size for this kind of estimate
is going to be proportional to the table size, not the constant sized sample
that regular statistics need. However that same paper has some preliminary
verbiage about how previous papers had found that the sample sizes needed were
unacceptably large. Something like 90%.

Unfortunately I can't find the reference to the paper this moment. I'll look
again later.

I think it would be interesting to get the algorithm for sampling from that
paper in. It would only be able to be used on a VACUUM ANALYZE but currently
VACUUMs are necessary anyways. I do wonder what would happen when we get the
incremental VACUUMs and the bitmaps to avoid vacuuming pages unnecessarily
though. Then it would be less useful.

-- 
greg



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Simon Riggs
Дата:
Сообщение: Re: Improving N-Distinct estimation by ANALYZE
Следующее
От: Simon Riggs
Дата:
Сообщение: Re: Improving N-Distinct estimation by ANALYZE