Re: benchmarking the query planner

Поиск

Список

Период

Сортировка

От	Simon Riggs
Тема	Re: benchmarking the query planner
Дата	12 декабря 2008 г. 11:00:44
Msg-id	1229093975.13078.275.camel@ebony.2ndQuadrant обсуждение исходный текст
Ответ на	Re: benchmarking the query planner ("Robert Haas" <robertmhaas@gmail.com>)
Ответы	Re: benchmarking the query planner
Список	pgsql-hackers

Дерево обсуждения

On Fri, 2008-12-12 at 06:44 -0500, Robert Haas wrote:
> On Fri, Dec 12, 2008 at 4:04 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >> The existing sampling mechanism is tied to solid statistics. It
> >> provides the correct sample size to get a consistent confidence range
> >> for range queries. This is the same mathematics which governs election
> >> polling and other surveys. The sample size you need to get +/- 5% 19
> >> times out of 20 increases as the population increases, but not by very
> >> much.
> >
> > Sounds great, but its not true. The sample size is not linked to data
> > volume, so how can it possibly give a consistent confidence range?
> 
> I'm not 100% sure how relevant it is to this case, but I think what
> Greg is referring to is:
> 
> http://en.wikipedia.org/wiki/Margin_of_error#Effect_of_population_size
> 
> It is a pretty well-known mathematical fact that for something like an
> opinion poll your margin of error does not depend on the size of the
> population but only on the size of your sample.

Yes, I agree with that *but* that refers to population statistics and
has nothing at all to do with the calculation of ndistinct, which is
what we were discussing. You can't just switch topics and have the
statement remain true.

ndistinct estimation is improved by larger sample sizes, that's what the
maths says and what experimentation shows also.

Note that the estimator we use was shown to be stable in the range of
sample size between 5-20%.
http://www.almaden.ibm.com/cs/people/peterh/jasa3rj.pdf
We currently use a sample size of 300*stats_target. With default=10 that
means our sample size is 0.3% on a 1 million row table, and 0.003% on a
100 million row table (they're common, I find). 

That size of sample is OK for some kinds of statistics, but not much
good for ndistinct estimation.

These issues only show up in the field, they never show up on optimizer
test platforms because they typically are several orders of magnitude
too small. (Conversely, the stats system works very well indeed for
smaller tables... so I'm not trying to debunk everything).

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: benchmarking the query planner