Re: ANALYZE sampling is too good

Поиск
Список
Период
Сортировка
От Sergey E. Koposov
Тема Re: ANALYZE sampling is too good
Дата
Msg-id alpine.LRH.2.00.1312110506150.19468@lnfm1.sai.msu.ru
обсуждение исходный текст
Ответ на Re: ANALYZE sampling is too good  (Peter Geoghegan <pg@heroku.com>)
Ответы Re: ANALYZE sampling is too good  (Simon Riggs <simon@2ndQuadrant.com>)
Список pgsql-hackers
For what it's worth.

I'll quote Chaudhuri et al. first line from the abstract about the block 
sampling.
"Block-level sampling is far more efficient than true uniform-random 
sampling over a large database, but prone to  significant errors if used 
to create database statistics."

And after briefly glancing through the paper, my opinion is why it works 
is because after making one version of statistics they cross-validate, see 
how well it goes and then collect more if the cross-validation error is 
large (for example because the data is clustered). Without this bit, as 
far as I can a simply block based sampler will be bound to make 
catastrophic mistakes depending on the distribution

Also, just another point about targets (e.g X%) for estimating stuff from 
the samples (as it was discussed in the thread). Basically, the is a 
point talking about a sampling a fixed target (5%) of the data 
ONLY if you fix the actual  distribution of your data in the table, and 
decide what statistic you are trying to find, e.g. average, std. dev. a 
90% percentile, ndistinct or a histogram and so forth. There won't be a 
general answer as the percentages will be distribution dependend and 
statistic dependent.

Cheers,    Sergey

PS I'm not a statistician, but I use statistics a lot

*******************************************************************
Sergey E. Koposov, PhD, Research Associate
Institute of Astronomy, University of Cambridge
Madingley road, CB3 0HA, Cambridge, UK
Tel: +44-1223-337-551 Web: http://www.ast.cam.ac.uk/~koposov/



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Why we are going to have to go DirectIO
Следующее
От: Tom Lane
Дата:
Сообщение: Re: pg_stat_statements fingerprinting logic and ArrayExpr