Re: ANALYZE sampling is too good

Поиск
Список
Период
Сортировка
От Josh Berkus
Тема Re: ANALYZE sampling is too good
Дата
Msg-id 52A4C266.50000@agliodbs.com
обсуждение исходный текст
Ответ на ANALYZE sampling is too good  (Greg Stark <stark@mit.edu>)
Ответы Re: ANALYZE sampling is too good
Re: ANALYZE sampling is too good
Список pgsql-hackers
On 12/08/2013 10:14 AM, Greg Stark wrote:
> With rows_per_block=1   the MCV frequency list ranges from .0082 to .0123
> With rows_per_block=4   the MCV frequency list ranges from .0063 to .0125
> With rows_per_block=16 the MCV frequency list ranges from .0058 to .0164
> With rows_per_block=64 the MCV frequency list ranges from .0021 to .0213
> 
> I'm not really sure if this is due to the blocky sample combined with
> the skewed pgbench run or not. It doesn't seem to be consistently
> biasing towards or against bid 1 which I believe are the only rows
> that would have been touched by pgbench. Still it's suspicious that
> they seem to be consistently getting less accurate as the blockiness
> increases.

They will certainly do so if you don't apply any statistical adjustments
for selecting more rows from the same pages.

So there's a set of math designed to calculate for the skew introduced
by reading *all* of the rows in each block.  That's what I meant by
"block-based sampling"; you read, say, 400 pages, you compile statistics
on *all* of the rows on those pages, you apply some algorithms to adjust
for groupings of rows based on how grouped they are.  And you have a
pretty good estimate of how grouped they are, because you just looked a
complete sets of rows on a bunch of nonadjacent pages.

Obviously, you need to look at more rows than you would with a
pure-random sample.  Like I said, the 80%+ accurate point in the papers
seemed to be at a 5% sample.  However, since those rows come from the
same pages, the cost of looking at more rows is quite small, compared to
the cost of looking at 64 times as many disk pages.

My ACM subscription has lapsed, though; someone with a current ACM
subscription could search for this; there are several published papers,
with math and pseudocode.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavel Stehule
Дата:
Сообщение: Re: Re: [BUGS] BUG #7873: pg_restore --clean tries to drop tables that don't exist
Следующее
От: Heikki Linnakangas
Дата:
Сообщение: Re: ANALYZE sampling is too good