Re: ANALYZE sampling is too good

Поиск

Список

Период

Сортировка

От	Josh Berkus
Тема	Re: ANALYZE sampling is too good
Дата	12 декабря 2013 г. 21:56:54
Msg-id	52AA06EA.1030508@agliodbs.com обсуждение исходный текст
Ответ на	ANALYZE sampling is too good (Greg Stark <stark@mit.edu>)
Ответы	Re: ANALYZE sampling is too good
Список	pgsql-hackers

Дерево обсуждения

On 12/12/2013 10:33 AM, Claudio Freire wrote:
> Well, why not take a supersample containing all visible tuples from N
> selected blocks, and do bootstrapping over it, with subsamples of M
> independent rows each?

Well, we still need to look at each individual block to determine
grouping correlation.  Let's take a worst case example: imagine a table
has *just* been created by:

CREATE TABLE newdata AS SELECT * FROM olddata ORDER BY category, item;

If "category" is fairly low cardinality, then grouping will be severe;
we can reasonably expect that if we sample 100 blocks, many of them will
have only one category value present.  The answer to this is to make our
block samples fairly widely spaced and compare them.

In this simplified example, if the table had 1000 blocks, we would take
blocks 1,101,201,301,401,etc.  Then we would compare the number and
content of values found on each block with the number and content found
on each other block.  For example, if we see that block 101 is entirely
the category "cats", and block 701 is entirely the category "shopping"
and block 901 is split 60/40 between the categories "transportation" and
"voting", then we can assume that the level of grouping is very high,
and the number of unknown values we haven't seen is also high.

Whereas if 101 is "cats" and 201 is "cats" and 301 through 501 are
"cats" with 2% other stuff, then we assume that the level of grouping is
moderate and it's just the case that most of the dataset is "cats".
Which means that the number of unknown values we haven't seen is low.

Whereas if 101, 201, 501, and 901 have near-identical distributions of
values, we assume that the level of grouping is very low, and that there
are very few values we haven't seen.

As someone else pointed out, full-block (the proposal) vs. random-row
(our current style) doesn't have a very significant effect on estimates
of Histograms and nullfrac, as long as the sampled blocks are widely
spaced.  Well, nullfrac is affected in the extreme example of a totally
ordered table where the nulls are all in one block, but I'll point out
that we can (and do) also miss that using our current algo.

Estimated grouping should, however, affect MCVs.  In cases where we
estimate that grouping levels are high, the expected % of observed
values should be "discounted" somehow.  That is, with total random
distribution you have a 1:1 ratio between observed frequency of a value
and assumed frequency.  However, with highly grouped values, you might
have a 2:1 ratio.

Again, more math (backed by statistical analysis) is needed.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Andres Freund
Дата: 12 декабря 2013 г., 21:52:41
Сообщение: Re: Changeset Extraction Interfaces

Следующее

От: Claudio Freire
Дата: 12 декабря 2013 г., 22:13:27
Сообщение: Re: ANALYZE sampling is too good

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: ANALYZE sampling is too good

Предыдущее

Следующее