Re: ANALYZE sampling is too good
От | Josh Berkus |
---|---|
Тема | Re: ANALYZE sampling is too good |
Дата | |
Msg-id | 52AA06EA.1030508@agliodbs.com обсуждение исходный текст |
Ответ на | ANALYZE sampling is too good (Greg Stark <stark@mit.edu>) |
Ответы |
Re: ANALYZE sampling is too good
|
Список | pgsql-hackers |
On 12/12/2013 10:33 AM, Claudio Freire wrote: > Well, why not take a supersample containing all visible tuples from N > selected blocks, and do bootstrapping over it, with subsamples of M > independent rows each? Well, we still need to look at each individual block to determine grouping correlation. Let's take a worst case example: imagine a table has *just* been created by: CREATE TABLE newdata AS SELECT * FROM olddata ORDER BY category, item; If "category" is fairly low cardinality, then grouping will be severe; we can reasonably expect that if we sample 100 blocks, many of them will have only one category value present. The answer to this is to make our block samples fairly widely spaced and compare them. In this simplified example, if the table had 1000 blocks, we would take blocks 1,101,201,301,401,etc. Then we would compare the number and content of values found on each block with the number and content found on each other block. For example, if we see that block 101 is entirely the category "cats", and block 701 is entirely the category "shopping" and block 901 is split 60/40 between the categories "transportation" and "voting", then we can assume that the level of grouping is very high, and the number of unknown values we haven't seen is also high. Whereas if 101 is "cats" and 201 is "cats" and 301 through 501 are "cats" with 2% other stuff, then we assume that the level of grouping is moderate and it's just the case that most of the dataset is "cats". Which means that the number of unknown values we haven't seen is low. Whereas if 101, 201, 501, and 901 have near-identical distributions of values, we assume that the level of grouping is very low, and that there are very few values we haven't seen. As someone else pointed out, full-block (the proposal) vs. random-row (our current style) doesn't have a very significant effect on estimates of Histograms and nullfrac, as long as the sampled blocks are widely spaced. Well, nullfrac is affected in the extreme example of a totally ordered table where the nulls are all in one block, but I'll point out that we can (and do) also miss that using our current algo. Estimated grouping should, however, affect MCVs. In cases where we estimate that grouping levels are high, the expected % of observed values should be "discounted" somehow. That is, with total random distribution you have a 1:1 ratio between observed frequency of a value and assumed frequency. However, with highly grouped values, you might have a 2:1 ratio. Again, more math (backed by statistical analysis) is needed. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
В списке pgsql-hackers по дате отправления: