Re: ANALYZE sampling is too good
От | Florian Pflug |
---|---|
Тема | Re: ANALYZE sampling is too good |
Дата | |
Msg-id | 34A68E9C-6716-4A01-BFCF-2815F0B738D8@phlo.org обсуждение исходный текст |
Ответ на | Re: ANALYZE sampling is too good (Jeff Janes <jeff.janes@gmail.com>) |
Ответы |
Re: ANALYZE sampling is too good
|
Список | pgsql-hackers |
Here's an analysis of Jeff Janes' simple example of a table where our n_distinct estimate is way off. On Dec11, 2013, at 00:03 , Jeff Janes <jeff.janes@gmail.com> wrote: > create table baz as select floor(random()*10000000), md5(random()::text) from generate_series(1,100000000); > create table baz2 as select * from baz order by floor; > create table baz3 as select * from baz order by md5(floor::text); > > baz unclustered, baz2 is clustered with perfect correlation, baz3 is clustered but without correlation. > > After analyzing all of them: > > select tablename, n_distinct,correlation from pg_stats where tablename like 'baz%' and attname='floor' ; > tablename | n_distinct | correlation > -----------+-------------+------------- > baz | 8.56006e+06 | 0.00497713 > baz2 | 333774 | 1 > baz3 | 361048 | -0.0118147 I think I understand what's going on here. I worked with a reduced test cases of 1e7 rows containing random values between 0 and 5e5 and instrumented analyze to print the raw ndistinct and nmultiple values of the sample population (i.e. the number of distinct values in the sample, and the number of distinct values which appeared more than once). I've also considered only baz and baz2, and thus removed the than unnecessary md5 column. To account for the reduced table sizes, I adjusted default_statistics_target to 10 instead of 100. The resulting estimates are then tablename | n_distinct (est) | n_distinct (act) -----------+------------------+------------------baz | 391685 | 500000 baz2 | 36001| 500000 ANALYZE assumes that both tables contain 10000048 rows and samples 3000 of those. The sample of baz contains 2989 distinct values, 11 of which appear more than once. The sample of baz2 contains 2878 distinct values, 117 (!) of which appear more than once. The very different results stem from the Duj1 estimator we use. It estimates n_distinct by computing n*d/(n - f1 + f1*n/N) where n is the number of samples, N the number of rows, d the number of distinct samples, and f1 the number of distinct samples occurring exactly once. If all samples are unique (i.e. n=d=f1) this yields N. But if f1 is less than d, the results drops very quickly - sampling baz2 produces 117 non-unique values out of 2878 - roughly 0.03% - and the estimate already less than a 1/10 of what it would be if f1 where 0. Which leaves us with the question why sampling baz2 produces more duplicate values than sampling baz does. Now, if we used block sampling, that behaviour would be unsurprising - baz2 is sorted, so each block very likely contains each value more than since, since the row count exceeds the number of possible values by more than a magnitude. In fact, with block sampling, we'd probably see f1 values close to 0 and thus our estimate of n_distinct would be roughly equal to the number of distinct values in the *sample* population, i.e. about 300 or so. So why does vitter's algorithm fail here? Given that we see inflated numbers of duplicate values in our sample, yet still far fewer than block-based sampling would yield, my guess is that it's the initial reservoir filling that bites us here. After that initial filling step, the reservoir contains a lot of consecutive rows, i.e. a block-based sample taken from just a few blocks. If the replacement phase that follows somehow uses a slightly smaller replacement probability than it should, more of these rows will survive replacement, resulting in exactly the kind of inflated numbers of duplicate values we're seeing. I've yet to validate this by looking at the reservoir before and after the replacement stage, though. So at least for the purpose of estimating n_distinct, our current sampling method seems to exhibit the worst rather than the best properties of block- and row- based sampling. What conclusions to draw of that, I'm not sure yet - other that if we move to block-based sampling, we'll certainly have to change the way we estimate n_distinct. best regards, Florian Pflug
В списке pgsql-hackers по дате отправления: