Re: ANALYZE sampling is too good

Поиск
Список
Период
Сортировка
От Jeff Janes
Тема Re: ANALYZE sampling is too good
Дата
Msg-id CAMkU=1wRH_jopyCAyUKbdQY4DWhsx1-1e2s0VVgfrryfXDe2SQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: ANALYZE sampling is too good  (Florian Pflug <fgp@phlo.org>)
Ответы Re: ANALYZE sampling is too good
Список pgsql-hackers
On Thu, Dec 12, 2013 at 6:39 AM, Florian Pflug <fgp@phlo.org> wrote:
Here's an analysis of Jeff Janes' simple example of a table where our
n_distinct estimate is way off.

On Dec11, 2013, at 00:03 , Jeff Janes <jeff.janes@gmail.com> wrote:
> create table baz as select floor(random()*10000000), md5(random()::text) from generate_series(1,100000000);
> create table baz2 as select * from baz order by floor;
> create table baz3 as select * from baz order by md5(floor::text);
>
> baz unclustered, baz2 is clustered with perfect correlation, baz3 is clustered but without correlation.
>
> After analyzing all of them:
>
> select tablename, n_distinct,correlation  from pg_stats where tablename  like 'baz%' and attname='floor'  ;
>  tablename | n_distinct  | correlation
> -----------+-------------+-------------
>  baz       | 8.56006e+06 |  0.00497713
>  baz2      |      333774 |           1
>  baz3      |      361048 |  -0.0118147

I think I understand what's going on here. I worked with a reduced test cases
of 1e7 rows containing random values between 0 and 5e5 and instrumented
analyze to print the raw ndistinct and nmultiple values of the sample
population (i.e. the number of distinct values in the sample, and the number
of distinct values which appeared more than once). I've also considered only
baz and baz2, and thus removed the than unnecessary md5 column. To account for
the reduced table sizes, I adjusted default_statistics_target to 10 instead of
100. The resulting estimates are then

 tablename | n_distinct (est) | n_distinct (act)
-----------+------------------+------------------
 baz       |           391685 |           500000
 baz2      |            36001 |           500000

ANALYZE assumes that both tables contain 10000048 rows and samples 3000 of
those.

The sample of baz contains 2989 distinct values, 11 of which appear more than
once. The sample of baz2 contains 2878 distinct values, 117 (!) of which
appear more than once.

The very different results stem from the Duj1 estimator we use. It estimates
n_distinct by computing n*d/(n - f1 + f1*n/N) where n is the number of
samples, N the number of rows, d the number of distinct samples, and f1 the
number of distinct samples occurring exactly once. If all samples are unique
(i.e. n=d=f1) this yields N. But if f1 is less than d, the results drops very
quickly - sampling baz2 produces 117 non-unique values out of 2878 - roughly
0.03% - and the estimate already less than a 1/10 of what it would be if f1
where 0.

Which leaves us with the question why sampling baz2 produces more duplicate
values than sampling baz does. Now, if we used block sampling, that behaviour
would be unsurprising - baz2 is sorted, so each block very likely contains
each value more than since, since the row count exceeds the number of possible
values by more than a magnitude. In fact, with block sampling, we'd probably
see f1 values close to 0 and thus our estimate of n_distinct would be roughly
equal to the number of distinct values in the *sample* population, i.e. about
300 or so.

So why does vitter's algorithm fail here? Given that we see inflated numbers
of duplicate values in our sample, yet still far fewer than block-based
sampling would yield, my guess is that it's the initial reservoir filling that
bites us here. After that initial filling step, the reservoir contains a lot of
consecutive rows, i.e. a block-based sample taken from just a few blocks. If
the replacement phase that follows somehow uses a slightly smaller replacement
probability than it should, more of these rows will survive replacement,
resulting in exactly the kind of inflated numbers of duplicate values we're
seeing. I've yet to validate this by looking at the reservoir before and after
the replacement stage, though.


I think the problem is more subtle than that.  It is easier to visualize it if you think of every block having the same number of rows, with that number being fairly large.  If you pick 30,000 rows at random from 1,000,000 blocks, the number of rows chosen from any given block should be close to following a poisson distribution with average of 0.03, which means about 29113 blocks should have exactly 1 row chosen from them while 441 would have two or more rows chosen from them.  

But if you instead select 30,000 row from 30,000 blocks, which is what we ask Vitter's algorithm to do, you get about a Poisson distribution with average of 1.0.  Then about 11036 blocks have exactly one row chosen from them, and 7927 blocks have two or more rows sampled from it.  Another 11,036 blocks get zero rows selected from them due to Vitter, in addition to the 970,000 that didn't even get submitted to Vitter in the first place.  That is why you see too many duplicates for clustered data, as too many blocks are sampled multiple times.

The Poisson argument doesn't apply cleanly when blocks have variable number of rows, but the general principle still applies.  Too-few blocks have exactly one row sampled from them, and too many blocks have either zero row, or more than one row.

It would be relatively easy to fix this if we trusted the number of visible rows in each block to be fairly constant.  But without that assumption, I don't see a way to fix the sample selection process without reading the entire table.


 
So at least for the purpose of estimating n_distinct, our current sampling
method seems to exhibit the worst rather than the best properties of block-
and row- based sampling. What conclusions to draw of that, I'm not sure yet -
other that if we move to block-based sampling, we'll certainly have to change
the way we estimate n_distinct.

Perhaps we should consider changing that even if we don't change to block based sampling--but what would the other candidates be?  I guess the same paper that presented Duj1 would be a good place to start, but are there others?  It sounds like this was discussed several years ago, but I didn't find it in the archives with a casual search.
 
Cheers,

Jeff

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Fabrízio de Royes Mello
Дата:
Сообщение: Re: Time-Delayed Standbys
Следующее
От: Tom Lane
Дата:
Сообщение: Re: should we add a XLogRecPtr/LSN SQL type?