Re: ANALYZE sampling is too good

Поиск
Список
Период
Сортировка
От Gavin Flower
Тема Re: ANALYZE sampling is too good
Дата
Msg-id 52A3DA05.7040506@archidevsys.co.nz
обсуждение исходный текст
Ответ на Re: ANALYZE sampling is too good  (Greg Stark <stark@mit.edu>)
Список pgsql-hackers
On 08/12/13 10:27, Greg Stark wrote:
> On Sat, Dec 7, 2013 at 8:25 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> The only approach which makes sense is to base it on a % of the table.
>> In fact, pretty much every paper which has examined statistics
>> estimation for database tables has determined that any estimate based on
>> a less-than-5% sample is going to be wildly inaccurate.  Not that 5%
>> samples are 100% accurate, but at least they fit the 80/20 rule.
> This is nonsense. The math behind the deductions the planner makes is
> well understood and we know how to estimate the precision based on the
> sample size. There are some questions that really need a proportional
> sample such as n_distinct (and as Robert points out the MCV list) but
> the main motivation of the sample size is the histograms and those are
> used to answer questions which very clearly do not need a proportional
> sample. The statistics is very clear there. Using a proportional
> sample for the histograms would just be silly. It would be
> substituting a gut feeling for real statistics.
>
> The problems with using a proportional sample for things like
> n_distinct or the MCV list is that you're talking about sorting or
> hashing an unboundedly large set of values and storing an unboundedly
> large array in the statistics table. I don't think that would be
> tenable without dramatically changing the way process and store that
> data to be more scalable. Using a lossy counting algorithm and
> something more scalable than toasted arrays would be prerequisites I
> think.
>
> And as Robert mentions even if we solved those problems it wouldn't
> help n_distinct. You really need to read all the values to deal with
> n_distinct.
>
>> This is the reason why implementing block-based sampling is critical;
>> using our current "take one row out of every page" method, sampling 5%
>> of the table means scanning the whole thing in most tables.  We also
>> need to decouple the number of MCVs we keep from the sample size.
>> Certainly our existing sampling algo seems designed to maximize IO for
>> the sample size.
> This would be helpful if you could specify what you mean by
> "black-based sampling". The reason these previous pleas didn't go
> anywhere is not because we can't do math. It's because of the lack of
> specifics here. What we do *today* is called "block-based sampling" by
> the literature.
>
> What I'm saying is we need to change the "block-based sampling" that
> we do to extract more rows per block. We currently extract the
> "correct" number of rows to get a strong guarantee of uniformity. If
> we could get a weaker guarantee of being "close enough" to uniform
> samples for the deductions we want to make and get many more rows per
> block then we could read a lot fewer blocks.
>
> Or to put it another way people could increase the statistics target
> dramatically and still be reading the same number of blocks as today.
> In an ideal world perhaps we could have people reading 1% of the
> blocks they read today and get statistics targets 10x better than
> today.
>
I suspect that the number of rows to sample should be something of the form:

      { N                      where N <= a * 100
n =   { (a * N) / 100          where a * 100 < N <= 10000      { (a * SQRT(N)) / 100 where N > 10000

n: the number of rows to sample
N: total number of rows
a: configurable coefficient  (1 <= a <= 100)


For very large numbers of rows, like 10^8, taking a constant fraction 
will over sample for most purposes, hence the square root.
  a      N        n   sampled%  1  10000      100         1%
100  10000    10000       100%  1   10^8    10000      0.01%
100   10^8     10^6         1%  1  10^10     10^5     0.001%
100  10^10     10^7      0.01%


For very large values of N, you can get can still get a representative 
sample with a very small fraction of rows sampled.

Hmm... Yes, I can see some issues, once I tried out with test values!  
However. it might inspire a more useful and practical approach!


Cheers,
Gavin



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: dblink performance regression
Следующее
От: "MauMau"
Дата:
Сообщение: Re: [bug fix] pg_ctl always uses the same event source