Re: ANALYZE sampling is too good
| От | Gavin Flower |
|---|---|
| Тема | Re: ANALYZE sampling is too good |
| Дата | |
| Msg-id | 52A3DA05.7040506@archidevsys.co.nz обсуждение исходный текст |
| Ответ на | Re: ANALYZE sampling is too good (Greg Stark <stark@mit.edu>) |
| Список | pgsql-hackers |
On 08/12/13 10:27, Greg Stark wrote:
> On Sat, Dec 7, 2013 at 8:25 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> The only approach which makes sense is to base it on a % of the table.
>> In fact, pretty much every paper which has examined statistics
>> estimation for database tables has determined that any estimate based on
>> a less-than-5% sample is going to be wildly inaccurate. Not that 5%
>> samples are 100% accurate, but at least they fit the 80/20 rule.
> This is nonsense. The math behind the deductions the planner makes is
> well understood and we know how to estimate the precision based on the
> sample size. There are some questions that really need a proportional
> sample such as n_distinct (and as Robert points out the MCV list) but
> the main motivation of the sample size is the histograms and those are
> used to answer questions which very clearly do not need a proportional
> sample. The statistics is very clear there. Using a proportional
> sample for the histograms would just be silly. It would be
> substituting a gut feeling for real statistics.
>
> The problems with using a proportional sample for things like
> n_distinct or the MCV list is that you're talking about sorting or
> hashing an unboundedly large set of values and storing an unboundedly
> large array in the statistics table. I don't think that would be
> tenable without dramatically changing the way process and store that
> data to be more scalable. Using a lossy counting algorithm and
> something more scalable than toasted arrays would be prerequisites I
> think.
>
> And as Robert mentions even if we solved those problems it wouldn't
> help n_distinct. You really need to read all the values to deal with
> n_distinct.
>
>> This is the reason why implementing block-based sampling is critical;
>> using our current "take one row out of every page" method, sampling 5%
>> of the table means scanning the whole thing in most tables. We also
>> need to decouple the number of MCVs we keep from the sample size.
>> Certainly our existing sampling algo seems designed to maximize IO for
>> the sample size.
> This would be helpful if you could specify what you mean by
> "black-based sampling". The reason these previous pleas didn't go
> anywhere is not because we can't do math. It's because of the lack of
> specifics here. What we do *today* is called "block-based sampling" by
> the literature.
>
> What I'm saying is we need to change the "block-based sampling" that
> we do to extract more rows per block. We currently extract the
> "correct" number of rows to get a strong guarantee of uniformity. If
> we could get a weaker guarantee of being "close enough" to uniform
> samples for the deductions we want to make and get many more rows per
> block then we could read a lot fewer blocks.
>
> Or to put it another way people could increase the statistics target
> dramatically and still be reading the same number of blocks as today.
> In an ideal world perhaps we could have people reading 1% of the
> blocks they read today and get statistics targets 10x better than
> today.
>
I suspect that the number of rows to sample should be something of the form:
{ N where N <= a * 100
n = { (a * N) / 100 where a * 100 < N <= 10000 { (a * SQRT(N)) / 100 where N > 10000
n: the number of rows to sample
N: total number of rows
a: configurable coefficient (1 <= a <= 100)
For very large numbers of rows, like 10^8, taking a constant fraction
will over sample for most purposes, hence the square root.
a N n sampled% 1 10000 100 1%
100 10000 10000 100% 1 10^8 10000 0.01%
100 10^8 10^6 1% 1 10^10 10^5 0.001%
100 10^10 10^7 0.01%
For very large values of N, you can get can still get a representative
sample with a very small fraction of rows sampled.
Hmm... Yes, I can see some issues, once I tried out with test values!
However. it might inspire a more useful and practical approach!
Cheers,
Gavin
В списке pgsql-hackers по дате отправления: