Re: ANALYZE sampling is too good
От | Gavin Flower |
---|---|
Тема | Re: ANALYZE sampling is too good |
Дата | |
Msg-id | 52A3DA05.7040506@archidevsys.co.nz обсуждение исходный текст |
Ответ на | Re: ANALYZE sampling is too good (Greg Stark <stark@mit.edu>) |
Список | pgsql-hackers |
On 08/12/13 10:27, Greg Stark wrote: > On Sat, Dec 7, 2013 at 8:25 PM, Josh Berkus <josh@agliodbs.com> wrote: >> The only approach which makes sense is to base it on a % of the table. >> In fact, pretty much every paper which has examined statistics >> estimation for database tables has determined that any estimate based on >> a less-than-5% sample is going to be wildly inaccurate. Not that 5% >> samples are 100% accurate, but at least they fit the 80/20 rule. > This is nonsense. The math behind the deductions the planner makes is > well understood and we know how to estimate the precision based on the > sample size. There are some questions that really need a proportional > sample such as n_distinct (and as Robert points out the MCV list) but > the main motivation of the sample size is the histograms and those are > used to answer questions which very clearly do not need a proportional > sample. The statistics is very clear there. Using a proportional > sample for the histograms would just be silly. It would be > substituting a gut feeling for real statistics. > > The problems with using a proportional sample for things like > n_distinct or the MCV list is that you're talking about sorting or > hashing an unboundedly large set of values and storing an unboundedly > large array in the statistics table. I don't think that would be > tenable without dramatically changing the way process and store that > data to be more scalable. Using a lossy counting algorithm and > something more scalable than toasted arrays would be prerequisites I > think. > > And as Robert mentions even if we solved those problems it wouldn't > help n_distinct. You really need to read all the values to deal with > n_distinct. > >> This is the reason why implementing block-based sampling is critical; >> using our current "take one row out of every page" method, sampling 5% >> of the table means scanning the whole thing in most tables. We also >> need to decouple the number of MCVs we keep from the sample size. >> Certainly our existing sampling algo seems designed to maximize IO for >> the sample size. > This would be helpful if you could specify what you mean by > "black-based sampling". The reason these previous pleas didn't go > anywhere is not because we can't do math. It's because of the lack of > specifics here. What we do *today* is called "block-based sampling" by > the literature. > > What I'm saying is we need to change the "block-based sampling" that > we do to extract more rows per block. We currently extract the > "correct" number of rows to get a strong guarantee of uniformity. If > we could get a weaker guarantee of being "close enough" to uniform > samples for the deductions we want to make and get many more rows per > block then we could read a lot fewer blocks. > > Or to put it another way people could increase the statistics target > dramatically and still be reading the same number of blocks as today. > In an ideal world perhaps we could have people reading 1% of the > blocks they read today and get statistics targets 10x better than > today. > I suspect that the number of rows to sample should be something of the form: { N where N <= a * 100 n = { (a * N) / 100 where a * 100 < N <= 10000 { (a * SQRT(N)) / 100 where N > 10000 n: the number of rows to sample N: total number of rows a: configurable coefficient (1 <= a <= 100) For very large numbers of rows, like 10^8, taking a constant fraction will over sample for most purposes, hence the square root. a N n sampled% 1 10000 100 1% 100 10000 10000 100% 1 10^8 10000 0.01% 100 10^8 10^6 1% 1 10^10 10^5 0.001% 100 10^10 10^7 0.01% For very large values of N, you can get can still get a representative sample with a very small fraction of rows sampled. Hmm... Yes, I can see some issues, once I tried out with test values! However. it might inspire a more useful and practical approach! Cheers, Gavin
В списке pgsql-hackers по дате отправления: