Re: ANALYZE sampling is too good

Поиск
Список
Период
Сортировка
От Jeff Janes
Тема Re: ANALYZE sampling is too good
Дата
Msg-id CAMkU=1yVmSVjSgw0qL9imbabLa7_kUYQxu9ZcdO1o357kq=_Zw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: ANALYZE sampling is too good  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: ANALYZE sampling is too good
Re: ANALYZE sampling is too good
Список pgsql-hackers
On Sat, Dec 7, 2013 at 11:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Dec 3, 2013 at 6:30 PM, Greg Stark <stark@mit.edu> wrote:
> I always gave the party line that ANALYZE only takes a small
> constant-sized sample so even very large tables should be very quick.
> But after hearing the same story again in Heroku I looked into it a
> bit further. I was kind of shocked but the numbers.
>
> ANALYZE takes a sample of 300 * statistics_target rows. That sounds
> pretty reasonable but with default_statistics_target set to 100 that's
> 30,000 rows. If I'm reading the code right It takes this sample by
> sampling 30,000 blocks and then (if the table is large enough) taking
> an average of one row per block. Each block is 8192 bytes so that
> means it's reading 240MB of each table.That's a lot more than I
> realized.

That is a lot.  On the other hand, I question the subject line:
sometimes, our ANALYZE sampling is not good enough.  Before we raised
the default statistics target from 10 to 100, complaints about bad
plan choices due to insufficiently-precise statistics were legion --
and we still have people periodically proposing to sample a fixed
percentage of the table instead of a fixed amount of the table, even
on large tables, which is going the opposite direction.  I think this
is because they're getting really bad n_distinct estimates, and no
fixed-size sample can reliably give a good one.


I don't recall ever tracing a bad plan down to a bad n_distinct.  I have seen several that were due to bad frequency estimates in MCV list, because hash join planning is extremely sensitive to that.  Do we have some kind of catalog of generators of problematic data, so that changes can be tested on known problem sets?  Perhaps a wiki page to accumulate them would be useful.  For automated testing I guess the generator and query is the easy part, the hard part is the cost settings/caching/RAM needed to trigger the problem, and parsing and interpreting the results.
 

More generally, I think the basic problem that people are trying to
solve by raising the statistics target is avoid index scans on
gigantic tables.  Obviously, there are a lot of other situations where
inadequate statistics can cause problems, but that's a pretty
easy-to-understand one that we do not always get right.  We know that
an equality search looking for some_column = 'some_constant', where
some_constant is an MCV, must be more selective than a search for the
least-frequent MCV.  If you store more and more MCVs for a table,
eventually you'll have enough that the least-frequent one is pretty
infrequent, and then things work a lot better.

My reading of the code is that if it is not in the MCV, then it is assumed to have the average selectivity (about 1/n_distinct, but deflating top and bottom for the MCV list).  There is also a check that it is less than the least common of the MCV, but I don't know why that situation would ever prevail--that should always be higher or equal to the average selectivity.

 

This is more of a problem for big tables than for small tables.  MCV
#100 can't have a frequency of greater than 1/100 = 0.01, but that's a
lot more rows on a big table than small one.  On a table with 10
million rows we might estimate something close to 100,000 rows when
the real number is just a handful; when the table has only 10,000
rows, we just can't be off by as many orders of magnitude.  Things
don't always work out that badly, but in the worst case they do.

Maybe there's some highly-principled statistical approach which could
be taken here, and if so that's fine, but I suspect not.  So what I
think we should do is auto-tune the statistics target based on the
table size.  If, say, we think that the generally useful range for the
statistics target is something like 10 to 400, then let's come up with
a formula based on table size that outputs 10 for small tables, 400
for really big tables, and intermediate values for tables in the
middle.

I think that parts of the planner are N^2 in the size of histogram (or was that the size of the MCV list?).  So we would probably need a way to use a larger sample size to get more accurate n_distinct and MCV frequencies, but not save the entire histogram that goes with that sample size.

 
Cheers,

Jeff

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jim Nasby
Дата:
Сообщение: Re: plpgsql_check_function - rebase for 9.3
Следующее
От: Jim Nasby
Дата:
Сообщение: Re: ANALYZE sampling is too good