Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: [HACKERS] Make ANALYZE more selective about what is a "mostcommon value"?
Дата
Msg-id CA+TgmoYbm+RL21JsQr=q4KD6r5ia+izO_DUqpaoqMuyv=7ZYMA@mail.gmail.com
обсуждение исходный текст
Ответ на [HACKERS] Make ANALYZE more selective about what is a "most common value"?  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: [HACKERS] Make ANALYZE more selective about what is a "most common value"?  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On Sun, Jun 4, 2017 at 5:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I've been thinking about the behavior discussed in
> https://www.postgresql.org/message-id/flat/20170522132017.29944.48391%40wrigleys.postgresql.org
> and it seems to me that there are a couple of things we ought to do about
> it.
>
> First, I think we need a larger hard floor on the number of occurrences
> of a value that're required to make ANALYZE decide it is a "most common
> value".  The existing coding is willing to believe that anything that
> appears at least twice in the sample is a potential MCV, but that design
> originated when we were envisioning stats samples of just a few thousand
> rows --- specifically, default_statistics_target was originally just 10,
> leading to a 3000-row sample size.  So accepting two-appearance values as
> MCVs would lead to a minimum MCV frequency estimate of 1/1500.  Now it
> could be a tenth or a hundredth of that.
>
> As a round number, I'm thinking that a good floor would be a frequency
> estimate of 1/1000.  With today's typical sample size of 30000 rows,
> a value would have to appear at least 30 times in the sample to be
> believed to be an MCV.  That seems like it gives us a reasonable margin
> of error against the kind of sampling noise seen in the above-cited
> thread.

This kind of math isn't my strong point, but it seems to me that,
while sampling noise is a problem, it's not obvious how to tell
whether a given result is signal or noise.  I mean, if we sample
30,000 rows from a table with a billion rows and we see the the same
value twice, then it could be the case that the row occurs just twice
in the whole table and we unluckily saw both occurrences.
Contrariwise, it could also be that the typical value in that table
occurs 1000 times and the value in question occurs 50,000 times, so
our sample was right on the money.  There's no way to know whether, if
we were to take a bunch more samples, we'd see that value coming up
twice in most of those samples or whether the fact that it showed up
twice this time is a statistical fluke.

What makes me a bit cautious about this approach overall is that I've
seen cases where the length of the MCV list turns out to be key to
getting a good plan, and I needed to make it longer in order for
things to work.  Any non-MCV is, IIRC, assumed to be less frequent
than the least-common MCV.  If you raise the minimum frequency that is
required for something to be considered an MCV, then the cap on the
frequency of a non-MCV also goes up, and I think it's not impossible
to imagine that being a problem for somebody.  Now maybe you're going
to say those people can fix it by twiddling n_distinct using DDL.  I'm
not sure whether that works in all cases, but even if it does, some of
those people might not need to twiddle n_distinct today, so from their
point of view it would be a regression.

Another way to state it is: is this problem one-sided?  If we *only*
have a problem with things that aren't really MCVs being considered as
MCVs, then tightening up the rules for including something in the MCV
list will fix it.  Also, the statistics will be more stable and
planning will be faster, which all sounds like a win.  However, I'm
guessing (on the basis of some personal experience) that we *also*
have a problem with things that really are MCVs *not* being considered
MCVs.  If it's true that both of those things are problems, then the
only real fix is to increase the sample size -- which I notice you
recommended on the original thread.

In general, I've pretty skeptical of the idea that sampling 30,000
rows out of an arbitrarily large table will produce a
sufficiently-accurate MCV list.  There's a comment in std_typanalyze()
citing a 1998 paper by Chaudhuri, Motwani, and Narasayya for the
proposition that we'll get a reasonably accurate histogram with that
sample size, but histogram bins are not MCVs.  Moreover, in practice,
we need *increased* accuracy, not just *equal* accuracy, as the table
size increases.  On a billion row table, a minimum MCV frequency of
1/1000 means that we can't distinguish between a value that occurs 1
time and a value that occurs 999,999 times -- they are both non-MCVs.
As you shrink the table that sort of thing becomes a smaller and
smaller problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: [HACKERS] shm_toc_lookup API
Следующее
От: Robert Haas
Дата:
Сообщение: Re: [HACKERS] shm_toc_lookup API