Re: More stable query plans via more predictable column statistics

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: More stable query plans via more predictable column statistics
Дата
Msg-id CA+Tgmobh+u_Km5O5P-_5o9fVWrTK-5Gb17iVvDVvcmkibmtGbg@mail.gmail.com
обсуждение исходный текст
Ответ на More stable query plans via more predictable column statistics  ("Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de>)
Ответы Re: More stable query plans via more predictable column statistics  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: More stable query plans via more predictable column statistics  ("Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de>)
Список pgsql-hackers
On Tue, Dec 1, 2015 at 10:21 AM, Shulgin, Oleksandr
<oleksandr.shulgin@zalando.de> wrote:
> Hi Hackers!
>
> This post summarizes a few weeks of research of ANALYZE statistics
> distribution on one of our bigger production databases with some real-world
> data and proposes a patch to rectify some of the oddities observed.
>
>
> Introduction
> ============
>
> We have observed that for certain data sets the distribution of samples
> between most_common_vals and histogram_bounds can be unstable: so that it
> may change dramatically with the next ANALYZE run, thus leading to radically
> different plans.
>
> I was revisiting the following performance thread and I've found some
> interesting details about statistics in our environment:
>
>
>
http://www.postgresql.org/message-id/flat/CAMkU=1zxyNMN11YL8G7AGF7k5u4ZHVJN0DqCc_ecO1qs49uJgA@mail.gmail.com#CAMkU=1zxyNMN11YL8G7AGF7k5u4ZHVJN0DqCc_ecO1qs49uJgA@mail.gmail.com
>
> My initial interest was in evaluation if distribution of samples could be
> made more predictable and less dependent on the factor of luck, thus leading
> to more stable execution plans.
>
>
> Unexpected findings
> ===================
>
> What I have found is that in a significant percentage of instances, when a
> duplicate sample value is *not* put into the MCV list, it does produce
> duplicates in the histogram_bounds, so it looks like the MCV cut-off happens
> too early, even though we have enough space for more values in the MCV list.
>
> In the extreme cases I've found completely empty MCV lists and histograms
> full of duplicates at the same time, with only about 20% of distinct values
> in the histogram (as it turns out, this happens due to high fraction of
> NULLs in the sample).

Wow, this is very interesting work.  Using values_cnt rather than
samplerows to compute avgcount seems like a clear improvement.  It
doesn't make any sense to raise the threshold for creating an MCV
based on the presence of additional nulls or too-wide values in the
table.  I bet compute_distinct_stats needs a similar fix.  But for
plan stability considerations, I'd say we should back-patch this all
the way, but those considerations might mitigate for a more restrained
approach.  Still, maybe we should try to sneak at least this much into
9.5 RSN, because I have to think this is going to help people with
mostly-NULL (or mostly-really-wide) columns.

As far as the rest of the fix, your code seems to remove the handling
for ndistinct < 0.  That seems unlikely to be correct, although it's
possible that I am missing something.  Aside from that, the rest of
this seems like a policy change, and I'm not totally sure off-hand
whether it's the right policy.  Having more MCVs can increase planning
time noticeably, and the point of the existing cutoff is to prevent us
from choosing MCVs that aren't actually "C".  I think this change
significantly undermines those protections.  It seems to me that it
might be useful to evaluate the effects of this part of the patch
separately from the samplerows -> values_cnt change.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Masahiko Sawada
Дата:
Сообщение: Re: Freeze avoidance of very large table.
Следующее
От: Tom Lane
Дата:
Сообщение: Size of Path nodes