Re: More stable query plans via more predictable column statistics

Поиск
Список
Период
Сортировка
От Shulgin, Oleksandr
Тема Re: More stable query plans via more predictable column statistics
Дата
Msg-id CACACo5TvrNZcB_K2zc2CbcgL4xXGR55re1G4gp9TtqL++WBvCg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: More stable query plans via more predictable column statistics  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On Fri, Dec 4, 2015 at 6:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Dec 1, 2015 at 10:21 AM, Shulgin, Oleksandr
<oleksandr.shulgin@zalando.de> wrote:
>
> What I have found is that in a significant percentage of instances, when a
> duplicate sample value is *not* put into the MCV list, it does produce
> duplicates in the histogram_bounds, so it looks like the MCV cut-off happens
> too early, even though we have enough space for more values in the MCV list.
>
> In the extreme cases I've found completely empty MCV lists and histograms
> full of duplicates at the same time, with only about 20% of distinct values
> in the histogram (as it turns out, this happens due to high fraction of
> NULLs in the sample).

Wow, this is very interesting work.  Using values_cnt rather than
samplerows to compute avgcount seems like a clear improvement.  It
doesn't make any sense to raise the threshold for creating an MCV
based on the presence of additional nulls or too-wide values in the
table.  I bet compute_distinct_stats needs a similar fix.

Yes, and there's also the magic 1.25 multiplier in that code.  I think it would make sense to agree first on how exactly the patch for compute_scalar_stats() should look like, then port the relevant bits to compute_distinct_stats().
 
But for
plan stability considerations, I'd say we should back-patch this all
the way, but those considerations might mitigate for a more restrained
approach.  Still, maybe we should try to sneak at least this much into
9.5 RSN, because I have to think this is going to help people with
mostly-NULL (or mostly-really-wide) columns.

I'm not sure.  Likely people would complain or have found this out on their own if they were seriously affected.

What I would be interested is people running the queries I've shown on their data to see if there are any interesting/unexpected patterns.

As far as the rest of the fix, your code seems to remove the handling
for ndistinct < 0.  That seems unlikely to be correct, although it's
possible that I am missing something.

The difference here is that ndistinct at this scope in the original code did hide a variable from an outer scope.  That one could be < 0, but in my code there is no inner-scope ndistinct, we are referring to the outer scope variable which cannot be < 0.
 
Aside from that, the rest of
this seems like a policy change, and I'm not totally sure off-hand
whether it's the right policy.  Having more MCVs can increase planning
time noticeably, and the point of the existing cutoff is to prevent us
from choosing MCVs that aren't actually "C".  I think this change
significantly undermines those protections.  It seems to me that it
might be useful to evaluate the effects of this part of the patch
separately from the samplerows -> values_cnt change.

Yes, that's why I was wondering if frequency cut-off approach might be helpful here.  I'm going to have a deeper look at array's typanalyze implementation at the least.

--
Alex

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Ildus Kurbangaliev
Дата:
Сообщение: Re: Support of partial decompression for datums
Следующее
От: YUriy Zhuravlev
Дата:
Сообщение: Re: Some questions about the array.