Re: More stable query plans via more predictable column statistics

Поиск
Список
Период
Сортировка
От Shulgin, Oleksandr
Тема Re: More stable query plans via more predictable column statistics
Дата
Msg-id CACACo5R_+UZcDPbF38_yG7E7W1zA=Bm9U8BR_wvrLchDELkR7g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: More stable query plans via more predictable column statistics  ("Shulgin, Oleksandr" <oleksandr.shulgin@zalando.de>)
Ответы Re: More stable query plans via more predictable column statistics  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
On Mon, Jan 25, 2016 at 5:11 PM, Shulgin, Oleksandr <oleksandr.shulgin@zalando.de> wrote:
>
> On Sat, Jan 23, 2016 at 11:22 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>>
>> Overall, I think this is really about deciding when to cut-off the MCV, so that it does not grow needlessly large - as Robert pointed out, the larger the list, the more expensive the estimation (and thus planning).
>>
>> So even if we could fit the whole sample into the MCV list (i.e. we believe we've seen all the values and we can fit them into the MCV list), it may not make sense to do so. The ultimate goal is to estimate conditions, and if we can do that reasonably even after cutting of the least frequent values from the MCV list, then why not?
>>
>> From this point of view, the analysis concentrates deals just with the ANALYZE part and does not discuss the estimation counter-part at all.
>
>
> True, this aspect still needs verification.  As stated, my primary motivation was to improve the plan stability for relatively short MCV lists.
>
> Longer MCV lists might be a different story, but see "Increasing stats target" section of the original mail: increasing the target doesn't give quite the expected results with unpatched code either.

To address this concern I've run my queries again on the same dataset, now focusing on how the number of MCV items changes with the patched code (using the CTEs from my original mail):

WITH ...

SELECT count(1),
       min(num_mcv)::real,
       avg(num_mcv)::real,
       max(num_mcv)::real,
       stddev(num_mcv)::real

  FROM stats2

 WHERE num_mcv IS NOT NULL;

(ORIGINAL)
count  | 27452
min    | 1
avg    | 32.7115
max    | 100
stddev | 40.6927

(PATCHED)
count  | 27527
min    | 1
avg    | 38.4341
max    | 100
stddev | 43.3596

A significant portion of the MCV lists is occupying all 100 slots available with the default statistics target, so it also interesting to look at the stats that habe "underfilled" MCV lists (by changing the condition of the WHERE clause to read "num_mcv < 100"):

(<100 ORIGINAL)
count  | 20980
min    | 1
avg    | 11.9541
max    | 99
stddev | 18.4132

(<100 PATCHED)
count  | 19329
min    | 1
avg    | 12.3222
max    | 99
stddev | 19.6959

As one can see, with the patched code the average length of MCV lists doesn't change all that dramatically, while at the same time exposing all the improvements described in the original mail.

>> After fixing the estimator to consider fraction of NULLs, the estimates look like this:
>>
>>     statistics target |   master  |  patched
>>    ------------------------------------------
>>                   100 |     1302  |     5356
>>                  1000 |     6022  |     6791
>>
>> So this seems to significantly improve the ndistinct estimate (patch attached).
>
>
> Hm... this looks correct.  And compute_distinct_stats() needs the same treatment, obviously.

I've incorporated this fix into the v2 of my patch, I think it is related closely enough.  Also, added corresponding changes to compute_distinct_stats(), which doesn't produce a histogram.

I'm adding this to the next CommitFest.  Further reviews are very much appreciated!

--
Alex
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Ashutosh Bapat
Дата:
Сообщение: Re: postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)
Следующее
От: Andres Freund
Дата:
Сообщение: Re: [ADMIN] 9.5 new setting "cluster name" and logging