Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Поиск
Список
Период
Сортировка
От Dean Rasheed
Тема Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Дата
Msg-id CAEZATCVqG5gBkn0RM8rSU1ZKz1boVYshz_XM-=s+TQovqX4kdw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Список pgsql-hackers
On 17 July 2018 at 14:03, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> For equalities it's going to be hard. The only thing I can think of at the
> moment is checking if there are any matching buckets at all, and using that
> to decide whether to extrapolate the MCV selectivity to the non-MCV part or
> not (or perhaps to what part of the non-MCV part).
>

So I decided to play a little more with this, experimenting with a
much simpler approach -- this is for MCV's only at the moment, see the
attached (very much WIP) patch (no doc or test updates, and lots of
areas for improvement).

The basic idea when building the MCV stats is to not just record the
frequency of each combination of values, but also what I'm calling the
"base frequency" -- that is the frequency that that combination of
values would have if the columns were independent (i.e., the product
of each value's individual frequency).

The reasoning then, is that if we find an MCV entry matching the query
clauses, the difference (frequency - base_frequency) can be viewed as
a correction to be applied to the selectivity returned by
clauselist_selectivity_simple(). If all possible values were covered
by matching MCV entries, the sum of the base frequencies of the
matching MCV entries would approximately cancel out with the simple
selectivity, and only the MCV frequencies would be left (ignoring
second order effects arising from the fact that
clauselist_selectivity_simple() doesn't just sum up disjoint
possibilities). For partial matches, it will use what multivariate
stats are available to improve upon the simple selectivity.

I wondered about just storing the difference (frequency -
base_frequency) in the stats, but it's actually useful to have both
values, because then the total of all the MCV frequencies can be used
to set an upper bound on the non-MCV part.

The advantage of this approach is that it is very simple, and in
theory ought to be reasonably applicable to arbitrary combinations of
clauses. Also, it naturally falls back to the univariate-based
estimate when there are no matching MCV entries. In fact, even when
there are no matching MCV entries, it can still improve upon the
univariate estimate by capping it to 1-total_mcv_sel.

I tested it with the same data posted previously and a few simple
queries, and the initial results are quite encouraging. Where the
previous patch sometimes gave noticeable over- or under-estimates,
this patch generally did better:


Query  Actual rows  Est (HEAD)  Est (24 Jun patch)  Est (new patch)
  Q1       50000      12625           48631              49308
  Q2       40000       9375           40739              38710
  Q3       90000      21644          172688              88018
  Q4      140000      52048          267528             138228
  Q5      140000      52978          267528             138228
  Q6      140000      52050          267528             138228
  Q7      829942     777806          149886             822788
  Q8      749942     748302          692686             747922
  Q9       15000      40989           27595              14131
 Q10       15997      49853           27595              23121

Q1: a=1 and b=1
Q2: a=1 and b=2
Q3: a=1 and (b=1 or b=2)
Q4: (a=1 or a=2) and (b=1 or b=2)
Q5: (a=1 or a=2) and (b<=2)
Q6: (a=1 or a=2 or a=4) and (b=1 or b=2)
Q7: (a=1 or a=2) and not (b=2)
Q8: (a=1 or a=2) and not (b=1 or b=2)
Q9: a=3 and b>0 and b<3
Q10: a=3 and b>0 and b<1000


I've not tried anything with histograms. Possibly the histograms could
be used as-is, to replace the non-MCV part (other_sel). Or, a similar
approach could be used, recording the base frequency of each histogram
bucket, and then using that to refine the other_sel estimate. Either
way, I think it would be necessary to exclude equality clauses from
the histograms, otherwise MCVs might end up being double-counted.

Regards,
Dean

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: David Steele
Дата:
Сообщение: Re: Standby trying "restore_command" before local WAL
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.