Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Дата
Msg-id 341ca23f-144d-130d-1b45-ddddf581797e@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Ответы Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Список pgsql-hackers

On 08/03/2018 04:24 PM, Dean Rasheed wrote:
> On 17 July 2018 at 14:03, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>> For equalities it's going to be hard. The only thing I can think of at the
>> moment is checking if there are any matching buckets at all, and using that
>> to decide whether to extrapolate the MCV selectivity to the non-MCV part or
>> not (or perhaps to what part of the non-MCV part).
>>
> 
> So I decided to play a little more with this, experimenting with a
> much simpler approach -- this is for MCV's only at the moment, see the
> attached (very much WIP) patch (no doc or test updates, and lots of
> areas for improvement).
> 
> The basic idea when building the MCV stats is to not just record the
> frequency of each combination of values, but also what I'm calling the
> "base frequency" -- that is the frequency that that combination of
> values would have if the columns were independent (i.e., the product
> of each value's individual frequency).
> 
> The reasoning then, is that if we find an MCV entry matching the query
> clauses, the difference (frequency - base_frequency) can be viewed as
> a correction to be applied to the selectivity returned by
> clauselist_selectivity_simple(). If all possible values were covered
> by matching MCV entries, the sum of the base frequencies of the
> matching MCV entries would approximately cancel out with the simple
> selectivity, and only the MCV frequencies would be left (ignoring
> second order effects arising from the fact that
> clauselist_selectivity_simple() doesn't just sum up disjoint
> possibilities). For partial matches, it will use what multivariate
> stats are available to improve upon the simple selectivity.
> 
> I wondered about just storing the difference (frequency -
> base_frequency) in the stats, but it's actually useful to have both
> values, because then the total of all the MCV frequencies can be used
> to set an upper bound on the non-MCV part.
> 
> The advantage of this approach is that it is very simple, and in
> theory ought to be reasonably applicable to arbitrary combinations of
> clauses. Also, it naturally falls back to the univariate-based
> estimate when there are no matching MCV entries. In fact, even when
> there are no matching MCV entries, it can still improve upon the
> univariate estimate by capping it to 1-total_mcv_sel.
> 
> I tested it with the same data posted previously and a few simple
> queries, and the initial results are quite encouraging. Where the
> previous patch sometimes gave noticeable over- or under-estimates,
> this patch generally did better:
> 
> 
> Query  Actual rows  Est (HEAD)  Est (24 Jun patch)  Est (new patch)
>    Q1       50000      12625           48631              49308
>    Q2       40000       9375           40739              38710
>    Q3       90000      21644          172688              88018
>    Q4      140000      52048          267528             138228
>    Q5      140000      52978          267528             138228
>    Q6      140000      52050          267528             138228
>    Q7      829942     777806          149886             822788
>    Q8      749942     748302          692686             747922
>    Q9       15000      40989           27595              14131
>   Q10       15997      49853           27595              23121
> 
> Q1: a=1 and b=1
> Q2: a=1 and b=2
> Q3: a=1 and (b=1 or b=2)
> Q4: (a=1 or a=2) and (b=1 or b=2)
> Q5: (a=1 or a=2) and (b<=2)
> Q6: (a=1 or a=2 or a=4) and (b=1 or b=2)
> Q7: (a=1 or a=2) and not (b=2)
> Q8: (a=1 or a=2) and not (b=1 or b=2)
> Q9: a=3 and b>0 and b<3
> Q10: a=3 and b>0 and b<1000
> 

Interesting idea, and the improvements certainly seem encouraging.

I wonder what a counter-example would look like - I think the MCV and 
non-MCV parts would have to behave very differently (one perfectly 
dependent, the other perfectly independent). But that does seem very 
likely, and even if it was there's not much we can do about such cases.

> 
> I've not tried anything with histograms. Possibly the histograms could
> be used as-is, to replace the non-MCV part (other_sel). Or, a similar
> approach could be used, recording the base frequency of each histogram
> bucket, and then using that to refine the other_sel estimate. Either
> way, I think it would be necessary to exclude equality clauses from
> the histograms, otherwise MCVs might end up being double-counted.
> 

I do have an idea about histograms. I didn't have time to hack on it 
yet, but I think it could work in combination with your MCV algorithm.

Essentially there are two related issues with histograms:


1) equality conditions

Histograms work nicely with inequalities, not that well for equalities. 
For equality clauses, we can estimate the selectivity as 1/ndistinct, 
similarly to what we do in 1-D cases (we can use ndistinct coefficients 
if we have them, and MCV tracking the common combinations).

If there are both equalities and inequalities, we can then use the 
equality clauses merely as condition (to limit the set of buckets), and 
evaluate the inequalities for those buckets only. Essentially compute

     P(equals + inequals) = P(equals) * P(inequals | equals)

IMHO that should help with estimating selectivity of equality clauses.


2) estimating bucket selectivity

The other question is how to combine selectivities of multiple clauses 
for a single bucket. I think the linear approximation (convert_scalar or 
something like that) and computing geometric mean (as you proposed) is a 
solid plan.

I do have this on my TODO list for this week, unless something urgent 
comes up.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Aleksander Alekseev
Дата:
Сообщение: Re: [GSoC]The project summary
Следующее
От: Stephen Frost
Дата:
Сообщение: Re: Allow COPY's 'text' format to output a header