Re: [HACKERS] PATCH: multivariate histograms and MCV lists

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Дата
Msg-id 80bc98b2-1c43-1fe8-b316-d447bb68cff2@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Ответы Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Re: [HACKERS] PATCH: multivariate histograms and MCV lists  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Список pgsql-hackers

On 07/16/2018 02:54 PM, Dean Rasheed wrote:
> On 16 July 2018 at 13:23, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>>> The top-level clauses allow us to make such deductions, with deeper
>>>> clauses it's much more difficult (perhaps impossible). Because for
>>>> example with (a=1 AND b=1) there can be just a single match, so if we
>>>> find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR
>>>> b=2)) it's not that simple, because there may be multiple combinations
>>>> and so a match in MCV does not guarantee anything.
>>>
>>> Actually, it guarantees a lower bound on the overall selectivity, and
>>> maybe that's the best that we can do in the absence of any other
>>> stats.
>>>
>> Hmmm, is that actually true? Let's consider a simple example, with two
>> columns, each with just 2 values, and a "perfect" MCV list:
>>
>>     a | b | frequency
>>    -------------------
>>     1 | 1 | 0.5
>>     2 | 2 | 0.5
>>
>> And let's estimate sel(a=1 & b=2).
> 
> OK.In this case, there are no MCV matches, so there is no lower bound (it's 0).
> 
> What we could do though is also impose an upper bound, based on the
> sum of non-matching MCV frequencies. In this case, the upper bound is
> also 0, so we could actually say the resulting selectivity is 0.
> 

Hmmm, it's not very clear to me how would we decide which of these cases
applies, because in most cases we don't have MCV covering 100% rows.

Anyways, I've been thinking about how to modify the code to wort the way
you proposed (in a way sufficient for a PoC). But after struggling with
it for a while it occurred to me it might be useful to do it on paper
first, to verify how would it work on your examples.

So let's use this data

create table foo(a int, b int);
insert into foo select 1,1 from generate_series(1,50000);
insert into foo select 1,2 from generate_series(1,40000);
insert into foo select 1,x/10 from generate_series(30,250000) g(x);
insert into foo select 2,1 from generate_series(1,30000);
insert into foo select 2,2 from generate_series(1,20000);
insert into foo select 2,x/10 from generate_series(30,500000) g(x);
insert into foo select 3,1 from generate_series(1,10000);
insert into foo select 3,2 from generate_series(1,5000);
insert into foo select 3,x from generate_series(3,600000) g(x);
insert into foo select x,x/10 from generate_series(4,750000) g(x);

Assuming we have perfectly exact statistics, we have these MCV lists
(both univariate and multivariate):

  select a, count(*), round(count(*) /2254937.0, 4) AS frequency
    from foo group by a order by 2 desc;

     a    | count  | frequency
  --------+--------+-----------
        3 | 614998 |    0.2727
        2 | 549971 |    0.2439
        1 | 339971 |    0.1508
     1014 |      1 |    0.0000
    57220 |      1 |    0.0000
    ...

  select b, count(*), round(count(*) /2254937.0, 4) AS frequency
    from foo group by b order by 2 desc;

     b    | count | frequency
  --------+-------+-----------
        1 | 90010 |    0.0399
        2 | 65010 |    0.0288
        3 |    31 |    0.0000
        7 |    31 |    0.0000
       ...

  select a, b, count(*), round(count(*) /2254937.0, 4) AS frequency
    from foo group by a, b order by 3 desc;

     a    |   b    | count | frequency
  --------+--------+-------+-----------
        1 |      1 | 50000 |    0.0222
        1 |      2 | 40000 |    0.0177
        2 |      1 | 30000 |    0.0133
        2 |      2 | 20000 |    0.0089
        3 |      1 | 10000 |    0.0044
        3 |      2 |  5000 |    0.0022
        2 |  12445 |    10 |    0.0000
        ...

And let's estimate the two queries with complex clauses, where the
multivariate stats gave 2x overestimates.

SELECT * FROM foo WHERE a=1 and (b=1 or b=2);
-- actual 90000, univariate: 24253, multivariate: 181091

   univariate:

     sel(a=1) = 0.1508
     sel(b=1) = 0.0399
     sel(b=2) = 0.0288
     sel(b=1 or b=2) = 0.0673

   multivariate:
     sel(a=1 and (b=1 or b=2)) = 0.0399 (0.0770)

The second multivariate estimate comes from assuming only the first 5
items make it to the multivariate MCV list (covering 6.87% of the data)
and extrapolating the selectivity to the non-MCV data too.

(Notice it's about 2x the actual selectivity, so this extrapolation due
to not realizing the MCV already contains all the matches is pretty much
responsible for the whole over-estimate).

So, how would the proposed algorithm work? Let's start with "a=1":

   sel(a=1) = 0.1508

I don't see much point in applying the two "b" clauses independently (or
how would it be done, as it's effectively a single clause):

   sel(b=1 or b=2) = 0.0673

And we get

   total_sel = sel(a=1) * sel(b=1 or b=2) = 0.0101

From the multivariate MCV we get

   mcv_sel = 0.0399

And finally

   total_sel = Max(total_sel, mcv_sel) = 0.0399

Which is great, but I don't see how that actually helped anything? We
still only have the estimate for the ~7% covered by the MCV list, and
not the remaining non-MCV part.

I could do the same thing for the second query, but the problem there is
actually exactly the same - extrapolation from MCV to non-MCV part
roughly doubles the estimate.

So unless I'm applying your algorithm incorrectly, this does not seem
like a very promising direction :-(

There may be valuable information we could learn from the univariate
estimates (using a Max() of them as an upper boundary seems reasonable),
but that's still quite crude. And it will only ever work with simple
top-level clauses. Once the clauses get more complicated, it seems
rather tricky - presumably multivariate stats would be only used for
correlated columns, so trying to deduce something from univariate
estimates on complex clauses on such columns seems somewhat suspicious.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: ENOSPC FailedAssertion("!(RefCountErrors == 0)"
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: New GUC to sample log queries