Re: multivariate statistics / patch v6

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: multivariate statistics / patch v6
Дата
Msg-id 55547A86.8020400@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: multivariate statistics / patch v6  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Ответы Re: multivariate statistics / patch v6  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Список pgsql-hackers

On 05/13/15 10:31, Kyotaro HORIGUCHI wrote:
> Hello, this might be somewhat out of place but strongly related
> to this patch so I'll propose this here.
>
> This is a proposal of new feature for this patch or asking for
> your approval for my moving on this as a different (but very
> close) project.
>
> ===
>
>> Attached is v6 of the multivariate stats, with a number of
>> improvements:
> ...
>> 2) fix of pg_proc issues (reported by Jeff)
>>
>> 3) rebase to current master
>
> Unfortunately, the v6 patch suffers some system oid conflicts
> with recently added ones. And what more unfortunate for me is
> that the code for functional dependencies looks undone:)

I'll fix the OID conflicts once the CF completes, which should be in a 
few days I guess. Until then you can apply it on top of master from 
about May 6 (that's when the v6 was created, and there should be no 
conflicts).

Regarding the functional dependencies - you're right there's room for 
improvement. For example it only works with dependencies between pairs 
of columns, not multi-column dependencies. Is this what you mean by 
incomplete?

> I mention this because I recently had a issue from strong
> correlation between two columns in dbt3 benchmark. Two columns in
> some table are in strong correlation but not in functional
> dependencies, there are too many values and the distribution of
> them is very uniform so MCV is no use for the table (histogram
> has nothing to do with equal conditions). As the result, planner
> estimates the number of rows largely wrong as expected especially
> for joins.

I think the other statistics types (esp. histograms) might be more 
useful here, but I assume you haven't tried that because of the conflicts.

The current patch does not handle joins at all, though.


> I, then, had a try calculating the ratio between the product of
> distinctness of every column and the distinctness of the set of
> the columns, call it multivariate coefficient here, and found
> that it looks greately useful for the small storage space, less
> calculation, and simple code.

So when you have two columns A and B, you compute this:
   ndistinct(A) * ndistinct(B)   ---------------------------          ndistinct(A,B)

where ndistinc(...) means number of distinct values in the column(s)?


> The attached first is a script to generate problematic tables.
> And the second is a patch to make use of the mv coef on current
> master.  The patch is a very primitive POC so no syntactical
> interfaces involved.
>
> For the case of your first example,
>
>> =# create table t (a int, b int, c int);
>> =# insert into t (select a/10000, a/10000, a/10000
>>                    from generate_series(0, 999999) a);
>> =# analyze t;
>> =# explain analyze select * from t where a = 1 and b = 1 and c = 1;
>>   Seq Scan on t  (cost=0.00..22906.00 rows=1 width=12)
>>                  (actual time=3.878..250.628 rows=10000 loops=1)
>
> Make use of mv coefficient.
>
>> =# insert into pg_mvcoefficient values ('t'::regclass, 1, 2, 3, 0);
>> =# analyze t;
>> =# explain analyze select * from t where a = 1 and b = 1 and c = 1;
>>   Seq Scan on t  (cost=0.00..22906.00 rows=9221 width=12)
>>                  (actual time=3.740..242.330 rows=10000 loops=1)
>
> Row number estimation was largely improved.

With my patch:

alter table t add statistics (mcv) on (a,b,c);
analyze t;
select * from pg_mv_stats;
 tablename | attnums | mcvbytes |  mcvinfo
-----------+---------+----------+------------ t         | 1 2 3   |     2964 | nitems=100

explain (analyze,timing off)  select * from t where a = 1 and b = 1 and c = 1;
                            QUERY PLAN
------------------------------------------------------------ Seq Scan on t  (cost=0.00..22906.00 rows=9533 width=12)
           (actual rows=10000 loops=1)   Filter: ((a = 1) AND (b = 1) AND (c = 1))   Rows Removed by Filter: 990000
Planningtime: 0.233 ms Execution time: 93.212 ms
 
(5 rows)

alter table t drop statistics all;
alter table t add statistics (histogram) on (a,b,c);
analyze t;

explain (analyze,timing off) select * from t where a = 1 and b = 1 and c = 1;
                        QUERY PLAN
-------------------------------------------------------------------- Seq Scan on t  (cost=0.00..22906.00 rows=9667
width=12)               (actual rows=10000 loops=1)   Filter: ((a = 1) AND (b = 1) AND (c = 1))   Rows Removed by
Filter:990000 Planning time: 0.594 ms Execution time: 109.917 ms
 
(5 rows)

So both the MCV list and histogram do quite a good work here, but there 
are certainly cases when that does not work and the mvcoefficient works 
better.

> Well, my example,
>
>> $ perl gentbl.pl 10000 | psql postgres
>> $ psql postgres
>> =# explain analyze select * from t1 where a = 1 and b = 2501;
>>   Seq Scan on t1  (cost=0.00..6216.00 rows=1 width=8)
>>                   (actual time=0.030..66.005 rows=8 loops=1)
>>
>> =# explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);
>>   Hash Join  (cost=1177.00..11393.76 rows=76 width=16)
>>              (actual time=29.811..322.271 rows=320000 loops=1)
>
> Too bad estimate for the join.
>
>> =# insert into pg_mvcoefficient values ('t1'::regclass, 1, 2, 0, 0);
>> =# analyze t1;
>> =# explain analyze select * from t1 where a = 1 and b = 2501;
>>   Seq Scan on t1  (cost=0.00..6216.00         rows=8 width=8)
>>                   (actual time=0.032..104.144 rows=8 loops=1)
>>
>> =# explain analyze select * from t1 join t2 on (t1.a = t2.a and t1.b = t2.b);
>>   Hash Join  (cost=1177.00..11393.76      rows=305652 width=16)
>>              (actual time=40.642..325.679 rows=320000 loops=1)
>
> It gives almost correct estimations.

The current patch does not handle joins, but it's one of the TODO items.

>
> I think the result above shows that the multivariate coefficient
> is significant to imporove estimates when correlated colums are
> involved.

Yes, it looks interesting. I'm wondering what are the "failure cases" 
when the coefficient approach does not work. It seems to me it relies on 
an assumption of consistency for all the ndistinct values. For example 
lets assume you have two columns - A and B, each with 1000 distinct 
values, and that each value in A has 100 matching values in B, so the 
coefficient is ~10
   1,000 * 1,000 / 100,000 = 10

Now, let's assume the distribution looks differently - with first 100 
values in A matching all 1000 values of B, and the remaining 900 values 
just a single B value. Then
  1,000 * 1,000 / (100,000 + 900) = ~9,9

So a very different distribution, but almost the same coefficient.

Are there any other assumptions like this?

Also, does the coefficient work only for equality conditions only?

>
> Would you consider this in your patch? Otherwise I move on this
> as a different project from yours if you don't mind. Except user
> interface won't conflict with yours, I suppose. But finally they
> should need some labor of consolidation.

I think it's a neat idea, and I think it might be added to the patch. It 
would fit in quite nicely, actually - I already do have other kinds of 
stats for addition, but I'm not going to work on that in the near 
future. It will require changes in some parts of the patch (selecting 
the stats for a list of clauses) and I'd like to complete the current 
patch first, and then add features in follow-up patches.

>
> regards,

regards
Tomas



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Denis Kirjanov
Дата:
Сообщение: trust authentication behavior
Следующее
От: Etsuro Fujita
Дата:
Сообщение: Re: Missing importing option of postgres_fdw