Re: multivariate statistics / patch v7

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: multivariate statistics / patch v7
Дата
Msg-id 55982ED6.7000807@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: multivariate statistics / patch v7  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Ответы Re: multivariate statistics / patch v7  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Список pgsql-hackers
Hello Horiguchi-san!

On 07/03/2015 07:30 AM, Kyotaro HORIGUCHI wrote:
> Hello, I started to work on this patch.
>
>> attached is v7 of the multivariate stats patch. The main improvement
>> is major refactoring of the clausesel.c portion - splitting the
>> awfully long spaghetti-style functions into smaller pieces, making it
>> much more understandable etc.
>
> Thank you, it looks clearer. I have some comment for the brief look
> at this. This patchset is relatively large so I will comment on
> "per-notice" basis.. which means I'll send comment before examining
> the entire of this patchset. Sorry in advance for the desultory
> comments.

Sure. If you run into something that's not clear enough, I'm happy to 
explain that (I tried to cover all the important details in the 
comments, but it's a large patch, indeed.)

> =======
> General comments:
>
> - You included unnecessary stuffs such like regression.diffs in
>    these patches.

Ahhhh :-/ Will fix.

>
> - Now OID 3307 is used by pg_stat_file. I moved
>    pg_mv_stats_dependencies_info/show to 3311/3312.

Will fix while rebasing to current master.

>
> - Single-variate stats have a mechanism to inject arbitrary
>    values as statistics, that is, get_relation_stats_hook and the
>    similar stuffs. I want the similar mechanism for multivariate
>    statistics, too.

Fair point, although I'm not sure where should we place the hook, how 
exactly should it be defined and how useful that would be in the end. 
Can you give an example of how you'd use such hook?

I've never used get_relation_stats_hook, but if I get it right, the 
plugins can use the hook to create the stats (for each column), either 
from scratch or tweaking the existing stats.

I'm not sure how this should work with multivariate stats, though, 
because there can be arbitrary number of stats for a column, and it 
really depends on all the clauses (so examine_variable() seems a bit 
inappropriate, as it only sees a single variable at a time).

Moreover, with multivariate stats
   (a) there may be arbitrary number of stats for a column
   (b) only some of the stats end up being used for the estimation

I see two or three possible places for calling such hook:
   (a) at the very beginning, after fetching the list of stats
       - sees all the existing stats on a table       - may add entirely new stats or tweak the existing ones
   (b) after collecting the list of variables compatible with       multivariate stats
       - like (a) and additionally knows which columns are interesting         for the query (but only with respect to
theexisting stats)
 
   (c) after optimization (selection of the right combination if stats)
       - like (b), but can't affect the optimization

But I can't really imagine anyone building multivariate stats on the 
fly, in the hook.

It's more complicated, though, because the query may call 
clauselist_selectivity multiple times, depending on how complex the 
WHERE clauses are.


> 0001:
>
> - I also don't think it is right thing for expression_tree_walker
>    to recognize RestrictInfo since it is not a part of expression.

Yes. In my working git repo, I've reworked this to use the second 
option, i.e. adding RestrictInfo pull_(varno|varattno)_walker:

https://github.com/tvondra/postgres/commit/2dc79b914c759d31becd8ae670b37b79663a595f

Do you think this is the correct solution? If not, how to fix it?

>
> 0003:
>
> - In clauselist_selectivity, find_stats is uselessly called for
>    single clause. This should be called after the clauselist found
>    to consist more than one clause.

Ok, will fix.

>
> - Searching vars to be compared with mv-stat columns which
>    find_stats does should stop at disjunctions. But this patch
>    doesn't behave so and it should be an unwanted behavior. The
>    following steps shows that.

Why should it stop at disjunctions? There's nothing wrong with using 
multivariate stats to estimate OR-clauses, IMHO.

>
> ====
>   =# CREATE TABLE t1 (a int, b int, c int);
>   =# INSERT INTO t1 (SELECT a, a * 2, a * 3 FROM generate_series(0, 9999) a);
>   =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
>    Seq Scan on t1  (cost=0.00..230.00 rows=1 width=12)
>   =# ALTER TABLE t1 ADD STATISTICS (HISTOGRAM) ON (a, b, c);
>   =# ANALZYE t1;
>   =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
>    Seq Scan on t1  (cost=0.00..230.00 rows=268 width=12)
> ====
>   Rows changed unwantedly.

That has nothing to do with OR clauses, but rather with using a type of 
statistics that does not fit the data and queries. Histograms are quite 
inaccurate for discrete data and equality conditions - in this case the 
clauses probably match one bucket, and so we use 1/2 the bucket as an 
estimate. There's nothing wrong with that.

So let's use MCV instead:

ALTER TABLE t1 ADD STATISTICS (MCV) ON (a, b, c);
ANALYZE t1;
EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;                     QUERY PLAN
----------------------------------------------------- Seq Scan on t1  (cost=0.00..230.00 rows=1 width=12)   Filter:
(((a= 1) AND (b = 2)) OR (c = 3))
 
(2 rows)

>   It seems not so simple thing as your code assumes.

Maybe, but I don't see what assumption is invalid? I see nothing wrong 
with the previous query.

kind regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Dunstan
Дата:
Сообщение: Re: PostgreSQL 9.5 Alpha 1 build fail with perl 5.22
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?