Re: multivariate statistics / patch v7

Поиск
Список
Период
Сортировка
От Kyotaro HORIGUCHI
Тема Re: multivariate statistics / patch v7
Дата
Msg-id 20150707.150554.43512301.horiguchi.kyotaro@lab.ntt.co.jp
обсуждение исходный текст
Ответ на Re: multivariate statistics / patch v7  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: multivariate statistics / patch v7  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
Hi, Tomas. I'll kick the gas pedal.

> > Thank you, it looks clearer. I have some comment for the brief look
> > at this. This patchset is relatively large so I will comment on
> > "per-notice" basis.. which means I'll send comment before examining
> > the entire of this patchset. Sorry in advance for the desultory
> > comments.
> 
> Sure. If you run into something that's not clear enough, I'm happy to
> explain that (I tried to cover all the important details in the
> comments, but it's a large patch, indeed.)


> > - Single-variate stats have a mechanism to inject arbitrary
> >    values as statistics, that is, get_relation_stats_hook and the
> >    similar stuffs. I want the similar mechanism for multivariate
> >    statistics, too.
> 
> Fair point, although I'm not sure where should we place the hook, how
> exactly should it be defined and how useful that would be in the
> end. Can you give an example of how you'd use such hook?

It's my secret, but is open:p. this is crucial for us to examine
many planner-related problems occurred in our customer in-vitro.

http://pgdbmsstats.osdn.jp/pg_dbms_stats-en.html

# Mmm, this doc is a bit too old..

One tool of ours does like following, 

- Copy pg_statistics and some attributes of pg_class into some table. Of course this is exportable.

- For example, in examine_simple_variable, using the hook get_relation_stats_hook, inject the saved statistics in place
ofthe real statistics.
 

The hook point is placed where the parameters to specify what
statistics is needed are avaiable in compact shape, and all the
hook function should do is returning corresponding statistics
values.

So the parallel stuff for this mv stats will look like this.

MVStatisticInfo *
get_mv_statistics(PlannerInfo *root, relid);

or 

MVStatisticInfo *
get_mv_statistics(PlannerInfo *root, relid, <bitmap or list of attnos>);

So by simplly applying this, the current clauselist_selectivity
code will turn into following.

> if (list_length(clauses) == 1)
>    return clause_selectivity(....);
> 
> Index varrelid = find_singleton_relid(root, clauses, varRelid);
> 
> if (varrelid)
> {
> // Bitmapset attnums = collect_attnums(root, clauses, varrelid);
>   if (get_mv_statistics_hook)
>     stats = get_mv_statistics_hook(root, varrelid /*, attnums */);
>   else
>     statis = get_mv_statistics(root, varrelid /*, attnums*/);
> 
>   ....

In comparison to single statistics, statistics values might be
preferable to separate from definition.

> I've never used get_relation_stats_hook, but if I get it right, the
> plugins can use the hook to create the stats (for each column), either
> from scratch or tweaking the existing stats.

Mostly existing stats without change. I saw few hackers wanted to
provide predefined statistics for typical cases. I haven't see
anyone who tweaks existing stats.

> I'm not sure how this should work with multivariate stats, though,
> because there can be arbitrary number of stats for a column, and it
> really depends on all the clauses (so examine_variable() seems a bit
> inappropriate, as it only sees a single variable at a time).

Restriction clauses are not a problem. What is needed to replace
stats value is defining few APIs to retrieve them, and to
retrieve the stats values only in a way that compatible with the
API. It would be okay to be a substitute views for mv stats as an
extreme case but it is not good.

> Moreover, with multivariate stats
> 
>    (a) there may be arbitrary number of stats for a column
> 
>    (b) only some of the stats end up being used for the estimation
> 
> I see two or three possible places for calling such hook:
> 
>    (a) at the very beginning, after fetching the list of stats
> 
>        - sees all the existing stats on a table
>        - may add entirely new stats or tweak the existing ones

Getting all stats for a table would be okay but attnum list can
restrict the possibilities, as the second form of the example
APIs above. And we may forget the case of forged or tweaked
stats, they are their problem, not ours.


>    (b) after collecting the list of variables compatible with
>        multivariate stats
> 
>        - like (a) and additionally knows which columns are interesting
>          for the query (but only with respect to the existing stats)

We should carefully design the API to be able to point the
pertinent stats for every situation. Mv stats is based on the
correlation of multiple columns so I think only relid and
attributes list are enough as the parameter.

| if (st.relid == param.relid && bms_equal(st.attnums, param.attnums))
|    /* This is the stats to be wanted  */

If we can filter the appropriate stats from all the stats using
clauselist, we definitely can make the appropriate parameter
(column set) prior to retrieving mv statistics. Isn't it correct?

>    (c) after optimization (selection of the right combination if stats)
> 
>        - like (b), but can't affect the optimization
> 
> But I can't really imagine anyone building multivariate stats on the
> fly, in the hook.
> 
> It's more complicated, though, because the query may call
> clauselist_selectivity multiple times, depending on how complex the
> WHERE clauses are.
> 
> 
> > 0001:
> >
> > - I also don't think it is right thing for expression_tree_walker
> >    to recognize RestrictInfo since it is not a part of expression.
> 
> Yes. In my working git repo, I've reworked this to use the second
> option, i.e. adding RestrictInfo pull_(varno|varattno)_walker:
> 
> https://github.com/tvondra/postgres/commit/2dc79b914c759d31becd8ae670b37b79663a595f
> 
> Do you think this is the correct solution? If not, how to fix it?

The reason why I think it is not appropreate is that RestrictInfo
is not a part of expression.

Increasing selectivity of a condition by column correlation is
occurs only for a set of conjunctive clauses. OR operation
devides the sets. Is it agreeable? RestrictInfos can be nested
each other and we should be aware of the AND/OR operators. This
is what expression_tree_walker doesn't.

Perhaps we should provide the dedicate function such like
find_conjunctive_attr_set which does this,

- Check the type top expression of the clause
 - If it is a RestrictInfo, check clause_relids then check   clause.
 - If it is a bool OR, stop to search and return empty set of   attributes.
 - If it is a bool AND, make further check of the components. A   list of RestrictInfo should be treaed as AND
connection.
 - If it is operator exression, collect used relids and attrs   walking the expression tree.

I should missing something but I think the outline is correct.

Addition to that we should carefully avoid duplicate correction
using the same mv statistics.

I haven't understood what choose_mv_satistics precisely but I
suppose what this function does would be split into the 'making
parameter to find stats' part and 'matching the parameter with
stats in order to retrieve desired stats' part. Could you
reconstruct this process into the form like this?

I feel it is too invasive, or exccesively intermix(?)ed.

> > 0003:
> >
> > - In clauselist_selectivity, find_stats is uselessly called for
> >    single clause. This should be called after the clauselist found
> >    to consist more than one clause.
> 
> Ok, will fix.
> 
> >
> > - Searching vars to be compared with mv-stat columns which
> >    find_stats does should stop at disjunctions. But this patch
> >    doesn't behave so and it should be an unwanted behavior. The
> >    following steps shows that.
> 
> Why should it stop at disjunctions? There's nothing wrong with using
> multivariate stats to estimate OR-clauses, IMHO.

Mv statistics represents how often *every combination of the
column values* occurs. Is it correct? Where the combination can
be replaced with coexists, that is AND. For example MV-MCV.

(a, b, c) freq
(1, 2, 3)  100
(1, 2, 5)   50
(1, 3, 8)   20
(1, 7, 2)    5
===============
total      175

| select * from t where a = 1 and b = 2 and c = 3;
| SELECT 100

This is correct,

| select * from t where a = 1 and b = 2 or c = 3;
| SELECT 100

This is *not* correct. The correct number of tuples is 150.
This is a simple example where OR breaks MV stats assumption.

> > ====
> >   =# CREATE TABLE t1 (a int, b int, c int);
> >   =# INSERT INTO t1 (SELECT a, a * 2, a * 3 FROM generate_series(0,
> >   9999) a);
> >   =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
> >    Seq Scan on t1  (cost=0.00..230.00 rows=1 width=12)
> >   =# ALTER TABLE t1 ADD STATISTICS (HISTOGRAM) ON (a, b, c);
> >   =# ANALZYE t1;
> >   =# EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
> >    Seq Scan on t1  (cost=0.00..230.00 rows=268 width=12)
> > ====
> >   Rows changed unwantedly.
> 
> That has nothing to do with OR clauses, but rather with using a type
> of statistics that does not fit the data and queries. Histograms are
> quite inaccurate for discrete data and equality conditions - in this
> case the clauses probably match one bucket, and so we use 1/2 the
> bucket as an estimate. There's nothing wrong with that.
> 
> So let's use MCV instead:

Hmm, it's not a problem what specific number is displayed as
rows. What is crucial is the fact that rows has changed even
though it shouldn't have changed. As I demonstrated above.

> ALTER TABLE t1 ADD STATISTICS (MCV) ON (a, b, c);
> ANALYZE t1;
> EXPLAIN SELECT * FROM t1 WHERE a = 1 AND b = 2 OR c = 3;
>                      QUERY PLAN
> -----------------------------------------------------
>  Seq Scan on t1  (cost=0.00..230.00 rows=1 width=12)
>    Filter: (((a = 1) AND (b = 2)) OR (c = 3))
> (2 rows)
> 
> >   It seems not so simple thing as your code assumes.
> 
> Maybe, but I don't see what assumption is invalid? I see nothing wrong
> with the previous query.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: Support for N synchronous standby servers - take 2
Следующее
От: Jeff Davis
Дата:
Сообщение: Re: Parallel Seq Scan