Re: multivariate statistics / patch v6

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: multivariate statistics / patch v6
Дата
Msg-id 55562D07.2090908@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: multivariate statistics / patch v6  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Список pgsql-hackers
Hello,

On 05/15/15 08:29, Kyotaro HORIGUCHI wrote:
> Hello,
>
>> Regarding the functional dependencies - you're right there's room
>> for improvement. For example it only works with dependencies
>> between pairs of columns, not multi-column dependencies. Is this
>> what you mean by incomplete?
>
> No, It overruns dependencies->deps because build_mv_dependencies
> stores many elements into dependencies->deps[n] although it
> really has a room for only one element. I suppose that you paused
> writing it when you noticed that the number of required elements
> is unknown before finising walk through all pairs of
> values. palloc'ing numattrs^2 is reasonable enough as POC code
> for now. Am I looking wrong version of patch?
>
> -    dependencies = (MVDependencies)palloc0(sizeof(MVDependenciesData))
> +    dependencies = (MVDependencies)palloc0(sizeof(MVDependenciesData) +
> +                                sizeof(MVDependency) * numattrs * numattrs);

Ah! That's clearly a bug. Thanks for noticing that, will fix in the next 
version of the patch.

>>> I mention this because I recently had a issue from strong
>>> correlation between two columns in dbt3 benchmark. Two columns
>>> in some table are in strong correlation but not in functional
>>> dependencies, there are too many values and the distribution of
>>> them is very uniform so MCV is no use for the table (histogram
>>> has nothing to do with equal conditions). As the result, planner
>>> estimates the number of rows largely wrong as expected
>>> especially for joins.
>>
>> I think the other statistics types (esp. histograms) might be more
>> useful here, but I assume you haven't tried that because of the
>> conflicts.
>>
>> The current patch does not handle joins at all, though.
>
> Well, that's one of the resons. But I understood that any
> deterministic estimation cannot be applied for such distribution
> when I saw what made the wrong estimation. eqsel and eqsel_join
> finally relies on random match assumption on uniform distribution
> when the value is not found in MCV list. And functional
> dependencies stuff in your old patch (which works) (rightfully)
> failed to find such relationship between the problematic
> columns. So I tried ndistinct, which is not contained in your
> patch to see how it works well.

Yes, that's certainly true. I think you're right that mv coefficient 
might be quite useful in some cases.

>> With my patch:
>>
>> alter table t add statistics (mcv) on (a,b,c);
> ...
>>   Seq Scan on t  (cost=0.00..22906.00 rows=9533 width=12)
>
> Yes, your MV-MCV list should have one third of all possible (set
> of) values so it works fine, I guess. But my original problem was
> occurred on the condition that (the single column) MCVs contain
> under 1% of possible values, MCV would not work for such cases,
> but its very uniform distribution helps random assumption to
> work.

Actually, I think the MCV list should contain all the items, as it 
decides the sample contains all the values from the data. The usual 1D 
MCV list uses the same logic. But you're right that on a data set with 
more MCV items and mostly uniform distribution, this won't work.


>
>> $ perl gentbl.pl 200000 | psql postgres
> <takes a while..>
>> posttres=# alter table t1 add statistics (mcv true) on (a, b);
>> postgres=# analyze t1;
>> postgres=# explain analyze select * from t1 where a = 1 and b = 2501;
>> Seq Scan on t1  (cost=0.00..124319.00 rows=1 width=8)
>>                  (actual time=0.051..1250.773 rows=8 loops=1)
>
> The estimate "rows=1" is internally 2.4e-11, 3.33e+11 times
> smaller than the real number. This will result in roughly the
> same order of error for joins. This is because MV-MCV holds too
> small part of the domain and then calculated using random
> assumption. This won't be not saved by increasing
> statistics_target to any sane amount.

Yes, the MCV lists don't do work well with data sets like this.

>> alter table t drop statistics all;
>> alter table t add statistics (histogram) on (a,b,c);
> ...
>>   Seq Scan on t  (cost=0.00..22906.00 rows=9667 width=12)
>
>> So both the MCV list and histogram do quite a good work here,
>
> I understand how you calculate selectivity for equality clauses
> using histogram. And it calculates the result rows as 2.3e-11,
> which is almost same as MV-MCV, and this comes the same cause
> with it then yields the same result for joins.
>
>> but there are certainly cases when that does not work and the
>> mvcoefficient works better.

+1

> The condition mv-coef is effective where, as metioned above,
> MV-MCV or MV-HISTO cannot hold sufficient part of the domain. The
> appropriate combination of MV-MCV and mv-coef would be the same
> as va_eq_(non_)const/eqjoinsel_inner for single column, which is,
> applying mv-coef on the part of selectivity corresponding to
> values not in MV-MCV. I have no idea to combinate it with
> MV-HISTOGRAM right now.
>
>> The current patch does not handle joins, but it's one of the TODO
>> items.
>
> Yes, but the result on the very large tables can be deduced from
> the discussion above.
>
>>> I think the result above shows that the multivariate coefficient
>>> is significant to imporove estimates when correlated colums are
>>> involved.
>>
>> Yes, it looks interesting. I'm wondering what are the "failure cases"
>> when the coefficient approach does not work. It seems to me it relies
>> on an assumption of consistency for all the ndistinct values. For
>> example lets assume you have two columns - A and B, each with 1000
>> distinct values, and that each value in A has 100 matching values in
>> B, so the coefficient is ~10
>>
>>     1,000 * 1,000 / 100,000 = 10
>>
>> Now, let's assume the distribution looks differently - with first 100
>> values in A matching all 1000 values of B, and the remaining 900
>> values just a single B value. Then
>>
>>    1,000 * 1,000 / (100,000 + 900) = ~9,9
>>
>> So a very different distribution, but almost the same coefficient.
>>
>> Are there any other assumptions like this?
>
> I think no for now. Just like the current var_eq_(non_)const and
> eqjoinsel_inner does, since no clue for *the true* distribution
> available, we have no choice other than stand on the random (on
> uniform dist) assumption. And it gives not so bad estimates for
> not so extreme distributions. It's of course not perfect but good
> enough.
>
>> Also, does the coefficient work only for equality conditions only?
>
> The mvcoef is a parallel of ndistinct, (it is a bit wierd
> expression though). So I guess it is appliable on the current
> estimation codes where using ndistinct, almost of all of them
> look to relate to equiality comparison.

ISTM the estimation of GROUP BY might benefit tremendously from this 
statistics. That is, helping with cardinality estimation of analytical 
queries, etc.

Also, we've only discussed 2-column coefficients. Would it be useful to 
track those coefficients for large groups of columns? For example
     ndistinct(A,B,C)    --------------------------------------------     ndistinct(A) * ndistinct(B) * ndistinct(C)

which might work better for queries like
    SELECT a,b,c FROM t GROUP BY a,b,c;

>>> Would you consider this in your patch? Otherwise I move on this
>>> as a different project from yours if you don't mind. Except user
>>> interface won't conflict with yours, I suppose. But finally they
>>> should need some labor of consolidation.
>>
>> I think it's a neat idea, and I think it might be added to the
>> patch. It would fit in quite nicely, actually - I already do have
>> other kinds of stats for addition, but I'm not going to work on
>> that in the near future. It will require changes in some parts of
>> the patch (selecting the stats for a list of clauses) and I'd like
>> to complete the current patch first, and then add features in
>> follow-up patches.
>
> I see. Let's work on this for now.

Thanks!

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Volker Aßmann
Дата:
Сообщение: Re: Disabling trust/ident authentication configure option
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: multivariate statistics / patch v6