Re: [HACKERS] multivariate statistics (v19)

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: [HACKERS] multivariate statistics (v19)
Дата
Msg-id 026e37cf-319a-fa6f-79c2-d5a2fc98b1b6@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: [HACKERS] multivariate statistics (v19)  (Dilip Kumar <dilipbalaut@gmail.com>)
Список pgsql-hackers
On 01/26/2017 10:43 AM, Dilip Kumar wrote:
>
> histograms
> --------------
> + if (matches[i] == MVSTATS_MATCH_FULL)
> + s += mvhist->buckets[i]->ntuples;
> + else if (matches[i] == MVSTATS_MATCH_PARTIAL)
> + s += 0.5 * mvhist->buckets[i]->ntuples;
>
> Isn't it will be better that take some percentage of the bucket based
> on the number of distinct element for partial matching buckets.
>

I don't think so, for the same reason why ineq_histogram_selectivity() 
in selfuncs.c uses
    binfrac = 0.5;

for partial bucket matches - it provides minimum average error. Even if 
we knew the number of distinct items in the bucket, we have no idea what 
the distribution within the bucket looks like. Maybe 99% of the bucket 
are covered by a single distinct value, maybe all the items are squashed 
on one side of the bucket, etc.

Moreover we don't really know the number of distinct values in the 
bucket - we only know the number of distinct items in the sample, and 
only while building the histogram. I don't think it makes much sense to 
estimate the number of distinct items in a bucket, because the buckets 
contain only very few rows so the estimates would be wildly inaccurate.

>
> +static int
> +update_match_bitmap_histogram(PlannerInfo *root, List *clauses,
> +  int2vector *stakeys,
> +  MVSerializedHistogram mvhist,
> +  int nmatches, char *matches,
> +  bool is_or)
> +{
> + int i;
>
> For each clause we are processing all the buckets, can't we use some
> data structure which can make multi-dimensions information searching
> faster.>

No, we're not processing all buckets for each clause. We're' only 
processing buckets that were not "ruled out" by preceding clauses. 
That's the whole point of the bitmap.

For example for condition (a=1) AND (b=2), the code will first evaluate 
(a=1) on all buckets, and then (b=2) but only on buckets where (a=1) was 
evaluated as true. Similarly for OR clauses.
>
> Something like HTree, RTree, Maybe storing histogram in these formats
> will be difficult?
>

Maybe, but I don't want to do that in the first version. I'm not opposed 
to doing that in the future, if we find out the v1 histograms are not 
efficient (I don't think we will, based on tests I did while working on 
the patch). Support for other histogram implementations is pretty much 
why there is 'type' field in the struct.

For now I think we should stick with the simple implementation.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Stephen Frost
Дата:
Сообщение: Re: [HACKERS] [COMMITTERS] pgsql: test_pg_dump TAP test whitespace cleanup
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: [HACKERS] multivariate statistics (v19)