Re: Selectivity estimation for inet operators

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: Selectivity estimation for inet operators
Дата
Msg-id 5423DC8D.3080206@vmware.com
обсуждение исходный текст
Ответ на Re: Selectivity estimation for inet operators  (Emre Hasegeli <emre@hasegeli.com>)
Ответы Re: Selectivity estimation for inet operators  (Emre Hasegeli <emre@hasegeli.com>)
Список pgsql-hackers
On 09/07/2014 07:09 PM, Emre Hasegeli wrote:
>>>> I updated the patch to cover semi and anti joins with eqjoinsel_semi().
>>>> I think it is better than returning a constant.
>>>
>>> What you did there is utterly unacceptable from a modularity standpoint;
>>> and considering that the values will be nowhere near right, the argument
>>> that "it's better than returning a constant" seems pretty weak.  I think
>>> you should just take that out again.
>>
>> I will try to come up with a better, data type specific implementation
>> in a week.
>
> New version with semi join estimation function attached.

Thanks. Overall, my impression of this patch is that it works very well. 
But damned if I understood *how* it works :-). There's a lot of 
statistics involved, and it's not easy to see why something is 
multiplied by something else. I'm adding comments as I read through it.

I've gotten to the inet_semi_join_selec function:

> /*
>  * Inet semi join selectivity estimation.
>  */
> static Selectivity
> inet_semi_join_selec(bool mcv2_exists, Datum *mcv2_values, int mcv2_nvalues,
>                      bool his2_exists, Datum *his2_values, int his2_nvalues,
>                      double his2_weight, Datum *constvalue,
>                      FmgrInfo *proc, short opr_order)
> {
>     if (mcv2_exists)
>     {
>         int            i;
>
>         for (i = 0; i < mcv2_nvalues; i++)
>         {
>             if (DatumGetBool(FunctionCall2Coll(proc, DEFAULT_COLLATION_OID,
>                                                *constvalue, mcv2_values[i])))
>                 return 1.0;
>         }
>     }
>
>     /* Do not bother if histogram weight is smaller than 0.1. */
>     if (his2_exists && his2_weight > 0.1)
>     {
>         Selectivity    his_selec;
>
>         his_selec = inet_his_inclusion_selec(his2_values, his2_nvalues,
>                                              constvalue, opr_order);
>
>         if (his_selec > 0)
>             return Min(1.0, his2_weight * his_selec);
>     }
>
>     return 0.0;
> }

This desperately needs comment at the top of the function explaining 
what it does. Let me try to explain what I think it does:

This function calculates the probability that there is at least one row 
in table B, which satisfies the "constant op column" qual. The constant 
is passed as argument, and for table B, the MCV list and histogram is 
provided. his2_weight is the total number of rows in B that are covered 
by the histogram. For example, if the table has 1000 rows, and 10% of 
the rows in the table are in the MCV, and another 10% are NULLs, 
his_weight would be 800.

First, we check if the constant matches any of the most common values. 
If it does, return 1.0, because then there is surely a match.

Next, we use the histogram to estimate the number of rows in the table 
that matches the qual. If it amounts to more than 1 row, we return 1.0. 
If it's between 0.0 and 1.0 rows, we return that number as the probability.


Now, I think that last step is wrong. Firstly, the "Do not bother if 
histogram weight is smaller than 0.1" rule seems bogus. The his2_weight 
is the total number of rows represented by the histogram, so surely it 
can't be less than 1. It can't really be less than the statistics 
target. Unless maybe if the histogram was collected when the table was 
large, but it has since shrunk to contain only a few rows, but that 
seems like a very bizarre corner case. At least it needs more comments 
explaining what the test is all about, but I think we should just always 
use the histogram (if it's available).

Secondly, if we estimate that there is on average 1.0 matching row in 
the table, it does not follow that the probability that at least one row 
matches is 1.0. Assuming a gaussian distribution with mean 1.0, the 
probability that at least one row matches is 0.5. Assuming a gaussian 
distribution here isn't quite right - I guess a Poisson distribution 
would be more accurate - but it sure doesn't seem right as it is.

The error isn't very big, and perhaps you don't run into that very 
often, so I'm not sure what the best way to fix that would be. My 
statistics skills are a bit rusty, but I think the appropriate way would 
be to apply the Poisson distribution, with the estimated number of 
matched rows as the mean. The probability of at least one match would be 
the cumulative distribution function at k=1. It sounds like overkill, if 
this is case occurs only rarely. But then again, perhaps it's not all 
that rare.

That said, I can't immediately find a test case where that error would 
matter. I tried this:

create table inettbl1 (a inet);
insert into inettbl1 select '10.0.0.' || (g % 255) from 
generate_series(1, 10) g;
analyze inettbl1;
explain analyze select count(*) from inettbl1 where a >>= ANY (SELECT a 
from inettbl1);

The estimate for that is pretty accurate, 833 rows estimated vs 1000 
actual, with the current patch. I'm afraid if we fixed 
inet_semi_join_selec the way I suggest, the estimate would be smaller, 
i.e. more wrong. Is there something else in the estimates that 
accidentally compensates for this currently?

Thoughts?

- Heikki




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Fabien COELHO
Дата:
Сообщение: Re: add modulo (%) operator to pgbench
Следующее
От: Andres Freund
Дата:
Сообщение: Re: a fast bloat measurement tool (was Re: Measuring relation free space)