Re: Improve selectivity estimate for range queries

Поиск
Список
Период
Сортировка
От Kyotaro HORIGUCHI
Тема Re: Improve selectivity estimate for range queries
Дата
Msg-id 20181221.122436.238213221.horiguchi.kyotaro@lab.ntt.co.jp
обсуждение исходный текст
Ответ на Improve selectivity estimate for range queries  ("Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp>)
Ответы RE: Improve selectivity estimate for range queries  ("Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp>)
Список pgsql-hackers
Hello.

At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in
<008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp>
> In my environment, the selectivity for id > 0 was 0.99990000000000001,
> and the selectivity for id < 10000 was 0.33333333333333331. Then, the
> value of rqlist->hibound and rqlist->lobound are set respectively.
> On the other hand, DEFAULT_INEQ_SEL is 0.3333333333333333.  As a result,
> the final selectivity is computed according to DEFAULT_RANGE_INEQ_SEL,
> since the condition of the second if statement was satisfied. However,
> these selectivities were computed according to the statistics, so the
> final selectivity had to be calculated from rqlist->hibound + rqlist->lobound - 1.0.
> My test case might be uncommon, but I think it would cause some problems.

Yeah, its known behavior as the comment just above. If in your
example query, the table weer very large and had an index it
could run faultly an index scan over a fair amount of tuples. But
I'm not sure how much it matters and your example doesn't show
that.

The behvavior discussed here is introduced by this commit.

| commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a
| Author: Tom Lane <tgl@sss.pgh.pa.us>
| Date:   Tue Nov 9 00:34:46 2004 +0000
| 
|     Use a hopefully-more-reliable method of detecting default selectivity
|     estimates when combining the estimates for a range query.  As pointed out
|     by Miquel van Smoorenburg, the existing check for an impossible combined
|     result would quite possibly fail to detect one default and one non-default
|     input.  It seems better to use the default range query estimate in such
|     cases.  To do so, add a check for an estimate of exactly DEFAULT_INEQ_SEL.
|     This is a bit ugly because it introduces additional coupling between
|     clauselist_selectivity and scalarltsel/scalargtsel, but it's not like
|     there wasn't plenty already...

> To handle such cases I've thought up of an idea based on a previous commit[1]
> which solved a similar problem related to DEFAULT_NUM_DISTINCT.  Just like
> this modification, I added a flag which shows whether the selectivity

The commit [1] added almost no complexity, but this does. Affects
many functions to introduce tighter coupling between
operator-selectivity functions and clause selectivity functions.
isdefault travells alone too long just to bind remote
functions. We might need more pricipled way to handle that.

> was calculated according to the statistics or not to clauselist_selectivity,
> and used it as a condition of the if statement instead of 
> rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL.
> I am afraid that changes were more than I had expected, but we can estimate
> selectivity correctly.
> 
> Patch attached.  Do you have any thoughts?

I've just run over the patch, but some have comments.

+            if (isdefault)
+                rqelem->lobound_isdefault = true;

This is taking the disjunction of lobounds ever added, I suppose
it should be the last lobounds' isdefault.

As you see, four other instances of "selec ==/!= DEFAULT_*" exist
in mergejoinsel. Don't they lead to similar kind of problem?


     Selectivity lobound;        /* Selectivity of a var > something clause */
     Selectivity hibound;        /* Selectivity of a var < something clause */
+    bool        lobound_isdefault;    /* lobound is a default selectivity? */
+    bool        hibound_isdefault;    /* hibound is a default selectivity? */
 } RangeQueryClause;

Around the [1], there was a discussion about introducing the
notion of uncertainty to selectivity. The isdefualt is a kind of
uncertainty value indicating '0/100% uncertain'. So my feeling is
saying that it's better that Selectivity has certinty component
then building a selectivity arithmetics involving uncertainty,
though I don't have a clear picture.


> [1]
> https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=4c2777d0b733220d9029f78817af8ce

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Kyotaro HORIGUCHI
Дата:
Сообщение: Re: [HACKERS] Macros bundling RELKIND_* conditions
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Tid scan improvements