Обсуждение: Improve selectivity estimate for range queries
Hi, I found the problem about selectivity estimate for range queries on master as follows. This is my test case: postgres=# CREATE TABLE test(id int, val text); CREATE TABLE postgres=# INSERT INTO test SELECT i, 'testval' FROM generate_series(0,29999)i; INSERT 0 30000 postgres=# VACUUM analyze test; VACUUM postgres=# EXPLAIN ANALYZE SELECT * FROM test WHERE id > 0 and id < 10000; QUERY PLAN ---------------------------------------------------------------------------------------------------- --- Seq Scan on test (cost=0.00..613.00 rows=150 width=12) (actual time=0.044..21.371 rows=9999 loops=1) Filter: ((id > 0) AND (id < 10000)) Rows Removed by Filter: 20001 Planning Time: 0.099 ms Execution Time: 37.142 ms (5 rows) Although the actual number of rows was 9999, the estimated number of rows was 150. So I dug to see what happened and thought that the following part in clauselist_selectivity caused this problem. ------------ /* * Now scan the rangequery pair list. */ while (rqlist != NULL) { RangeQueryClause *rqnext; if (rqlist->have_lobound && rqlist->have_hibound) { /* Successfully matched a pair of range clauses */ Selectivity s2; /* * Exact equality to the default value probably means the * selectivity function punted. This is not airtight but should * be good enough. */ if (rqlist->hibound == DEFAULT_INEQ_SEL || rqlist->lobound == DEFAULT_INEQ_SEL) { s2 = DEFAULT_RANGE_INEQ_SEL; } else { s2 = rqlist->hibound + rqlist->lobound - 1.0; ------------ In my environment, the selectivity for id > 0 was 0.99990000000000001, and the selectivity for id < 10000 was 0.33333333333333331. Then, the value of rqlist->hibound and rqlist->lobound are set respectively. On the other hand, DEFAULT_INEQ_SEL is 0.3333333333333333. As a result, the final selectivity is computed according to DEFAULT_RANGE_INEQ_SEL, since the condition of the second if statement was satisfied. However, these selectivities were computed according to the statistics, so the final selectivity had to be calculated from rqlist->hibound + rqlist->lobound - 1.0. My test case might be uncommon, but I think it would cause some problems. To handle such cases I've thought up of an idea based on a previous commit[1] which solved a similar problem related to DEFAULT_NUM_DISTINCT. Just like this modification, I added a flag which shows whether the selectivity was calculated according to the statistics or not to clauselist_selectivity, and used it as a condition of the if statement instead of rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL. I am afraid that changes were more than I had expected, but we can estimate selectivity correctly. Patch attached. Do you have any thoughts? [1] https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=4c2777d0b733220d9029f78817af8ce Best regards, Yuzuko Hosoya NTT Open Source Software Center
Вложения
Hello. At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in <008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp> > In my environment, the selectivity for id > 0 was 0.99990000000000001, > and the selectivity for id < 10000 was 0.33333333333333331. Then, the > value of rqlist->hibound and rqlist->lobound are set respectively. > On the other hand, DEFAULT_INEQ_SEL is 0.3333333333333333. As a result, > the final selectivity is computed according to DEFAULT_RANGE_INEQ_SEL, > since the condition of the second if statement was satisfied. However, > these selectivities were computed according to the statistics, so the > final selectivity had to be calculated from rqlist->hibound + rqlist->lobound - 1.0. > My test case might be uncommon, but I think it would cause some problems. Yeah, its known behavior as the comment just above. If in your example query, the table weer very large and had an index it could run faultly an index scan over a fair amount of tuples. But I'm not sure how much it matters and your example doesn't show that. The behvavior discussed here is introduced by this commit. | commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a | Author: Tom Lane <tgl@sss.pgh.pa.us> | Date: Tue Nov 9 00:34:46 2004 +0000 | | Use a hopefully-more-reliable method of detecting default selectivity | estimates when combining the estimates for a range query. As pointed out | by Miquel van Smoorenburg, the existing check for an impossible combined | result would quite possibly fail to detect one default and one non-default | input. It seems better to use the default range query estimate in such | cases. To do so, add a check for an estimate of exactly DEFAULT_INEQ_SEL. | This is a bit ugly because it introduces additional coupling between | clauselist_selectivity and scalarltsel/scalargtsel, but it's not like | there wasn't plenty already... > To handle such cases I've thought up of an idea based on a previous commit[1] > which solved a similar problem related to DEFAULT_NUM_DISTINCT. Just like > this modification, I added a flag which shows whether the selectivity The commit [1] added almost no complexity, but this does. Affects many functions to introduce tighter coupling between operator-selectivity functions and clause selectivity functions. isdefault travells alone too long just to bind remote functions. We might need more pricipled way to handle that. > was calculated according to the statistics or not to clauselist_selectivity, > and used it as a condition of the if statement instead of > rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL. > I am afraid that changes were more than I had expected, but we can estimate > selectivity correctly. > > Patch attached. Do you have any thoughts? I've just run over the patch, but some have comments. + if (isdefault) + rqelem->lobound_isdefault = true; This is taking the disjunction of lobounds ever added, I suppose it should be the last lobounds' isdefault. As you see, four other instances of "selec ==/!= DEFAULT_*" exist in mergejoinsel. Don't they lead to similar kind of problem? Selectivity lobound; /* Selectivity of a var > something clause */ Selectivity hibound; /* Selectivity of a var < something clause */ + bool lobound_isdefault; /* lobound is a default selectivity? */ + bool hibound_isdefault; /* hibound is a default selectivity? */ } RangeQueryClause; Around the [1], there was a discussion about introducing the notion of uncertainty to selectivity. The isdefualt is a kind of uncertainty value indicating '0/100% uncertain'. So my feeling is saying that it's better that Selectivity has certinty component then building a selectivity arithmetics involving uncertainty, though I don't have a clear picture. > [1] > https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=4c2777d0b733220d9029f78817af8ce regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, Thanks for the comments. I attach the v2 patch. > From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] > Sent: Friday, December 21, 2018 12:25 PM > > Hello. > > At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in > <008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp> > > In my environment, the selectivity for id > 0 was 0.99990000000000001, > > and the selectivity for id < 10000 was 0.33333333333333331. Then, the > > value of rqlist->hibound and rqlist->lobound are set respectively. > > On the other hand, DEFAULT_INEQ_SEL is 0.3333333333333333. As a > > result, the final selectivity is computed according to > > DEFAULT_RANGE_INEQ_SEL, since the condition of the second if statement > > was satisfied. However, these selectivities were computed according to > > the statistics, so the final selectivity had to be calculated from rqlist->hibound + rqlist->lobound > - 1.0. > > My test case might be uncommon, but I think it would cause some problems. > > Yeah, its known behavior as the comment just above. If in your example query, the table weer very large > and had an index it could run faultly an index scan over a fair amount of tuples. But I'm not sure > how much it matters and your example doesn't show that. > > The behvavior discussed here is introduced by this commit. > > | commit 547bb4a7f2bdccad9253a99211ce84b3f9de485a > | Author: Tom Lane <tgl@sss.pgh.pa.us> > | Date: Tue Nov 9 00:34:46 2004 +0000 > | > | Use a hopefully-more-reliable method of detecting default selectivity > | estimates when combining the estimates for a range query. As pointed out > | by Miquel van Smoorenburg, the existing check for an impossible combined > | result would quite possibly fail to detect one default and one non-default > | input. It seems better to use the default range query estimate in such > | cases. To do so, add a check for an estimate of exactly DEFAULT_INEQ_SEL. > | This is a bit ugly because it introduces additional coupling between > | clauselist_selectivity and scalarltsel/scalargtsel, but it's not like > | there wasn't plenty already... > > > To handle such cases I've thought up of an idea based on a previous > > commit[1] which solved a similar problem related to > > DEFAULT_NUM_DISTINCT. Just like this modification, I added a flag > > which shows whether the selectivity > > The commit [1] added almost no complexity, but this does. Affects many functions to introduce tighter > coupling between operator-selectivity functions and clause selectivity functions. > isdefault travells alone too long just to bind remote functions. We might need more pricipled way to > handle that. > Yes, indeed. But I have not found better way than this approach yet. > > was calculated according to the statistics or not to > > clauselist_selectivity, and used it as a condition of the if statement > > instead of > > rqlist->hibound == DEFAULT_INEQ_SEL and rqlist->lobound == DEFAULT_INEQ_SEL. > > I am afraid that changes were more than I had expected, but we can > > estimate selectivity correctly. > > > > Patch attached. Do you have any thoughts? > > I've just run over the patch, but some have comments. > > + if (isdefault) > + rqelem->lobound_isdefault = true; > > This is taking the disjunction of lobounds ever added, I suppose it should be the last lobounds' > isdefault. > Indeed. I fixed it. > As you see, four other instances of "selec ==/!= DEFAULT_*" exist in mergejoinsel. Don't they lead > to similar kind of problem? > I didn't encounter similar problems, but as you said, such problems would be occurred due to the condition, selec != DEFAULT_INEQ_SEL. I changed these conditions into "!isdefault". > > > Selectivity lobound; /* Selectivity of a var > something clause */ > Selectivity hibound; /* Selectivity of a var < something clause */ > + bool lobound_isdefault; /* lobound is a default selectivity? */ > + bool hibound_isdefault; /* hibound is a default selectivity? */ > } RangeQueryClause; > > Around the [1], there was a discussion about introducing the notion of uncertainty to selectivity. > The isdefualt is a kind of uncertainty value indicating '0/100% uncertain'. So my feeling is saying > that it's better that Selectivity has certinty component then building a selectivity arithmetics > involving uncertainty, though I don't have a clear picture. > I see. Indeed if we would change Selectivity so that it includes information about default/non-default, such problems I mentioned would be solved. But I'm afraid that this modification would lead to a big impact, since there are lots of functions that calculate selectivity. I also don't have a concreate idea, so I will think about it. Besides that, I'd like to ask community whether such modification would be acceptable or not. Best regards, Yuzuko Hosoya NTT Open Source Software Center > > > [1] > > https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=4c2 > > 777d0b733220d9029f78817af8ce > > regards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center
Вложения
"Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> writes: > From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] >> At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in >> <008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp> >>> To handle such cases I've thought up of an idea based on a previous >>> commit[1] which solved a similar problem related to >>> DEFAULT_NUM_DISTINCT. Just like this modification, I added a flag >>> which shows whether the selectivity >> The commit [1] added almost no complexity, but this does. Affects many >> functions to introduce tighter coupling between operator-selectivity >> functions and clause selectivity functions. isdefault travells alone >> too long just to bind remote functions. We might need more pricipled >> way to handle that. Yeah, I don't think that this approach is a reasonable tradeoff to eliminate a narrow case. In particular, what's been done here to clause_selectivity is totally unprincipled and poorly thought out: you can't add an output parameter that's set in only some cases. Yet, there's no good way to decide how to set it in many cases. For instance, what would you do for an AND or OR where some of the branches are default estimates and some not? >> Around the [1], there was a discussion about introducing the notion of >> uncertainty to selectivity. The isdefualt is a kind of uncertainty >> value indicating '0/100% uncertain'. So my feeling is saying that >> it's better that Selectivity has certinty component then building a >> selectivity arithmetics involving uncertainty, though I don't have a >> clear picture. > I see. Indeed if we would change Selectivity so that it includes > information about default/non-default, such problems I mentioned > would be solved. But I'm afraid that this modification would lead > to a big impact, since there are lots of functions that calculate > selectivity. I also don't have a concreate idea, so I will think > about it. Besides that, I'd like to ask community whether such > modification would be acceptable or not. The planner definitely has a lot of problems that could be addressed with some sort of uncertainty framework ... but it'd be a huge undertaking, which is why nobody's tried yet. A smaller-footprint way to fix the immediate problem might be to change the values of DEFAULT_INEQ_SEL and friends so that they're even less likely to be matched by accident. That is, instead of 0.3333333333333333, use 0.333333333333342 or some such. It might seem that that's just moving the problem around, but I think it might be possible to show that such a value couldn't be computed by scalarltsel given a histogram with no more than 10000 members. (I haven't tried to actually prove that, but it seems intuitive that the set of possible results would be quantized with no more than about 5 digits precision.) regards, tom lane
Hello. At Fri, 21 Dec 2018 11:50:28 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <28533.1545411028@sss.pgh.pa.us> > "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> writes: > > From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp] > >> At Thu, 20 Dec 2018 17:21:29 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in > >> <008701d4983d$02e731c0$08b59540$@lab.ntt.co.jp> > >>> To handle such cases I've thought up of an idea based on a previous > >>> commit[1] which solved a similar problem related to > >>> DEFAULT_NUM_DISTINCT. Just like this modification, I added a flag > >>> which shows whether the selectivity > > >> The commit [1] added almost no complexity, but this does. Affects many > >> functions to introduce tighter coupling between operator-selectivity > >> functions and clause selectivity functions. isdefault travells alone > >> too long just to bind remote functions. We might need more pricipled > >> way to handle that. > > Yeah, I don't think that this approach is a reasonable tradeoff to > eliminate a narrow case. In particular, what's been done here to > clause_selectivity is totally unprincipled and poorly thought out: > you can't add an output parameter that's set in only some cases. > Yet, there's no good way to decide how to set it in many cases. > For instance, what would you do for an AND or OR where some of > the branches are default estimates and some not? > > >> Around the [1], there was a discussion about introducing the notion of > >> uncertainty to selectivity. The isdefualt is a kind of uncertainty > >> value indicating '0/100% uncertain'. So my feeling is saying that > >> it's better that Selectivity has certinty component then building a > >> selectivity arithmetics involving uncertainty, though I don't have a > >> clear picture. > > > I see. Indeed if we would change Selectivity so that it includes > > information about default/non-default, such problems I mentioned > > would be solved. But I'm afraid that this modification would lead > > to a big impact, since there are lots of functions that calculate > > selectivity. I also don't have a concreate idea, so I will think > > about it. Besides that, I'd like to ask community whether such > > modification would be acceptable or not. > > The planner definitely has a lot of problems that could be addressed > with some sort of uncertainty framework ... but it'd be a huge > undertaking, which is why nobody's tried yet. Sure. > A smaller-footprint way to fix the immediate problem might be to > change the values of DEFAULT_INEQ_SEL and friends so that they're > even less likely to be matched by accident. That is, instead of > 0.3333333333333333, use 0.333333333333342 or some such. It might Sounds reasonable. > seem that that's just moving the problem around, but I think it > might be possible to show that such a value couldn't be computed > by scalarltsel given a histogram with no more than 10000 members. > (I haven't tried to actually prove that, but it seems intuitive > that the set of possible results would be quantized with no more > than about 5 digits precision.) FWIW, I got the following result on my environment. It seems different enough if this holds on all supported platforms, though there still is a case where the result of a sequence of arithmetics makes false match. x = 0.33333333333333331 d = 1.000000e+30 match d = 1.000000e+31: 0.000000 match x = 0.33333333333334197 d = 0.000000e+00 match d = 1.000000e+00: 0.000000 match ==== t.c #include <stdio.h> #include <math.h> int test(double x) { double d = 1.0; double d0 = 0; fprintf(stderr, "x = %19.17f\n", x); while (d != d0) { double z = floor(d * 3); double z1 = z + 1.0; double y = d / z; double y1 = d / z1; /* Check if both sides of d * 3 doesn't make match */ if (y != x && y1 != x) { fprintf(stderr, " d = %e match\n", d0); fprintf(stderr, " d = %e: %f match\n", d); return 0; } d0 = d; d = d * 10; } } int main(void) { test(0.3333333333333333); test(0.333333333333342); return 0; } ==== -- Kyotaro Horiguchi NTT Open Source Software Center
Mmm. At Tue, 08 Jan 2019 16:26:38 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190108.162638.106314087.horiguchi.kyotaro@lab.ntt.co.jp> > FWIW, I got the following result on my environment. It seems > different enough if this holds on all supported platforms, though > there still is a case where the result of a sequence of > arithmetics makes false match. > > x = 0.33333333333333331 > d = 1.000000e+30 match > d = 1.000000e+31: 0.000000 match Of course the "match" in the last line above is a mistake of "NOT match". -- Kyotaro Horiguchi NTT Open Source Software Center
Sigh.. In the frrst place, the loop should not stop until the maximum exponent.
sorry, but I don't have a time now..
sorry, but I don't have a time now..
2019年1月8日(火) 16:43 Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>:
Mmm.
At Tue, 08 Jan 2019 16:26:38 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190108.162638.106314087.horiguchi.kyotaro@lab.ntt.co.jp>
> FWIW, I got the following result on my environment. It seems
> different enough if this holds on all supported platforms, though
> there still is a case where the result of a sequence of
> arithmetics makes false match.
>
> x = 0.33333333333333331
> d = 1.000000e+30 match
> d = 1.000000e+31: 0.000000 match
Of course the "match" in the last line above is a mistake of "NOT
match".
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Tue, 08 Jan 2019 16:26:38 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20190108.162638.106314087.horiguchi.kyotaro@lab.ntt.co.jp> > Hello. > > At Fri, 21 Dec 2018 11:50:28 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <28533.1545411028@sss.pgh.pa.us> > > seem that that's just moving the problem around, but I think it > > might be possible to show that such a value couldn't be computed > > by scalarltsel given a histogram with no more than 10000 members. > > (I haven't tried to actually prove that, but it seems intuitive > > that the set of possible results would be quantized with no more > > than about 5 digits precision.) I think we don't need a perfect proof for that. The fact that exactly 1/3 is quite natural and common but 1/3 + ε is not would be enough. > FWIW, I got the following result on my environment. It seems > different enough if this holds on all supported platforms, though > there still is a case where the result of a sequence of > arithmetics makes false match. Simple selectivity of a relation theoretically cannot match with the epsilon. (Of couse on *my* environment.) (0.333..) binary format: 3f d5 55 55 55 55 55 55 x = 0.333333333333333315 231 matches, 79 no_matches (0.3{13}42..) binary format: 3f d5 55 55 55 55 55 f1 x = 0.333333333333341975 0 matches, 310 no_matches (0.3{15}42..) binary format: 3f d5 55 55 55 55 55 57 x = 0.333333333333333426 0 matches, 310 no_matches It seems that 0.3{13}42 is correctly 0.3{15}42, which makes just two LSBs difference from 1/3. I believe C is well standardized on the translation. Other DEFAULT_*_SELs are not compared in this way. The attached small patch fixes the case mentioned in this thread, but I'm not sure where to put a test. Static assertion is not usable. Assertion in somewhere perhaps in clauselist_selectivity seems somewhat overdone.. I don't find a suitable place in regression test.. regards. -- Kyotaro Horiguchi NTT Open Source Software Center #include <stdio.h> #include <math.h> int test(double x) { double d = 1.0; double d0 = 0; unsigned char *c_x = (unsigned char *) &x; int nmatches = 0; int nnomatches = 0; int i; fprintf(stderr, "binary format: "); for (i = 7 ; i >= 0 ; i--) fprintf(stderr, "%s%02x", i < 7 ? " " : "", c_x[i]); fprintf(stderr, "\n"); fprintf(stderr, "x = %20.18f\n", x); while (d != d0) { double z = floor(d * 3); double z1 = z + 1.0; double y = d / z; double y1 = d / z1; /* Check if both sides of d * 3 doesn't make match */ if (y == x || y1 == x) nmatches++; else nnomatches++; d0 = d; d = d * 10; } fprintf(stderr, " %d matches, %d no_matches\n", nmatches, nnomatches); } int main(void) { test(0.3333333333333333); test(0.333333333333342); test(0.33333333333333342); return 0; } diff --git a/src/backend/optimizer/path/clausesel.c b/src/backend/optimizer/path/clausesel.c index 3739b9817a..cdeaac22c8 100644 --- a/src/backend/optimizer/path/clausesel.c +++ b/src/backend/optimizer/path/clausesel.c @@ -109,7 +109,7 @@ clauselist_selectivity(PlannerInfo *root, ListCell *l; int listidx; - /* + /* * If there's exactly one clause, just go directly to * clause_selectivity(). None of what we might do below is relevant. */ diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h index 5cc4cf15e2..15a8d2402a 100644 --- a/src/include/utils/selfuncs.h +++ b/src/include/utils/selfuncs.h @@ -33,8 +33,12 @@ /* default selectivity estimate for equalities such as "A = b" */ #define DEFAULT_EQ_SEL 0.005 -/* default selectivity estimate for inequalities such as "A < b" */ -#define DEFAULT_INEQ_SEL 0.3333333333333333 +/* + * default selectivity estimate for inequalities such as "A < b" + * The last two digits prevent it from making a false match with 1/3 computed + * from histogram and/or MCV. + */ +#define DEFAULT_INEQ_SEL 0.33333333333333342 /* default selectivity estimate for range inequalities "A > b AND A < c" */ #define DEFAULT_RANGE_INEQ_SEL 0.005
On Fri, Dec 21, 2018 at 11:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > A smaller-footprint way to fix the immediate problem might be to > change the values of DEFAULT_INEQ_SEL and friends so that they're > even less likely to be matched by accident. That is, instead of > 0.3333333333333333, use 0.333333333333342 or some such. It might > seem that that's just moving the problem around, but I think it > might be possible to show that such a value couldn't be computed > by scalarltsel given a histogram with no more than 10000 members. > (I haven't tried to actually prove that, but it seems intuitive > that the set of possible results would be quantized with no more > than about 5 digits precision. That's not a dumb idea, but it seems pretty unprincipled to me, and to be honest I'm kind of surprised that you're not proposing something cleaner. If you admit the need to distinguish between real estimates and last-ditch ones, why shouldn't we have an explicit representation of that rather than trying to pick a last-ditch value that is unlikely to be confused with a real one? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Dec 21, 2018 at 11:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> A smaller-footprint way to fix the immediate problem might be to >> change the values of DEFAULT_INEQ_SEL and friends so that they're >> even less likely to be matched by accident. That is, instead of >> 0.3333333333333333, use 0.333333333333342 or some such. > That's not a dumb idea, but it seems pretty unprincipled to me, and to > be honest I'm kind of surprised that you're not proposing something > cleaner. The problem is the invasiveness of such a change (large) vs the benefit (not so large). The upthread patch attempted to add a separate signaling path, but it was very incomplete --- and yet both I and Horiguchi-san thought it was already too messy. Maybe at some point we'll go over to something reasonably principled, like adding confidence intervals to all selectivity estimates. That would be *really* invasive but perhaps would bring enough benefit to justify itself. But the current patch is just attempting to fix one extremely narrow pain point, and if that is all it's doing it should have a commensurately small footprint. So I don't think the submitted patch looks good from a cost/benefit standpoint. regards, tom lane
Hi, Thanks for the comments, and I'm sorry for the late reply. > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > Sent: Friday, January 11, 2019 7:04 AM > > Robert Haas <robertmhaas@gmail.com> writes: > > On Fri, Dec 21, 2018 at 11:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> A smaller-footprint way to fix the immediate problem might be to > >> change the values of DEFAULT_INEQ_SEL and friends so that they're > >> even less likely to be matched by accident. That is, instead of > >> 0.3333333333333333, use 0.333333333333342 or some such. > > > That's not a dumb idea, but it seems pretty unprincipled to me, and to > > be honest I'm kind of surprised that you're not proposing something > > cleaner. > > The problem is the invasiveness of such a change (large) vs the benefit (not so large). The upthread > patch attempted to add a separate signaling path, but it was very incomplete --- and yet both I and > Horiguchi-san thought it was already too messy. > > Maybe at some point we'll go over to something reasonably principled, like adding confidence intervals > to all selectivity estimates. That would be *really* invasive but perhaps would bring enough benefit > to justify itself. But the current patch is just attempting to fix one extremely narrow pain point, > and if that is all it's doing it should have a commensurately small footprint. So I don't think the > submitted patch looks good from a cost/benefit standpoint. > Yes, I agree with you. Indeed the patch I attached is insufficient in cost-effectiveness. However, I want to solve problems of that real estimates happened to equal to the default values such as this case, even though it's a narrow pain point. So I tried distinguishing explicitly between real estimates and otherwise as Robert said. The idea Tom proposed and Horiguchi-san tried seems reasonable, but I'm concerned whether any range queries really cannot match 0.333333333333342 (or some such) by accident in any environments. Is the way which Horiguchi-san did enough to prove that? Best regards, Yuzuko Hosoya NTT Open Source Software Center
Hi. At Fri, 11 Jan 2019 11:36:47 +0900, "Yuzuko Hosoya" <hosoya.yuzuko@lab.ntt.co.jp> wrote in <000001d4a956$806a2ab0$813e8010$@lab.ntt.co.jp> > Hi, > > Thanks for the comments, and I'm sorry for the late reply. > > > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > > Sent: Friday, January 11, 2019 7:04 AM > > > Robert Haas <robertmhaas@gmail.com> writes: > > > On Fri, Dec 21, 2018 at 11:50 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> A smaller-footprint way to fix the immediate problem might be to > > >> change the values of DEFAULT_INEQ_SEL and friends so that they're > > >> even less likely to be matched by accident. That is, instead of > > >> 0.3333333333333333, use 0.333333333333342 or some such. > > > > > That's not a dumb idea, but it seems pretty unprincipled to me, and to > > > be honest I'm kind of surprised that you're not proposing something > > > cleaner. > > > > The problem is the invasiveness of such a change (large) vs the benefit (not so large). The > upthread > > patch attempted to add a separate signaling path, but it was very incomplete --- and yet both I > and > > Horiguchi-san thought it was already too messy. > > > > Maybe at some point we'll go over to something reasonably principled, like adding confidence > intervals > > to all selectivity estimates. That would be *really* invasive but perhaps would bring enough > benefit > > to justify itself. But the current patch is just attempting to fix one extremely narrow pain > point, > > and if that is all it's doing it should have a commensurately small footprint. So I don't think > the > > submitted patch looks good from a cost/benefit standpoint. > > > Yes, I agree with you. Indeed the patch I attached is insufficient in cost-effectiveness. > However, I want to solve problems of that real estimates happened to equal to the default > values such as this case, even though it's a narrow pain point. So I tried distinguishing > explicitly between real estimates and otherwise as Robert said. > > The idea Tom proposed and Horiguchi-san tried seems reasonable, but I'm concerned whether > any range queries really cannot match 0.333333333333342 (or some such) by accident in any > environments. Is the way which Horiguchi-san did enough to prove that? It cannot be perfect in priciple, but I think it is reliable enough. This is not principled at all but very effective comparatively the complexity, I think. Actually, i/(i*3+1) for some 10^n's are: 1/ 4:binary format: 3f d0 00 00 00 00 00 00 10/ 31:binary format: 3f d4 a5 29 4a 52 94 a5 100/ 301:binary format: 3f d5 43 30 7a 78 c5 51 1000/ 3001:binary format: 3f d5 53 83 74 70 f1 95 10000/ 30001:binary format: 3f d5 55 26 bb 44 2b a8 100000/ 300001:binary format: 3f d5 55 50 ac 4a 74 6d 1000000/ 3000001:binary format: 3f d5 55 54 de 07 5a 96 10000000/ 30000001:binary format: 3f d5 55 55 49 67 22 6d 100000000/ 300000001:binary format: 3f d5 55 55 54 23 e9 d7 1000000000/ 3000000001:binary format: 3f d5 55 55 55 36 ca 95 10000000000/ 30000000001:binary format: 3f d5 55 55 55 52 47 75 100000000000/ 300000000001:binary format: 3f d5 55 55 55 55 07 25 1000000000000/ 3000000000001:binary format: 3f d5 55 55 55 55 4d 84 10000000000000/ 30000000000001:binary format: 3f d5 55 55 55 55 54 8d 100000000000000/ 300000000000001:binary format: 3f d5 55 55 55 55 55 41 1000000000000000/ 3000000000000001:binary format: 3f d5 55 55 55 55 55 53 So, (as Tom said upthread) intuitively we can expect that we are safe up to roughly 10^10? for the simple cases. # Of course involving histogram makes things complex, though. regards. -- Kyotaro Horiguchi NTT Open Source Software Center