Обсуждение: Weird, bad 0.5% selectivity estimate for a column equal to itself

Поиск
Список
Период
Сортировка

Weird, bad 0.5% selectivity estimate for a column equal to itself

От
Josh Berkus
Дата:
Folks,

I'm getting something really odd in 9.2.4, where the planner estimates
that the selectivity of a column equal to itself is always exactly 0.5%
(i.e. 0.005X).  I can't figure out where this constant is coming from,
or why it's being applied.

Test case:

create table esttest (
    id int not null primary key,
    state1 int not null default 0,
    state2 int not null default 0,
    state3 int not null default 0
);

insert into esttest (id, state1, state2, state3)
select i,
    (random()*3)::int,
    (random())::int,
    (random()*100)::int
from generate_series (1, 20000)
    as gs(i);

vacuum analyze esttest;

explain analyze
select * from esttest
where state1 = state1;

explain analyze
select * from esttest
where state2 = state2;

explain analyze
select * from esttest
where state3 = state3;

Results of test case:

badestimate=# explain analyze
badestimate-# select * from esttest
badestimate-# where state1 = state1;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Seq Scan on esttest  (cost=0.00..359.00 rows=100 width=16) (actual
time=0.009..4.145 rows=20000 loops=1)
   Filter: (state1 = state1)
 Total runtime: 5.572 ms
(3 rows)

badestimate=#
badestimate=# explain analyze
badestimate-# select * from esttest
badestimate-# where state2 = state2;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Seq Scan on esttest  (cost=0.00..359.00 rows=100 width=16) (actual
time=0.006..4.166 rows=20000 loops=1)
   Filter: (state2 = state2)
 Total runtime: 5.595 ms
(3 rows)

badestimate=#
badestimate=# explain analyze
badestimate-# select * from esttest
badestimate-# where state3 = state3;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Seq Scan on esttest  (cost=0.00..359.00 rows=100 width=16) (actual
time=0.005..4.298 rows=20000 loops=1)
   Filter: (state3 = state3)
 Total runtime: 5.716 ms
(3 rows)




--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Weird, bad 0.5% selectivity estimate for a column equal to itself

От
Tom Lane
Дата:
Josh Berkus <josh@agliodbs.com> writes:
> I'm getting something really odd in 9.2.4, where the planner estimates
> that the selectivity of a column equal to itself is always exactly 0.5%
> (i.e. 0.005X).  I can't figure out where this constant is coming from,
> or why it's being applied.

See DEFAULT_EQ_SEL.  But why exactly do you care?  Surely it's a stupid
query and you should fix it.

            regards, tom lane


Re: Weird, bad 0.5% selectivity estimate for a column equal to itself

От
Josh Berkus
Дата:
On 06/21/2013 02:32 PM, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>> I'm getting something really odd in 9.2.4, where the planner estimates
>> that the selectivity of a column equal to itself is always exactly 0.5%
>> (i.e. 0.005X).  I can't figure out where this constant is coming from,
>> or why it's being applied.
>
> See DEFAULT_EQ_SEL.

Why is it using that?  We have statistics on the column.  What reason
would it have for using a default estimate?

> But why exactly do you care?  Surely it's a stupid
> query and you should fix it.

(a) that test case is a substantial simplication of a much more complex
query, one which exhibits actual execution time issues because of this
selectivity bug.

(b) that query is also auto-generated by external software, so "just fix
it" isn't as easy as it sounds.

(c) PostgreSQL ought to perform well even on the stupid queries.

Obviously, we're going to code around this for the existing software,
but this is an example of a planner bug which should be on the fix list.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Weird, bad 0.5% selectivity estimate for a column equal to itself

От
Tom Lane
Дата:
Josh Berkus <josh@agliodbs.com> writes:
> On 06/21/2013 02:32 PM, Tom Lane wrote:
>> See DEFAULT_EQ_SEL.

> Why is it using that?  We have statistics on the column.  What reason
> would it have for using a default estimate?

The stats are generally consulted for "Var Op Constant" scenarios.
It doesn't know what to do with "Var Op Var" cases that aren't joins.
As long as we lack cross-column-correlation stats I doubt it'd be very
helpful to try to derive a stats-based number for such cases.  Of
course, "X = X" is a special case, but ...

>> But why exactly do you care?  Surely it's a stupid
>> query and you should fix it.

> (b) that query is also auto-generated by external software, so "just fix
> it" isn't as easy as it sounds.

Personally, I'll bet lunch that that external software is outright
broken, ie it probably thinks "X = X" is constant true and they found
they could save two lines of code and a few machine cycles by emitting
that rather than not emitting anything.  Of course, the amount of
parsing/planning time wasted in dealing with the useless-and-incorrect
clause exceeds what was saved by multiple orders of magnitude, but hey
it was easy.

It wouldn't take too much new code to get the planner to replace "X = X"
with "X IS NOT NULL", but I think we're probably fixing the wrong piece
of software if we do.

            regards, tom lane


Re: Weird, bad 0.5% selectivity estimate for a column equal to itself

От
Josh Berkus
Дата:
> Personally, I'll bet lunch that that external software is outright
> broken, ie it probably thinks "X = X" is constant true and they found
> they could save two lines of code and a few machine cycles by emitting
> that rather than not emitting anything.  Of course, the amount of
> parsing/planning time wasted in dealing with the useless-and-incorrect
> clause exceeds what was saved by multiple orders of magnitude, but hey
> it was easy.

Well, it was more in the form of:

tab1.x = COALESCE(tab2.y,tab1.x)

... which some programmer 8 years ago though would be a cute shorthand for:

tab.x = tab2.y OR tab2.y IS NULL

Still stupid, sure, but when you're dealing with partly-third-party
legacy software which was ported from MSSQL (which has issues with "IS
NULL"), that's what you get.

> It wouldn't take too much new code to get the planner to replace "X = X"
> with "X IS NOT NULL", but I think we're probably fixing the wrong piece
> of software if we do.

Well, I'd be more satisfied with having a solution for:

WHERE tab1.x = tab1.y

... in general, even if it didn't have correlation stats.  Like, what's
preventing us from using the same selectivity logic we would on a join
for that?  It wouldn't be accurate for highly correlated columns (or for
colX = colx) but it would be a damsight better than defaultsel. Heck,
even multiplying the the two ndistincts together would be an improvement ...

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: Weird, bad 0.5% selectivity estimate for a column equal to itself

От
Tom Lane
Дата:
Josh Berkus <josh@agliodbs.com> writes:
>> Personally, I'll bet lunch that that external software is outright
>> broken, ie it probably thinks "X = X" is constant true and they found
>> they could save two lines of code and a few machine cycles by emitting
>> that rather than not emitting anything.

> Well, it was more in the form of:
> tab1.x = COALESCE(tab2.y,tab1.x)

Hm.  I'm not following how you get from there to complaining about not
being smart about X = X, because that surely ain't the same.

> Well, I'd be more satisfied with having a solution for:
> WHERE tab1.x = tab1.y
> ... in general, even if it didn't have correlation stats.  Like, what's
> preventing us from using the same selectivity logic we would on a join
> for that?

It's a totally different case.  In the join case you expect that each
element of one table will be compared with each element of the other.
In the single-table case, that's exactly *not* what will happen, and
I don't see how you get to anything very useful without knowing
something about the value pairs that actually occur.  As a concrete
example, applying the join selectivity logic would certainly give a
completely wrong answer for X = X, unless there were only one value
occurring in the column.

            regards, tom lane


Re: Weird, bad 0.5% selectivity estimate for a column equal to itself

От
Josh Berkus
Дата:
On 06/25/2013 06:41 PM, Tom Lane wrote:
>> Well, it was more in the form of:
>> tab1.x = COALESCE(tab2.y,tab1.x)
>
> Hm.  I'm not following how you get from there to complaining about not
> being smart about X = X, because that surely ain't the same.

Actually, it was dominated by defaultsel, since tab2.y had a nullfrac of
70%.  It took us a couple days of reducing the bad query plan to figure
out where the bad estimate was coming from.  The real estimate should
have been 0.7 + ( est. tab2.y = tab1.x ), but instead we were getting
0.005 + ( est. tab2.y = tab1.x ), which was throwing the whole query
plan way off ... with an execution time difference of 900X.

> It's a totally different case.  In the join case you expect that each
> element of one table will be compared with each element of the other.
> In the single-table case, that's exactly *not* what will happen, and
> I don't see how you get to anything very useful without knowing
> something about the value pairs that actually occur.

Sure you can.  If you make the assumption that there is 0 correlation,
then you can simply estimate the comparison as between two random
columns.  In the simplest approach, you would multiply the two
ndistincts, so that a column with 3 values would match a column with 10
values 0.033 of the time.

Now for a real estimator, we'd of course want to use the MCVs and the
histogram to calculate a better estimation; obviously our 3X10 table is
going to match 0% of the time if col1 is [1,2,3] and col2 contains
values from 1000 to 1100.  The MCVs would be particularly valuable here;
if the same MCV appears in both columns, we can multiply the probabilities.

To me, this seems just like estimating on a foreign table match, only
simpler.  Of course, a coefficient of corellation would make it even
more accurate, but even without one we can arrive at much better
estimates than defaultsel.

> As a concrete
> example, applying the join selectivity logic would certainly give a
> completely wrong answer for X = X, unless there were only one value
> occurring in the column.

Yeah, I think we'll eventually need to special-case that one.  In the
particular case I ran across, though, using column match estimation
would have still yielded a better result than defaultsel.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com