Обсуждение: Weird, bad 0.5% selectivity estimate for a column equal to itself
Folks, I'm getting something really odd in 9.2.4, where the planner estimates that the selectivity of a column equal to itself is always exactly 0.5% (i.e. 0.005X). I can't figure out where this constant is coming from, or why it's being applied. Test case: create table esttest ( id int not null primary key, state1 int not null default 0, state2 int not null default 0, state3 int not null default 0 ); insert into esttest (id, state1, state2, state3) select i, (random()*3)::int, (random())::int, (random()*100)::int from generate_series (1, 20000) as gs(i); vacuum analyze esttest; explain analyze select * from esttest where state1 = state1; explain analyze select * from esttest where state2 = state2; explain analyze select * from esttest where state3 = state3; Results of test case: badestimate=# explain analyze badestimate-# select * from esttest badestimate-# where state1 = state1; QUERY PLAN ---------------------------------------------------------------------------------------------------------- Seq Scan on esttest (cost=0.00..359.00 rows=100 width=16) (actual time=0.009..4.145 rows=20000 loops=1) Filter: (state1 = state1) Total runtime: 5.572 ms (3 rows) badestimate=# badestimate=# explain analyze badestimate-# select * from esttest badestimate-# where state2 = state2; QUERY PLAN ---------------------------------------------------------------------------------------------------------- Seq Scan on esttest (cost=0.00..359.00 rows=100 width=16) (actual time=0.006..4.166 rows=20000 loops=1) Filter: (state2 = state2) Total runtime: 5.595 ms (3 rows) badestimate=# badestimate=# explain analyze badestimate-# select * from esttest badestimate-# where state3 = state3; QUERY PLAN ---------------------------------------------------------------------------------------------------------- Seq Scan on esttest (cost=0.00..359.00 rows=100 width=16) (actual time=0.005..4.298 rows=20000 loops=1) Filter: (state3 = state3) Total runtime: 5.716 ms (3 rows) -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > I'm getting something really odd in 9.2.4, where the planner estimates > that the selectivity of a column equal to itself is always exactly 0.5% > (i.e. 0.005X). I can't figure out where this constant is coming from, > or why it's being applied. See DEFAULT_EQ_SEL. But why exactly do you care? Surely it's a stupid query and you should fix it. regards, tom lane
On 06/21/2013 02:32 PM, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: >> I'm getting something really odd in 9.2.4, where the planner estimates >> that the selectivity of a column equal to itself is always exactly 0.5% >> (i.e. 0.005X). I can't figure out where this constant is coming from, >> or why it's being applied. > > See DEFAULT_EQ_SEL. Why is it using that? We have statistics on the column. What reason would it have for using a default estimate? > But why exactly do you care? Surely it's a stupid > query and you should fix it. (a) that test case is a substantial simplication of a much more complex query, one which exhibits actual execution time issues because of this selectivity bug. (b) that query is also auto-generated by external software, so "just fix it" isn't as easy as it sounds. (c) PostgreSQL ought to perform well even on the stupid queries. Obviously, we're going to code around this for the existing software, but this is an example of a planner bug which should be on the fix list. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > On 06/21/2013 02:32 PM, Tom Lane wrote: >> See DEFAULT_EQ_SEL. > Why is it using that? We have statistics on the column. What reason > would it have for using a default estimate? The stats are generally consulted for "Var Op Constant" scenarios. It doesn't know what to do with "Var Op Var" cases that aren't joins. As long as we lack cross-column-correlation stats I doubt it'd be very helpful to try to derive a stats-based number for such cases. Of course, "X = X" is a special case, but ... >> But why exactly do you care? Surely it's a stupid >> query and you should fix it. > (b) that query is also auto-generated by external software, so "just fix > it" isn't as easy as it sounds. Personally, I'll bet lunch that that external software is outright broken, ie it probably thinks "X = X" is constant true and they found they could save two lines of code and a few machine cycles by emitting that rather than not emitting anything. Of course, the amount of parsing/planning time wasted in dealing with the useless-and-incorrect clause exceeds what was saved by multiple orders of magnitude, but hey it was easy. It wouldn't take too much new code to get the planner to replace "X = X" with "X IS NOT NULL", but I think we're probably fixing the wrong piece of software if we do. regards, tom lane
> Personally, I'll bet lunch that that external software is outright > broken, ie it probably thinks "X = X" is constant true and they found > they could save two lines of code and a few machine cycles by emitting > that rather than not emitting anything. Of course, the amount of > parsing/planning time wasted in dealing with the useless-and-incorrect > clause exceeds what was saved by multiple orders of magnitude, but hey > it was easy. Well, it was more in the form of: tab1.x = COALESCE(tab2.y,tab1.x) ... which some programmer 8 years ago though would be a cute shorthand for: tab.x = tab2.y OR tab2.y IS NULL Still stupid, sure, but when you're dealing with partly-third-party legacy software which was ported from MSSQL (which has issues with "IS NULL"), that's what you get. > It wouldn't take too much new code to get the planner to replace "X = X" > with "X IS NOT NULL", but I think we're probably fixing the wrong piece > of software if we do. Well, I'd be more satisfied with having a solution for: WHERE tab1.x = tab1.y ... in general, even if it didn't have correlation stats. Like, what's preventing us from using the same selectivity logic we would on a join for that? It wouldn't be accurate for highly correlated columns (or for colX = colx) but it would be a damsight better than defaultsel. Heck, even multiplying the the two ndistincts together would be an improvement ... -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: >> Personally, I'll bet lunch that that external software is outright >> broken, ie it probably thinks "X = X" is constant true and they found >> they could save two lines of code and a few machine cycles by emitting >> that rather than not emitting anything. > Well, it was more in the form of: > tab1.x = COALESCE(tab2.y,tab1.x) Hm. I'm not following how you get from there to complaining about not being smart about X = X, because that surely ain't the same. > Well, I'd be more satisfied with having a solution for: > WHERE tab1.x = tab1.y > ... in general, even if it didn't have correlation stats. Like, what's > preventing us from using the same selectivity logic we would on a join > for that? It's a totally different case. In the join case you expect that each element of one table will be compared with each element of the other. In the single-table case, that's exactly *not* what will happen, and I don't see how you get to anything very useful without knowing something about the value pairs that actually occur. As a concrete example, applying the join selectivity logic would certainly give a completely wrong answer for X = X, unless there were only one value occurring in the column. regards, tom lane
On 06/25/2013 06:41 PM, Tom Lane wrote: >> Well, it was more in the form of: >> tab1.x = COALESCE(tab2.y,tab1.x) > > Hm. I'm not following how you get from there to complaining about not > being smart about X = X, because that surely ain't the same. Actually, it was dominated by defaultsel, since tab2.y had a nullfrac of 70%. It took us a couple days of reducing the bad query plan to figure out where the bad estimate was coming from. The real estimate should have been 0.7 + ( est. tab2.y = tab1.x ), but instead we were getting 0.005 + ( est. tab2.y = tab1.x ), which was throwing the whole query plan way off ... with an execution time difference of 900X. > It's a totally different case. In the join case you expect that each > element of one table will be compared with each element of the other. > In the single-table case, that's exactly *not* what will happen, and > I don't see how you get to anything very useful without knowing > something about the value pairs that actually occur. Sure you can. If you make the assumption that there is 0 correlation, then you can simply estimate the comparison as between two random columns. In the simplest approach, you would multiply the two ndistincts, so that a column with 3 values would match a column with 10 values 0.033 of the time. Now for a real estimator, we'd of course want to use the MCVs and the histogram to calculate a better estimation; obviously our 3X10 table is going to match 0% of the time if col1 is [1,2,3] and col2 contains values from 1000 to 1100. The MCVs would be particularly valuable here; if the same MCV appears in both columns, we can multiply the probabilities. To me, this seems just like estimating on a foreign table match, only simpler. Of course, a coefficient of corellation would make it even more accurate, but even without one we can arrive at much better estimates than defaultsel. > As a concrete > example, applying the join selectivity logic would certainly give a > completely wrong answer for X = X, unless there were only one value > occurring in the column. Yeah, I think we'll eventually need to special-case that one. In the particular case I ran across, though, using column match estimation would have still yielded a better result than defaultsel. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com