Обсуждение: EXISTS clauses not being optimized in the face of 'one time pass' optimizable expressions
EXISTS clauses not being optimized in the face of 'one time pass' optimizable expressions
От
Merlin Moncure
Дата:
Hello hackers, Observe the following test case (apologies if this is a well understood problem): create temp table foo as select generate_series(1,1000000) id; create index on foo(id); create temp table bar as select id, id % 100000 = 0 as good from generate_series(1,1000000) id; create index on bar(good); analyze foo; analyze bar; explain analyze select * from foo where false or exists (select 1 from bar where good and foo.id = bar.id); -- A explain analyze select * from foo where exists (select 1 from bar where good and foo.id = bar.id); -- B These queries are trivially verified as identical but give very different plans. A gives QUERY PLAN ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────Seq Scanon foo (cost=0.00..4459425.00 rows=500000 width=4) (actual time=13.299..130.271 rows=10 loops=1) Filter: (alternatives: SubPlan 1 or hashed SubPlan 2) Rows Removed by Filter: 999990 SubPlan 1 -> Index Scan using bar_good_idx on bar (cost=0.42..4.45 rows=1 width=0) (never executed) Index Cond: (good = true) Filter: (good AND (foo.id = id)) SubPlan 2 -> Index Scan using bar_good_idx on bar bar_1 (cost=0.42..4.44 rows=1 width=4) (actual time=0.024..0.055 rows=10 loops=1) Index Cond: (good = true) Filter: goodPlanningtime: 0.103 msExecution time: 130.312 ms B gives QUERY PLAN ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────Nested Loop (cost=4.87..12.91 rows=1 width=4) (actual time=0.075..0.161 rows=10 loops=1) -> HashAggregate (cost=4.45..4.46 rows=1 width=4) (actual time=0.058..0.060 rows=10 loops=1) Group Key: bar.id -> Index Scan using bar_good_idx on bar (cost=0.42..4.44 rows=1 width=4) (actual time=0.018..0.045 rows=10 loops=1) Index Cond: (good = true) Filter: good -> Index Only Scan using foo_id_idx on foo (cost=0.42..8.44 rows=1 width=4) (actual time=0.009..0.009 rows=1 loops=10) Index Cond: (id = bar.id) Heap Fetches: 10Planningtime: 0.193 msExecution time: 0.187 ms This is a general problem to OR expressions while AND expressions will generally pass the optimization through. The 'old school' optimization approach is to rewrite the OR expressions to UNION ALL but this can have unpleasant downstream effects on the query in real world scenarios. The question is: can the one time filter logic be expanded such the first query can be functionally be written into the second one? This type of query happens a lot when trying to mix multiple different filtering expressions (a 'filter mode' if you will) in a single query based on a user supplied switch. Food for thought. merlin
On Tue, Jun 21, 2016 at 4:18 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > Observe the following test case (apologies if this is a well > understood problem): > > create temp table foo as select generate_series(1,1000000) id; > create index on foo(id); > > create temp table bar as select id, id % 100000 = 0 as good from > generate_series(1,1000000) id; > create index on bar(good); > > analyze foo; > analyze bar; > > explain analyze select * from foo where false or exists (select 1 from > bar where good and foo.id = bar.id); -- A > explain analyze select * from foo where exists (select 1 from bar > where good and foo.id = bar.id); -- B > > These queries are trivially verified as identical but give very different plans. Right. I suspect wouldn't be very hard to notice the special case of FALSE OR (SOMETHING THAT MIGHT NOT BE FALSE) but I'm not sure that's worth optimizing by itself. A more promising line of attack as it seems to me is to let the planner transform back and forth between this form for the query and the UNION form. Obviously, in this case, the WHERE false branch could then be pruned altogether, but there are lots of other cases where both branches survived. Tom's upper planner pathification stuff makes it much easier to think about how such an optimization might work, but I think it's still not particularly simple to get right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jun 21, 2016 at 4:18 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> explain analyze select * from foo where false or exists (select 1 from >> bar where good and foo.id = bar.id); -- A >> explain analyze select * from foo where exists (select 1 from bar >> where good and foo.id = bar.id); -- B >> >> These queries are trivially verified as identical but give very different plans. > Right. I suspect wouldn't be very hard to notice the special case of > FALSE OR (SOMETHING THAT MIGHT NOT BE FALSE) but I'm not sure that's > worth optimizing by itself. Constant-folding will get rid of the OR FALSE (as well as actually-useful variants of this example). The problem is that that doesn't happen till after we identify semijoins. So the second one gives you a semijoin plan and the first doesn't. This isn't especially easy to improve. Much of the value of doing constant-folding would disappear if we ran it before subquery pullup + join simplification, because in non-stupidly-written queries those are what expose the expression simplification opportunities. We could run it twice but that seems certain to be a dead loser most of the time. > A more promising line of attack as it > seems to me is to let the planner transform back and forth between > this form for the query and the UNION form. Maybe, but neither UNION nor UNION ALL would duplicate the semantics of OR, so there's some handwaving here that I missed. regards, tom lane
On Fri, Jul 1, 2016 at 9:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Jun 21, 2016 at 4:18 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>> explain analyze select * from foo where false or exists (select 1 from >>> bar where good and foo.id = bar.id); -- A >>> explain analyze select * from foo where exists (select 1 from bar >>> where good and foo.id = bar.id); -- B >>> >>> These queries are trivially verified as identical but give very different plans. > >> Right. I suspect wouldn't be very hard to notice the special case of >> FALSE OR (SOMETHING THAT MIGHT NOT BE FALSE) but I'm not sure that's >> worth optimizing by itself. > > Constant-folding will get rid of the OR FALSE (as well as actually-useful > variants of this example). The problem is that that doesn't happen till > after we identify semijoins. So the second one gives you a semijoin plan > and the first doesn't. This isn't especially easy to improve. Much of > the value of doing constant-folding would disappear if we ran it before > subquery pullup + join simplification, because in non-stupidly-written > queries those are what expose the expression simplification opportunities. > We could run it twice but that seems certain to be a dead loser most of > the time. > >> A more promising line of attack as it >> seems to me is to let the planner transform back and forth between >> this form for the query and the UNION form. > > Maybe, but neither UNION nor UNION ALL would duplicate the semantics > of OR, so there's some handwaving here that I missed. SELECT * FROM foo WHERE a = 5 OR a = 4 isn't equivalent to SELECT * FROM foo WHERE a = 5 UNION SELECT * FROM foo WHERE a = 4 ? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Jul 1, 2016 at 9:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Maybe, but neither UNION nor UNION ALL would duplicate the semantics >> of OR, so there's some handwaving here that I missed. > SELECT * FROM foo WHERE a = 5 OR a = 4 > isn't equivalent to > SELECT * FROM foo WHERE a = 5 > UNION > SELECT * FROM foo WHERE a = 4 > ? It probably is, but you're assuming that "a" appears in the list of columns being unioned. If you make that just "SELECT b FROM ..." then the latter form gets rid of duplicate b values where the first doesn't. On the other hand, UNION ALL might introduce duplicates not present in the OR query's result. regards, tom lane
Re: EXISTS clauses not being optimized in the face of 'one time pass' optimizable expressions
От
Merlin Moncure
Дата:
On Fri, Jul 1, 2016 at 9:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Jul 1, 2016 at 9:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Maybe, but neither UNION nor UNION ALL would duplicate the semantics >>> of OR, so there's some handwaving here that I missed. > >> SELECT * FROM foo WHERE a = 5 OR a = 4 >> isn't equivalent to >> SELECT * FROM foo WHERE a = 5 >> UNION >> SELECT * FROM foo WHERE a = 4 >> ? > > It probably is, but you're assuming that "a" appears in the list of > columns being unioned. If you make that just "SELECT b FROM ..." > then the latter form gets rid of duplicate b values where the first > doesn't. On the other hand, UNION ALL might introduce duplicates > not present in the OR query's result. Yeah. Also, even if you could parse out those cases, it's major optimization fence. Consider if you have an ORDER BY clause here: SELECT FROM foo WHERE a OR b ORDER BY c; ... by pushing inside a union, you're going to be in trouble in real world cases. That's just a mess and it would add a lot of runtime analysis of the alternative paths. It's hard for me to believe rewriting is easier and simpler than rewriting 'false OR x' to 'x'. I also thing that constant folding strategies are going to render much more sensible output to EXPLAIN. FYI, The query is something along the lines of SELECT * FROM foo WHERE ('a' = 'a' AND EXISTS ...) OR ('a' = 'b' AND EXISTS ...) OR ('a' = 'c' AND EXISTS ...) ...where the left side of the equality is a parameterized 'filter mode' flag. That way the query can introduce filtering behaviors without doing dynamic acrobatics. merlin
On Fri, Jul 1, 2016 at 10:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Jul 1, 2016 at 9:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Maybe, but neither UNION nor UNION ALL would duplicate the semantics >>> of OR, so there's some handwaving here that I missed. > >> SELECT * FROM foo WHERE a = 5 OR a = 4 >> isn't equivalent to >> SELECT * FROM foo WHERE a = 5 >> UNION >> SELECT * FROM foo WHERE a = 4 >> ? > > It probably is, but you're assuming that "a" appears in the list of > columns being unioned. If you make that just "SELECT b FROM ..." > then the latter form gets rid of duplicate b values where the first > doesn't. On the other hand, UNION ALL might introduce duplicates > not present in the OR query's result. Right, so, significant query transformations are non-trivial. But the point is that with the upper planification stuff, I think it is possible, at least in some cases, that we could consider reordering set operations with scan/join planning, just as we've previously talked about reordering grouping stages relative to scan/join planning. The details are undeniably hard to get right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jul 1, 2016 at 10:20 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > Yeah. Also, even if you could parse out those cases, it's major > optimization fence. Consider if you have an ORDER BY clause here: > > SELECT FROM foo WHERE a OR b ORDER BY c; > > ... by pushing inside a union, you're going to be in trouble in real > world cases. That's just a mess and it would add a lot of runtime > analysis of the alternative paths. It's hard for me to believe > rewriting is easier and simpler than rewriting 'false OR x' to 'x'. I > also thing that constant folding strategies are going to render much > more sensible output to EXPLAIN. I don't think that it's easier and simpler and didn't intend to say otherwise. I do think that I've run across LOTS of queries over the years where rewriting OR using UNION ALL was a lot faster, and I think that case is more likely to occur in practice than FALSE OR WHATEVER. But, I'm just throwing out opinions to see what sticks here; I'm not deeply invested in this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: EXISTS clauses not being optimized in the face of 'one time pass' optimizable expressions
От
Merlin Moncure
Дата:
On Fri, Jul 1, 2016 at 10:27 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jul 1, 2016 at 10:20 AM, Merlin Moncure <mmoncure@gmail.com> wrote: >> Yeah. Also, even if you could parse out those cases, it's major >> optimization fence. Consider if you have an ORDER BY clause here: >> >> SELECT FROM foo WHERE a OR b ORDER BY c; >> >> ... by pushing inside a union, you're going to be in trouble in real >> world cases. That's just a mess and it would add a lot of runtime >> analysis of the alternative paths. It's hard for me to believe >> rewriting is easier and simpler than rewriting 'false OR x' to 'x'. I >> also thing that constant folding strategies are going to render much >> more sensible output to EXPLAIN. > > I don't think that it's easier and simpler and didn't intend to say > otherwise. I do think that I've run across LOTS of queries over the > years where rewriting OR using UNION ALL was a lot faster, and I think > that case is more likely to occur in practice than FALSE OR WHATEVER. > But, I'm just throwing out opinions to see what sticks here; I'm not > deeply invested in this. Sure (I didn't put you on that position, just thinking out loud). The problem with UNION ALL is that it's only safe to do so when you know for sure the both sides of the partition are non-overlapping. The author of the query often knows this going in but for the planner it's not so simple to figure out in many cases. If there's a subset of cases. UNION sans ALL is probably a dead end on performance grounds. This hinges on Tom's earlier statements, "Much of the value of doing constant-folding would disappear if we ran it before subquery pullup + join simplification, because in non-stupidly-written queries those are what expose the expression simplification opportunities." and, especially, "We could run it twice but that seems certain to be a dead loser most of the time." It's pretty easy to craft a query where you're on the winning side, but what's the worst case of doing two pass...is constant folding a non trivial fraction of planning time? merlin
Tom, all, * Tom Lane (tgl@sss.pgh.pa.us) wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Tue, Jun 21, 2016 at 4:18 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > >> explain analyze select * from foo where false or exists (select 1 from > >> bar where good and foo.id = bar.id); -- A > >> explain analyze select * from foo where exists (select 1 from bar > >> where good and foo.id = bar.id); -- B > >> > >> These queries are trivially verified as identical but give very different plans. > > > Right. I suspect wouldn't be very hard to notice the special case of > > FALSE OR (SOMETHING THAT MIGHT NOT BE FALSE) but I'm not sure that's > > worth optimizing by itself. > > Constant-folding will get rid of the OR FALSE (as well as actually-useful > variants of this example). The problem is that that doesn't happen till > after we identify semijoins. So the second one gives you a semijoin plan > and the first doesn't. This isn't especially easy to improve. Much of > the value of doing constant-folding would disappear if we ran it before > subquery pullup + join simplification, because in non-stupidly-written > queries those are what expose the expression simplification opportunities. > We could run it twice but that seems certain to be a dead loser most of > the time. While it might be a loser most of the time to run it twice, I have to agree that it's pretty unfortunate that we don't handle this case in a more sane way. I looked a bit into pull_up_sublinks() and it doens't look like there's an easy way to realize this case there without going through the full effort of constant-folding. One approach that I'm wondering about is to do constant folding first and then track if we introduce a case where additional constant folding might help and only perform it again in those cases. Thanks! Stephen
On Fri, Jul 1, 2016 at 12:00 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > Sure (I didn't put you on that position, just thinking out loud). The > problem with UNION ALL is that it's only safe to do so when you know > for sure the both sides of the partition are non-overlapping. The > author of the query often knows this going in but for the planner it's > not so simple to figure out in many cases. If there's a subset of > cases. UNION sans ALL is probably a dead end on performance grounds. I'm not sure about that. It's certainly true that things are much more likely to work out when you can prove that UNION ALL is sufficient, because now you avoid de-duplication. But if the number of output rows is really small, it might work out anyway. I mean, consider this: SELECT * FROM enormous WHERE rarely_one = 1 OR EXISTS (SELECT 1 FROM tiny WHERE tiny.x = enormous.x) As written, you're not going to be able to answer this query without scanning a full scan of the enormous table. If you rewrite it to use UNION, then the first half can be implemented with an index scan or a bitmap index scan, and the second half can be implemented with a nested loop over the tiny table with an inner index scan on the enormous table. The fact that you have to deduplicate the results may be a small price to pay for avoiding an enormous scan. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: EXISTS clauses not being optimized in the face of 'one time pass' optimizable expressions
От
Alvaro Herrera
Дата:
Merlin Moncure wrote: > It's pretty easy to craft a query where you're on the winning side, > but what's the worst case of doing two pass...is constant folding a > non trivial fraction of planning time? One thing that has been suggested is to re-examine the plan after planning is done, and if execution time is estimated to be large (FSVO), then run a second planning pass with more expensive optimizations enabled to try and find better plans. The guiding principle would be to continue to very quickly find good enough plans for frequent/small/simple queries, but spend more planning effort on more complex ones where execution is likely to take much longer than planning time. So doing constant-folding twice would be enabled for the second planning pass. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: EXISTS clauses not being optimized in the face of 'one time pass' optimizable expressions
От
Merlin Moncure
Дата:
On Fri, Jul 1, 2016 at 11:45 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Merlin Moncure wrote: > >> It's pretty easy to craft a query where you're on the winning side, >> but what's the worst case of doing two pass...is constant folding a >> non trivial fraction of planning time? > > One thing that has been suggested is to re-examine the plan after > planning is done, and if execution time is estimated to be large (FSVO), > then run a second planning pass with more expensive optimizations > enabled to try and find better plans. The guiding principle would be to > continue to very quickly find good enough plans for > frequent/small/simple queries, but spend more planning effort on more > complex ones where execution is likely to take much longer than planning > time. > > So doing constant-folding twice would be enabled for the second planning > pass. I like this idea. Maybe a GUC controlling the cost based cutoff (with 0 meaning, "assume the worst and plan the hard way first"). merlin