Обсуждение: EXISTS clauses not being optimized in the face of 'one time pass' optimizable expressions

Поиск
Список
Период
Сортировка

EXISTS clauses not being optimized in the face of 'one time pass' optimizable expressions

От
Merlin Moncure
Дата:
Hello hackers,

Observe the following test case (apologies if this is a well
understood problem):

create temp table foo as select generate_series(1,1000000) id;
create index on foo(id);

create temp table bar as select id, id % 100000 = 0 as good from
generate_series(1,1000000) id;
create index on bar(good);

analyze foo;
analyze bar;

explain analyze select * from foo where false or exists (select 1 from
bar where good and foo.id = bar.id);  -- A
explain analyze select * from foo where exists (select 1 from bar
where good and foo.id = bar.id);  -- B

These queries are trivially verified as identical but give very different plans.
A gives                                                         QUERY PLAN

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────Seq
Scanon foo  (cost=0.00..4459425.00 rows=500000 width=4) (actual
 
time=13.299..130.271 rows=10 loops=1)  Filter: (alternatives: SubPlan 1 or hashed SubPlan 2)  Rows Removed by Filter:
999990 SubPlan 1    ->  Index Scan using bar_good_idx on bar  (cost=0.42..4.45 rows=1
 
width=0) (never executed)          Index Cond: (good = true)          Filter: (good AND (foo.id = id))  SubPlan 2    ->
Index Scan using bar_good_idx on bar bar_1  (cost=0.42..4.44
 
rows=1 width=4) (actual time=0.024..0.055 rows=10 loops=1)          Index Cond: (good = true)          Filter:
goodPlanningtime: 0.103 msExecution time: 130.312 ms
 

B gives                                                         QUERY PLAN

───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────Nested
Loop (cost=4.87..12.91 rows=1 width=4) (actual
 
time=0.075..0.161 rows=10 loops=1)  ->  HashAggregate  (cost=4.45..4.46 rows=1 width=4) (actual
time=0.058..0.060 rows=10 loops=1)        Group Key: bar.id        ->  Index Scan using bar_good_idx on bar
(cost=0.42..4.44
rows=1 width=4) (actual time=0.018..0.045 rows=10 loops=1)              Index Cond: (good = true)              Filter:
good ->  Index Only Scan using foo_id_idx on foo  (cost=0.42..8.44
 
rows=1 width=4) (actual time=0.009..0.009 rows=1 loops=10)        Index Cond: (id = bar.id)        Heap Fetches:
10Planningtime: 0.193 msExecution time: 0.187 ms
 

This is a general problem to OR expressions while AND expressions will
generally pass the optimization through.   The 'old school'
optimization approach is to rewrite the OR expressions to UNION ALL
but this can have unpleasant downstream effects on the query in real
world scenarios.  The question is: can the one time filter logic be
expanded such the first query can be functionally be written into the
second one?  This type of query happens a lot when trying to mix
multiple different filtering expressions (a 'filter mode' if you will)
in a single query based on a user supplied switch.  Food for thought.

merlin

Re: EXISTS clauses not being optimized in the face of 'one time pass' optimizable expressions

От
Robert Haas
Дата:
On Tue, Jun 21, 2016 at 4:18 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> Observe the following test case (apologies if this is a well
> understood problem):
>
> create temp table foo as select generate_series(1,1000000) id;
> create index on foo(id);
>
> create temp table bar as select id, id % 100000 = 0 as good from
> generate_series(1,1000000) id;
> create index on bar(good);
>
> analyze foo;
> analyze bar;
>
> explain analyze select * from foo where false or exists (select 1 from
> bar where good and foo.id = bar.id);  -- A
> explain analyze select * from foo where exists (select 1 from bar
> where good and foo.id = bar.id);  -- B
>
> These queries are trivially verified as identical but give very different plans.

Right.  I suspect wouldn't be very hard to notice the special case of
FALSE OR (SOMETHING THAT MIGHT NOT BE FALSE) but I'm not sure that's
worth optimizing by itself.  A more promising line of attack as it
seems to me is to let the planner transform back and forth between
this form for the query and the UNION form.  Obviously, in this case,
the WHERE false branch could then be pruned altogether, but there are
lots of other cases where both branches survived.  Tom's upper planner
pathification stuff makes it much easier to think about how such an
optimization might work, but I think it's still not particularly
simple to get right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jun 21, 2016 at 4:18 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> explain analyze select * from foo where false or exists (select 1 from
>> bar where good and foo.id = bar.id);  -- A
>> explain analyze select * from foo where exists (select 1 from bar
>> where good and foo.id = bar.id);  -- B
>> 
>> These queries are trivially verified as identical but give very different plans.

> Right.  I suspect wouldn't be very hard to notice the special case of
> FALSE OR (SOMETHING THAT MIGHT NOT BE FALSE) but I'm not sure that's
> worth optimizing by itself.

Constant-folding will get rid of the OR FALSE (as well as actually-useful
variants of this example).  The problem is that that doesn't happen till
after we identify semijoins.  So the second one gives you a semijoin plan
and the first doesn't.  This isn't especially easy to improve.  Much of
the value of doing constant-folding would disappear if we ran it before
subquery pullup + join simplification, because in non-stupidly-written
queries those are what expose the expression simplification opportunities.
We could run it twice but that seems certain to be a dead loser most of
the time.

> A more promising line of attack as it
> seems to me is to let the planner transform back and forth between
> this form for the query and the UNION form.

Maybe, but neither UNION nor UNION ALL would duplicate the semantics
of OR, so there's some handwaving here that I missed.
        regards, tom lane



Re: EXISTS clauses not being optimized in the face of 'one time pass' optimizable expressions

От
Robert Haas
Дата:
On Fri, Jul 1, 2016 at 9:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Jun 21, 2016 at 4:18 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> explain analyze select * from foo where false or exists (select 1 from
>>> bar where good and foo.id = bar.id);  -- A
>>> explain analyze select * from foo where exists (select 1 from bar
>>> where good and foo.id = bar.id);  -- B
>>>
>>> These queries are trivially verified as identical but give very different plans.
>
>> Right.  I suspect wouldn't be very hard to notice the special case of
>> FALSE OR (SOMETHING THAT MIGHT NOT BE FALSE) but I'm not sure that's
>> worth optimizing by itself.
>
> Constant-folding will get rid of the OR FALSE (as well as actually-useful
> variants of this example).  The problem is that that doesn't happen till
> after we identify semijoins.  So the second one gives you a semijoin plan
> and the first doesn't.  This isn't especially easy to improve.  Much of
> the value of doing constant-folding would disappear if we ran it before
> subquery pullup + join simplification, because in non-stupidly-written
> queries those are what expose the expression simplification opportunities.
> We could run it twice but that seems certain to be a dead loser most of
> the time.
>
>> A more promising line of attack as it
>> seems to me is to let the planner transform back and forth between
>> this form for the query and the UNION form.
>
> Maybe, but neither UNION nor UNION ALL would duplicate the semantics
> of OR, so there's some handwaving here that I missed.

SELECT * FROM foo WHERE a = 5 OR a = 4

isn't equivalent to

SELECT * FROM foo WHERE a = 5
UNION
SELECT * FROM foo WHERE a = 4

?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Jul 1, 2016 at 9:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Maybe, but neither UNION nor UNION ALL would duplicate the semantics
>> of OR, so there's some handwaving here that I missed.

> SELECT * FROM foo WHERE a = 5 OR a = 4
> isn't equivalent to
> SELECT * FROM foo WHERE a = 5
> UNION
> SELECT * FROM foo WHERE a = 4
> ?

It probably is, but you're assuming that "a" appears in the list of
columns being unioned.  If you make that just "SELECT b FROM ..."
then the latter form gets rid of duplicate b values where the first
doesn't.  On the other hand, UNION ALL might introduce duplicates
not present in the OR query's result.
        regards, tom lane



Re: EXISTS clauses not being optimized in the face of 'one time pass' optimizable expressions

От
Merlin Moncure
Дата:
On Fri, Jul 1, 2016 at 9:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Jul 1, 2016 at 9:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Maybe, but neither UNION nor UNION ALL would duplicate the semantics
>>> of OR, so there's some handwaving here that I missed.
>
>> SELECT * FROM foo WHERE a = 5 OR a = 4
>> isn't equivalent to
>> SELECT * FROM foo WHERE a = 5
>> UNION
>> SELECT * FROM foo WHERE a = 4
>> ?
>
> It probably is, but you're assuming that "a" appears in the list of
> columns being unioned.  If you make that just "SELECT b FROM ..."
> then the latter form gets rid of duplicate b values where the first
> doesn't.  On the other hand, UNION ALL might introduce duplicates
> not present in the OR query's result.

Yeah.  Also, even if you could parse out those cases, it's major
optimization fence.  Consider if you have an ORDER BY clause here:

SELECT FROM foo WHERE a OR b ORDER BY c;

... by pushing inside a union, you're going to be in trouble in real
world cases.  That's just a mess and it would add a lot of runtime
analysis of the alternative paths.  It's hard for me to believe
rewriting is easier and simpler than rewriting 'false OR x' to 'x'.  I
also thing that constant folding strategies are going to render much
more sensible output to EXPLAIN.

FYI, The query is something along the lines of
SELECT * FROM foo
WHERE ('a' = 'a' AND EXISTS ...) OR ('a' = 'b' AND EXISTS ...) OR ('a' = 'c' AND EXISTS ...)

...where the left side of the equality is a parameterized 'filter
mode' flag.  That way the query can introduce filtering behaviors
without doing dynamic acrobatics.

merlin



Re: EXISTS clauses not being optimized in the face of 'one time pass' optimizable expressions

От
Robert Haas
Дата:
On Fri, Jul 1, 2016 at 10:11 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Jul 1, 2016 at 9:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Maybe, but neither UNION nor UNION ALL would duplicate the semantics
>>> of OR, so there's some handwaving here that I missed.
>
>> SELECT * FROM foo WHERE a = 5 OR a = 4
>> isn't equivalent to
>> SELECT * FROM foo WHERE a = 5
>> UNION
>> SELECT * FROM foo WHERE a = 4
>> ?
>
> It probably is, but you're assuming that "a" appears in the list of
> columns being unioned.  If you make that just "SELECT b FROM ..."
> then the latter form gets rid of duplicate b values where the first
> doesn't.  On the other hand, UNION ALL might introduce duplicates
> not present in the OR query's result.

Right, so, significant query transformations are non-trivial.  But the
point is that with the upper planification stuff, I think it is
possible, at least in some cases, that we could consider reordering
set operations with scan/join planning, just as we've previously
talked about reordering grouping stages relative to scan/join
planning.

The details are undeniably hard to get right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: EXISTS clauses not being optimized in the face of 'one time pass' optimizable expressions

От
Robert Haas
Дата:
On Fri, Jul 1, 2016 at 10:20 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> Yeah.  Also, even if you could parse out those cases, it's major
> optimization fence.  Consider if you have an ORDER BY clause here:
>
> SELECT FROM foo WHERE a OR b ORDER BY c;
>
> ... by pushing inside a union, you're going to be in trouble in real
> world cases.  That's just a mess and it would add a lot of runtime
> analysis of the alternative paths.  It's hard for me to believe
> rewriting is easier and simpler than rewriting 'false OR x' to 'x'.  I
> also thing that constant folding strategies are going to render much
> more sensible output to EXPLAIN.

I don't think that it's easier and simpler and didn't intend to say
otherwise.  I do think that I've run across LOTS of queries over the
years where rewriting OR using UNION ALL was a lot faster, and I think
that case is more likely to occur in practice than FALSE OR WHATEVER.
But, I'm just throwing out opinions to see what sticks here; I'm not
deeply invested in this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: EXISTS clauses not being optimized in the face of 'one time pass' optimizable expressions

От
Merlin Moncure
Дата:
On Fri, Jul 1, 2016 at 10:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jul 1, 2016 at 10:20 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> Yeah.  Also, even if you could parse out those cases, it's major
>> optimization fence.  Consider if you have an ORDER BY clause here:
>>
>> SELECT FROM foo WHERE a OR b ORDER BY c;
>>
>> ... by pushing inside a union, you're going to be in trouble in real
>> world cases.  That's just a mess and it would add a lot of runtime
>> analysis of the alternative paths.  It's hard for me to believe
>> rewriting is easier and simpler than rewriting 'false OR x' to 'x'.  I
>> also thing that constant folding strategies are going to render much
>> more sensible output to EXPLAIN.
>
> I don't think that it's easier and simpler and didn't intend to say
> otherwise.  I do think that I've run across LOTS of queries over the
> years where rewriting OR using UNION ALL was a lot faster, and I think
> that case is more likely to occur in practice than FALSE OR WHATEVER.
> But, I'm just throwing out opinions to see what sticks here; I'm not
> deeply invested in this.

Sure (I didn't put you on that position, just thinking out loud).  The
problem with UNION ALL is that it's only safe to do so when you know
for sure the both sides of the partition are non-overlapping.  The
author of the query often knows this going in but for the planner it's
not so simple to figure out in many cases.  If there's a subset of
cases.   UNION sans ALL is probably a dead end on performance grounds.

This hinges on Tom's earlier statements, "Much of
the value of doing constant-folding would disappear if we ran it before
subquery pullup + join simplification, because in non-stupidly-written
queries those are what expose the expression simplification opportunities."

and, especially, "We could run it twice but that seems certain to be a
dead loser most of
the time."

It's pretty easy to craft a query where you're on the winning side,
but what's the worst case of doing two pass...is constant folding a
non trivial fraction of planning time?

merlin



Re: EXISTS clauses not being optimized in the face of 'one time pass' optimizable expressions

От
Stephen Frost
Дата:
Tom, all,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Tue, Jun 21, 2016 at 4:18 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> >> explain analyze select * from foo where false or exists (select 1 from
> >> bar where good and foo.id = bar.id);  -- A
> >> explain analyze select * from foo where exists (select 1 from bar
> >> where good and foo.id = bar.id);  -- B
> >>
> >> These queries are trivially verified as identical but give very different plans.
>
> > Right.  I suspect wouldn't be very hard to notice the special case of
> > FALSE OR (SOMETHING THAT MIGHT NOT BE FALSE) but I'm not sure that's
> > worth optimizing by itself.
>
> Constant-folding will get rid of the OR FALSE (as well as actually-useful
> variants of this example).  The problem is that that doesn't happen till
> after we identify semijoins.  So the second one gives you a semijoin plan
> and the first doesn't.  This isn't especially easy to improve.  Much of
> the value of doing constant-folding would disappear if we ran it before
> subquery pullup + join simplification, because in non-stupidly-written
> queries those are what expose the expression simplification opportunities.
> We could run it twice but that seems certain to be a dead loser most of
> the time.

While it might be a loser most of the time to run it twice, I have to
agree that it's pretty unfortunate that we don't handle this case in a
more sane way.  I looked a bit into pull_up_sublinks() and it doens't
look like there's an easy way to realize this case there without going
through the full effort of constant-folding.

One approach that I'm wondering about is to do constant folding first
and then track if we introduce a case where additional constant folding
might help and only perform it again in those cases.

Thanks!

Stephen

Re: EXISTS clauses not being optimized in the face of 'one time pass' optimizable expressions

От
Robert Haas
Дата:
On Fri, Jul 1, 2016 at 12:00 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> Sure (I didn't put you on that position, just thinking out loud).  The
> problem with UNION ALL is that it's only safe to do so when you know
> for sure the both sides of the partition are non-overlapping.  The
> author of the query often knows this going in but for the planner it's
> not so simple to figure out in many cases.  If there's a subset of
> cases.   UNION sans ALL is probably a dead end on performance grounds.

I'm not sure about that.  It's certainly true that things are much
more likely to work out when you can prove that UNION ALL is
sufficient, because now you avoid de-duplication.  But if the number
of output rows is really small, it might work out anyway.  I mean,
consider this:

SELECT * FROM enormous WHERE rarely_one = 1 OR EXISTS (SELECT 1 FROM
tiny WHERE tiny.x = enormous.x)

As written, you're not going to be able to answer this query without
scanning a full scan of the enormous table.  If you rewrite it to use
UNION, then the first half can be implemented with an index scan or a
bitmap index scan, and the second half can be implemented with a
nested loop over the tiny table with an inner index scan on the
enormous table.  The fact that you have to deduplicate the results may
be a small price to pay for avoiding an enormous scan.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: EXISTS clauses not being optimized in the face of 'one time pass' optimizable expressions

От
Alvaro Herrera
Дата:
Merlin Moncure wrote:

> It's pretty easy to craft a query where you're on the winning side,
> but what's the worst case of doing two pass...is constant folding a
> non trivial fraction of planning time?

One thing that has been suggested is to re-examine the plan after
planning is done, and if execution time is estimated to be large (FSVO),
then run a second planning pass with more expensive optimizations
enabled to try and find better plans.  The guiding principle would be to
continue to very quickly find good enough plans for
frequent/small/simple queries, but spend more planning effort on more
complex ones where execution is likely to take much longer than planning
time.

So doing constant-folding twice would be enabled for the second planning
pass.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: EXISTS clauses not being optimized in the face of 'one time pass' optimizable expressions

От
Merlin Moncure
Дата:
On Fri, Jul 1, 2016 at 11:45 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Merlin Moncure wrote:
>
>> It's pretty easy to craft a query where you're on the winning side,
>> but what's the worst case of doing two pass...is constant folding a
>> non trivial fraction of planning time?
>
> One thing that has been suggested is to re-examine the plan after
> planning is done, and if execution time is estimated to be large (FSVO),
> then run a second planning pass with more expensive optimizations
> enabled to try and find better plans.  The guiding principle would be to
> continue to very quickly find good enough plans for
> frequent/small/simple queries, but spend more planning effort on more
> complex ones where execution is likely to take much longer than planning
> time.
>
> So doing constant-folding twice would be enabled for the second planning
> pass.

I like this idea.  Maybe a GUC controlling the cost based cutoff (with
0 meaning, "assume the worst and plan the hard way first").

merlin