Обсуждение: Overhead of union versus union all

Поиск
Список
Период
Сортировка

Overhead of union versus union all

От
Tim Keitt
Дата:
I am combining query results that I know are disjoint. I'm wondering
how much overhead there is in calling union versus union all. (Just
curious really; I can't see a reason not to use union all.) (cc me
please; not subscribed...)

THK

--
Timothy H. Keitt
http://www.keittlab.org/

Re: Overhead of union versus union all

От
Alvaro Herrera
Дата:
Tim Keitt wrote:
> I am combining query results that I know are disjoint. I'm wondering
> how much overhead there is in calling union versus union all. (Just
> curious really; I can't see a reason not to use union all.)

UNION needs to uniquify the output, for which it plasters an additional
sort step, whereas UNION ALL does not need to uniquify its output and
thus it can avoid the sort step.  Using UNION ALL is recommended
wherever possible.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Overhead of union versus union all

От
Adam Rich
Дата:
Tim Keitt wrote:
> I am combining query results that I know are disjoint. I'm wondering
> how much overhead there is in calling union versus union all. (Just
> curious really; I can't see a reason not to use union all.) (cc me
> please; not subscribed...)
>
> THK
>


I think you can test this one yourself pretty easily.  Just run the two
queries with "explain analyze".  Union All should run in about the sum
of the separate queries.  Plain Union will always be slower, because it
takes the same results from "union all" and runs them through an extra
sort/distinct or hash step.  In my tests, on a query with 600,000 rows,
the Plain Union took about 3x as long to complete.



Re: Overhead of union versus union all

От
Bruce Momjian
Дата:
Alvaro Herrera wrote:
> Tim Keitt wrote:
> > I am combining query results that I know are disjoint. I'm wondering
> > how much overhead there is in calling union versus union all. (Just
> > curious really; I can't see a reason not to use union all.)
>
> UNION needs to uniquify the output, for which it plasters an additional
> sort step, whereas UNION ALL does not need to uniquify its output and
> thus it can avoid the sort step.  Using UNION ALL is recommended
> wherever possible.

Yep, ideally UNION ALL would be the default behavior, but that standard
requires otherwise.  Many people don't know that UNION has an extra
SORT/UNIQUE step.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Overhead of union versus union all

От
Scott Bailey
Дата:
Alvaro Herrera wrote:
> Tim Keitt wrote:
>> I am combining query results that I know are disjoint. I'm wondering
>> how much overhead there is in calling union versus union all. (Just
>> curious really; I can't see a reason not to use union all.)
>
> UNION needs to uniquify the output, for which it plasters an additional
> sort step, whereas UNION ALL does not need to uniquify its output and
> thus it can avoid the sort step.  Using UNION ALL is recommended
> wherever possible.
>

I think I read somewhere that as of 8.4 it no longer required the sort
step, due to the improvements in hashing. Here it is

http://wiki.postgresql.org/wiki/WhatsNew84#Performance

Re: Overhead of union versus union all

От
Bruce Momjian
Дата:
Scott Bailey wrote:
> Alvaro Herrera wrote:
> > Tim Keitt wrote:
> >> I am combining query results that I know are disjoint. I'm wondering
> >> how much overhead there is in calling union versus union all. (Just
> >> curious really; I can't see a reason not to use union all.)
> >
> > UNION needs to uniquify the output, for which it plasters an additional
> > sort step, whereas UNION ALL does not need to uniquify its output and
> > thus it can avoid the sort step.  Using UNION ALL is recommended
> > wherever possible.
> >
>
> I think I read somewhere that as of 8.4 it no longer required the sort
> step, due to the improvements in hashing. Here it is
>
> http://wiki.postgresql.org/wiki/WhatsNew84#Performance

Oh, yea, hashing is used in some cases rather than sort.  I assume sort
is still used if the hash exceeds workmem size.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Overhead of union versus union all

От
Scott Marlowe
Дата:
On Thu, Jul 9, 2009 at 7:58 PM, Bruce Momjian<bruce@momjian.us> wrote:
> Scott Bailey wrote:
>> Alvaro Herrera wrote:
>> > Tim Keitt wrote:
>> >> I am combining query results that I know are disjoint. I'm wondering
>> >> how much overhead there is in calling union versus union all. (Just
>> >> curious really; I can't see a reason not to use union all.)
>> >
>> > UNION needs to uniquify the output, for which it plasters an additional
>> > sort step, whereas UNION ALL does not need to uniquify its output and
>> > thus it can avoid the sort step.  Using UNION ALL is recommended
>> > wherever possible.
>> >
>>
>> I think I read somewhere that as of 8.4 it no longer required the sort
>> step, due to the improvements in hashing. Here it is
>>
>> http://wiki.postgresql.org/wiki/WhatsNew84#Performance
>
> Oh, yea, hashing is used in some cases rather than sort.  I assume sort
> is still used if the hash exceeds workmem size.

The important point being that it's still more expensive than a plain
union all thought, right?

Re: Overhead of union versus union all

От
Bruce Momjian
Дата:
Scott Marlowe wrote:
> On Thu, Jul 9, 2009 at 7:58 PM, Bruce Momjian<bruce@momjian.us> wrote:
> > Scott Bailey wrote:
> >> Alvaro Herrera wrote:
> >> > Tim Keitt wrote:
> >> >> I am combining query results that I know are disjoint. I'm wondering
> >> >> how much overhead there is in calling union versus union all. (Just
> >> >> curious really; I can't see a reason not to use union all.)
> >> >
> >> > UNION needs to uniquify the output, for which it plasters an additional
> >> > sort step, whereas UNION ALL does not need to uniquify its output and
> >> > thus it can avoid the sort step. ?Using UNION ALL is recommended
> >> > wherever possible.
> >> >
> >>
> >> I think I read somewhere that as of 8.4 it no longer required the sort
> >> step, due to the improvements in hashing. Here it is
> >>
> >> http://wiki.postgresql.org/wiki/WhatsNew84#Performance
> >
> > Oh, yea, hashing is used in some cases rather than sort. ?I assume sort
> > is still used if the hash exceeds workmem size.
>
> The important point being that it's still more expensive than a plain
> union all thought, right?

Yep.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Overhead of union versus union all

От
Simon Riggs
Дата:
On Thu, 2009-07-09 at 20:41 -0600, Scott Marlowe wrote:
> On Thu, Jul 9, 2009 at 7:58 PM, Bruce Momjian<bruce@momjian.us> wrote:
> > Scott Bailey wrote:
> >> Alvaro Herrera wrote:
> >> > Tim Keitt wrote:
> >> >> I am combining query results that I know are disjoint. I'm wondering
> >> >> how much overhead there is in calling union versus union all. (Just
> >> >> curious really; I can't see a reason not to use union all.)
> >> >
> >> > UNION needs to uniquify the output, for which it plasters an additional
> >> > sort step, whereas UNION ALL does not need to uniquify its output and
> >> > thus it can avoid the sort step.  Using UNION ALL is recommended
> >> > wherever possible.
> >> >
> >> I think I read somewhere that as of 8.4 it no longer required the sort
> >> step, due to the improvements in hashing. Here it is
> >>
> >> http://wiki.postgresql.org/wiki/WhatsNew84#Performance
> >
> > Oh, yea, hashing is used in some cases rather than sort.  I assume sort
> > is still used if the hash exceeds workmem size.
>
> The important point being that it's still more expensive than a plain
> union all thought, right?

I think it should be possible to use predtest theorem proving to discard
the sort/hash step in cases where we can prove the sets are disjoint.
Often there are top-level quals that can be compared in the WHERE
clauses of the sub-queries, so a shallow search could be quite
profitable in allowing us to rewrite a UNION into a UNION ALL.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: Overhead of union versus union all

От
Bruce Momjian
Дата:
Simon Riggs wrote:
>
> On Thu, 2009-07-09 at 20:41 -0600, Scott Marlowe wrote:
> > On Thu, Jul 9, 2009 at 7:58 PM, Bruce Momjian<bruce@momjian.us> wrote:
> > > Scott Bailey wrote:
> > >> Alvaro Herrera wrote:
> > >> > Tim Keitt wrote:
> > >> >> I am combining query results that I know are disjoint. I'm wondering
> > >> >> how much overhead there is in calling union versus union all. (Just
> > >> >> curious really; I can't see a reason not to use union all.)
> > >> >
> > >> > UNION needs to uniquify the output, for which it plasters an additional
> > >> > sort step, whereas UNION ALL does not need to uniquify its output and
> > >> > thus it can avoid the sort step.  Using UNION ALL is recommended
> > >> > wherever possible.
> > >> >
> > >> I think I read somewhere that as of 8.4 it no longer required the sort
> > >> step, due to the improvements in hashing. Here it is
> > >>
> > >> http://wiki.postgresql.org/wiki/WhatsNew84#Performance
> > >
> > > Oh, yea, hashing is used in some cases rather than sort.  I assume sort
> > > is still used if the hash exceeds workmem size.
> >
> > The important point being that it's still more expensive than a plain
> > union all thought, right?
>
> I think it should be possible to use predtest theorem proving to discard
> the sort/hash step in cases where we can prove the sets are disjoint.
> Often there are top-level quals that can be compared in the WHERE
> clauses of the sub-queries, so a shallow search could be quite
> profitable in allowing us to rewrite a UNION into a UNION ALL.

I assume we would still need the distinct removal step;  we just avoid
the sort/hash.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Overhead of union versus union all

От
Simon Riggs
Дата:
On Fri, 2009-07-10 at 08:59 -0400, Bruce Momjian wrote:

> > I think it should be possible to use predtest theorem proving to
> discard
> > the sort/hash step in cases where we can prove the sets are
> disjoint.
> > Often there are top-level quals that can be compared in the WHERE
> > clauses of the sub-queries, so a shallow search could be quite
> > profitable in allowing us to rewrite a UNION into a UNION ALL.
>
> I assume we would still need the distinct removal step;  we just avoid
> the sort/hash.

I mean it seems possible to prove that the distinct removal step is not
necessary, by proving that the various sub-queries are already disjoint.
It's a common manual optimization, so automating it seems a reasonable
future goal.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: Overhead of union versus union all

От
Bruce Momjian
Дата:
Simon Riggs wrote:
>
> On Fri, 2009-07-10 at 08:59 -0400, Bruce Momjian wrote:
>
> > > I think it should be possible to use predtest theorem proving to
> > discard
> > > the sort/hash step in cases where we can prove the sets are
> > disjoint.
> > > Often there are top-level quals that can be compared in the WHERE
> > > clauses of the sub-queries, so a shallow search could be quite
> > > profitable in allowing us to rewrite a UNION into a UNION ALL.
> >
> > I assume we would still need the distinct removal step;  we just avoid
> > the sort/hash.
>
> I mean it seems possible to prove that the distinct removal step is not
> necessary, by proving that the various sub-queries are already disjoint.
> It's a common manual optimization, so automating it seems a reasonable
> future goal.

I am confused what sub-queries produce _distinct_ output.  I know there
are some that produce _ordered_ output.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Overhead of union versus union all

От
Simon Riggs
Дата:
On Fri, 2009-07-10 at 09:28 -0400, Bruce Momjian wrote:
> Simon Riggs wrote:
> >
> > On Fri, 2009-07-10 at 08:59 -0400, Bruce Momjian wrote:
> >
> > > > I think it should be possible to use predtest theorem proving to
> > > discard
> > > > the sort/hash step in cases where we can prove the sets are
> > > disjoint.
> > > > Often there are top-level quals that can be compared in the WHERE
> > > > clauses of the sub-queries, so a shallow search could be quite
> > > > profitable in allowing us to rewrite a UNION into a UNION ALL.
> > >
> > > I assume we would still need the distinct removal step;  we just avoid
> > > the sort/hash.
> >
> > I mean it seems possible to prove that the distinct removal step is not
> > necessary, by proving that the various sub-queries are already disjoint.
> > It's a common manual optimization, so automating it seems a reasonable
> > future goal.
>
> I am confused what sub-queries produce _distinct_ output.  I know there
> are some that produce _ordered_ output.

None, that was not my point.

If you have a query like this

 Select ..., status, ...
 ...
 where status = '1'
 union
 Select ..., status, ...
 ...
 where status = '2';

or a query like this

 Select '1', ...
 ...
 union
 Select '2', ...
 ...
 ;

or a query like this

 Select '1', ...
 ...
 union
 Select status, ...
 ...
 where status != '1';
 ;

then it is clear that we could automatically prove that the the distinct
step is redundant and so we could either hash or sort. This is the same
as replacing the UNION with UNION ALL.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: Overhead of union versus union all

От
Bruce Momjian
Дата:
Simon Riggs wrote:
> or a query like this
>
>  Select '1', ...
>  ...
>  union
>  Select status, ...
>  ...
>  where status != '1';
>  ;
>
> then it is clear that we could automatically prove that the the distinct
> step is redundant and so we could either hash or sort. This is the same
> as replacing the UNION with UNION ALL.

In the last example, how do you know that status != '1' produces unique
output?  I assumed UNION gave distinct for the entire output, not just
remove duplicates from the two UNION branches;  that's how Postgres
behaves now:

    test=> SELECT 1 UNION (SELECT 2 UNION ALL SELECT 2);
     ?column?
    ----------
            1
            2
    (2 rows)

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Overhead of union versus union all

От
Simon Riggs
Дата:
On Fri, 2009-07-10 at 09:46 -0400, Bruce Momjian wrote:
> Simon Riggs wrote:
> > or a query like this
> >
> >  Select '1', ...
> >  ...
> >  union
> >  Select status, ...
> >  ...
> >  where status != '1';
> >  ;
> >
> > then it is clear that we could automatically prove that the the distinct
> > step is redundant and so we could either hash or sort. This is the same
> > as replacing the UNION with UNION ALL.
>
> In the last example, how do you know that status != '1' produces unique
> output?

You don't. I was assuming that you could already prove that each
subquery was distinct in itself.

It's one for the TODO, that's all. I see it often, but I'm not planning
to work on the code for this myself.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support


Re: Overhead of union versus union all

От
Jeff Davis
Дата:
On Fri, 2009-07-10 at 14:22 +0100, Simon Riggs wrote:
> I mean it seems possible to prove that the distinct removal step is not
> necessary, by proving that the various sub-queries are already disjoint.
> It's a common manual optimization, so automating it seems a reasonable
> future goal.

There are even simpler cases that postgresql can't optimize. Consider:

-- foo has a primary key
SELECT * FROM foo UNION SELECT * FROM foo;

That's logically equivalent to:

SELECT * FROM foo;

But postgresql will add a sort anyway.

There are lots of optimizations along these lines. They seem obscure,
but these optimizations become much more useful when using views or
complex queries where the same table appears multiple times. For
instance, if you have two views that are projections of the same table,
then, you join the views together, you can optimize away the join in
some cases, and just scan the original table.

I think a lot of these optimizations depend on knowing which tables (or
subqueries) are relations in the relational theory sense; i.e.
unordered, distinct, and have no NULLs in the relevant attributes.

Regards,
    Jeff Davis


Re: Overhead of union versus union all

От
Greg Stark
Дата:
On Fri, Jul 10, 2009 at 6:37 PM, Jeff Davis<pgsql@j-davis.com> wrote:
>
> -- foo has a primary key
> SELECT * FROM foo UNION SELECT * FROM foo;
>
> That's logically equivalent to:
>
> SELECT * FROM foo;
>
> But postgresql will add a sort anyway.


Well no, it's equivalent to SELECT DISTINCT * FROM foo;


--
greg
http://mit.edu/~gsstark/resume.pdf

Re: Overhead of union versus union all

От
Jeff Davis
Дата:
On Fri, 2009-07-10 at 18:47 +0100, Greg Stark wrote:
> > -- foo has a primary key

> Well no, it's equivalent to SELECT DISTINCT * FROM foo;

I think you missed that "foo" has a primary key.

Regards,
    Jeff Davis


Re: Overhead of union versus union all

От
Scott Marlowe
Дата:
On Fri, Jul 10, 2009 at 11:47 AM, Greg Stark<gsstark@mit.edu> wrote:
> On Fri, Jul 10, 2009 at 6:37 PM, Jeff Davis<pgsql@j-davis.com> wrote:
>>
>> -- foo has a primary key
>> SELECT * FROM foo UNION SELECT * FROM foo;
>>
>> That's logically equivalent to:
>>
>> SELECT * FROM foo;
>>
>> But postgresql will add a sort anyway.
>
>
> Well no, it's equivalent to SELECT DISTINCT * FROM foo;

And honestly, I'd rather see development effort go into making complex
queries run faster (the things like bitmap indexes on disk etc) rather
than optimising things that i can optimise myself.