Обсуждение: Group by more efficient than distinct?

Поиск
Список
Период
Сортировка

Group by more efficient than distinct?

От
Francisco Reyes
Дата:
I am trying to get a distinct set of rows from 2 tables.
After looking at someone else's query I noticed they were doing a group by
to obtain the unique list.

After comparing on multiple machines with several tables, it seems using
group by to obtain a distinct list is substantially faster than using
select distinct.

Is there any dissadvantage of using "group by" to obtain a unique list?

On a small dataset the difference was about 20% percent.

Group by
 HashAggregate  (cost=369.61..381.12 rows=1151 width=8) (actual
time=76.641..85.167 rows=2890 loops=1)

Distinct
 Unique  (cost=1088.23..1174.53 rows=1151 width=8) (actual
time=90.516..140.123 rows=2890 loops=1)

Although I don't have the numbers here with me, a simmilar result was
obtaining against a query that would return 100,000 rows. 20% and more
speed differnce between "group by" over "select distinct".

Re: Group by more efficient than distinct?

От
Thomas Pundt
Дата:
On Freitag, 18. April 2008, Francisco Reyes wrote:
| I am trying to get a distinct set of rows from 2 tables.
| After looking at someone else's query I noticed they were doing a group by
| to obtain the unique list.
|
| After comparing on multiple machines with several tables, it seems using
| group by to obtain a distinct list is substantially faster than using
| select distinct.
|
| Is there any dissadvantage of using "group by" to obtain a unique list?

Searching the archives suggests that the code related to "group by" is much
newer than the one related to "distinct" and thus might benefit from more
optimization paths.

Ciao,
Thomas

--
Thomas Pundt <thomas.pundt@rp-online.de> ---- http://rp-online.de/ ----

Re: Group by more efficient than distinct?

От
Gregory Stark
Дата:
"Francisco Reyes" <lists@stringsutils.com> writes:

> Is there any dissadvantage of using "group by" to obtain a unique list?
>
> On a small dataset the difference was about 20% percent.
>
> Group by
> HashAggregate  (cost=369.61..381.12 rows=1151 width=8) (actual
> time=76.641..85.167 rows=2890 loops=1)

HashAggregate needs to store more values in memory at the same time so it's
not a good plan if you have a lot of distinct values.

But the planner knows that and so as long as your work_mem is set to a
reasonable size (keeping in mind each sort or other operation feels free to
use that much itself even if there are several in the query) and the rows
estimate is reasonable accurate -- here it's mediocre but not dangerously bad
-- then if the planner is picking it it's probably a good idea.

I'm not sure but I think there are cases where the DISTINCT method wins too.
This is basically a bug, in an ideal world both queries would generate
precisely the same plans since they're equivalent. It's just not a high
priority since we have so many more interesting improvements competing for
time.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's PostGIS support!

Re: Group by more efficient than distinct?

От
PFC
Дата:
On Fri, 18 Apr 2008 11:36:02 +0200, Gregory Stark <stark@enterprisedb.com>
wrote:

> "Francisco Reyes" <lists@stringsutils.com> writes:
>
>> Is there any dissadvantage of using "group by" to obtain a unique list?
>>
>> On a small dataset the difference was about 20% percent.
>>
>> Group by
>> HashAggregate  (cost=369.61..381.12 rows=1151 width=8) (actual
>> time=76.641..85.167 rows=2890 loops=1)

    Basically :

    - If you process up to some percentage of your RAM worth of data, hashing
is going to be a lot faster
    - If the size of the hash grows larger than your RAM, hashing will fail
miserably and sorting will be much faster since PG's disksort is really
good
    - GROUP BY knows this and acts accordingly
    - DISTINCT doesn't know this, it only knows sorting, so it sorts
    - If you need DISTINCT x ORDER BY x, sorting may be faster too (depending
on the % of distinct rows)
    - If you need DISTINCT ON, well, you're stuck with the Sort
    - So, for the time being, you can replace DISTINCT with GROUP BY...

Re: Group by more efficient than distinct?

От
Francisco Reyes
Дата:
Gregory Stark writes:

> HashAggregate needs to store more values in memory at the same time so it's
> not a good plan if you have a lot of distinct values.

So far the resulting number of rows where in the thousands and the source
data were in there hundreds of thousands and the group by was faster.

When you say "a lot of distinct values" you mean unique values as part of
the result data set?

In other words the HashAggregate will store in memory the resulting rows or
will be used for processing the source rows?

> But the planner knows that and so as long as your work_mem is set to a
> reasonable size (keeping in mind each sort or other operation feels free to

If I make sure to have vacuum analyze on a table will it be reasonable to
trust the explain to see whether distinct or group by is better?  I started
a new job and still don't have a good feeling for the sizes or distributions
of the data. Soon I will be given access to the test DB so I will be able to
do counts and explore the data without affecting production.

Re: Group by more efficient than distinct?

От
Francisco Reyes
Дата:
PFC writes:

>- If you process up to some percentage of your RAM worth of data, hashing
> is going to be a lot faster

Thanks for the excellent breakdown and explanation. I will try and get sizes
of the tables in question and how much memory the machines have.

>     - If you need DISTINCT ON, well, you're stuck with the Sort
>     - So, for the time being, you can replace DISTINCT with GROUP BY...

Have seen a few of those already on some code (new job..) so for those it is
a matter of having a good disk subsystem?

Re: Group by more efficient than distinct?

От
Luke Lonergan
Дата:
Hi Francisco,

Generally, PG sorting is much slower than hash aggregation for performing the distinct operation.  There may be small sizes where this isn’t true, but for large amounts of data (in-memory or not), hash agg (used most often, but not always by GROUP BY) is faster.

We’ve implemented a special optimization to PG sorting that does the distinct processing within the sort, instead of afterward, but it’s limited to some small-ish number (10,000) of distinct values due to it’s use of a memory and processing intensive heap.

So, you’re better off using GROUP BY and making sure that the planner is using hash agg to do the work.

- Luke


On 4/17/08 8:46 PM, "Francisco Reyes" <lists@stringsutils.com> wrote:

I am trying to get a distinct set of rows from 2 tables.
After looking at someone else's query I noticed they were doing a group by
to obtain the unique list.

After comparing on multiple machines with several tables, it seems using
group by to obtain a distinct list is substantially faster than using
select distinct.

Is there any dissadvantage of using "group by" to obtain a unique list?

On a small dataset the difference was about 20% percent.

Group by
 HashAggregate  (cost=369.61..381.12 rows=1151 width=8) (actual
time=76.641..85.167 rows=2890 loops=1)

Distinct
 Unique  (cost=1088.23..1174.53 rows=1151 width=8) (actual
time=90.516..140.123 rows=2890 loops=1)

Although I don't have the numbers here with me, a simmilar result was
obtaining against a query that would return 100,000 rows. 20% and more
speed differnce between "group by" over "select distinct".  

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Group by more efficient than distinct?

От
PFC
Дата:
On Sun, 20 Apr 2008 17:15:36 +0200, Francisco Reyes
<lists@stringsutils.com> wrote:

> PFC writes:
>
>> - If you process up to some percentage of your RAM worth of data,
>> hashing  is going to be a lot faster
>
> Thanks for the excellent breakdown and explanation. I will try and get
> sizes of the tables in question and how much memory the machines have.

    Actually, the memory used by the hash depends on the number of distinct
values, not the number of rows which are processed...
    Consider :

SELECT a GROUP BY a
SELECT a,count(*) GROUP BY a

    In both cases the hash only holds discinct values. So if you have 1
million rows to process but only 10 distinct values of "a", the hash will
only contain those 10 values (and the counts), so it will be very small
and fast, it will absorb a huge seq scan without problem. If however, you
have (say) 100 million distinct values for a, using a hash would be a bad
idea. As usual, divide the size of your RAM by the number of concurrent
connections or something.
    Note that "a" could be a column, several columns, anything, the size of
the hash will be proportional to the number of distinct values, ie. the
number of rows returned by the query, not the number of rows processed
(read) by the query. Same with hash joins etc, that's why when you join a
very small table to a large one Postgres likes to use seq scan + hash join
on the small table.


>>     - If you need DISTINCT ON, well, you're stuck with the Sort
>>     - So, for the time being, you can replace DISTINCT with GROUP BY...
>
> Have seen a few of those already on some code (new job..) so for those
> it is a matter of having a good disk subsystem?

    Depends on your RAM, sorting in RAM is always faster than sorting on disk
of course, unless you eat all the RAM and trash the other processes.
Tradeoffs...



Re: Group by more efficient than distinct?

От
Mark Mielke
Дата:
PFC wrote:
>     Actually, the memory used by the hash depends on the number of
> distinct values, not the number of rows which are processed...
>     Consider :
>
> SELECT a GROUP BY a
> SELECT a,count(*) GROUP BY a
>
>     In both cases the hash only holds discinct values. So if you have
> 1 million rows to process but only 10 distinct values of "a", the hash
> will only contain those 10 values (and the counts), so it will be very
> small and fast, it will absorb a huge seq scan without problem. If
> however, you have (say) 100 million distinct values for a, using a
> hash would be a bad idea. As usual, divide the size of your RAM by the
> number of concurrent connections or something.
>     Note that "a" could be a column, several columns, anything, the
> size of the hash will be proportional to the number of distinct
> values, ie. the number of rows returned by the query, not the number
> of rows processed (read) by the query. Same with hash joins etc,
> that's why when you join a very small table to a large one Postgres
> likes to use seq scan + hash join on the small table.

This surprises me - hash values are lossy, so it must still need to
confirm against the real list of values, which at a minimum should
require references to the rows to check against?

Is PostgreSQL doing something beyond my imagination? :-)

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>


Re: Group by more efficient than distinct?

От
Mark Mielke
Дата:
Mark Mielke wrote:
> PFC wrote:
>>     Actually, the memory used by the hash depends on the number of
>> distinct values, not the number of rows which are processed...
>>     Consider :
>>
>> SELECT a GROUP BY a
>> SELECT a,count(*) GROUP BY a
>>
>>     In both cases the hash only holds discinct values. So if you have
>> 1 million rows to process but only 10 distinct values of "a", the
>> hash will only contain those 10 values (and the counts), so it will
>> be very small and fast, it will absorb a huge seq scan without
>> problem. If however, you have (say) 100 million distinct values for
>> a, using a hash would be a bad idea. As usual, divide the size of
>> your RAM by the number of concurrent connections or something.
>>     Note that "a" could be a column, several columns, anything, the
>> size of the hash will be proportional to the number of distinct
>> values, ie. the number of rows returned by the query, not the number
>> of rows processed (read) by the query. Same with hash joins etc,
>> that's why when you join a very small table to a large one Postgres
>> likes to use seq scan + hash join on the small table.
>
> This surprises me - hash values are lossy, so it must still need to
> confirm against the real list of values, which at a minimum should
> require references to the rows to check against?
>
> Is PostgreSQL doing something beyond my imagination? :-)

Hmmm... You did say distinct values, so I can see how that would work
for distinct. What about seq scan + hash join, though? To complete the
join, wouldn't it need to have a reference to each of the rows to join
against? If there is 20 distinct values and 200 rows in the small table
- wouldn't it need 200 references to be stored?

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>


Re: Group by more efficient than distinct?

От
Matthew Wakeling
Дата:
On Mon, 21 Apr 2008, Mark Mielke wrote:
> This surprises me - hash values are lossy, so it must still need to confirm
> against the real list of values, which at a minimum should require references
> to the rows to check against?
>
> Is PostgreSQL doing something beyond my imagination? :-)

Not too far beyond your imagination, I hope.

It's simply your assumption that the hash table is lossy. Sure, hash
values are lossy, but a hash table isn't. Postgres stores in memory not
only the hash values, but the rows they refer to as well, having checked
them all on disc beforehand. That way, it doesn't need to look up anything
on disc for that branch of the join again, and it has a rapid in-memory
lookup for each row.

Matthew

--
X's book explains this very well, but, poor bloke, he did the Cambridge Maths
Tripos...                               -- Computer Science Lecturer

Re: Group by more efficient than distinct?

От
Mark Mielke
Дата:
Matthew Wakeling wrote:
> On Mon, 21 Apr 2008, Mark Mielke wrote:
>> This surprises me - hash values are lossy, so it must still need to
>> confirm against the real list of values, which at a minimum should
>> require references to the rows to check against?
>>
>> Is PostgreSQL doing something beyond my imagination? :-)
>
> Not too far beyond your imagination, I hope.
>
> It's simply your assumption that the hash table is lossy. Sure, hash
> values are lossy, but a hash table isn't. Postgres stores in memory
> not only the hash values, but the rows they refer to as well, having
> checked them all on disc beforehand. That way, it doesn't need to look
> up anything on disc for that branch of the join again, and it has a
> rapid in-memory lookup for each row.

I said hash *values* are lossy. I did not say hash table is lossy.

The poster I responded to said that the memory required for a hash join
was relative to the number of distinct values, not the number of rows.
They gave an example of millions of rows, but only a few distinct
values. Above, you agree with me that it it would include the rows (or
at least references to the rows) as well. If it stores rows, or
references to rows, then memory *is* relative to the number of rows, and
millions of records would require millions of rows (or row references).

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>


Re: Group by more efficient than distinct?

От
Matthew Wakeling
Дата:
On Tue, 22 Apr 2008, Mark Mielke wrote:
> The poster I responded to said that the memory required for a hash join was
> relative to the number of distinct values, not the number of rows. They gave
> an example of millions of rows, but only a few distinct values. Above, you
> agree with me that it it would include the rows (or at least references to
> the rows) as well. If it stores rows, or references to rows, then memory *is*
> relative to the number of rows, and millions of records would require
> millions of rows (or row references).

Yeah, I think we're talking at cross-purposes, due to hash tables being
used in two completely different places in Postgres. Firstly, you have
hash joins, where Postgres loads the references to the actual rows, and
puts those in the hash table. For that situation, you want a small number
of rows. Secondly, you have hash aggregates, where Postgres stores an
entry for each "group" in the hash table, and does not store the actual
rows. For that situation, you can have a bazillion individual rows, but
only a small number of distinct groups.

Matthew

--
First law of computing:  Anything can go wro
sig: Segmentation fault.  core dumped.

Re: Group by more efficient than distinct?

От
Mark Mielke
Дата:
Matthew Wakeling wrote:
> On Tue, 22 Apr 2008, Mark Mielke wrote:
>> The poster I responded to said that the memory required for a hash
>> join was relative to the number of distinct values, not the number of
>> rows. They gave an example of millions of rows, but only a few
>> distinct values. Above, you agree with me that it it would include
>> the rows (or at least references to the rows) as well. If it stores
>> rows, or references to rows, then memory *is* relative to the number
>> of rows, and millions of records would require millions of rows (or
>> row references).
>
> Yeah, I think we're talking at cross-purposes, due to hash tables
> being used in two completely different places in Postgres. Firstly,
> you have hash joins, where Postgres loads the references to the actual
> rows, and puts those in the hash table. For that situation, you want a
> small number of rows. Secondly, you have hash aggregates, where
> Postgres stores an entry for each "group" in the hash table, and does
> not store the actual rows. For that situation, you can have a
> bazillion individual rows, but only a small number of distinct groups.

That makes sense with my reality. :-)

Thanks,
mark

--
Mark Mielke <mark@mielke.cc>