Обсуждение: select DISTINCT

Поиск
Список
Период
Сортировка

select DISTINCT

От
pg noob
Дата:

Hi all,

I'm curious about some of the query estimates that I'm seeing with queries that use DISTINCT.
I am using postgres 8.4.13

I did a couple of quick tests, and found that PostgreSQL seems to do some expensive work to
return DISTINCT rows.  This is contrary to what I was expecting because I expected that
if it knows that it is building a distinct result set that it would maintain
some kind of ordering as it built the results to be able to quickly throw
away duplicates without any real overhead to the query.  This is different than an ORDER BY
which requires knowledge of all rows in the table before it can determine the
ordering.

I would think that distinct is something that can be determined even from a partial result set.

But what I find is that there is a significant difference between a count vs. a count distinct:

db=# explain analyze select count(column1) from table1;
                                                          QUERY PLAN                                                         
------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=631397.32..631397.33 rows=1 width=8) (actual time=5265.143..5265.143 rows=1 loops=1)
   ->  Seq Scan on table1 (cost=0.00..591411.65 rows=15994265 width=8) (actual time=0.005..3392.991 rows=16163849 loops=1)
 Total runtime: 5265.170 ms
(3 rows)

db=# explain analyze select count(distinct(column1)) from table1;
                                                          QUERY PLAN                                                         
------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=631397.32..631397.33 rows=1 width=8) (actual time=21828.644..21828.644 rows=1 loops=1)
   ->  Seq Scan on table1  (cost=0.00..591411.65 rows=15994265 width=8) (actual time=0.005..2742.391 rows=16163937 loops=1)
 Total runtime: 21828.736 ms
(3 rows)

Somehow the estimated query plan cost of the two queries is _exactly_ the same but the actual
time takes 4 times longer for one than the other to execute.
And it is the query returning less rows in the result which takes more time.
That says that there is work being done to return distinct results that is not accounted for by the
query plan.
(I tried this several times on different systems because I could not believe it.
But it seems consistent.)

But it gets a bit stranger than that.
According to the docs,
“DISTINCT ON column will eliminate all duplicates in the specified column; this is equivalent to using GROUP BY column.”

Note that it says that using distinct is EQUIVALENT to using GROUP BY.
And yet, look at the execution time difference of DISTINCT vs. GROUP BY:

db=# explain analyze select count(distinct(column1)) from table1;
                                                          QUERY PLAN                                                         
------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=631397.32..631397.33 rows=1 width=8) (actual time=18596.606..18596.607 rows=1 loops=1)
   ->  Seq Scan on table1  (cost=0.00..591411.65 rows=15994265 width=8) (actual time=0.003..2391.644 rows=16151368 loops=1)
 Total runtime: 18596.631 ms
(3 rows)

db=# explain analyze select count(foo.column1) from (select column1 from table1 group by column1) as foo;
 
                                                             QUERY PLAN                                                            
------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=631419.39..631419.40 rows=1 width=8) (actual time=4954.187..4954.188 rows=1 loops=1)
   ->  HashAggregate  (cost=631397.31..631407.12 rows=981 width=8) (actual time=4953.477..4954.044 rows=2389 loops=1)
         ->  Seq Scan on table1  (cost=0.00..591411.65 rows=15994265 width=8) (actual time=0.004..2050.133 rows=16151805 loops=1)
 Total runtime: 4954.223 ms
(4 rows)

The GROUP BY performs much better than DISTINCT even though both these two queries return the exact same count result.

In the above examples, column1 is type bigint and it is not indexed.

Any ideas why this is?  Or what I am doing wrong?

Thank you.


Re: select DISTINCT

От
Kevin Grittner
Дата:
pg noob <pgnube@gmail.com> wrote:

> The GROUP BY performs much better than DISTINCT even though both
> these two queries return the exact same count result.

No, GROUP BY performs much better than count(DISTINCT colname).

To confirm that this isn't something that has changed in the four
years since 8.4 was released, I used the latest source code.

test=# explain analyze select count(column1) from table1;
                                                        QUERY
PLAN                                                         

---------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=273571.41..273571.42 rows=1 width=4) (actual time=1704.584..1704.584 rows=1 loops=1)
   ->  Seq Scan on table1  (cost=0.00..233161.53 rows=16163953 width=4) (actual time=0.356..777.089 rows=16163937
loops=1)
 Total runtime: 1704.624 ms
(3 rows)

It is no surprise that just incrementing a counter for each row
scanned is faster than maintaining a collection of values seen so
far, looking up each value to see whether it is in the collection,
and inserting it if not, or sorting the list of values and counting
transitions.  Either method of getting a count of distinct values
is going to be a lot slower than just counting rows.

For your use of DISTINCT within the aggregate function, I removed
the unnecessary parentheses for clarity.  It's the same plan and
speed either way.

test=# explain analyze select count(distinct column1) from table1;
                                                        QUERY
PLAN                                                         

---------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=273571.41..273571.42 rows=1 width=4) (actual time=10553.480..10553.480 rows=1 loops=1)
   ->  Seq Scan on table1  (cost=0.00..233161.53 rows=16163953 width=4) (actual time=0.017..880.412 rows=16163937
loops=1)
 Total runtime: 10553.517 ms
(3 rows)

Note that the above is very different from using DISTINCT outside
of the aggregate function:

test=# explain analyze select count(*) from (select distinct column1 from table1) foo;
                                                           QUERY
PLAN                                                            

---------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=273627.68..273627.70 rows=1 width=0) (actual time=2637.148..2637.148 rows=1 loops=1)
   ->  HashAggregate  (cost=273571.41..273596.42 rows=2501 width=4) (actual time=2636.867..2637.034 rows=2501 loops=1)
         ->  Seq Scan on table1  (cost=0.00..233161.53 rows=16163953 width=4) (actual time=0.027..802.792 rows=16163937
loops=1)
 Total runtime: 2637.213 ms
(4 rows)

Which is the same plan as the GROUP BY:

test=# explain analyze select count(foo.column1) from (select column1 from table1 group by column1) as foo;
                                                           QUERY
PLAN                                                            

---------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=273627.68..273627.70 rows=1 width=4) (actual time=2649.196..2649.196 rows=1 loops=1)
   ->  HashAggregate  (cost=273571.41..273596.42 rows=2501 width=4) (actual time=2648.903..2649.073 rows=2501 loops=1)
         ->  Seq Scan on table1  (cost=0.00..233161.53 rows=16163953 width=4) (actual time=0.029..807.184 rows=16163937
loops=1)
 Total runtime: 2649.264 ms
(4 rows)

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: select DISTINCT

От
Jeff Janes
Дата:
On Friday, September 6, 2013, pg noob wrote:

Hi all,

I'm curious about some of the query estimates that I'm seeing with queries that use DISTINCT.
I am using postgres 8.4.13

I did a couple of quick tests, and found that PostgreSQL seems to do some expensive work to
return DISTINCT rows.  This is contrary to what I was expecting because I expected that
if it knows that it is building a distinct result set that it would maintain
some kind of ordering as it built the results to be able to quickly throw
away duplicates without any real overhead to the query.

There is no cost-free way of doing that.   You either have to sort, or hash, or walk in index order, and none of those things are free.

...
 
But it gets a bit stranger than that.
According to the docs,
“DISTINCT ON column will eliminate all duplicates in the specified column; this is equivalent to using GROUP BY column.”

DISTINCT ON is not the same thing as DISTINCT, which is not the same as count(distinct ...)


 

Note that it says that using distinct is EQUIVALENT to using GROUP BY.
And yet, look at the execution time difference of DISTINCT vs. GROUP BY:

db=# explain analyze select count(distinct(column1)) from table1;
                                                          QUERY PLAN                                                         
------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=631397.32..631397.33 rows=1 width=8) (actual time=18596.606..18596.607 rows=1 loops=1)
   ->  Seq Scan on table1  (cost=0.00..591411.65 rows=15994265 width=8) (actual time=0.003..2391.644 rows=16151368 loops=1)
 Total runtime: 18596.631 ms
(3 rows)

db=# explain analyze select count(foo.column1) from (select column1 from table1 group by column1) as foo;

Compare that to:

explain analyze select count(*) from (select DISTINCT column1 from table1) as foo;
 

Any ideas why this is?  Or what I am doing wrong?

The code that executes "count(distinct ....)" has never been taught how to use hash aggregation, whereas DISTINCT and GROUP BY have been.  It always sorts, which is often much slower than a hash, and also often much less memory efficient.   I speculate that this is because the implementation of count(distinct ...) is really ancient code and never been brought up to date, either about hashing or about more thorough EXPLAIN estimates--I have been meaning to dig into it but haven't gotten around to it yet.

You can get the increased performance by just spelling it in the more verbose way, but be aware that count (distinct ...) deals with NULL differently than select count(*) from (select DISTINCT...)  does, so they are not exactly identical.
 
Cheers,

Jeff

Re: select DISTINCT

От
pg noob
Дата:
Thank you Kevin and Jeff for the responses.
These are very helpful.


On Fri, Sep 6, 2013 at 10:48 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Friday, September 6, 2013, pg noob wrote:

Hi all,

I'm curious about some of the query estimates that I'm seeing with queries that use DISTINCT.
I am using postgres 8.4.13

I did a couple of quick tests, and found that PostgreSQL seems to do some expensive work to
return DISTINCT rows.  This is contrary to what I was expecting because I expected that
if it knows that it is building a distinct result set that it would maintain
some kind of ordering as it built the results to be able to quickly throw
away duplicates without any real overhead to the query.

There is no cost-free way of doing that.   You either have to sort, or hash, or walk in index order, and none of those things are free.

...
 
But it gets a bit stranger than that.
According to the docs,
“DISTINCT ON column will eliminate all duplicates in the specified column; this is equivalent to using GROUP BY column.”

DISTINCT ON is not the same thing as DISTINCT, which is not the same as count(distinct ...)


 

Note that it says that using distinct is EQUIVALENT to using GROUP BY.
And yet, look at the execution time difference of DISTINCT vs. GROUP BY:

db=# explain analyze select count(distinct(column1)) from table1;
                                                          QUERY PLAN                                                         
------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=631397.32..631397.33 rows=1 width=8) (actual time=18596.606..18596.607 rows=1 loops=1)
   ->  Seq Scan on table1  (cost=0.00..591411.65 rows=15994265 width=8) (actual time=0.003..2391.644 rows=16151368 loops=1)
 Total runtime: 18596.631 ms
(3 rows)

db=# explain analyze select count(foo.column1) from (select column1 from table1 group by column1) as foo;

Compare that to:

explain analyze select count(*) from (select DISTINCT column1 from table1) as foo;
 

Any ideas why this is?  Or what I am doing wrong?

The code that executes "count(distinct ....)" has never been taught how to use hash aggregation, whereas DISTINCT and GROUP BY have been.  It always sorts, which is often much slower than a hash, and also often much less memory efficient.   I speculate that this is because the implementation of count(distinct ...) is really ancient code and never been brought up to date, either about hashing or about more thorough EXPLAIN estimates--I have been meaning to dig into it but haven't gotten around to it yet.

You can get the increased performance by just spelling it in the more verbose way, but be aware that count (distinct ...) deals with NULL differently than select count(*) from (select DISTINCT...)  does, so they are not exactly identical.
 
Cheers,

Jeff