Обсуждение: select count(distinct ...) is slower than select distinct in about 5x

Поиск
Список
Период
Сортировка

select count(distinct ...) is slower than select distinct in about 5x

От
jacket41142
Дата:
Hi

I'm just trying about PostgreSQL, I create a database "test" with a table "t1":

test=> \d t1
                            Table "public.t1"
 Column  |  Type   |                      Modifiers                     
---------+---------+-----------------------------------------------------
 col_id  | integer | not null default nextval('t1_col_id_seq'::regclass)
 col_int | integer |
Indexes:
    "t1_pkey" PRIMARY KEY, btree (col_id)
    "t1_col_int_idx" btree (col_int)

There are 10001000 rows in that table, and basically col_int of each row are filled with random data within range [0,1024].

One strange thing I found is that:

test=> select distinct col_int from t1;
Time: 1258.627 ms
test=> select distinct col_int from t1;
Time: 1264.667 ms
test=> select distinct col_int from t1;
Time: 1261.805 ms

If I use "group by":

test=> select distinct col_int from t1 group by col_int;
Time: 1180.617 ms
test=> select distinct col_int from t1 group by col_int;
Time: 1179.849 ms
test=> select distinct col_int from t1 group by col_int;
Time: 1177.936 ms

So the performance difference is not very large.
But when I do that:

test=> select count(distinct col_int) from t1;
 count
-------
  1025
(1 row)

Time: 7367.476 ms
test=> select count(distinct col_int) from t1;
 count
-------
  1025
(1 row)

Time: 6946.233 ms
test=> select count(distinct col_int) from t1;
 count
-------
  1025
(1 row)

Time: 7386.969 ms
test=> select count(distinct col_int) from t1;
 count
-------
  1025
(1 row)


The speed is straightly worse! But it's doesn't make sense. Since we can just make a temporary table or subquery and count for all rows. The performance should be almost the same.
So do you have any idea about this? Or maybe PostgreSQL's query planner can be improved for this kinds of query?

By the way, if I use a subquery to replace the above query, here's what I got:

test=> select count(*) from (select distinct col_int from t1) as tmp;
 count
-------
  1025
(1 row)

Time: 1267.468 ms
test=> select count(*) from (select distinct col_int from t1) as tmp;
 count
-------
  1025
(1 row)

Time: 1257.327 ms
test=> select count(*) from (select distinct col_int from t1) as tmp;
 count
-------
  1025
(1 row)

Time: 1258.189 ms


OK, this workaround works. But I just think if postgres can improve with this kind of query, it will be better. Also I'm not sure about other similar scenarios that will cause similar problem.

The following is the output of "explain analyze ...":
test=> explain analyze select distinct col_int from t1;
                                                       QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=169268.05..169278.30 rows=1025 width=4) (actual time=39034.653..39037.239 rows=1025 loops=1)
   ->  Seq Scan on t1  (cost=0.00..144265.04 rows=10001204 width=4) (actual time=0.041..19619.931 rows=10001000 loops=1)
 Total runtime: 39039.136 ms
(3 rows)

Time: 39103.622 ms
test=> explain analyze select distinct col_int from t1 group by col_int;
                                                          QUERY PLAN                                                          
-------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=169280.86..169291.11 rows=1025 width=4) (actual time=39062.417..39064.882 rows=1025 loops=1)
   ->  HashAggregate  (cost=169268.05..169278.30 rows=1025 width=4) (actual time=39058.136..39060.303 rows=1025 loops=1)
         ->  Seq Scan on t1  (cost=0.00..144265.04 rows=10001204 width=4) (actual time=0.024..19439.482 rows=10001000 loops=1)
 Total runtime: 39066.896 ms
(4 rows)

Time: 39067.198 ms
test=> explain analyze select count(distinct col_int) from t1;
                                                       QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=169268.05..169268.06 rows=1 width=4) (actual time=45994.120..45994.123 rows=1 loops=1)
   ->  Seq Scan on t1  (cost=0.00..144265.04 rows=10001204 width=4) (actual time=0.025..19599.950 rows=10001000 loops=1)
 Total runtime: 45994.154 ms
(3 rows)

Time: 45994.419 ms

test=> explain analyze select count(*) from (select distinct col_int from t1) as tmp;
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=169291.11..169291.12 rows=1 width=0) (actual time=39050.598..39050.600 rows=1 loops=1)
   ->  HashAggregate  (cost=169268.05..169278.30 rows=1025 width=4) (actual time=39046.814..39048.742 rows=1025 loops=1)
         ->  Seq Scan on t1  (cost=0.00..144265.04 rows=10001204 width=4) (actual time=0.035..19616.631 rows=10001000 loops=1)
 Total runtime: 39050.634 ms
(4 rows)

Time: 39050.896 ms

P.S. I have already use "analyze verbose t1;" several times so the database should already be optimized. But I'm just new to PostgreSQL.

The environment I use is:
PostgreSQL 9.3.1 (postgresql-9.3 9.3.1-1.pgdg12.4+1) on local machine (actually a vbox VM, but when I try the test, I didn't run something very different or some heavy program for all queries. And the result seems consistent.)
Ubuntu 12.04 LTS
I followed the installation step in Quickstart section of http://wiki.postgresql.org/wiki/Apt

best regards,
jacket41142

Re: select count(distinct ...) is slower than select distinct in about 5x

От
Kevin Grittner
Дата:
jacket41142 <jacket41142@gmail.com> wrote:

> [ subject issue in detail ]

Please review this thread:


http://www.postgresql.org/message-id/flat/CAPNY-2Utce-c+kNTwsMCbAk58=9mYeAeViTXT9LO7r1k77jukw@mail.gmail.com#CAPNY-2Utce-c+kNTwsMCbAk58=9mYeAeViTXT9LO7r1k77jukw@mail.gmail.com

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: select count(distinct ...) is slower than select distinct in about 5x

От
Jeff Janes
Дата:
On Tue, Dec 10, 2013 at 9:28 AM, jacket41142 <jacket41142@gmail.com> wrote:
 

test=> select distinct col_int from t1 group by col_int;
Time: 1177.936 ms

So the performance difference is not very large.
But when I do that:

test=> select count(distinct col_int) from t1;
 count
-------
  1025
(1 row)

Time: 7367.476 ms


count(distinct ...) always sorts, rather than using a hash, to do its work.  I don't think that there is any fundamental reason that it could not be changed to allow it to use hashing, it just hasn't been done yet.  It is complicated by the fact that you can have multiple count() expressions in the same query which demand sorting/grouping on different columns.

Cheers,

Jeff