Обсуждение: select count(distinct ...) is slower than select distinct in about 5x
Hi
I'm just trying about PostgreSQL, I create a database "test" with a table "t1":test=> \d t1
Table "public.t1"
Column | Type | Modifiers
---------+---------+-----------------------------------------------------
col_id | integer | not null default nextval('t1_col_id_seq'::regclass)
col_int | integer |
Indexes:
"t1_pkey" PRIMARY KEY, btree (col_id)
"t1_col_int_idx" btree (col_int)
test=> select distinct col_int from t1;
Time: 1258.627 ms
test=> select distinct col_int from t1;
Time: 1264.667 ms
test=> select distinct col_int from t1;
Time: 1261.805 ms
test=> select distinct col_int from t1 group by col_int;
Time: 1180.617 ms
test=> select distinct col_int from t1 group by col_int;
Time: 1179.849 ms
test=> select distinct col_int from t1 group by col_int;
Time: 1177.936 ms
test=> select count(distinct col_int) from t1;
count
-------
1025
(1 row)
Time: 7367.476 ms
test=> select count(distinct col_int) from t1;
count
-------
1025
(1 row)
Time: 6946.233 ms
test=> select count(distinct col_int) from t1;
count
-------
1025
(1 row)
Time: 7386.969 ms
test=> select count(distinct col_int) from t1;
count
-------
1025
(1 row)
test=> select count(*) from (select distinct col_int from t1) as tmp;
count
-------
1025
(1 row)
Time: 1267.468 ms
test=> select count(*) from (select distinct col_int from t1) as tmp;
count
-------
1025
(1 row)
Time: 1257.327 ms
test=> select count(*) from (select distinct col_int from t1) as tmp;
count
-------
1025
(1 row)
Time: 1258.189 ms
test=> explain analyze select distinct col_int from t1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=169268.05..169278.30 rows=1025 width=4) (actual time=39034.653..39037.239 rows=1025 loops=1)
-> Seq Scan on t1 (cost=0.00..144265.04 rows=10001204 width=4) (actual time=0.041..19619.931 rows=10001000 loops=1)
Total runtime: 39039.136 ms
(3 rows)
Time: 39103.622 ms
test=> explain analyze select distinct col_int from t1 group by col_int;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=169280.86..169291.11 rows=1025 width=4) (actual time=39062.417..39064.882 rows=1025 loops=1)
-> HashAggregate (cost=169268.05..169278.30 rows=1025 width=4) (actual time=39058.136..39060.303 rows=1025 loops=1)
-> Seq Scan on t1 (cost=0.00..144265.04 rows=10001204 width=4) (actual time=0.024..19439.482 rows=10001000 loops=1)
Total runtime: 39066.896 ms
(4 rows)
Time: 39067.198 ms
test=> explain analyze select count(distinct col_int) from t1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=169268.05..169268.06 rows=1 width=4) (actual time=45994.120..45994.123 rows=1 loops=1)
-> Seq Scan on t1 (cost=0.00..144265.04 rows=10001204 width=4) (actual time=0.025..19599.950 rows=10001000 loops=1)
Total runtime: 45994.154 ms
(3 rows)
Time: 45994.419 ms
test=> explain analyze select count(*) from (select distinct col_int from t1) as tmp;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=169291.11..169291.12 rows=1 width=0) (actual time=39050.598..39050.600 rows=1 loops=1)
-> HashAggregate (cost=169268.05..169278.30 rows=1025 width=4) (actual time=39046.814..39048.742 rows=1025 loops=1)
-> Seq Scan on t1 (cost=0.00..144265.04 rows=10001204 width=4) (actual time=0.035..19616.631 rows=10001000 loops=1)
Total runtime: 39050.634 ms
(4 rows)
Time: 39050.896 ms
The environment I use is:
PostgreSQL 9.3.1 (postgresql-9.3 9.3.1-1.pgdg12.4+1) on local machine (actually a vbox VM, but when I try the test, I didn't run something very different or some heavy program for all queries. And the result seems consistent.)
Ubuntu 12.04 LTS
I followed the installation step in Quickstart section of http://wiki.postgresql.org/wiki/Apt
jacket41142
jacket41142 <jacket41142@gmail.com> wrote: > [ subject issue in detail ] Please review this thread: http://www.postgresql.org/message-id/flat/CAPNY-2Utce-c+kNTwsMCbAk58=9mYeAeViTXT9LO7r1k77jukw@mail.gmail.com#CAPNY-2Utce-c+kNTwsMCbAk58=9mYeAeViTXT9LO7r1k77jukw@mail.gmail.com -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Dec 10, 2013 at 9:28 AM, jacket41142 <jacket41142@gmail.com> wrote:
But when I do that:So the performance difference is not very large.test=> select distinct col_int from t1 group by col_int;
Time: 1177.936 ms
test=> select count(distinct col_int) from t1;
count
-------
1025
(1 row)
Time: 7367.476 ms
count(distinct ...) always sorts, rather than using a hash, to do its work. I don't think that there is any fundamental reason that it could not be changed to allow it to use hashing, it just hasn't been done yet. It is complicated by the fact that you can have multiple count() expressions in the same query which demand sorting/grouping on different columns.
Cheers,
Jeff