Обсуждение: GROUP BY vs DISTINCT

Поиск
Список
Период
Сортировка

GROUP BY vs DISTINCT

От
Brian Herlihy
Дата:
I have a question about the following.  The table has an index on (clicked at time zone 'PST').  I am using postgres
8.1.3

Actually, I think I answered my own question already.  But I want to confirm - Is the GROUP BY faster because it
doesn'thave to sort results, whereas DISTINCT must produce sorted results?  This wasn't clear to me from the
documentation. If it's true, then I could save considerable time by using GROUP BY where I have been using DISTINCT in
thepast.  Usually I simply want a count of the distinct values, and there is no need to sort for that. 

I'm also interested in the count(distinct) case at the bottom.  The cost estimate seems similar to the GROUP BY, but
theactual cost is much higher. 

The table is insert-only and was analyzed before running these queries.  The domain column being aggregated has around
16kdistinct values, and there are 780k rows in total (for the entire table, not the slice being selected in these
queries).

Thanks,
Brian


live:parking=> explain analyze SELECT domain
  FROM parked_redirects
  WHERE (clicked at time zone 'PST') >= '2006-12-17'
   AND (clicked at time zone 'PST')
   < '2006-12-18'::timestamp without time zone + '1 day'::interval
   GROUP BY domain;

                                                                                               QUERY PLAN
                                        

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=19078.50..19085.29 rows=679 width=18) (actual time=709.309..717.096 rows=14526 loops=1)
   ->  Index Scan using parked_redirects_pst on parked_redirects  (cost=0.01..17846.82 rows=492672 width=18) (actual
time=0.073..406.510rows=504972 loops=1) 
         Index Cond: ((timezone('PST'::text, clicked) >= '2006-12-17 00:00:00'::timestamp without time zone) AND
(timezone('PST'::text,clicked) < '2006-12-19 00:00:00'::timestamp without time zone)) 
 Total runtime: 719.810 ms
(4 rows)

live:parking=> explain analyze SELECT DISTINCT domain
   FROM parked_redirects
  WHERE (clicked at time zone 'PST') >= '2006-12-17'
   AND (clicked at time zone 'PST')
   < '2006-12-18'::timestamp without time zone + '1 day'::interval;

                                                                                                  QUERY PLAN
                                        

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=64433.98..66897.52 rows=679 width=18) (actual time=15329.904..15647.849 rows=14526 loops=1)
   ->  Sort  (cost=64433.98..65665.75 rows=492709 width=18) (actual time=15329.901..15511.479 rows=504972 loops=1)
         Sort Key: "domain"
         ->  Index Scan using parked_redirects_pst on parked_redirects  (cost=0.01..17847.41 rows=492709 width=18)
(actualtime=0.068..519.696 rows=504972 loops=1) 
               Index Cond: ((timezone('PST'::text, clicked) >= '2006-12-17 00:00:00'::timestamp without time zone) AND
(timezone('PST'::text,clicked) < '2006-12-19 00:00:00'::timestamp without time zone)) 
 Total runtime: 15666.863 ms
(6 rows)

live:parking=> explain analyze SELECT count(DISTINCT domain)
   FROM parked_redirects
  WHERE (clicked at time zone 'PST') >= '2006-12-17'
   AND (clicked at time zone 'PST')
   < '2006-12-18'::timestamp without time zone + '1 day'::interval;

                                                                                               QUERY PLAN
                                        

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=19107.20..19107.21 rows=1 width=18) (actual time=11380.530..11380.531 rows=1 loops=1)
   ->  Index Scan using parked_redirects_pst on parked_redirects  (cost=0.01..17873.67 rows=493412 width=18) (actual
time=0.022..347.473rows=504972 loops=1) 
         Index Cond: ((timezone('PST'::text, clicked) >= '2006-12-17 00:00:00'::timestamp without time zone) AND
(timezone('PST'::text,clicked) < '2006-12-19 00:00:00'::timestamp without time zone)) 
 Total runtime: 11384.923 ms
(4 rows)




Re: GROUP BY vs DISTINCT

От
"Steinar H. Gunderson"
Дата:
On Tue, Dec 19, 2006 at 11:19:39PM -0800, Brian Herlihy wrote:
> Actually, I think I answered my own question already.  But I want to
> confirm - Is the GROUP BY faster because it doesn't have to sort results,
> whereas DISTINCT must produce sorted results?  This wasn't clear to me from
> the documentation.  If it's true, then I could save considerable time by
> using GROUP BY where I have been using DISTINCT in the past.  Usually I
> simply want a count of the distinct values, and there is no need to sort
> for that.

You are right; at the moment, GROUP BY is more intelligent than DISTINCT,
even if they have to compare the same columns. This is, as always, something
that could be improved in a future release, TTBOMK.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: GROUP BY vs DISTINCT

От
"Peter Childs"
Дата:
On 20/12/06, Steinar H. Gunderson <sgunderson@bigfoot.com> wrote:
> On Tue, Dec 19, 2006 at 11:19:39PM -0800, Brian Herlihy wrote:
> > Actually, I think I answered my own question already.  But I want to
> > confirm - Is the GROUP BY faster because it doesn't have to sort results,
> > whereas DISTINCT must produce sorted results?  This wasn't clear to me from
> > the documentation.  If it's true, then I could save considerable time by
> > using GROUP BY where I have been using DISTINCT in the past.  Usually I
> > simply want a count of the distinct values, and there is no need to sort
> > for that.
>
> You are right; at the moment, GROUP BY is more intelligent than DISTINCT,
> even if they have to compare the same columns. This is, as always, something
> that could be improved in a future release, TTBOMK.
>
> /* Steinar */

Oh so thats why group by is nearly always quicker than distinct. I
always thought distinct was just short hand for "group by same columns
as I've just selected"
Is it actually in the sql spec to sort in a distinct or could we just
get the parser to rewrite distinct into group by and hence remove the
extra code a different way of doing it must mean.?

Peter.

Re: GROUP BY vs DISTINCT

От
Tom Lane
Дата:
"Peter Childs" <peterachilds@gmail.com> writes:
> Is it actually in the sql spec to sort in a distinct

No.  PG's code that supports GROUP BY is newer and smarter than the code
that supports DISTINCT, is all.  One of the things on the to-do list is
to revise DISTINCT so it can also consider hash-based implementations.
The hard part of that is not to break DISTINCT ON ...

            regards, tom lane