Обсуждение: AGG_PLAIN thinks sorts are free

Поиск
Список
Период
Сортировка

AGG_PLAIN thinks sorts are free

От
Jeff Janes
Дата:
AGG_PLAIN sometimes does sorts, but it thinks they are free.  Also, under explain analyze it does not explicitly report whether the sort was external or not, nor report the disk or memory usage, the way other sorts do.  I don't know if those two things are related or not.  

This behavior seems to be ancient, at least back to 8.4.

Does someone more familiar with this part of the code know if this is a simple oversight or a fundamental design issue?

Here is a test case, in which adding a "distinct" increases the run time 500% but doesn't change the estimate at all:

create table foo as select (random()*1000000)::int as val from generate_series(1,20000000);

analyze foo;


explain (analyze,buffers) select count(distinct val) from foo;
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=338497.20..338497.21 rows=1 width=4) (actual time=28185.597..28185.598 rows=1 loops=1)
   Buffers: shared hit=192 read=88304, temp read=112326 written=112326
   I/O Timings: read=200.810
   ->  Seq Scan on foo  (cost=0.00..288496.96 rows=20000096 width=4) (actual time=0.040..2192.281 rows=20000000 loops=1)
         Buffers: shared hit=192 read=88304
         I/O Timings: read=200.810
 Total runtime: 28185.628 ms

explain (analyze,buffers) select count(val) from foo;
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=338497.20..338497.21 rows=1 width=4) (actual time=4230.892..4230.892 rows=1 loops=1)
   Buffers: shared hit=224 read=88272
   I/O Timings: read=145.003
   ->  Seq Scan on foo  (cost=0.00..288496.96 rows=20000096 width=4) (actual time=0.098..2002.396 rows=20000000 loops=1)
         Buffers: shared hit=224 read=88272
         I/O Timings: read=145.003
 Total runtime: 4230.948 ms

Cheers,

Jeff

Re: AGG_PLAIN thinks sorts are free

От
Tom Lane
Дата:
Jeff Janes <jeff.janes@gmail.com> writes:
> AGG_PLAIN sometimes does sorts, but it thinks they are free.  Also, under
> explain analyze it does not explicitly report whether the sort was external
> or not, nor report the disk or memory usage, the way other sorts do.  I
> don't know if those two things are related or not.

DISTINCT (and also ORDER BY) properties of aggregates are implemented
at runtime; the planner doesn't really do anything about them, except
suppress the choice it might otherwise make of using hashed aggregation.
Since the behavior is entirely local to the Agg plan node, it's also
not visible to the EXPLAIN ANALYZE machinery.

Arguably we should have the planner add on some cost factor for such
aggregates, but that would have no effect whatever on the current level
of plan, and could only be useful if this was a subquery whose cost
would affect choices in an outer query level.  Which is a case that's
pretty few and far between AFAIK (do you have a real-world example where
it matters?).  So it's something that hasn't gotten to the top of
anybody's to-do list.

An arguably more useful thing to do would be to integrate this behavior
into the planner more completely, so that (for instance) if only one
aggregate had ORDER BY then we would make the underlying query produce
that order instead of implementing a sort locally in the Agg node.
That hasn't risen to the top of the to-do list either, as yet.
        regards, tom lane



Re: AGG_PLAIN thinks sorts are free

От
Ashutosh Bapat
Дата:



On Fri, Jul 19, 2013 at 8:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Jeff Janes <jeff.janes@gmail.com> writes:
> AGG_PLAIN sometimes does sorts, but it thinks they are free.  Also, under
> explain analyze it does not explicitly report whether the sort was external
> or not, nor report the disk or memory usage, the way other sorts do.  I
> don't know if those two things are related or not.

DISTINCT (and also ORDER BY) properties of aggregates are implemented
at runtime; the planner doesn't really do anything about them, except
suppress the choice it might otherwise make of using hashed aggregation.
Since the behavior is entirely local to the Agg plan node, it's also
not visible to the EXPLAIN ANALYZE machinery.

Arguably we should have the planner add on some cost factor for such
aggregates, but that would have no effect whatever on the current level
of plan, and could only be useful if this was a subquery whose cost
would affect choices in an outer query level.  Which is a case that's
pretty few and far between AFAIK (do you have a real-world example where
it matters?).  So it's something that hasn't gotten to the top of
anybody's to-do list.

An arguably more useful thing to do would be to integrate this behavior
into the planner more completely, so that (for instance) if only one
aggregate had ORDER BY then we would make the underlying query produce
that order instead of implementing a sort locally in the Agg node.

Slight modification would be *all* aggregates having same ORDER BY clause. I think it would be easy given that we already influence the sortedness of underlying query result in planner.
 
That hasn't risen to the top of the to-do list either, as yet.

                        regards, tom lane


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



--
Best Wishes,
Ashutosh Bapat
EntepriseDB Corporation
The Postgres Database Company

Re: AGG_PLAIN thinks sorts are free

От
Jeff Janes
Дата:
On Thu, Jul 18, 2013 at 8:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jeff Janes <jeff.janes@gmail.com> writes:
>> AGG_PLAIN sometimes does sorts, but it thinks they are free.  Also, under
>> explain analyze it does not explicitly report whether the sort was external
>> or not, nor report the disk or memory usage, the way other sorts do.  I
>> don't know if those two things are related or not.
>
> DISTINCT (and also ORDER BY) properties of aggregates are implemented
> at runtime; the planner doesn't really do anything about them, except
> suppress the choice it might otherwise make of using hashed aggregation.
> Since the behavior is entirely local to the Agg plan node, it's also
> not visible to the EXPLAIN ANALYZE machinery.

Couldn't a hash aggregate be superior to a sort one (for the distinct,
not the order by)?

> Arguably we should have the planner add on some cost factor for such
> aggregates, but that would have no effect whatever on the current level
> of plan, and could only be useful if this was a subquery whose cost
> would affect choices in an outer query level.  Which is a case that's
> pretty few and far between AFAIK (do you have a real-world example where
> it matters?).

Not that I know of.  It is mainly an analytical headache.  I'm trying
to figure out why the planner makes the choices it does on more
complex queries, but one of the component queries I'm trying to build
it up from suddenly falls into this plan, where I can't see the
estimated costs and can't use "set enable_*"  to shift it away from
that into a more transparent one.


Thanks for the explanation.

Jeff



Re: AGG_PLAIN thinks sorts are free

От
Tom Lane
Дата:
Jeff Janes <jeff.janes@gmail.com> writes:
> On Thu, Jul 18, 2013 at 8:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> DISTINCT (and also ORDER BY) properties of aggregates are implemented
>> at runtime; the planner doesn't really do anything about them, except
>> suppress the choice it might otherwise make of using hashed aggregation.

> Couldn't a hash aggregate be superior to a sort one (for the distinct,
> not the order by)?

If it worked at all, it might be superior, but since it doesn't, it ain't.

This isn't really a matter of lack of round tuits, but a deliberate
judgment that it's probably not worth the trouble.  Per the comment in
choose_hashed_grouping:
   /*    * Executor doesn't support hashed aggregation with DISTINCT or ORDER BY    * aggregates.    (Doing so would
implystoring *all* the input values in    * the hash table, and/or running many sorts in parallel, either of which    *
seemslike a certain loser.)    */
 
        regards, tom lane