Push down Aggregates below joins

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Push down Aggregates below joins
Дата
Msg-id c165b72e-8dbb-2a24-291f-113aeb67b76a@iki.fi
обсуждение исходный текст
Ответы Re: Push down Aggregates below joins  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: Push down Aggregates below joins  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: Push down Aggregates below joins  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Список pgsql-hackers
Currently, the planner always first decides the scan/join order, and 
adds Group/Agg nodes on top of the joins. Sometimes it would be legal, 
and beneficial, to perform the aggregation below a join. I've been 
hacking on a patch to allow that.

For example:

create temp table a (id int4 primary key);
create temp table b (id int4);

insert into a select g from generate_series(1, 1000) g;
insert into b select g/10 from generate_series(1, 10000) g;
analyze a,b;

explain select b.id from a, b where a.id = b.id group by b.id;

Currently, you get a plan like this:

                               QUERY PLAN
-----------------------------------------------------------------------
  HashAggregate  (cost=323.64..333.65 rows=1001 width=4)
    Group Key: b.id
    ->  Hash Join  (cost=27.50..298.66 rows=9990 width=4)
          Hash Cond: (b.id = a.id)
          ->  Seq Scan on b  (cost=0.00..145.00 rows=10000 width=4)
          ->  Hash  (cost=15.00..15.00 rows=1000 width=4)
                ->  Seq Scan on a  (cost=0.00..15.00 rows=1000 width=4)
(7 rows)

With the patch, you get a plan like this:

                                QUERY PLAN
-------------------------------------------------------------------------
  Hash Join  (cost=192.52..221.27 rows=9990 width=4)
    Hash Cond: (a.id = b.id)
    ->  Seq Scan on a  (cost=0.00..15.00 rows=1000 width=4)
    ->  Hash  (cost=180.01..180.01 rows=1001 width=4)
          ->  HashAggregate  (cost=170.00..180.01 rows=1001 width=4)
                Group Key: b.id
                ->  Seq Scan on b  (cost=0.00..145.00 rows=10000 width=4)
(7 rows)


This is faster, because you need to join fewer rows. (That's not 
reflected in the cost estimates above; cost estimation is not done 
properly in this prototype yet.)

Implementation
--------------

Move the responsibility of planning aggregates from the "upper stages", 
in grouping_planner(), into scan/join planning, in query_planner().

In query_planner(), after building the RelOptInfo for a scan or join 
rel, also build a grouped RelOptInfo to shadow each RelOptInfo (if 
aggregation can be done at that rel). The grouped RelOptInfo is stored 
in a new 'grouped_rel' field in the parent RelOptInfo.

A grouped rel holds Paths where the grouping/aggregation is already
performed at that node, or below it. For a base rel, it represents 
performing the aggregation on top of the scan, i.e. the Paths are like 
Agg(Scan). For a grouped join rel, the paths look like an Agg(Join(A, 
B)), or Join(Agg(A), B).

The first three of the attached patches just move existing code around. 
The fourth patch contains the actual feature.

This is still a rough prototype, but any thoughts on the general approach?

- Heikki

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Stephen Frost
Дата:
Сообщение: Re: [PATCH] Include application_name in "connection authorized" logmessage
Следующее
От: Tom Lane
Дата:
Сообщение: Re: ERROR: ORDER/GROUP BY expression not found in targetlist