Обсуждение: pgsql: Implement Eager Aggregation

Поиск
Список
Период
Сортировка

pgsql: Implement Eager Aggregation

От
Richard Guo
Дата:
Implement Eager Aggregation

Eager aggregation is a query optimization technique that partially
pushes aggregation past a join, and finalizes it once all the
relations are joined.  Eager aggregation may reduce the number of
input rows to the join and thus could result in a better overall plan.

In the current planner architecture, the separation between the
scan/join planning phase and the post-scan/join phase means that
aggregation steps are not visible when constructing the join tree,
limiting the planner's ability to exploit aggregation-aware
optimizations.  To implement eager aggregation, we collect information
about aggregate functions in the targetlist and HAVING clause, along
with grouping expressions from the GROUP BY clause, and store it in
the PlannerInfo node.  During the scan/join planning phase, this
information is used to evaluate each base or join relation to
determine whether eager aggregation can be applied.  If applicable, we
create a separate RelOptInfo, referred to as a grouped relation, to
represent the partially-aggregated version of the relation and
generate grouped paths for it.

Grouped relation paths can be generated in two ways.  The first method
involves adding sorted and hashed partial aggregation paths on top of
the non-grouped paths.  To limit planning time, we only consider the
cheapest or suitably-sorted non-grouped paths in this step.
Alternatively, grouped paths can be generated by joining a grouped
relation with a non-grouped relation.  Joining two grouped relations
is currently not supported.

To further limit planning time, we currently adopt a strategy where
partial aggregation is pushed only to the lowest feasible level in the
join tree where it provides a significant reduction in row count.
This strategy also helps ensure that all grouped paths for the same
grouped relation produce the same set of rows, which is important to
support a fundamental assumption of the planner.

For the partial aggregation that is pushed down to a non-aggregated
relation, we need to consider all expressions from this relation that
are involved in upper join clauses and include them in the grouping
keys, using compatible operators.  This is essential to ensure that an
aggregated row from the partial aggregation matches the other side of
the join if and only if each row in the partial group does.  This
ensures that all rows within the same partial group share the same
"destiny", which is crucial for maintaining correctness.

One restriction is that we cannot push partial aggregation down to a
relation that is in the nullable side of an outer join, because the
NULL-extended rows produced by the outer join would not be available
when we perform the partial aggregation, while with a
non-eager-aggregation plan these rows are available for the top-level
aggregation.  Pushing partial aggregation in this case may result in
the rows being grouped differently than expected, or produce incorrect
values from the aggregate functions.

If we have generated a grouped relation for the topmost join relation,
we finalize its paths at the end.  The final paths will compete in the
usual way with paths built from regular planning.

The patch was originally proposed by Antonin Houska in 2017.  This
commit reworks various important aspects and rewrites most of the
current code.  However, the original patch and reviews were very
useful.

Author: Richard Guo <guofenglinux@gmail.com>
Author: Antonin Houska <ah@cybertec.at> (in an older version)
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Jian He <jian.universality@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Reviewed-by: Matheus Alcantara <matheusssilv97@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me> (in an older version)
Reviewed-by: Andy Fan <zhihuifan1213@163.com> (in an older version)
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> (in an older version)
Discussion: https://postgr.es/m/CAMbWs48jzLrPt1J_00ZcPZXWUQKawQOFE8ROc-ADiYqsqrpBNw@mail.gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/8e11859102f947e6145acdd809e5cdcdfbe90fa5

Modified Files
--------------
contrib/postgres_fdw/expected/postgres_fdw.out    |   49 +-
doc/src/sgml/config.sgml                          |   31 +
src/backend/optimizer/README                      |  110 ++
src/backend/optimizer/geqo/geqo_eval.c            |   21 +-
src/backend/optimizer/path/allpaths.c             |  467 +++++-
src/backend/optimizer/path/joinrels.c             |  193 +++
src/backend/optimizer/plan/initsplan.c            |  370 +++++
src/backend/optimizer/plan/planmain.c             |    9 +
src/backend/optimizer/plan/planner.c              |  124 +-
src/backend/optimizer/util/appendinfo.c           |   51 +
src/backend/optimizer/util/relnode.c              |  650 ++++++++
src/backend/utils/misc/guc_parameters.dat         |   16 +
src/backend/utils/misc/postgresql.conf.sample     |    2 +
src/include/nodes/pathnodes.h                     |  117 ++
src/include/optimizer/pathnode.h                  |    4 +
src/include/optimizer/paths.h                     |    4 +
src/include/optimizer/planmain.h                  |    1 +
src/test/regress/expected/collate.icu.utf8.out    |   32 +-
src/test/regress/expected/eager_aggregate.out     | 1714 +++++++++++++++++++++
src/test/regress/expected/join.out                |   12 +-
src/test/regress/expected/partition_aggregate.out |    2 +
src/test/regress/expected/sysviews.out            |    3 +-
src/test/regress/parallel_schedule                |    2 +-
src/test/regress/sql/eager_aggregate.sql          |  380 +++++
src/test/regress/sql/partition_aggregate.sql      |    2 +
src/tools/pgindent/typedefs.list                  |    3 +
26 files changed, 4293 insertions(+), 76 deletions(-)