Re: Combining Aggregates

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: Combining Aggregates
Дата
Msg-id CA+TgmoaezOewn6nob53SgA6n011JMHC5GMa_yCqXfYghTSjWdA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Combining Aggregates  (David Rowley <david.rowley@2ndquadrant.com>)
Ответы Re: Combining Aggregates  (David Rowley <david.rowley@2ndquadrant.com>)
Список pgsql-hackers
On Wed, Jan 20, 2016 at 8:32 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> At the moment I think everything which will use this is queued up behind the
> pathification of the grouping planner which Tom is working on. I think
> naturally Parallel Aggregate makes sense to work on first, given all the
> other parallel stuff in this release. I plan on working on that that by
> either assisting Haribabu, or... whatever else it takes.

Great!

> The other two usages which I have thought of are;
>
> 1) Aggregating before UNION ALL, which might be fairly simple after the
> grouping planner changes, as it may just be a matter of considering another
> "grouping path" which partially aggregates before the UNION ALL, and
> performs the final grouping stage after UNION ALL. At this stage it's hard
> to say how that will work as I'm not sure how far changes to the grouping
> planner will go. Perhaps Tom can comment?

I hope he will, but in the meantime let me ask how this does us any
good.  UNION ALL turns into an Append node.  Pushing aggregation
through an Append doesn't make anything faster AFAICS *unless* you can
further optimize beginning at that point.  For example, if one or both
sides of the Append node have a Gather under them, and you can push
the partial aggregation down into the Gather; or if you hit a Foreign
Scan and can have it do partial aggregation on the remote side, you
win.  But if you just have:

Finalize Aggregate
-> Append -> Partial Aggregate   -> Thing One -> Partial Aggregate   -> Thing Two

Instead of:

Aggregate
-> Append -> Thing One -> Thing Two

...then I don't see the point, at least not for a single-group
Aggregate or HashAggregate.  For a GroupAggregate, Thing One and Thing
Two might need to be sorted, and sorting two or more smaller data sets
might be faster than sorting one larger data set, but I can't see us
winning anything very big here.

To be clear, I'm not saying we shouldn't do this.  I just don't think
it does much *by itself*.  If you combine it with other optimizations
that let the aggregation be pushed further down the plan tree, it
could win big.

> 2) Group before join. e.g select p.description,sum(s.qty) from sale s inner
> join s.product_id = p.product_id group by s.product_id group by
> p.description;  I have a partial patch which implements this, although I was
> a bit stuck on if I should invent the concept of "GroupingPaths", or just
> inject alternative subquery relations which are already grouped by the
> correct clause.  The problem with "GroupingPaths" was down to the row
> estimates currently come from the RelOptInfo and is set in
> set_baserel_size_estimates() which always assumes the ungrouped number of
> rows, which is not what's needed if the grouping is already performed. I was
> holding off to see how Tom does this in the grouping planner changes.

Your sample query has two GROUP BY clauses; I think you meant to
include only the second one.  This seems like it can win big, because
it's possible that joining first means joining against a much larger
relation, whereas aggregating first might reduce "sale" to a volume of
data that fits in work_mem, after which you can hash join.  On the
other hand, if you add WHERE p.description LIKE '%snuffleupagus%' then
the aggregate-first strategy figures to lose, assuming things are
indexed properly.

On the substantive question you raise, I hope Tom will weigh in here.
However, I'll give you my take.  It seems to me that you are likely to
run into problems not only with the row counts but also with target
lists and width estimates.  All of those things change once you
aggregate.  It seems clear that you are going to need a GroupingPath
here, but it also seems really clear that you can't take a path where
some aggregation has been done and use add_path to toss it on the pile
for the same RelOptInfo as unaggregated paths to which it is not
comparable.  Right now, there's a RelOptInfo corresponding to each
subset of the joinrels for which we build paths; but in this new
world, I think you'll end up with multiple joinrels for each
RelOptInfo, one per

In your example above, you'd end up with RelOptInfos for (1) {s} with
no aggregation, (2) {p} with no aggregation, (3) {s p} with no
aggregation, (4) {s} aggregated by s.product_id and (5) {s p}
aggregated by p.description.  You can generate paths for (5) either by
joining (2) to (4) or by aggregating (3), and the cheapest path for
(5) is the overall winner.  It seems a bit tricky to determine which
aggregations are worth considering, and how to represent that in the
RelOptInfo, but overall that feels like the right idea.

I'm not sure how much this is going to overlap with the work Tom is
already doing.  I think his principal goal is to let the planning of a
subquery return multiple candidate paths.  That's almost certain to
involve creating something like GroupingPath, though I can't predict
the details, and I suspect it's also likely that he'll create some new
RelOptInfo that represents the output of the grouping stage.  However,
I'm not sure that'll have the aggregation that has been performed
represented in any principled way that would lend itself to having
more than one RelOptInfo of that kind.  That's probably another
problem altogether.

However, my Tom-predictive capabilities are far from perfect.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Konstantin Knizhnik
Дата:
Сообщение: Re: Batch update of indexes
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Combining Aggregates