Re: Combining Aggregates

Поиск
Список
Период
Сортировка
От David Rowley
Тема Re: Combining Aggregates
Дата
Msg-id CAKJS1f9POjTnjybdtYrG2awKnsuRoPbabwkLvisej9LV=HMnLQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Combining Aggregates  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: Combining Aggregates  (David Rowley <david.rowley@2ndquadrant.com>)
Список pgsql-hackers
On 9 December 2015 at 06:18, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Dec 3, 2015 at 12:01 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 27 July 2015 at 04:58, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> This patch seems sane to me, as far as it goes. However, there's no
>> planner or executor code to use the aggregate combining for anything. I'm
>> not a big fan of dead code, I'd really like to see something to use this.
>
> I've attached an updated version of the patch. The main change from last
> time is that I've added executor support and exposed this to the planner via
> two new parameters in make_agg().
>
> I've also added EXPLAIN support, this will display "Partial
> [Hash|Group]Aggregate" for cases where the final function won't be called
> and displays "Finalize [Hash|Group]Aggregate" when combining states and
> finalizing aggregates.
>
> This patch is currently intended for foundation work for parallel
> aggregation.

Given Tom's dislike of executor nodes that do more than one thing, I
fear he's not going to be very happy about combining Aggregate,
PartialAggregate, FinalizeAggregate, GroupAggregate,
PartialGroupAggregate, FinalizeGroupAggregate, HashAggregate,
PartialHashAggregate, and FinalizeHashAggregate under one roof.
However, it's not at all obvious to me what would be any better.
nodeAgg.c is already a very big file, and this patch adds a
surprisingly small amount of code to it.


Hmm yes. I read that too. It's a tricky one. I'm not sure where the line is and when we go over it. At least nodeAgg.c does not need any additional work again for parallel enabling... This should be all that's required for that.
 
I think the parameter in CREATE AGGREGATE ought to be called
COMBINEFUNC rather than CFUNC.  There are a lot of English words that
begin with C, and it's not self-evident which one is meant.


Good idea, I've changed this in the attached.
 
"iif performing the final aggregate stage we'll check the qual" should
probably say "If" with a capital letter rather than "iif" without one.


While building the test code I ended up deciding it's best to not change this part at all and just always check the qual. In the case of partial aggregation I think it should just be up to the planner not to pass the HAVING clause as the node's qual, and only pass this when Finalizing the aggregates. Seem fair?
 
I think it would be useful to have a test patch associated with this
patch that generates a FinalizeAggregate + PartialAggregate combo
instead of an aggregate, and run that through the regression tests.
There will obviously be EXPLAIN differences, but in theory nothing
else should blow up.  Of course, such tests will become more
meaningful as we add more combine-functions, but this would at least
be a start.


I've been working on this, and I've attached what I've got so far. The plans will look something like:

# explain select sum(a) from test;
                             QUERY PLAN
--------------------------------------------------------------------
 Finalize Aggregate  (cost=18.53..18.54 rows=1 width=4)
   ->  Partial Aggregate  (cost=18.51..18.52 rows=1 width=4)
         ->  Seq Scan on test  (cost=0.00..16.01 rows=1001 width=4)

I seem to have got all of this working with the test patch, and the only regression tests which failed were due to EXPLAIN output changing, rather than results changing. I *was* all quite happy with it until I thought that I'd better write a aggregate combine function which has an INTERNAL state. I had thought that I could get this working by insisting that the combine function either update the existing state, or in the case there's no existing state, just write all the values from the combining state into a newly created state which is allocated in the correct memory context. I tried this by implementing a combinefn for SUM(NUMERIC) and all seems to work fine for hash aggregates, but falls flat for sorted or plain aggregates. 

# select x,sum(x::numeric) from generate_series(1,3) x(x) group by x;
 x | sum
---+-----
 1 |   1
 3 |   3
 2 |   2
(3 rows)

This one works ok using hash aggregate.

But sorted and plain aggregates fail:

# select sum(x::numeric) from generate_series(1,3) x(x);
ERROR:  invalid memory alloc request size 18446744072025250716

The reason that this happens is down to the executor always thinking that Aggref returns the aggtype, but in cases where we're not finalizing the aggregate the executor needs to know that we're actually returning aggtranstype instead. Hash aggregates appear to work as the trans value is just stuffed into a hash table, but with plain and sorted aggregates this gets formed into a Tuple again, and forming a tuple naturally requires the types to be set correctly! I'm not sure exactly what the best way to fix this will be. I've hacked something up in the attached test patch which gets around the problem by adding a new aggtranstype to Aggref and also an 'aggskipfinalize'  field which I manually set to true in a bit of a hacky way inside the grouping planner. Then in exprType() for Aggref I conditionally return the aggtype or aggtranstype based on the aggskipfinalize setting.  This is most likely not the way to properly fix this, but I'm running out of steam today to think of how it should be done, so I'm currently very open to ideas on this.

I should also mention that the setrefs stuff in the test patch is borrowed from Haribabu's Parallel Aggregate patch. I'm not quite clear on which patch that part should go into at the moment.

Generally, I think that this patch will be excellent infrastructure
for parallel aggregate and I think we should try to get it committed
soon.

Thanks.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: YUriy Zhuravlev
Дата:
Сообщение: Small fix in pg_rewind (redundant declaration)
Следующее
От: David Rowley
Дата:
Сообщение: Re: Performance improvement for joins where outer side is unique