Обсуждение: [HACKERS] Parallel Aggregation support for aggregate functions that use transitions not implemented for array_agg

Поиск
Список
Период
Сортировка
It seems CREATE  AGGREGATE was expanded in 9.6 to support parallelization of
aggregate functions using transitions, with the addition of serialfunc and
deserialfunc to the aggregate definitions.

https://www.postgresql.org/docs/10/static/sql-createaggregate.html

I was looking at the PostgreSQL 10 source code for some example usages of
this and was hoping that array_agg and string_agg would support the feature.
At a cursory glance, it seems they do not use this.
Examples I see that do support it are the average and standard deviation
functions.

Is there a reason for this or it just wasn't gotten to?


Thanks,
Regina






On Tue, Jun 06, 2017 at 01:52:45PM -0400, Regina Obe wrote:
> It seems CREATE  AGGREGATE was expanded in 9.6 to support parallelization of
> aggregate functions using transitions, with the addition of serialfunc and
> deserialfunc to the aggregate definitions.
> 
> https://www.postgresql.org/docs/10/static/sql-createaggregate.html
> 
> I was looking at the PostgreSQL 10 source code for some example usages of
> this and was hoping that array_agg and string_agg would support the feature.
> At a cursory glance, it seems they do not use this.
> Examples I see that do support it are the average and standard deviation
> functions.
> 
> Is there a reason for this or it just wasn't gotten to?

I'd bet on lack of tuits.  Anything with text has to deal with
collation issues, etc., that may make this trickier than it first
appears.

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



On 2017-06-06 12:23:49 -0700, David Fetter wrote:
> On Tue, Jun 06, 2017 at 01:52:45PM -0400, Regina Obe wrote:
> > It seems CREATE  AGGREGATE was expanded in 9.6 to support parallelization of
> > aggregate functions using transitions, with the addition of serialfunc and
> > deserialfunc to the aggregate definitions.
> > 
> > https://www.postgresql.org/docs/10/static/sql-createaggregate.html
> > 
> > I was looking at the PostgreSQL 10 source code for some example usages of
> > this and was hoping that array_agg and string_agg would support the feature.
> > At a cursory glance, it seems they do not use this.
> > Examples I see that do support it are the average and standard deviation
> > functions.
> > 
> > Is there a reason for this or it just wasn't gotten to?

I'd suggest trying to write a parallel version of them ;).  Shouldn't be
too hard.


> I'd bet on lack of tuits.  Anything with text has to deal with
> collation issues, etc., that may make this trickier than it first
> appears.

I don't see how collations makes things more complicated here.

- Andres



On 6/6/17 13:52, Regina Obe wrote:
> It seems CREATE  AGGREGATE was expanded in 9.6 to support parallelization of
> aggregate functions using transitions, with the addition of serialfunc and
> deserialfunc to the aggregate definitions.
> 
> https://www.postgresql.org/docs/10/static/sql-createaggregate.html
> 
> I was looking at the PostgreSQL 10 source code for some example usages of
> this and was hoping that array_agg and string_agg would support the feature.

I'm not sure how you would parallelize these, since in most uses you
want to have a deterministic output order.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



On 2017-06-06 23:32:53 -0400, Peter Eisentraut wrote:
> On 6/6/17 13:52, Regina Obe wrote:
> > It seems CREATE  AGGREGATE was expanded in 9.6 to support parallelization of
> > aggregate functions using transitions, with the addition of serialfunc and
> > deserialfunc to the aggregate definitions.
> > 
> > https://www.postgresql.org/docs/10/static/sql-createaggregate.html
> > 
> > I was looking at the PostgreSQL 10 source code for some example usages of
> > this and was hoping that array_agg and string_agg would support the feature.
> 
> I'm not sure how you would parallelize these, since in most uses you
> want to have a deterministic output order.

Unless you specify ORDER BY you don't really have that anyway, consider
hash-aggregation.  If you want deterministic order, you really need an
ORDER BY inside the aggregate.

- Andres



> On 6/6/17 13:52, Regina Obe wrote:
>> It seems CREATE  AGGREGATE was expanded in 9.6 to support
>> parallelization of aggregate functions using transitions, with the
>> addition of serialfunc and deserialfunc to the aggregate definitions.
>>
>> https://www.postgresql.org/docs/10/static/sql-createaggregate.html
>>
>> I was looking at the PostgreSQL 10 source code for some example usages
>> of this and was hoping that array_agg and string_agg would support the feature.

> I'm not sure how you would parallelize these, since in most uses you want to have a deterministic output order.

>--
> Peter Eisentraut              http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Good point.  If that's the reason it wasn't done, that's good just wasn't sure.

But if you didn't have an ORDER BY in your aggregate usage, and you did have those transition functions, it shouldn't
beany different from any other use case right? 
I imagine you are right that most folks who use array_agg and string_agg usually combine it with array_agg(... ORDER BY
..)

My main reason for asking is that most of the PostGIS geometry and raster aggregate functions use transitions and were
patternedafter array agg. 

In the case of PostGIS the sorting is done internally and really only to expedite take advantage of things like
cascadedunion algorithms.  That is always done though (so even if each worker does it on just it's batch that's still
betterthan having only one worker). 
So I think it's still very beneficial to break into separate jobs since in the end the gather, will have  say 2 biggish
geometriesor 2 biggish rasters to union if you have 2 workers which is still better than having a million smallish
geometries/rastersto union 

Split Union

Worker 1:

Parallel agg (internal sort geoms by box)  - Union

Worker 2:
Parallel Agg (internal sort geoms )  - Union


Gather  Union(union, union) internal sort.


Thanks,
Regina







Andres Freund <andres@anarazel.de> writes:
> On 2017-06-06 23:32:53 -0400, Peter Eisentraut wrote:
>> I'm not sure how you would parallelize these, since in most uses you
>> want to have a deterministic output order.

> Unless you specify ORDER BY you don't really have that anyway, consider
> hash-aggregation.  If you want deterministic order, you really need an
> ORDER BY inside the aggregate.

Hash aggregation does not destroy the property that array_agg/string_agg
will produce results whose components appear in the order that the
subquery emitted them in.  It only causes the various aggregate results
in a GROUP BY query to themselves appear in random order.

Now you could argue that the subquery might've gotten parallelized and
emitted its outputs in some random order, so doing the same thing one
level further up changes nothing.  But you can't defend this on this
basis that it was historically unpredictable, because it wasn't.
        regards, tom lane



On 2017-06-07 00:03:15 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-06-06 23:32:53 -0400, Peter Eisentraut wrote:
> >> I'm not sure how you would parallelize these, since in most uses you
> >> want to have a deterministic output order.
> 
> > Unless you specify ORDER BY you don't really have that anyway, consider
> > hash-aggregation.  If you want deterministic order, you really need an
> > ORDER BY inside the aggregate.
> 
> Hash aggregation does not destroy the property that array_agg/string_agg
> will produce results whose components appear in the order that the
> subquery emitted them in.  It only causes the various aggregate results
> in a GROUP BY query to themselves appear in random order.

Whoa, I obviously should stop working tonight.  I think it's still a
hugely useful to parallelize such aggregates - it might be worthwhile to
have two versions of array_agg, one with a serial/combinefunc and one
without...

Greetings,

Andres Freund



On Tue, Jun 6, 2017 at 3:23 PM, David Fetter <david@fetter.org> wrote:
> I'd bet on lack of tuits.

I expect that was part of it.  Another thing to consider is that, for
numeric aggregates, the transition values don't generally get larger
as you aggregate, but for something like string_agg(), they will.
It's not clear how much work we'll really save by parallelizing that
sort of thing.  Maybe it will be great?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Wed, Jun 7, 2017 at 10:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jun 6, 2017 at 3:23 PM, David Fetter <david@fetter.org> wrote:
>> I'd bet on lack of tuits.
>
> I expect that was part of it.  Another thing to consider is that, for
> numeric aggregates, the transition values don't generally get larger
> as you aggregate, but for something like string_agg(), they will.
> It's not clear how much work we'll really save by parallelizing that
> sort of thing.  Maybe it will be great?

+1, I was thinking about the same. There might be some cases when the
output of array_agg/string_agg is not a lot wider but the underlying
scans are large e.g. having clause containing another aggregate and
very small group sizes. I am not sure how frequent are such usecases.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Hi,

On 6/7/17 5:52 AM, Regina Obe wrote:
>> On 6/6/17 13:52, Regina Obe wrote:
>>> It seems CREATE  AGGREGATE was expanded in 9.6 to support
>>> parallelization of aggregate functions using transitions, with the
>>> addition of serialfunc and deserialfunc to the aggregate definitions.
>>>
>>> https://www.postgresql.org/docs/10/static/sql-createaggregate.html
>>>
>>> I was looking at the PostgreSQL 10 source code for some example usages
>>> of this and was hoping that array_agg and string_agg would support the feature.
> 
>> I'm not sure how you would parallelize these, since in most uses
>> you want to have a deterministic output order.
> 
>> -- 
>> Peter Eisentraut              http://www.2ndQuadrant.com/
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> 
> Good point.  If that's the reason it wasn't done, that's good just wasn't sure.
> 
> But if you didn't have an ORDER BY in your aggregate usage, and you
> did have those transition functions, it shouldn't be any different from
> any other use case right?
> I imagine you are right that most folks who use array_agg and
> string_agg usually combine it with array_agg(... ORDER BY ..)
> 

I think that TL had in mind is something like
    SELECT array_agg(x) FROM (       SELECT x FROM bar ORDER BY y    ) foo;

i.e. a subquery producing the data in predictable order.
>
> My main reason for asking is that most of the PostGIS geometry and
> raster aggregate functions use transitions and were patterned after
> array agg.
> 

> In the case of PostGIS the sorting is done internally and really
> only to expedite take advantage of things like cascaded union
> algorithms.
> That is always done though (so even if each worker does it on just it's
> batch that's still better than having only one worker).
> So I think it's still very beneficial to break into separate jobs
> since in the end the gather, will have say 2 biggish geometries or 2
> biggish rasters to union if you have 2 workers which is still better
> than having a million smallish geometries/rasters to union
I'm not sure I got your point correctly, but if you can (for example) 
sort the per-worker results as part of the "serialize" function, and 
benefit from that while combining that in the gather, then sure, that 
should be a huge win.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services