Обсуждение: [GENERAL] Are multiple array_aggs going to be in the same order?

Поиск
Список
Период
Сортировка

[GENERAL] Are multiple array_aggs going to be in the same order?

От
Guyren Howe
Дата:
If I do a SELECT with a GROUP_BY and multiple ARRAY_AGGs, will the ARRAY_AGGs be guaranteed to have entries in the same (ie corresponding) order?

eg

SELECT
u.name,
ARRAY_AGG(o.order_date) AS order_dates,
ARRAY_AGG(o.order_total) AS order_totals
FROM
user u JOIN
orders o USING (user_id)
GROUP BY
u.user_id

Re: [GENERAL] Are multiple array_aggs going to be in the same order?

От
Merlin Moncure
Дата:
On Sun, Apr 9, 2017 at 4:27 PM, Guyren Howe <guyren@gmail.com> wrote:
>
> If I do a SELECT with a GROUP_BY and multiple ARRAY_AGGs, will the ARRAY_AGGs be guaranteed to have entries in the
same(ie corresponding) order? 
>
> eg
>
> SELECT
> u.name,
> ARRAY_AGG(o.order_date) AS order_dates,
> ARRAY_AGG(o.order_total) AS order_totals
> FROM
> user u JOIN
> orders o USING (user_id)
> GROUP BY
> u.user_id


It is unsafe to rely on aggregation order unless specified -- you can
add ORDER BY to the aggregation clause.

merlin


Re: [GENERAL] Are multiple array_aggs going to be in the same order?

От
Tom Lane
Дата:
Merlin Moncure <mmoncure@gmail.com> writes:
> On Sun, Apr 9, 2017 at 4:27 PM, Guyren Howe <guyren@gmail.com> wrote:
>> If I do a SELECT with a GROUP_BY and multiple ARRAY_AGGs, will the ARRAY_AGGs be guaranteed to have entries in the
same(ie corresponding) order? 
>>
>> eg
>>
>> SELECT
>> u.name,
>> ARRAY_AGG(o.order_date) AS order_dates,
>> ARRAY_AGG(o.order_total) AS order_totals
>> FROM
>> user u JOIN
>> orders o USING (user_id)
>> GROUP BY
>> u.user_id

> It is unsafe to rely on aggregation order unless specified -- you can
> add ORDER BY to the aggregation clause.

You definitely can't assume anything about the order in which the FROM
clause will deliver rows, but I think that's not quite what the question
was.  If I read it right, the OP wants to be sure that the two aggregate
functions will see the data in the *same* unspecified order.  I think
that's a pretty safe assumption.  The server would have to go way
out of its way to do differently, and it doesn't.

            regards, tom lane


Re: [GENERAL] Are multiple array_aggs going to be in the same order?

От
Paul Jungwirth
Дата:
On 04/10/2017 09:33 AM, Merlin Moncure wrote:
> On Sun, Apr 9, 2017 at 4:27 PM, Guyren Howe <guyren@gmail.com> wrote:
>> If I do a SELECT with a GROUP_BY and multiple ARRAY_AGGs,
 >> will the ARRAY_AGGs be guaranteed to have entries in the
 >> same (ie corresponding) order?
> It is unsafe to rely on aggregation order unless specified --
 > you can add ORDER BY to the aggregation clause.

Just thought I'd add that if you do this:

     SELECT  u.name,
             ARRAY_AGG(o.order_date ORDER BY o.id) AS order_dates,
             ARRAY_AGG(o.order_total ORDER BY o.id) AS order_totals
     ...

Then you can check EXPLAIN and should see that Postgres is only doing
one sort, not two, so there is no performance cost. (Of course for more
complicated queries you might want to double check what EXPLAIN intends
to do.)

But something I've wondered myself is how well Postgres handles sorting
already-mostly-sorted lists? I tried diving into the code, starting from
here:

https://doxygen.postgresql.org/tuplesort_8c_source.html

but I couldn't answer the question myself. The comments say that as long
as the tuples all fit in work_mem, it uses qsort, but the code appears
to call qsort_ssup or qsort_tuple, whose definitions I couldn't find. (I
looks like they are from qsort_tuple.c which is generated by a Perl
script.) I know that qsort(3) is not necessarily quicksort, despite the
name. Does anyone know what algorithm Postgres uses? It seems like the
mostly-already-sorted case would happen a lot, so I'm curious if
Postgres pays a performance cost there?

Thanks,
Paul



Re: [GENERAL] Are multiple array_aggs going to be in the same order?

От
Merlin Moncure
Дата:
On Mon, Apr 10, 2017 at 12:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> On Sun, Apr 9, 2017 at 4:27 PM, Guyren Howe <guyren@gmail.com> wrote:
>>> If I do a SELECT with a GROUP_BY and multiple ARRAY_AGGs, will the ARRAY_AGGs be guaranteed to have entries in the
same(ie corresponding) order? 
>>>
>>> eg
>>>
>>> SELECT
>>> u.name,
>>> ARRAY_AGG(o.order_date) AS order_dates,
>>> ARRAY_AGG(o.order_total) AS order_totals
>>> FROM
>>> user u JOIN
>>> orders o USING (user_id)
>>> GROUP BY
>>> u.user_id
>
>> It is unsafe to rely on aggregation order unless specified -- you can
>> add ORDER BY to the aggregation clause.
>
> You definitely can't assume anything about the order in which the FROM
> clause will deliver rows, but I think that's not quite what the question
> was.  If I read it right, the OP wants to be sure that the two aggregate
> functions will see the data in the *same* unspecified order.  I think
> that's a pretty safe assumption.  The server would have to go way
> out of its way to do differently, and it doesn't.

Sure, but isn't it fair to consider that an implementation artifact?
If his code depends on that ordering being the same across aggregate
functions, and the SQL standard doesn't specify that (I guess it
might, but I'm skeptical), he ought to specify that for clarify at the
very least.

merlin


Re: [GENERAL] Are multiple array_aggs going to be in the same order?

От
"David G. Johnston"
Дата:
On Mon, Apr 10, 2017 at 11:02 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Mon, Apr 10, 2017 at 12:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Merlin Moncure <mmoncure@gmail.com> writes:
>> On Sun, Apr 9, 2017 at 4:27 PM, Guyren Howe <guyren@gmail.com> wrote:
>>> If I do a SELECT with a GROUP_BY and multiple ARRAY_AGGs, will the ARRAY_AGGs be guaranteed to have entries in the same (ie corresponding) order?
>>>
>>> eg
>>>
>>> SELECT
>>> u.name,
>>> ARRAY_AGG(o.order_date) AS order_dates,
>>> ARRAY_AGG(o.order_total) AS order_totals
>>> FROM
>>> user u JOIN
>>> orders o USING (user_id)
>>> GROUP BY
>>> u.user_id
>
>> It is unsafe to rely on aggregation order unless specified -- you can
>> add ORDER BY to the aggregation clause.
>
> You definitely can't assume anything about the order in which the FROM
> clause will deliver rows, but I think that's not quite what the question
> was.  If I read it right, the OP wants to be sure that the two aggregate
> functions will see the data in the *same* unspecified order.  I think
> that's a pretty safe assumption.  The server would have to go way
> out of its way to do differently, and it doesn't.

Sure, but isn't it fair to consider that an implementation artifact?
If his code depends on that ordering being the same across aggregate
functions, and the SQL standard doesn't specify that (I guess it
might, but I'm skeptical), he ought to specify that for clarify at the
very least.

So, the presence of ORDER BY in the aggregate function call is a PostgreSQL extension...

It seems reasonable to declare that the order of the values in the generated array match whatever order the FROM clause supplies the rows.  If that is not acceptable a PostgreSQL-specific ORDER BY modifier can be added which will cause an additional sort-and-scan of the input relation to occur (optimized across multiple column invocations when possible).  Thus two aggregate functions w/o an ORDER BY will always see the source rows in the same order.

SELECT array_agg(v)
FROM (
SELECT * FROM (VALUES (3),(1),(2)) vals (v) ORDER BY 1
) AS src

The only real confusion is whether a query like the above is guaranteed to supply rows to the outer select target list in order.  If that is indeed the case then the overall behavior seems quite reasonable to explicitly define like above.

Dave


Re: [GENERAL] Are multiple array_aggs going to be in the same order?

От
Tom Lane
Дата:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Mon, Apr 10, 2017 at 11:02 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> Sure, but isn't it fair to consider that an implementation artifact?

> So, the presence of ORDER BY in the aggregate function call is a PostgreSQL
> extension...

> It seems reasonable to declare that the order of the values in the
> generated array match whatever order the FROM clause supplies the rows.  If
> that is not acceptable a PostgreSQL-specific ORDER BY modifier can be added
> which will cause an additional sort-and-scan of the input relation to occur
> (optimized across multiple column invocations when possible).

Yes, and in fact we documented the ORDER-BY-in-subselect solution back
before we had the ORDER-BY-in-aggregate feature.  I don't remember exactly
where, but I'm sure it's still described somewhere.  So it is documented
behavior that an aggregate without its own ORDER BY will see the rows in
whatever order the FROM clause supplies them.

I'm not very keen on recommending that the OP insert an ORDER BY into
each aggregate call, because that would cause a separate sort for each
aggregate (unless someone's improved that recently while I wasn't
looking).

            regards, tom lane


Re: [GENERAL] Are multiple array_aggs going to be in the same order?

От
Merlin Moncure
Дата:
On Mon, Apr 10, 2017 at 1:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "David G. Johnston" <david.g.johnston@gmail.com> writes:
>> On Mon, Apr 10, 2017 at 11:02 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> Sure, but isn't it fair to consider that an implementation artifact?
>
>> So, the presence of ORDER BY in the aggregate function call is a PostgreSQL
>> extension...
>
>> It seems reasonable to declare that the order of the values in the
>> generated array match whatever order the FROM clause supplies the rows.  If
>> that is not acceptable a PostgreSQL-specific ORDER BY modifier can be added
>> which will cause an additional sort-and-scan of the input relation to occur
>> (optimized across multiple column invocations when possible).
>
> Yes, and in fact we documented the ORDER-BY-in-subselect solution back
> before we had the ORDER-BY-in-aggregate feature.  I don't remember exactly
> where, but I'm sure it's still described somewhere.  So it is documented
> behavior that an aggregate without its own ORDER BY will see the rows in
> whatever order the FROM clause supplies them.

The documentation is a bit ambiguous on the topic TBH.    Via
https://www.postgresql.org/docs/9.6/static/functions-aggregate.html:

"The aggregate functions array_agg, json_agg, jsonb_agg,
json_object_agg, jsonb_object_agg, string_agg, and xmlagg, as well as
similar user-defined aggregate functions, produce meaningfully
different result values depending on the order of the input values.
This ordering is unspecified by default, but can be controlled by
writing an ORDER BY clause within the aggregate call, as shown in
Section 4.2.7. Alternatively, supplying the input values from a sorted
subquery will usually work. For example:

SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;

Beware that this approach can fail if the outer query level contains
additional processing, such as a join, because that might cause the
subquery's output to be reordered before the aggregate is computed."

If you think the behavior ought to be defined to work that way, that's
fine by me.   A small documentation fix could clarify that, I think.

merlin


Re: [GENERAL] Are multiple array_aggs going to be in the same order?

От
Paul Jungwirth
Дата:
On 04/10/2017 11:35 AM, Tom Lane wrote:
> I'm not very keen on recommending that the OP insert an ORDER BY into
> each aggregate call, because that would cause a separate sort for each
> aggregate (unless someone's improved that recently while I wasn't
> looking).

I mentioned this in my other email, upon a second look I was misreading
the EXPLAIN output. (The sort was for the GROUP BY, not the individual
ORDER BYs.) Do aggregate function sorts even show up in EXPLAIN? I can't
seem to find any. For example in this try the sorts are just for
grouping and joining:

db=> explain select c.id, array_agg(e.item order by e.id),
array_agg(e.payee order by e.id) from expense_categories c join expenses
e on e.expense_category_id = c.id group by c.id;
                                       QUERY PLAN

---------------------------------------------------------------------------------------
  GroupAggregate  (cost=223.44..285.14 rows=16 width=30)
    ->  Merge Join  (cost=223.44..264.44 rows=2728 width=30)
          Merge Cond: (c.id = e.expense_category_id)
          ->  Sort  (cost=1.48..1.52 rows=16 width=4)
                Sort Key: c.id
                ->  Seq Scan on expense_categories c  (cost=0.00..1.16
rows=16 width=4)
          ->  Sort  (cost=221.96..228.78 rows=2728 width=30)
                Sort Key: e.expense_category_id
                ->  Seq Scan on expenses e  (cost=0.00..66.28 rows=2728
width=30)
(9 rows)

Paul



Re: [GENERAL] Are multiple array_aggs going to be in the same order?

От
Merlin Moncure
Дата:
On Mon, Apr 10, 2017 at 1:43 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Mon, Apr 10, 2017 at 1:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Yes, and in fact we documented the ORDER-BY-in-subselect solution back
>> before we had the ORDER-BY-in-aggregate feature.  I don't remember exactly
>> where, but I'm sure it's still described somewhere.  So it is documented
>> behavior that an aggregate without its own ORDER BY will see the rows in
>> whatever order the FROM clause supplies them.
>
> The documentation is a bit ambiguous on the topic TBH.    Via
> https://www.postgresql.org/docs/9.6/static/functions-aggregate.html:
>
> "The aggregate functions array_agg, json_agg, jsonb_agg,
> json_object_agg, jsonb_object_agg, string_agg, and xmlagg, as well as
> similar user-defined aggregate functions, produce meaningfully
> different result values depending on the order of the input values.
> This ordering is unspecified by default, but can be controlled by
> writing an ORDER BY clause within the aggregate call, as shown in
> Section 4.2.7. Alternatively, supplying the input values from a sorted
> subquery will usually work.
<snip>

What do you think about injecting the following lingo?

... depending on the order of the input values. This ordering is
unspecified by default, but can be controlled by ...

would be changed to

... depending on the order of the input values. Input value ordering
will be consistent across multiple order dependent aggregate functions
across the same grouping but is otherwise unspecified by default as
long as all there is no explicit ordering for any aggregate function
in the grouping.  Furthermore, input value ordering can be controlled
by ...

merlin


Re: [GENERAL] Are multiple array_aggs going to be in the same order?

От
Tom Lane
Дата:
Paul Jungwirth <pj@illuminatedcomputing.com> writes:
> On 04/10/2017 11:35 AM, Tom Lane wrote:
>> I'm not very keen on recommending that the OP insert an ORDER BY into
>> each aggregate call, because that would cause a separate sort for each
>> aggregate (unless someone's improved that recently while I wasn't
>> looking).

> I mentioned this in my other email, upon a second look I was misreading
> the EXPLAIN output. (The sort was for the GROUP BY, not the individual
> ORDER BYs.) Do aggregate function sorts even show up in EXPLAIN?

They don't.  It's not a plannable behavior, just something that's
hard-wired into nodeAgg.c.  I'd like to see that improved someday,
but I don't think it's really high on anyone's priority list.

            regards, tom lane