Re: Significant performance issues with array_agg() + HashAggregate plans on Postgres 17
| От | Frank Heikens |
|---|---|
| Тема | Re: Significant performance issues with array_agg() + HashAggregate plans on Postgres 17 |
| Дата | |
| Msg-id | 540A0AD2-7CAA-4535-82F7-FC9055437B20@elevarq.com обсуждение исходный текст |
| Ответ на | Re: Significant performance issues with array_agg() + HashAggregate plans on Postgres 17 (Tom Lane <tgl@sss.pgh.pa.us>) |
| Список | pgsql-performance |
Hi,
I've been investigating a performance problem with user-defined aggregates
that use array_append as the transition function (SFUNC = array_append,
STYPE = bigint[]). These aggregates become catastrophically slow with
HashAggregate when many groups are present -- a workload demonstrated.
Root cause
----------
hash_agg_check_limits() calls MemoryContextMemAllocated(ctx, true) after
every new group addition. The recursive flag causes it to traverse all
child memory contexts, which is O(C) per call.
array_append creates an expanded-array object per group, each owning a
private AllocSet child context. With N groups, the total traversal cost
becomes O(N * C) ≈ O(N^2). This completely dominates the actual
aggregate computation once the group count reaches tens of thousands.
Reproducer (Tom's test case)
---------------------------------
CREATE AGGREGATE array_agg(
BASETYPE = bigint, SFUNC = array_append,
STYPE = bigint[], INITCOND = '{}');
CREATE TABLE array_agg_test(
product_id bigint NOT NULL,
region_id bigint NOT NULL,
available boolean NOT NULL);
INSERT INTO array_agg_test
SELECT generate_series(1, 50000),
(ARRAY[1,2,3,4])[floor(random()*4)+1], true;
-- (repeat 3 more times with region arrays [11..14], [111..114], [1111..1114])
VACUUM ANALYZE array_agg_test;
SET work_mem = '200MB';
EXPLAIN (ANALYZE, BUFFERS)
SELECT product_id, array_agg(region_id)
FROM array_agg_test GROUP BY product_id;
-- 50K groups, ~4 rows per group
-- Without patch: ~5 seconds
-- With patch: ~150 ms
-- Built-in array_agg: ~100 ms
Fix
---
The attached patch throttles the recursive memory check: once the group
count exceeds 1024, the full check only runs every 1024th new group.
Below 1024 groups, behavior is unchanged.
This caps the spill-detection latency to at most 1024 groups' worth of
memory. For typical work_mem settings, that's a few megabytes of
potential overshoot -- well within the existing tolerance of the check,
which was already described as "imperfect" in its own comments.
The patch is purely local to hash_agg_check_limits() in nodeAgg.c:
about 15 lines of functional change plus comments.
Benchmarks (PG 19devel, debug build, -O0)
------------------------------------------
Tom Lane reproducer (50K groups):
custom array_agg 200MB: 4.6 s -> 152 ms (30x)
custom array_agg 4MB: 388 ms -> 283 ms (no regression)
built-in array_agg: 102 ms -> 103 ms (unchanged)
Synthetic (100K groups x 50 elems):
custom 256MB: 35.0 s -> 2.9 s (12x)
custom 4MB: 6.1 s -> 5.9 s (unchanged)
Extreme (500K groups x 2 elems):
custom 256MB: 672 s -> 1.7 s (395x)
Non-array aggregates (count, sum, string_agg): unchanged
Built-in array_agg: unchanged
Spill behavior (batches, disk usage): unchanged
Regards,
Frank Heikens
> On Apr 2, 2026, at 10:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Scott Carey <scott.carey@algonomy.com> writes:
>> I have discovered the root cause.
>> This database is old. It pre-dates Postgres 8.4 which introduced
>> array_agg. Apparently, in some version prior to 8.4 array_agg was added
>> as a user function, defined as below for bigint:
>
>> create AGGREGATE array_agg(
>> BASETYPE = bigint,
>> SFUNC = array_append,
>> STYPE = bigint[],
>> INITCOND = '{}'
>> );
>
>> So if you create a test database and run the previous test, performance
>> will be fine and the query will be fast. Then run:
>> create AGGREGATE array_agg(BASETYPE = bigint, SFUNC = array_append,STYPE =
>> bigint[], INITCOND = '{}');
>
>> It will be slow and reproduce this behavior.
>
> Thank you for running that to ground! I confirm your results that v13
> and up are far slower for this example than v12 was.
>
>> Why would this run so much more slowly after updating from postgres 12 to
>> 17? It is a user defined aggregate, although maybe not as optimized as
>> the intrinsic one it shouldn't behave this way.
>
> I did some bisecting using the attached simplified test case, and found
> that the query execution time jumps from circa 60ms to circa 7500ms here:
>
> 1f39bce021540fde00990af55b4432c55ef4b3c7 is the first bad commit
> commit 1f39bce021540fde00990af55b4432c55ef4b3c7
> Author: Jeff Davis <jdavis@postgresql.org>
> Date: Wed Mar 18 15:42:02 2020 -0700
>
> Disk-based Hash Aggregation.
>
> While performing hash aggregation, track memory usage when adding new
> groups to a hash table. If the memory usage exceeds work_mem, enter
> "spill mode".
>
> (Times quoted are on a Mac M4 Pro, but in assert-enabled builds so
> maybe not directly comparable to production.)
>
> I'm bemused as to why: the test case has work_mem set high enough that
> we shouldn't be triggering spill mode, so why did this change affect
> it at all?
>
> regards, tom lane
>
> CREATE AGGREGATE array_agg(
> BASETYPE = bigint,
> SFUNC = array_append,
> STYPE = bigint[],
> INITCOND = '{}'
> );
>
> drop table if exists array_agg_test;
>
> create table array_agg_test(product_id bigint not null, region_id bigint
> not null, available boolean not null);
>
> insert into array_agg_test (product_id, region_id, available) SELECT
> generate_series(1, 50000) as product_id,
> (ARRAY[1,2,3,4])[floor(random()*4)+1] as region_id, true as available;
>
> insert into array_agg_test (product_id, region_id, available) SELECT
> generate_series(1, 50000) as product_id,
> (ARRAY[11,12,13,14])[floor(random()*4)+1] as region_id, true as available;
>
> insert into array_agg_test (product_id, region_id, available) SELECT
> generate_series(1, 50000) as product_id,
> (ARRAY[111,112,113,114])[floor(random()*4)+1] as region_id, true as
> available;
>
> insert into array_agg_test (product_id, region_id, available) SELECT
> generate_series(1, 50000) as product_id,
> (ARRAY[1111,1112,1113,1114])[floor(random()*4)+1] as region_id, true as
> available;
>
> vacuum analyze array_agg_test;
>
> \set ECHO all
>
> -- set hash_mem_multiplier = 2;
> set work_mem = "200MB";
>
> explain (analyze, buffers) select product_id, array_agg(region_id) from
> array_agg_test group by product_id;
I've been investigating a performance problem with user-defined aggregates
that use array_append as the transition function (SFUNC = array_append,
STYPE = bigint[]). These aggregates become catastrophically slow with
HashAggregate when many groups are present -- a workload demonstrated.
Root cause
----------
hash_agg_check_limits() calls MemoryContextMemAllocated(ctx, true) after
every new group addition. The recursive flag causes it to traverse all
child memory contexts, which is O(C) per call.
array_append creates an expanded-array object per group, each owning a
private AllocSet child context. With N groups, the total traversal cost
becomes O(N * C) ≈ O(N^2). This completely dominates the actual
aggregate computation once the group count reaches tens of thousands.
Reproducer (Tom's test case)
---------------------------------
CREATE AGGREGATE array_agg(
BASETYPE = bigint, SFUNC = array_append,
STYPE = bigint[], INITCOND = '{}');
CREATE TABLE array_agg_test(
product_id bigint NOT NULL,
region_id bigint NOT NULL,
available boolean NOT NULL);
INSERT INTO array_agg_test
SELECT generate_series(1, 50000),
(ARRAY[1,2,3,4])[floor(random()*4)+1], true;
-- (repeat 3 more times with region arrays [11..14], [111..114], [1111..1114])
VACUUM ANALYZE array_agg_test;
SET work_mem = '200MB';
EXPLAIN (ANALYZE, BUFFERS)
SELECT product_id, array_agg(region_id)
FROM array_agg_test GROUP BY product_id;
-- 50K groups, ~4 rows per group
-- Without patch: ~5 seconds
-- With patch: ~150 ms
-- Built-in array_agg: ~100 ms
Fix
---
The attached patch throttles the recursive memory check: once the group
count exceeds 1024, the full check only runs every 1024th new group.
Below 1024 groups, behavior is unchanged.
This caps the spill-detection latency to at most 1024 groups' worth of
memory. For typical work_mem settings, that's a few megabytes of
potential overshoot -- well within the existing tolerance of the check,
which was already described as "imperfect" in its own comments.
The patch is purely local to hash_agg_check_limits() in nodeAgg.c:
about 15 lines of functional change plus comments.
Benchmarks (PG 19devel, debug build, -O0)
------------------------------------------
Tom Lane reproducer (50K groups):
custom array_agg 200MB: 4.6 s -> 152 ms (30x)
custom array_agg 4MB: 388 ms -> 283 ms (no regression)
built-in array_agg: 102 ms -> 103 ms (unchanged)
Synthetic (100K groups x 50 elems):
custom 256MB: 35.0 s -> 2.9 s (12x)
custom 4MB: 6.1 s -> 5.9 s (unchanged)
Extreme (500K groups x 2 elems):
custom 256MB: 672 s -> 1.7 s (395x)
Non-array aggregates (count, sum, string_agg): unchanged
Built-in array_agg: unchanged
Spill behavior (batches, disk usage): unchanged
Regards,
Frank Heikens
> On Apr 2, 2026, at 10:38 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Scott Carey <scott.carey@algonomy.com> writes:
>> I have discovered the root cause.
>> This database is old. It pre-dates Postgres 8.4 which introduced
>> array_agg. Apparently, in some version prior to 8.4 array_agg was added
>> as a user function, defined as below for bigint:
>
>> create AGGREGATE array_agg(
>> BASETYPE = bigint,
>> SFUNC = array_append,
>> STYPE = bigint[],
>> INITCOND = '{}'
>> );
>
>> So if you create a test database and run the previous test, performance
>> will be fine and the query will be fast. Then run:
>> create AGGREGATE array_agg(BASETYPE = bigint, SFUNC = array_append,STYPE =
>> bigint[], INITCOND = '{}');
>
>> It will be slow and reproduce this behavior.
>
> Thank you for running that to ground! I confirm your results that v13
> and up are far slower for this example than v12 was.
>
>> Why would this run so much more slowly after updating from postgres 12 to
>> 17? It is a user defined aggregate, although maybe not as optimized as
>> the intrinsic one it shouldn't behave this way.
>
> I did some bisecting using the attached simplified test case, and found
> that the query execution time jumps from circa 60ms to circa 7500ms here:
>
> 1f39bce021540fde00990af55b4432c55ef4b3c7 is the first bad commit
> commit 1f39bce021540fde00990af55b4432c55ef4b3c7
> Author: Jeff Davis <jdavis@postgresql.org>
> Date: Wed Mar 18 15:42:02 2020 -0700
>
> Disk-based Hash Aggregation.
>
> While performing hash aggregation, track memory usage when adding new
> groups to a hash table. If the memory usage exceeds work_mem, enter
> "spill mode".
>
> (Times quoted are on a Mac M4 Pro, but in assert-enabled builds so
> maybe not directly comparable to production.)
>
> I'm bemused as to why: the test case has work_mem set high enough that
> we shouldn't be triggering spill mode, so why did this change affect
> it at all?
>
> regards, tom lane
>
> CREATE AGGREGATE array_agg(
> BASETYPE = bigint,
> SFUNC = array_append,
> STYPE = bigint[],
> INITCOND = '{}'
> );
>
> drop table if exists array_agg_test;
>
> create table array_agg_test(product_id bigint not null, region_id bigint
> not null, available boolean not null);
>
> insert into array_agg_test (product_id, region_id, available) SELECT
> generate_series(1, 50000) as product_id,
> (ARRAY[1,2,3,4])[floor(random()*4)+1] as region_id, true as available;
>
> insert into array_agg_test (product_id, region_id, available) SELECT
> generate_series(1, 50000) as product_id,
> (ARRAY[11,12,13,14])[floor(random()*4)+1] as region_id, true as available;
>
> insert into array_agg_test (product_id, region_id, available) SELECT
> generate_series(1, 50000) as product_id,
> (ARRAY[111,112,113,114])[floor(random()*4)+1] as region_id, true as
> available;
>
> insert into array_agg_test (product_id, region_id, available) SELECT
> generate_series(1, 50000) as product_id,
> (ARRAY[1111,1112,1113,1114])[floor(random()*4)+1] as region_id, true as
> available;
>
> vacuum analyze array_agg_test;
>
> \set ECHO all
>
> -- set hash_mem_multiplier = 2;
> set work_mem = "200MB";
>
> explain (analyze, buffers) select product_id, array_agg(region_id) from
> array_agg_test group by product_id;
Вложения
В списке pgsql-performance по дате отправления: