Обсуждение: Query on partitioned table needs memory n_partitions * work_mem

Поиск
Список
Период
Сортировка

Query on partitioned table needs memory n_partitions * work_mem

От
Dimitrios Apostolou
Дата:
Hello list,

I have a table with 1000 partitions on PostgreSQL 16.
I notice that a fairly complicated query of the form:

SELECT ... GROUP BY ... LIMIT ...

causes the postgres backend process to grow insanely very fast, and the
kernel OOM killer to kill it rather soon.
It seems it tries to allocate at least 1000 * work_mem.

If I reduce the amount of work_mem, I can control the outcome and avoid
the crash, but this is suboptimal.
I have parallel plans disabled (max_parallel_workers_per_gather=0).

To add a bit more info on the execution plan, I believe the relevant part
is the 1000 HashAggregate nodes under Append:

->  Append
   ->  HashAggregate
     ->  Seq Scan
   -> ... 1000 more hashagg+seqscans


Is this allocation pattern (workmem * n_partitions) expected under any
scenario? I can't find it documented.  AFAIU the backend should allocate
up to (depth_of_execution_plan * work_mem) (putting aside the
hash_mem_multiplier and the parallel workers).

NOTE: after having written the above message, it occured to me that I have
enable_partitionwise_aggregate=on. And Turning it off fixes the issue and
makes the query faster too! Expected behaviour or bug?

Thank you in advance,
Dimitris

P.S. In the meantime I'm trying to reduce the query and the table schema,
      in order to submit a precise bug report with repro instructions.




Re: Query on partitioned table needs memory n_partitions * work_mem

От
David Rowley
Дата:
On Thu, 11 Jul 2024 at 13:19, Dimitrios Apostolou <jimis@gmx.net> wrote:
> I have a table with 1000 partitions on PostgreSQL 16.
> I notice that a fairly complicated query of the form:
>
> SELECT ... GROUP BY ... LIMIT ...
>
> causes the postgres backend process to grow insanely very fast, and the
> kernel OOM killer to kill it rather soon.
> It seems it tries to allocate at least 1000 * work_mem.

> ->  Append
>    ->  HashAggregate
>      ->  Seq Scan
>    -> ... 1000 more hashagg+seqscans
>
>
> Is this allocation pattern (workmem * n_partitions) expected under any
> scenario? I can't find it documented.  AFAIU the backend should allocate
> up to (depth_of_execution_plan * work_mem) (putting aside the
> hash_mem_multiplier and the parallel workers).

Not depth of execution plan. It relates to the number of nodes in the
plan which allocate work_mem or work_mem * hash_mem_multiplier.

There is some documentation in [1]:

"Note that a complex query might perform several sort and hash
operations at the same time"

Also, see the warning about execution time memory in [2].

> NOTE: after having written the above message, it occured to me that I have
> enable_partitionwise_aggregate=on. And Turning it off fixes the issue and
> makes the query faster too! Expected behaviour or bug?

enable_partitionwise_aggregate=on causes this query to perform an
aggregate per partition.  If your GROUP BY clause values are
distributed evenly throughout all partitions then you might find it's
not much slower to execute the query with
enable_partitionwise_aggregate=off.

It's understandable that how PostgreSQL uses work_mem isn't ideal
here, but unfortunately, that's the state of affairs, currently. You
might want to reconsider your enable_partitionwise_aggregate setting
and/or how many partitions you have.

David

[1] https://www.postgresql.org/docs/current/runtime-config-resource.html
[2] https://www.postgresql.org/docs/current/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE-BEST-PRACTICES



Re: Query on partitioned table needs memory n_partitions * work_mem

От
Dimitrios Apostolou
Дата:
Thank you for the feedback.

So I've managed to reduce the query to a rather simple one:

SELECT
     workitem_n, test_executable_n,
     bool_or(test_resulttype_n IN (2,3))
FROM
     test_runs_raw
GROUP BY
     workitem_n, test_executable_n
LIMIT 10;


The TABLE test_runs_raw has 1000 partitions on RANGE(workitem_n). All the
columns are various integer types. There is an index on workitem_n.



On Thu, 11 Jul 2024, David Rowley wrote:

> On Thu, 11 Jul 2024 at 13:19, Dimitrios Apostolou <jimis@gmx.net> wrote:
>> I have a table with 1000 partitions on PostgreSQL 16.
>> I notice that a fairly complicated query of the form:
>>
>> SELECT ... GROUP BY ... LIMIT ...
>>
>> causes the postgres backend process to grow insanely very fast, and the
>> kernel OOM killer to kill it rather soon.
>> It seems it tries to allocate at least 1000 * work_mem.
>
>> ->  Append
>>    ->  HashAggregate
>>      ->  Seq Scan
>>    -> ... 1000 more hashagg+seqscans
>>
>>
>> Is this allocation pattern (workmem * n_partitions) expected under any
>> scenario? I can't find it documented.  AFAIU the backend should allocate
>> up to (depth_of_execution_plan * work_mem) (putting aside the
>> hash_mem_multiplier and the parallel workers).
>
> Not depth of execution plan. It relates to the number of nodes in the
> plan which allocate work_mem or work_mem * hash_mem_multiplier.
>
> There is some documentation in [1]:
>
> "Note that a complex query might perform several sort and hash
> operations at the same time"

The latest query is not complex at all and I don't see it doing 1000s of
operations at the same time. By "number of nodes" would you add up all
HashAggregate nodes under an Append node? Here is part of the EXPLAIN
ANALYZE output:

  Limit  (cost=0.01..28.00 rows=10 width=7) (actual time=43120.466..43292.246 rows=10 loops=1)
    Output: test_runs_raw.workitem_n, test_runs_raw.test_executable_n, (bool_or((test_runs_raw.test_resulttype_n = ANY
('{2,3}'::integer[]))))
    Buffers: shared hit=96 read=883975
    I/O Timings: shared read=16284.731
    ->  Append  (cost=0.01..3416299633.71 rows=1220556171 width=7) (actual time=42968.794..43139.855 rows=10 loops=1)
          Buffers: shared hit=96 read=883975
          I/O Timings: shared read=16284.731
          ->  HashAggregate  (cost=0.01..0.02 rows=1 width=7) (actual time=10.662..10.663 rows=0 loops=1)
                Output: test_runs_raw.workitem_n, test_runs_raw.test_executable_n,
bool_or((test_runs_raw.test_resulttype_n= ANY ('{2,3}'::integer[]))) 
                Group Key: test_runs_raw.workitem_n, test_runs_raw.test_executable_n
                Batches: 1  Memory Usage: 24kB
                ->  Seq Scan on public.test_runs_raw__part_max20k test_runs_raw  (cost=0.00..0.00 rows=1 width=8)
(actualtime=9.960..9.961 rows=0 loops=1) 
                      Output: test_runs_raw.workitem_n, test_runs_raw.test_executable_n,
test_runs_raw.test_resulttype_n
          ->  HashAggregate  (cost=0.01..0.02 rows=1 width=7) (actual time=1.913..1.914 rows=0 loops=1)
                Output: test_runs_raw_1.workitem_n, test_runs_raw_1.test_executable_n,
bool_or((test_runs_raw_1.test_resulttype_n= ANY ('{2,3}'::integer[]))) 
                Group Key: test_runs_raw_1.workitem_n, test_runs_raw_1.test_executable_n
                Batches: 1  Memory Usage: 24kB
                ->  Seq Scan on public.test_runs_raw__part_max40k test_runs_raw_1  (cost=0.00..0.00 rows=1 width=8)
(actualtime=1.031..1.031 rows=0 loops=1) 
                      Output: test_runs_raw_1.workitem_n, test_runs_raw_1.test_executable_n,
test_runs_raw_1.test_resulttype_n
[ .... 1000s of similar HashAggregate nodes ... ]
Settings: temp_buffers = '32MB', work_mem = '32MB', effective_io_concurrency = '300', max_parallel_workers_per_gather =
'0',enable_hashjoin = 'off', enable_partitionwise_join = 'on', enable_partitionwise_aggregate = 'on', random_page_cost
='1.1', effective_cache_size = '6GB', from_collapse_limit = '24', join_collapse_limit = '24' 
  Planning:
    Buffers: shared hit=377
  Planning Time: 1503.800 ms
  Execution Time: 56515.185 ms
(5382 rows)

Memory usage on each HashAggregate is logged as 24KB (many HashAggregates
are missing that info though), I guess the EXPLAIN output is missing some
important part of the allocations here since I'm seeing MBs of allocations
per node.

I can't help but see this as a bug. I see many issues:

* postgres is not reading from partitions in parallel, but one after the
   other. It shouldn't need all this memory simultaneously.

* The memory is unnecessarily allocated early on, before any partitions
   are actually aggregated. I know this because I/O is slow on this device
   and the table sizes are huge, it's simply not possible that postgres
   went through all partitions and blew up the memory. That would take
   hours, but the OOM happens seconds after I start the query.

* The memory is not only allocated by the planner, but it's actually
   accessed. Libc's malloc() has no problem allocating gigabytes more than
   what I have available, growing the VSZ memory size without swapping out
   a single byte. That's because my kernel was set to overcommit.  The fact
   that postgres grows its RSS memory too and swaps out other processes and
   finally gets killed, is because it actually writes to that memory!
   I wonder what it writes, if it hasn't read any of the data on the
   partitions. :-)

For your consideration, here is also a memory dump from when the backend
consumes all the memory (work_mem increased to 256MB), and I've disabled
overcommit in the kernel:


TopMemoryContext: 573520 total in 8 blocks; 214096 free (27 chunks); 359424 used
   TopTransactionContext: 8192 total in 1 blocks; 7752 free (1 chunks); 440 used
   Record information cache: 8192 total in 1 blocks; 1616 free (0 chunks); 6576 used
   RegexpCacheMemoryContext: 1024 total in 1 blocks; 760 free (0 chunks); 264 used
     RegexpMemoryContext: 44032 total in 5 blocks; 36400 free (17 chunks); 7632 used: ^(test_runs_raw)$
   Btree proof lookup cache: 8192 total in 1 blocks; 592 free (0 chunks); 7600 used
   Operator lookup cache: 24576 total in 2 blocks; 10792 free (3 chunks); 13784 used
   TableSpace cache: 8192 total in 1 blocks; 2128 free (0 chunks); 6064 used
   Type information cache: 24368 total in 2 blocks; 2648 free (0 chunks); 21720 used
   RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used
   MessageContext: 16777216 total in 12 blocks; 2924704 free (11 chunks); 13852512 used
     partition directory: 8192 total in 1 blocks; 592 free (0 chunks); 7600 used
   Operator class cache: 8192 total in 1 blocks; 592 free (0 chunks); 7600 used
   smgr relation table: 524288 total in 7 blocks; 231192 free (28 chunks); 293096 used
   PgStat Shared Ref Hash: 50224 total in 2 blocks; 688 free (0 chunks); 49536 used
   PgStat Shared Ref: 90112 total in 14 blocks; 3648 free (14 chunks); 86464 used
   PgStat Pending: 294912 total in 39 blocks; 193864 free (788 chunks); 101048 used
   TransactionAbortContext: 32768 total in 1 blocks; 32504 free (0 chunks); 264 used
   Portal hash: 8192 total in 1 blocks; 592 free (0 chunks); 7600 used
   TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used
     PortalContext: 1024 total in 1 blocks; 624 free (0 chunks); 400 used: <unnamed>
       ExecutorState: 8462432 total in 13 blocks; 3316568 free (9 chunks); 5145864 used
         HashAgg meta context: 8192 total in 1 blocks; 7720 free (0 chunks); 472 used
         ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
         ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
         ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
         ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
         ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
         HashAgg meta context: 100671536 total in 2 blocks; 5528 free (0 chunks); 100666008 used
           ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
         ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
         ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
         ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
         ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
         ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
         HashAgg meta context: 201334832 total in 2 blocks; 5528 free (0 chunks); 201329304 used
           ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
         ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
         ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
         ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
         ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
         ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
[...]
         2907 more child contexts containing 13752609344 total in 3787 blocks; 25715912 free (0 chunks); 13726893432
used
-- NOTE: the total number above is the culprit for the OOM.

   Relcache by OID: 131072 total in 5 blocks; 38744 free (13 chunks); 92328 used
   CacheMemoryContext: 47147632 total in 599 blocks; 7315600 free (2 chunks); 39832032 used
     relation rules: 16384 total in 5 blocks; 2272 free (1 chunks); 14112 used: pg_settings
     index info: 1024 total in 1 blocks; 0 free (0 chunks); 1024 used: pg_trigger_oid_index
     index info: 1024 total in 1 blocks; 0 free (0 chunks); 1024 used: pg_trigger_tgconstraint_index
     index info: 1024 total in 1 blocks; 40 free (0 chunks); 984 used: pg_publication_rel_prpubid_index
     index info: 2048 total in 2 blocks; 728 free (2 chunks); 1320 used: pg_policy_polrelid_polname_index
     index info: 1024 total in 1 blocks; 72 free (0 chunks); 952 used: pg_policy_oid_index
     relation rules: 16384 total in 5 blocks; 7144 free (1 chunks); 9240 used: pg_roles
     index info: 2048 total in 2 blocks; 544 free (1 chunks); 1504 used: pg_inherits_relid_seqno_index
     index info: 1024 total in 1 blocks; 0 free (0 chunks); 1024 used: pg_tablespace_spcname_index
     index info: 1024 total in 1 blocks; 0 free (0 chunks); 1024 used: pg_constraint_contypid_index
     index info: 2048 total in 2 blocks; 688 free (1 chunks); 1360 used: pg_constraint_conname_nsp_index
     index info: 2048 total in 2 blocks; 912 free (0 chunks); 1136 used: pg_constraint_conparentid_index
     index info: 1024 total in 1 blocks; 0 free (0 chunks); 1024 used: pg_attrdef_oid_index
     index info: 2048 total in 2 blocks; 688 free (1 chunks); 1360 used: pg_attrdef_adrelid_adnum_index
     index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg_am_oid_index
     index info: 1024 total in 1 blocks; 0 free (0 chunks); 1024 used: pg_am_name_index
     index info: 1024 total in 1 blocks; 0 free (0 chunks); 1024 used: pg_rewrite_oid_index
     CachedPlanSource: 4096 total in 3 blocks; 688 free (0 chunks); 3408 used: SELECT * FROM pg_catalog.pg_rewrite
WHEREev_class = $1 AND rulename = $2 
       CachedPlanQuery: 8192 total in 4 blocks; 4016 free (1 chunks); 4176 used
     SPI Plan: 1024 total in 1 blocks; 600 free (0 chunks); 424 used
     index info: 2048 total in 2 blocks; 656 free (2 chunks); 1392 used: pg_class_tblspc_relfilenode_index
     index info: 2048 total in 2 blocks; 616 free (1 chunks); 1432 used: pg_toast_2619_index
     index info: 2048 total in 2 blocks; 728 free (2 chunks); 1320 used: test_runs_raw__part_max20000k_pkey
[...]
     1017 more child contexts containing 2142696 total in 2014 blocks; 684544 free (1928 chunks); 1458152 used
   WAL record construction: 50200 total in 2 blocks; 6376 free (0 chunks); 43824 used
   PrivateRefCount: 8192 total in 1 blocks; 2648 free (0 chunks); 5544 used
   MdSmgr: 131072 total in 5 blocks; 22552 free (1 chunks); 108520 used
   LOCALLOCK hash: 262144 total in 6 blocks; 59376 free (23 chunks); 202768 used
   GUCMemoryContext: 24576 total in 2 blocks; 9440 free (4 chunks); 15136 used
     GUC hash table: 32768 total in 3 blocks; 12704 free (5 chunks); 20064 used
   Timezones: 104112 total in 2 blocks; 2648 free (0 chunks); 101464 used
   ErrorContext: 8192 total in 1 blocks; 7928 free (2 chunks); 264 used Grand total: 16246728816 bytes in 6875 blocks;
41833616free (3045 chunks); 16204895200 used 



>
> Also, see the warning about execution time memory in [2].
>
>> NOTE: after having written the above message, it occured to me that I have
>> enable_partitionwise_aggregate=on. And Turning it off fixes the issue and
>> makes the query faster too! Expected behaviour or bug?
>
> enable_partitionwise_aggregate=on causes this query to perform an
> aggregate per partition.  If your GROUP BY clause values are
> distributed evenly throughout all partitions then you might find it's
> not much slower to execute the query with
> enable_partitionwise_aggregate=off.

The GROUP BY first-clause values (workitem_n) are not evenly distributed,
in fact the partitions are split by that. The rest indeed are.

Thank you, I'll turn off this flag globally to avoid such behaviour.

>
> It's understandable that how PostgreSQL uses work_mem isn't ideal
> here, but unfortunately, that's the state of affairs, currently. You
> might want to reconsider your enable_partitionwise_aggregate setting
> and/or how many partitions you have.

Having wasted long time in that, the minimum I can do is submit a
documentation patch. At enable_partitionwise_aggregate someting like
"WARNING it can increase the memory usage by at least
n_partitions * work_mem". How do I move on for such a patch? Pointers
would be appreciated. :-)


Thank you,
Dimitris


> [1] https://www.postgresql.org/docs/current/runtime-config-resource.html
> [2] https://www.postgresql.org/docs/current/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE-BEST-PRACTICES
>



Re: Query on partitioned table needs memory n_partitions * work_mem

От
Tom Lane
Дата:
Dimitrios Apostolou <jimis@gmx.net> writes:
> The TABLE test_runs_raw has 1000 partitions on RANGE(workitem_n).

So don't do that.  Adding partitions is not cost-free.

            regards, tom lane



Re: Query on partitioned table needs memory n_partitions * work_mem

От
Dimitrios Apostolou
Дата:
On Thu, 11 Jul 2024, Tom Lane wrote:

> Dimitrios Apostolou <jimis@gmx.net> writes:
>> The TABLE test_runs_raw has 1000 partitions on RANGE(workitem_n).
>
> So don't do that.  Adding partitions is not cost-free.
>

I understand that, they also add an administrative cost that I'd rather
avoid. But I ended up adding all these partitions because of performance
issues on a multi-billion rows table. There is probably some message from
me on this list a couple of years ago.

At the moment I have a work-around. I'm thankful that everyone is willing
to provide workarounds to all potential issues/bugs I have presented, but
unfortunately workarounds are not fixes, one will hit the same wall again
at some point.

My current concern is **reporting my findings responsibly**. I want to
provide as much data needed to pinpoint the issue, so that the developers
know exactly what's going on. Having right data is half the fix.

A way to track the issue would be nice. I might revisit it and even try to
submit a patch. I wonder how the postgres development community is
tracking all these issues, I've even started forgetting the ones I have
found, and I'm sure I have previously reported (on this list) a couple of
should-be-easy issues that would be ideal for beginners.


Regards,
Dimitris




Re: Query on partitioned table needs memory n_partitions * work_mem

От
"David G. Johnston"
Дата:
On Thursday, July 11, 2024, Dimitrios Apostolou <jimis@gmx.net> wrote:I wonder how the postgres development community is
tracking all these issues, I've even started forgetting the ones I have
found, and I'm sure I have previously reported (on this list) a couple of
should-be-easy issues that would be ideal for beginners.



David J.

Re: Query on partitioned table needs memory n_partitions * work_mem

От
David Rowley
Дата:
On Fri, 12 Jul 2024 at 02:08, Dimitrios Apostolou <jimis@gmx.net> wrote:
> I can't help but see this as a bug. I see many issues:
>
> * postgres is not reading from partitions in parallel, but one after the
>    other. It shouldn't need all this memory simultaneously.

I don't know for Hash Aggregate, but for nodes like Sort, we still
hold onto the tuplestore after returning the last tuple as a rescan
might want to read those tuples again. There's also a mark/restore
that might want to rewind a little to match up to the next outer tuple
of a Merge Join.

It might be possible to let go of the memory sooner in plans when
returning the final tuple means we'll never need the memory again, but
that would require figuring out all the cases where that could happen
and ensuring we don't ever release memory when it's required again.

> * The memory is unnecessarily allocated early on, before any partitions
>    are actually aggregated. I know this because I/O is slow on this device
>    and the table sizes are huge, it's simply not possible that postgres
>    went through all partitions and blew up the memory. That would take
>    hours, but the OOM happens seconds after I start the query.

That's interesting. Certainly, there is some memory allocated during
executor startup, but that amount should be fairly small.  Are you
able to provide a self-contained test case that shows the memory
blowing up before execution begins?

> Having wasted long time in that, the minimum I can do is submit a
> documentation patch. At enable_partitionwise_aggregate someting like
> "WARNING it can increase the memory usage by at least
> n_partitions * work_mem". How do I move on for such a patch? Pointers
> would be appreciated. :-)

I think mentioning something about this in enable_partitionwise_join
and enable_partitionwise_aggregate is probably wise. I'll propose a
patch on pgsql-hackers.

David



Re: Query on partitioned table needs memory n_partitions * work_mem

От
Dimitrios Apostolou
Дата:
On Thu, 18 Jul 2024, David Rowley wrote:

> On Fri, 12 Jul 2024 at 02:08, Dimitrios Apostolou <jimis@gmx.net> wrote:
>
>> * The memory is unnecessarily allocated early on, before any partitions
>>    are actually aggregated. I know this because I/O is slow on this device
>>    and the table sizes are huge, it's simply not possible that postgres
>>    went through all partitions and blew up the memory. That would take
>>    hours, but the OOM happens seconds after I start the query.
>
> That's interesting. Certainly, there is some memory allocated during
> executor startup, but that amount should be fairly small.  Are you
> able to provide a self-contained test case that shows the memory
> blowing up before execution begins?

I'm trying hard to create a self-contained way to reproduce the issue.
It's not easy, the behaviour is a bit unstable. So far I see high memory
usage (8-10GB) but I'm not able to OOM with that much (it actually OOM'ed
once, but it was so uncontrollable that I didn't manage to measure and
reproduce again later).



-- I create a table with 2k partitions.

CREATE TABLE   partitioned_table1(
     run_n       bigint   GENERATED ALWAYS AS IDENTITY,
     workitem_n  integer  NOT NULL,
     label_n     smallint,
     result      smallint NOT NULL,
     PRIMARY KEY(workitem_n, run_n)
) PARTITION BY RANGE(workitem_n);

DO $$
for i in range(0, 2000):
     stmt = f'''
         CREATE TABLE part_max{i+1}M
             PARTITION OF partitioned_table1
             FOR VALUES FROM ({i*1000*1000}) TO ({(i+1) * 1000*1000})
     '''
     plpy.execute(stmt)
$$ LANGUAGE plpython3u;


-- I insert random data. First I insert to all partitions, 1M rows each:

DO $$
for i in range(0, 2000):
     stmt = f'''
         INSERT INTO partitioned_table1(workitem_n, label_n, result)
             SELECT
                     j-j%4,
                     CAST(random()*1000 AS INTEGER),
                     CAST(random()*3    AS INTEGER)
                 FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1,
1) as j
     '''
     plpy.info(stmt)
     plpy.execute(stmt)
     plpy.commit()
$$ LANGUAGE plpython3u;


-- Disable parallel execution and group aggregate:

SET SESSION max_parallel_workers_per_gather TO 0;
SET SESSION enable_incremental_sort TO off;

SET SESSION work_mem TO '8MB';


-- Now the following query should do a HashAggregate:

SELECT
      workitem_n, label_n, bool_or(result IN (2,3))
FROM
      partitioned_table1
GROUP BY
      workitem_n, label_n
LIMIT 10;


-- How much was the RSS of the backend while the previous query was
-- running? Not that high. But if we insert some million rows to the
-- 1st partition, then it will be much higher.

DO $$
for i in range(0,2000):
     stmt = f'''
         INSERT INTO partitioned_table1(workitem_n, label_n, result)
             SELECT
                     j%1000000,
                     CAST(random()*20000 AS INTEGER),
                     CAST(random()*4    AS INTEGER)
                 FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1, 1) as j
     '''
     plpy.info(stmt)
     plpy.execute(stmt)
     plpy.commit()
$$ LANGUAGE plpython3u;


-- Now that same previous query consumes between 8GB and 10GB RSS. The
-- more data I insert (to all partitions?), the more memory the query
-- takes.


Overall:

* I don't see the RSS memory usage (8GB) growing proportionally as I
   expected. If I increase work_mem from 4MB to 8MB then I see double RSS
   memory usage (from ~4GB to ~8GB). But then if I increase it further the
   difference is miniscule and no OOM happens.

* Instead I notice RSS memory usage growing slowly while I insert more and more data
   to the table (especially into the 1st partition I think).

* Finally I don't see the memory being free'd by the backend after the
   SELECT finishes. The system is relieved only when I disconnect psql and
   the backend dies. Not sure if that's by design or not.

>
>> Having wasted long time in that, the minimum I can do is submit a
>> documentation patch. At enable_partitionwise_aggregate someting like
>> "WARNING it can increase the memory usage by at least
>> n_partitions * work_mem". How do I move on for such a patch? Pointers
>> would be appreciated. :-)
>
> I think mentioning something about this in enable_partitionwise_join
> and enable_partitionwise_aggregate is probably wise. I'll propose a
> patch on pgsql-hackers.

David and Ashutosh, thank you both for your interest in improving the
documentation. Unfortunately I'm not positive any longer on what exactly
is going on here, I don't understand how the memory is growing. One thing
I can verify is that it's definitely caused by partitioning: I have
another similar huge table but unpartitioned, and no such issues show up.
Maybe someone with knowledge of the HashAggregate algorithm and
partitioning can throw some ideas in.


Regards,
Dimitris




Re: Query on partitioned table needs memory n_partitions * work_mem

От
Dimitrios Apostolou
Дата:
Ok I reproduced the OOM, with only 200 partitions as opposed to 2K that I
mentioned before. The keys to reproduce it I believe are:

1. Write millions of rows to *all* partitions
2. Run ANALYSE so that the planner knows about the sizes


Here are the correct steps now. Let me know if you manage to reproduce.
Kind reminder to set vm.overcommit_memory to 2, before you lose your
system like it happened to me. :-)


CREATE TABLE   partitioned_table1(
     run_n       bigint   GENERATED ALWAYS AS IDENTITY,
     workitem_n  integer  NOT NULL,
     label_n     smallint,
     result      smallint NOT NULL,
     PRIMARY KEY(workitem_n, run_n)
) PARTITION BY RANGE(workitem_n);

-- Create 200 partitions

DO $$
for i in range(0, 200):
     start = i     * 10 * 1000 * 1000
     end   = (i+1) * 10 * 1000 * 1000
     stmt = f'''
         CREATE TABLE part{i}
             PARTITION OF partitioned_table1
             FOR VALUES FROM ({start}) TO ({end})
     '''
     plpy.execute(stmt)
$$ LANGUAGE plpython3u;

-- Insert 20M rows per partition. This will take a while but I don't know
-- of a way to speed it up.

DO $$
for i in range(0, 2000):
     stmt = f'''
         INSERT INTO partitioned_table1(workitem_n, label_n, result)
             SELECT
                     j,
                     CAST(random()*1000 AS INTEGER),
                     CAST(random()*4    AS INTEGER)
                 FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1, 1) as j
     '''
     plpy.info(stmt)
     plpy.execute(stmt)
     plpy.commit()
$$ LANGUAGE plpython3u;

-- Important to analyse! and set the right settings...

ANALYSE partitioned_table1;
SET SESSION max_parallel_workers_per_gather TO 0;
SET SESSION enable_incremental_sort         TO off;
SET SESSION enable_partitionwise_aggregate  TO on;
SET SESSION enable_partitionwise_join       TO on;

-- And then reproduce the issue:

\timing on
SET SESSION work_mem TO '4MB';

SELECT
       workitem_n, label_n, bool_or(result IN (2,3))
FROM
       partitioned_table1
GROUP BY
       workitem_n, label_n
LIMIT 10;


Repeat that last query while doubling the work_mem every time. You'll see
the process growing to immense dimensions. In my case, my 16GB of RAM were
exhausted when work mem was set to '128MB'.

Remarkably, the fastest execution is the one with the smallest work_mem,
where HashAggregate splits the work into many "partitions". As we grow the
work_mem, it becomes slower and slower (of course while not reaching the
limits of RAM yet; if it goes to swap then all measurements are off).

Let me know if you manage to reproduce the issue!

Dimitris