Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

Поиск
Список
Период
Сортировка
От Craig Milhiser
Тема Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
Дата
Msg-id CA+wnhO2sHcUOfi6_gvaoPMk=2DsKUOWkDjdUtGxJOguQOXDvOQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker  (Andrei Lepikhov <lepihov@gmail.com>)
Ответы Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
Список pgsql-bugs
On Oct 8 Andrei Lepikhov wrote
> I'm a bit confused: the thread subject named ' invalid DSA memory alloc
> request size ...', but you write about issue with OOM killer. It is two
> different issues, which one do you have exactly?


I started with the Invalid DSA memory allocation error. I was asked to try an experimental patch above. Then I got OOM with the patch only running parallel. You will see below, there was an OOM but I do not believe it is the query. 

Thanks for the push on OOM. I should have ran this test earlier.

v17.0 and out of the box Postgres configuration.

I ran a new test on an instance with 512 GiB of memory.  After I applied the patch, the Invalid DSA memory allocation message was not replicated.  Running max_parallel_workers_per_gather = 0, the query took ~9.5 seconds and used <1 GiB of memory.  With max_parallel_workers_per_gather = 2 the query used  ~170 GiB of memory, ~70 GB of temp files were written and the query ran for more than 1 hour until I ran out of disk space. 

I moved from Invalid DSA memory allocation of ~2 GB to using 170 GB of RAM and 70+GB of temp files with the patch. Only when using 2 parallel workers per gather. 

The new test:

This morning I increased the machine size from 32 GiB to 512 GiB RAM.  

With the patch applied and max_parallel_workers_per_gather = 0 the query worked in ~9.5 seconds at steady state. While it was running I captured memory. I ran the query a few times earlier to get the buffers loaded.

                total        used        free      shared  buff/cache   available
Mem:           493Gi       3.5Gi       484Gi       142Mi       8.4Gi       489Gi
Swap:             0B          0B          0B

With the patch applied and max_parallel_workers_per_gather = 2; the query ran for more than 1 hour. During that time memory settled at:
               total        used        free      shared  buff/cache   available
Mem:           493Gi       178Gi       209Gi       1.9Gi       110Gi       314Gi
Swap:             0B          0B          0B

Then the machine ran out of disk space: ERROR:  could not write to file "base/pgsql_tmp/pgsql_tmp4942.1.fileset/o1859485of2097152.p0.0": No space left on device

I captured top as well during the run.
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   4951 postgres  20   0   46.8g  45.5g   1.9g D   5.6   9.2   2:40.40 postgres
   4942 postgres  20   0   68.9g  65.6g   1.9g D   5.3  13.3   3:25.35 postgres
   4952 postgres  20   0   68.4g  65.2g   1.9g D   5.3  13.2   3:07.43 postgres

After rebooting:
 df -H
Filesystem       Size  Used Avail Use% Mounted on
/dev/root        266G  197G   70G  74% /

As you mentioned there are string aggregations. I ran with parallel=0 and did some analysis. The aggregations do not seem to be creating something that is out of line.

select max(length(groups)), sum(length(groups)), max(length(groupnames)), sum(length(groupnames)) from milhiser_test;
 max |   sum   | max |   sum
-----+---------+-----+----------
 143 | 6557620 | 499 | 22790616
(1 row)

Perhaps this is a different problem than the "invalid DSA memory alloc".  The patch might have addressed that problem and this is another issue. From < 1 GiB to ~170 GiB of memory and using ~70 GB of log files when moving from parallel = 0 to 2 seems something is off.


Summary before this test:
Before the patch linked above, I was receiving "ERROR:  invalid DSA memory alloc request size 1879048192" when I ran the query with max_parallel_workers_per_gather = 2. 

Before the patch with max_parallel_workers_per_gather = 0 the query worked in ~10 seconds at steady state.

I applied the patch to v17.0 source, rebuilt, and passed tests.

With max_parallel_workers_per_gather = 0 the query worked in ~9.5 seconds at steady state and took < 1 GiB of memory. 

With max_parallel_workers_per_gather = 2, the machine ran out of memory. This was a 32 GiB machine. The free memory when running without parallel was ~30 GiB free.

Thanks

On Tue, Oct 8, 2024 at 5:16 AM Andrei Lepikhov <lepihov@gmail.com> wrote:
On 10/7/24 18:42, Craig Milhiser wrote:
>
>     On Oct 1, 2024 Andrei Lepikhov wrote
>
>  > Can you provide an explain of this query?
>
> Apologies for the delay. I have been travelling since Wednesday night.
> Thanks for your help and time with this issue.
>
> Below is the query, with specific values redacted. An explain with
> max_parallel_workers_per_gather = 2 and explain analyze
> max_parallel_workers_per_gather = 0.
I'm a bit confused: the thread subject named ' invalid DSA memory alloc
request size ...', but you write about issue with OOM killer. It is two
different issues, which one do you have exactly?

OOM killer can be explained easily, because I see huge string_agg
aggregate - workers can utilise memory more intensively. For now,
explain of an Aggregate node don't show information about factual sort
operation of each aggregate and memory consumption.

--
regards, Andrei Lepikhov

В списке pgsql-bugs по дате отправления: