Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker

Поиск

Список

Период

Сортировка

От	Craig Milhiser
Тема	Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
Дата	9 октября 2024 г. 19:28:24
Msg-id	CA+wnhO2sHcUOfi6_gvaoPMk=2DsKUOWkDjdUtGxJOguQOXDvOQ@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker (Andrei Lepikhov <lepihov@gmail.com>)
Ответы	Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
Список	pgsql-bugs

Дерево обсуждения

On Oct 8 Andrei Lepikhov wrote

> I'm a bit confused: the thread subject named ' invalid DSA memory alloc
> request size ...', but you write about issue with OOM killer. It is two
> different issues, which one do you have exactly?

I started with the Invalid DSA memory allocation error. I was asked to try an experimental patch above. Then I got OOM with the patch only running parallel. You will see below, there was an OOM but I do not believe it is the query.

Thanks for the push on OOM. I should have ran this test earlier.

v17.0 and out of the box Postgres configuration.

I ran a new test on an instance with 512 GiB of memory. After I applied the patch, the Invalid DSA memory allocation message was not replicated. Running max_parallel_workers_per_gather = 0, the query took ~9.5 seconds and used <1 GiB of memory. With max_parallel_workers_per_gather = 2 the query used ~170 GiB of memory, ~70 GB of temp files were written and the query ran for more than 1 hour until I ran out of disk space.

I moved from Invalid DSA memory allocation of ~2 GB to using 170 GB of RAM and 70+GB of temp files with the patch. Only when using 2 parallel workers per gather.

The new test:

This morning I increased the machine size from 32 GiB to 512 GiB RAM.

With the patch applied and max_parallel_workers_per_gather = 0 the query worked in ~9.5 seconds at steady state. While it was running I captured memory. I ran the query a few times earlier to get the buffers loaded.

total used free shared buff/cache available
Mem: 493Gi 3.5Gi 484Gi 142Mi 8.4Gi 489Gi
Swap: 0B 0B 0B

With the patch applied and max_parallel_workers_per_gather = 2; the query ran for more than 1 hour. During that time memory settled at:

total used free shared buff/cache available
Mem: 493Gi 178Gi 209Gi 1.9Gi 110Gi 314Gi
Swap: 0B 0B 0B

Then the machine ran out of disk space: ERROR: could not write to file "base/pgsql_tmp/pgsql_tmp4942.1.fileset/o1859485of2097152.p0.0": No space left on device

I captured top as well during the run.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4951 postgres 20 0 46.8g 45.5g 1.9g D 5.6 9.2 2:40.40 postgres
4942 postgres 20 0 68.9g 65.6g 1.9g D 5.3 13.3 3:25.35 postgres
4952 postgres 20 0 68.4g 65.2g 1.9g D 5.3 13.2 3:07.43 postgres

After rebooting:

df -H
Filesystem Size Used Avail Use% Mounted on
/dev/root 266G 197G 70G 74% /

As you mentioned there are string aggregations. I ran with parallel=0 and did some analysis. The aggregations do not seem to be creating something that is out of line.

select max(length(groups)), sum(length(groups)), max(length(groupnames)), sum(length(groupnames)) from milhiser_test;
max | sum | max | sum
-----+---------+-----+----------
143 | 6557620 | 499 | 22790616
(1 row)

Perhaps this is a different problem than the "invalid DSA memory alloc". The patch might have addressed that problem and this is another issue. From < 1 GiB to ~170 GiB of memory and using ~70 GB of log files when moving from parallel = 0 to 2 seems something is off.

Summary before this test:

Before the patch linked above, I was receiving "ERROR: invalid DSA memory alloc request size 1879048192" when I ran the query with max_parallel_workers_per_gather = 2.

Before the patch with max_parallel_workers_per_gather = 0 the query worked in ~10 seconds at steady state.

I applied the patch to v17.0 source, rebuilt, and passed tests.

With max_parallel_workers_per_gather = 0 the query worked in ~9.5 seconds at steady state and took < 1 GiB of memory.

With max_parallel_workers_per_gather = 2, the machine ran out of memory. This was a 32 GiB machine. The free memory when running without parallel was ~30 GiB free.

Thanks

On Tue, Oct 8, 2024 at 5:16 AM Andrei Lepikhov <lepihov@gmail.com> wrote:

On 10/7/24 18:42, Craig Milhiser wrote:
>
> On Oct 1, 2024 Andrei Lepikhov wrote
>
> > Can you provide an explain of this query?
>
> Apologies for the delay. I have been travelling since Wednesday night.
> Thanks for your help and time with this issue.
>
> Below is the query, with specific values redacted. An explain with
> max_parallel_workers_per_gather = 2 and explain analyze
> max_parallel_workers_per_gather = 0.
I'm a bit confused: the thread subject named ' invalid DSA memory alloc
request size ...', but you write about issue with OOM killer. It is two
different issues, which one do you have exactly?

OOM killer can be explained easily, because I see huge string_agg
aggregate - workers can utilise memory more intensively. For now,
explain of an Aggregate node don't show information about factual sort
operation of each aggregate and memory consumption.

--
regards, Andrei Lepikhov

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker