Обсуждение: PG18 GIN parallel index build crash - invalid memory alloc request size

Поиск

Список

Период

Сортировка

PG18 GIN parallel index build crash - invalid memory alloc request size

От

Gregory Smith

Дата:

24 октября, 06:03:49

Testing PostgreSQL 18.0 on Debian from PGDG repo: 18.0-1.pgdg12+3 with PostGIS 3.6.0+dfsg-2.pgdg12+1. Running the osm2pgsql workload to load the entire OSM Planet data set in my home lab system.

I found a weird crash during the recently adjusted parallel GIN index building code. There are 2 parallel workers spawning, one of them crashes then everything terminates. This is one of the last steps in OSM loading, I can reproduce just by trying the one statement again:

gis=# CREATE INDEX ON "public"."planet_osm_polygon" USING GIN (tags);
ERROR: invalid memory alloc request size 1113001620

I see that this area of the code was just being triaged during early beta time in May, may need another round.

The table is 215 GB. Server has 128GB and only 1/3 is nailed down, there's plenty of RAM available.

Settings include:

work_mem=1GB
maintenance_work_mem=20GB
shared_buffers=48GB
max_parallel_workers_per_gather = 8

Log files show a number of similarly big allocations working before then, here's an example:

LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp161831.0.fileset/0.1", size 1073741824
STATEMENT: CREATE INDEX ON "public"."planet_osm_polygon" USING BTREE (osm_id)
ERROR: invalid memory alloc request size 1137667788
STATEMENT: CREATE INDEX ON "public"."planet_osm_polygon" USING GIN (tags)
CONTEXT: parallel worker

And another one to show the size at crash is a little different each time:

ERROR: Database error: ERROR: invalid memory alloc request size 1115943018

Hooked into the error message and it gave this stack trace:

#0 errfinish (filename=0x5646de247420 "./build/../src/backend/utils/mmgr/mcxt.c",
lineno=1174, funcname=0x5646de2477d0 <__func__.3> "MemoryContextSizeFailure")
at ./build/../src/backend/utils/error/elog.c:476
#1 0x00005646ddb4ae9c in MemoryContextSizeFailure (
context=context@entry=0x56471ce98c90, size=size@entry=1136261136,
flags=flags@entry=0) at ./build/../src/backend/utils/mmgr/mcxt.c:1174
#2 0x00005646de05898d in MemoryContextCheckSize (flags=0, size=1136261136,
context=0x56471ce98c90) at ./build/../src/include/utils/memutils_internal.h:172
#3 MemoryContextCheckSize (flags=0, size=1136261136, context=0x56471ce98c90)
at ./build/../src/include/utils/memutils_internal.h:167
#4 AllocSetRealloc (pointer=0x7f34f558b040, size=1136261136, flags=0)
at ./build/../src/backend/utils/mmgr/aset.c:1203
#5 0x00005646ddb701c8 in GinBufferStoreTuple (buffer=0x56471cee0d10,
tup=0x7f34dfdd2030) at ./build/../src/backend/access/gin/gininsert.c:1497
#6 0x00005646ddb70503 in _gin_process_worker_data (progress=<optimized out>,
worker_sort=0x56471cf13638, state=0x7ffc288b0200)
at ./build/../src/backend/access/gin/gininsert.c:1926
#7 _gin_parallel_scan_and_build (state=state@entry=0x7ffc288b0200,
ginshared=ginshared@entry=0x7f4168a5d360,
sharedsort=sharedsort@entry=0x7f4168a5d300, heap=heap@entry=0x7f41686e5280,
index=index@entry=0x7f41686e4738, sortmem=<optimized out>,
progress=<optimized out>) at ./build/../src/backend/access/gin/gininsert.c:2046
#8 0x00005646ddb71ebf in _gin_parallel_build_main (seg=<optimized out>,
toc=0x7f4168a5d000) at ./build/../src/backend/access/gin/gininsert.c:2159
#9 0x00005646ddbdf882 in ParallelWorkerMain (main_arg=<optimized out>)
at ./build/../src/backend/access/transam/parallel.c:1563
#10 0x00005646dde40670 in BackgroundWorkerMain (startup_data=<optimized out>,
startup_data_len=<optimized out>)
at ./build/../src/backend/postmaster/bgworker.c:843
#11 0x00005646dde42a45 in postmaster_child_launch (
child_type=child_type@entry=B_BG_WORKER, child_slot=320,
startup_data=startup_data@entry=0x56471cdbc8f8,
startup_data_len=startup_data_len@entry=1472, client_sock=client_sock@entry=0x0)
at ./build/../src/backend/postmaster/launch_backend.c:290
#12 0x00005646dde44265 in StartBackgroundWorker (rw=0x56471cdbc8f8)
at ./build/../src/backend/postmaster/postmaster.c:4157
#13 maybe_start_bgworkers () at ./build/../src/backend/postmaster/postmaster.c:4323
#14 0x00005646dde45b13 in LaunchMissingBackgroundProcesses ()
at ./build/../src/backend/postmaster/postmaster.c:3397
#15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1717
#16 0x00005646dde47f6d in PostmasterMain (argc=argc@entry=5,
argv=argv@entry=0x56471cd66dc0)
at ./build/../src/backend/postmaster/postmaster.c:1400
#17 0x00005646ddb4d56c in main (argc=5, argv=0x56471cd66dc0)
at ./build/../src/backend/main/main.c:227

I've frozen my testing at the spot where I can reproduce the problem. I was going to try dropping m_w_m next and turning off the parallel execution. I didn't want to touch anything until after asking if there's more data that should be collected from a crashing instance.

--
Greg Smith, Software Engineering
Snowflake - Where Data Does More
gregory.smith@snowflake.com

Re: PG18 GIN parallel index build crash - invalid memory alloc request size

От

Tomas Vondra

Дата:

24 октября, 15:38:24

Hi,

On 10/24/25 05:03, Gregory Smith wrote:
> Testing PostgreSQL 18.0 on Debian from PGDG repo:  18.0-1.pgdg12+3 with
> PostGIS 3.6.0+dfsg-2.pgdg12+1.  Running the osm2pgsql workload to load
> the entire OSM Planet data set in my home lab system.
> 
> I found a weird crash during the recently adjusted parallel GIN index
> building code.  There are 2 parallel workers spawning, one of them
> crashes then everything terminates.  This is one of the last steps in
> OSM loading, I can reproduce just by trying the one statement again:
> 
> gis=# CREATE INDEX ON "public"."planet_osm_polygon" USING GIN (tags);
> ERROR:  invalid memory alloc request size 1113001620
> 
> I see that this area of the code was just being triaged during early
> beta time in May, may need another round.
> 
> The table is 215 GB.  Server has 128GB and only 1/3 is nailed down,
> there's plenty of RAM available.
> 
> Settings include:
> work_mem=1GB
> maintenance_work_mem=20GB
> shared_buffers=48GB
> max_parallel_workers_per_gather = 8
> 

Hmmm, I wonder if the m_w_m is high enough to confuse the trimming logic
in some way. Can you try if using smaller m_w_m (maybe 128MB-256MB)
makes the issue go away?

> Log files show a number of similarly big allocations working before
> then, here's an example:
> 
> LOG:  temporary file: path "base/pgsql_tmp/
> pgsql_tmp161831.0.fileset/0.1", size 1073741824
> STATEMENT:  CREATE INDEX ON "public"."planet_osm_polygon" USING BTREE
> (osm_id)
> ERROR:  invalid memory alloc request size 1137667788
> STATEMENT:  CREATE INDEX ON "public"."planet_osm_polygon" USING GIN (tags)
> CONTEXT:  parallel worker
> 

But that btree allocation is exactly 1GB, which is the palloc limit. And
IIRC the tuplesort code is doing palloc_huge, so that's probably why it
works fine. While the GIN code does a plain repalloc(), so it's subject
to the MaxAllocSize limit.

> And another one to show the size at crash is a little different each time:
> ERROR: Database error: ERROR:  invalid memory alloc request size 1115943018
> 
> Hooked into the error message and it gave this stack trace:
> 
> #0  errfinish (filename=0x5646de247420 "./build/../src/backend/utils/
> mmgr/mcxt.c",
>     lineno=1174, funcname=0x5646de2477d0 <__func__.3>
> "MemoryContextSizeFailure")
>     at ./build/../src/backend/utils/error/elog.c:476
> #1  0x00005646ddb4ae9c in MemoryContextSizeFailure (
>     context=context@entry=0x56471ce98c90, size=size@entry=1136261136,
>     flags=flags@entry=0) at ./build/../src/backend/utils/mmgr/mcxt.c:1174
> #2  0x00005646de05898d in MemoryContextCheckSize (flags=0, size=1136261136,
>     context=0x56471ce98c90) at ./build/../src/include/utils/
> memutils_internal.h:172
> #3  MemoryContextCheckSize (flags=0, size=1136261136,
> context=0x56471ce98c90)
>     at ./build/../src/include/utils/memutils_internal.h:167
> #4  AllocSetRealloc (pointer=0x7f34f558b040, size=1136261136, flags=0)
>     at ./build/../src/backend/utils/mmgr/aset.c:1203
> #5  0x00005646ddb701c8 in GinBufferStoreTuple (buffer=0x56471cee0d10,
>     tup=0x7f34dfdd2030) at ./build/../src/backend/access/gin/
> gininsert.c:1497
> #6  0x00005646ddb70503 in _gin_process_worker_data (progress=<optimized
> out>,
>     worker_sort=0x56471cf13638, state=0x7ffc288b0200)
>     at ./build/../src/backend/access/gin/gininsert.c:1926
> #7  _gin_parallel_scan_and_build (state=state@entry=0x7ffc288b0200,
>     ginshared=ginshared@entry=0x7f4168a5d360,
>     sharedsort=sharedsort@entry=0x7f4168a5d300,
> heap=heap@entry=0x7f41686e5280,
>     index=index@entry=0x7f41686e4738, sortmem=<optimized out>,
>     progress=<optimized out>) at ./build/../src/backend/access/gin/
> gininsert.c:2046
> #8  0x00005646ddb71ebf in _gin_parallel_build_main (seg=<optimized out>,
>     toc=0x7f4168a5d000) at ./build/../src/backend/access/gin/
> gininsert.c:2159
> #9  0x00005646ddbdf882 in ParallelWorkerMain (main_arg=<optimized out>)
>     at ./build/../src/backend/access/transam/parallel.c:1563
> #10 0x00005646dde40670 in BackgroundWorkerMain (startup_data=<optimized
> out>,
>     startup_data_len=<optimized out>)
>     at ./build/../src/backend/postmaster/bgworker.c:843
> #11 0x00005646dde42a45 in postmaster_child_launch (
>     child_type=child_type@entry=B_BG_WORKER, child_slot=320,
>     startup_data=startup_data@entry=0x56471cdbc8f8,
>     startup_data_len=startup_data_len@entry=1472,
> client_sock=client_sock@entry=0x0)
>     at ./build/../src/backend/postmaster/launch_backend.c:290
> #12 0x00005646dde44265 in StartBackgroundWorker (rw=0x56471cdbc8f8)
>     at ./build/../src/backend/postmaster/postmaster.c:4157
> #13 maybe_start_bgworkers () at ./build/../src/backend/postmaster/
> postmaster.c:4323
> #14 0x00005646dde45b13 in LaunchMissingBackgroundProcesses ()
>     at ./build/../src/backend/postmaster/postmaster.c:3397
> #15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1717
> #16 0x00005646dde47f6d in PostmasterMain (argc=argc@entry=5,
>     argv=argv@entry=0x56471cd66dc0)
>     at ./build/../src/backend/postmaster/postmaster.c:1400
> #17 0x00005646ddb4d56c in main (argc=5, argv=0x56471cd66dc0)
>     at ./build/../src/backend/main/main.c:227
> 
> I've frozen my testing at the spot where I can reproduce the problem.  I
> was going to try dropping m_w_m next and turning off the parallel
> execution.  I didn't want to touch anything until after asking if
> there's more data that should be collected from a crashing instance.
> 

Hmm, so it's failing on the repalloc in GinBufferStoreTuple(), which is
merging the "GinTuple" into an in-memory buffer. I'll take a closer look
once I get back from pgconf.eu, but I guess I failed to consider that
the "parts" may be large enough to exceed MaxAlloc.

The code tries to flush the "frozen" part of the TID lists part that can
no longer change, but I think with m_w_m this large it could happen the
first two buffers are already too large (and the trimming happens only
after the fact).

Can you show the contents of buffer and tup? I'm especially interested
in these fields:

  buffer->nitems
  buffer->maxitems
  buffer->nfrozen
  tup->nitems

If I'm right, I think there are two ways to fix this:

(1) apply the trimming earlier, i.e. try to freeze + flush before
actually merging the data (essentially, update nfrozen earlier)

(2) use repalloc_huge (and palloc_huge) in GinBufferStoreTuple

Or we probably should do both.


regards

-- 
Tomas Vondra

Re: PG18 GIN parallel index build crash - invalid memory alloc request size

От

Gregory Smith

Дата:

24 октября, 23:22:29

On Fri, Oct 24, 2025 at 8:38 AM Tomas Vondra <tomas@vondra.me> wrote:

Hmmm, I wonder if the m_w_m is high enough to confuse the trimming logic
in some way. Can you try if using smaller m_w_m (maybe 128MB-256MB)
makes the issue go away?

The index builds at up to 4GB of m_w_m. 5GB and above crashes.

Now that I know roughly where the limits are that de-escalates things a bit. The sort of customers deploying a month after release should be OK with just knowing to be careful about high m_w_m settings on PG18 until a fix is ready.

Hope everyone is enjoying Latvia. My obscure music collection includes a band from there I used to see in the NYC area, The Quags; https://www.youtube.com/watch?v=Bg3P4736CxM

Can you show the contents of buffer and tup? I'm especially interested
in these fields:
buffer->nitems
buffer->maxitems
buffer->nfrozen
tup->nitems

I'll see if I can grab that data at the crash point.

FYI for anyone who wants to replicate this: if you have a system with 128GB+ of RAM you could probably recreate the test case. Just have to download the Planet file and run osm2pgsql with the overly tweaked settings I use. I've published all the details of how I run this regression test now.

Settings: https://github.com/gregs1104/pgbent/tree/main/conf/18/conf.d

Script setup: https://github.com/gregs1104/pgbent/blob/main/wl/osm-import

Test runner: https://github.com/gregs1104/pgbent/blob/main/util/osm-importer

Parse results: https://github.com/gregs1104/pgbent/blob/main/util/pgbench-init-parse

If I'm right, I think there are two ways to fix this:
(1) apply the trimming earlier, i.e. try to freeze + flush before
actually merging the data (essentially, update nfrozen earlier)
(2) use repalloc_huge (and palloc_huge) in GinBufferStoreTuple
Or we probably should do both.

Sounds like (2) is probably mandatory and (1) is good hygiene.

--
Greg Smith, Software engineering
Snowflake - Where Data Does More
gregory.smith@snowflake.com

Re: PG18 GIN parallel index build crash - invalid memory alloc request size

От

Tomas Vondra

Дата:

26 октября, 18:16:07

On 10/24/25 22:22, Gregory Smith wrote:
> On Fri, Oct 24, 2025 at 8:38 AM Tomas Vondra <tomas@vondra.me
> <mailto:tomas@vondra.me>> wrote:
> 
>     Hmmm, I wonder if the m_w_m is high enough to confuse the trimming logic
>     in some way. Can you try if using smaller m_w_m (maybe 128MB-256MB)
>     makes the issue go away?
> 
> 
> The index builds at up to 4GB of m_w_m.  5GB and above crashes.
> 
> Now that I know roughly where the limits are that de-escalates things a
> bit.  The sort of customers deploying a month after release should be OK
> with just knowing to be careful about high m_w_m settings on PG18 until
> a fix is ready.
> 
> Hope everyone is enjoying Latvia.  My obscure music collection includes
> a band from there I used to see in the NYC area, The Quags; https://
> www.youtube.com/watch?v=Bg3P4736CxM <https://www.youtube.com/watch?
> v=Bg3P4736CxM>
> 

Nice!

>     Can you show the contents of buffer and tup? I'm especially interested
>     in these fields:
>       buffer->nitems
>       buffer->maxitems
>       buffer->nfrozen
>       tup->nitems
> 
> 
> I'll see if I can grab that data at the crash point.
> 
> FYI for anyone who wants to replicate this:  if you have a system with
> 128GB+ of RAM you could probably recreate the test case.   Just have to
> download the Planet file and run osm2pgsql with the overly tweaked
> settings I use.  I've published all the details of how I run this
> regression test now.
> 
> Settings:  https://github.com/gregs1104/pgbent/tree/main/conf/18/conf.d
> <https://github.com/gregs1104/pgbent/tree/main/conf/18/conf.d>
> Script setup:  https://github.com/gregs1104/pgbent/blob/main/wl/osm-
> import <https://github.com/gregs1104/pgbent/blob/main/wl/osm-import>
> Test runner:  https://github.com/gregs1104/pgbent/blob/main/util/osm-
> importer <https://github.com/gregs1104/pgbent/blob/main/util/osm-importer>
> Parse results:  https://github.com/gregs1104/pgbent/blob/main/util/
> pgbench-init-parse <https://github.com/gregs1104/pgbent/blob/main/util/
> pgbench-init-parse>
>  

I did reproduce this using OSM, although I used different settings, but
that's only affects loading. Setting maintenance_work_mem=20GB is more
than enough to trigger the error during parallel index build.

So I don't need the data.

> 
>     If I'm right, I think there are two ways to fix this:
>     (1) apply the trimming earlier, i.e. try to freeze + flush before
>     actually merging the data (essentially, update nfrozen earlier)
>     (2) use repalloc_huge (and palloc_huge) in GinBufferStoreTuple
>     Or we probably should do both.
> 
> 
> Sounds like (2) is probably mandatory and (1) is good hygiene.
> 

Yes, (2) is mandatory to fix this, and it's also sufficient. See the
attached fix. I'll clean this up and push soon.

AFAICS (1) is not really needed. I was concerned we might end up with
each worker producing a TID buffer close to maintenance_work_mem, and
then the leader would have to use twice as much memory when merging. But
it turns out I already thought about that, and the workers use a fair
share or maintenance_work_mem, not a new limit. So they produce smaller
chunks, and those should not exceed maintenance_work_mem when merging.

I tried "freezing" the existing buffer more eagerly (before merging the
tuple), but that made no difference. The workers produce data with a lot
of overlaps (simply because that's how the parallel builds divide data),
and the amount of trimmed data is tiny. Something like 10k TIDs from a
buffer of 1M TIDs. So a tiny difference, and it'd still fail.

I'm not against maybe experimenting with this, but it's going to be a
mater-only thing, not for backpatching.

Maybe we should split the data into smaller chunks while building tuples
in ginFlushBuildState. That'd probably allow enforcing the memory limit
more strictly, because we sometimes hold multiple copies of the TIDs
arrays. But that's for master too.

regards

-- 
Tomas Vondra

Вложения

gin-palloc-fix.patch

Re: PG18 GIN parallel index build crash - invalid memory alloc request size

От

Tomas Vondra

Дата:

27 октября, 00:52:47

On 10/26/25 16:16, Tomas Vondra wrote:
> On 10/24/25 22:22, Gregory Smith wrote:
>> On Fri, Oct 24, 2025 at 8:38 AM Tomas Vondra <tomas@vondra.me
>> <mailto:tomas@vondra.me>> wrote:
>>
>> ...
>>>>     Can you show the contents of buffer and tup? I'm especially
interested
>>     in these fields:
>>       buffer->nitems
>>       buffer->maxitems
>>       buffer->nfrozen
>>       tup->nitems
>>
>>
>> I'll see if I can grab that data at the crash point.
>>
>> FYI for anyone who wants to replicate this:  if you have a system with
>> 128GB+ of RAM you could probably recreate the test case.   Just have to
>> download the Planet file and run osm2pgsql with the overly tweaked
>> settings I use.  I've published all the details of how I run this
>> regression test now.
>>
>> Settings:  https://github.com/gregs1104/pgbent/tree/main/conf/18/conf.d
>> <https://github.com/gregs1104/pgbent/tree/main/conf/18/conf.d>
>> Script setup:  https://github.com/gregs1104/pgbent/blob/main/wl/osm-
>> import <https://github.com/gregs1104/pgbent/blob/main/wl/osm-import>
>> Test runner:  https://github.com/gregs1104/pgbent/blob/main/util/osm-
>> importer <https://github.com/gregs1104/pgbent/blob/main/util/osm-importer>
>> Parse results:  https://github.com/gregs1104/pgbent/blob/main/util/
>> pgbench-init-parse <https://github.com/gregs1104/pgbent/blob/main/util/
>> pgbench-init-parse>
>>  
> 
> I did reproduce this using OSM, although I used different settings, but
> that's only affects loading. Setting maintenance_work_mem=20GB is more
> than enough to trigger the error during parallel index build.
> 
> So I don't need the data.
> 
>>
>>     If I'm right, I think there are two ways to fix this:
>>     (1) apply the trimming earlier, i.e. try to freeze + flush before
>>     actually merging the data (essentially, update nfrozen earlier)
>>     (2) use repalloc_huge (and palloc_huge) in GinBufferStoreTuple
>>     Or we probably should do both.
>>
>>
>> Sounds like (2) is probably mandatory and (1) is good hygiene.
>>
> 
> Yes, (2) is mandatory to fix this, and it's also sufficient. See the
> attached fix. I'll clean this up and push soon.
> 
> AFAICS (1) is not really needed. I was concerned we might end up with
> each worker producing a TID buffer close to maintenance_work_mem, and
> then the leader would have to use twice as much memory when merging. But
> it turns out I already thought about that, and the workers use a fair
> share or maintenance_work_mem, not a new limit. So they produce smaller
> chunks, and those should not exceed maintenance_work_mem when merging.
> 
> I tried "freezing" the existing buffer more eagerly (before merging the
> tuple), but that made no difference. The workers produce data with a lot
> of overlaps (simply because that's how the parallel builds divide data),
> and the amount of trimmed data is tiny. Something like 10k TIDs from a
> buffer of 1M TIDs. So a tiny difference, and it'd still fail.
> 
> I'm not against maybe experimenting with this, but it's going to be a
> mater-only thing, not for backpatching.
> 
> Maybe we should split the data into smaller chunks while building tuples
> in ginFlushBuildState. That'd probably allow enforcing the memory limit
> more strictly, because we sometimes hold multiple copies of the TIDs
> arrays. But that's for master too.
> 

I spoke too soon, apparently :-(

(2) is not actually a fix. It does fix some cases of invalid alloc size
failures, the following call to ginMergeItemPointers() can hit that too,
because it does palloc() internally. I didn't notice this before because
of the other experimental changes, and because it seems to depend on
which of the OSM indexes is being built, with how many workers, etc.

I was a bit puzzled how come we don't hit this with serial builds too,
because that calls ginMergeItemPointers() too. I guess that's just luck,
because with serial builds we're likely flushing the TID list in smaller
chunks, appending to an existing tuple. And it seems unlikely to cross
the alloc limit for any of those. But for parallel builds we're pretty
much guaranteed to see all TIDs for a key at once.

I see two ways to fix this:

a) Do the (re)palloc_huge change, but then also change the palloc call
in ginMergeItemPointers. I'm not sure if we want to change the existing
function, or create a static copy in gininsert.c with this tweak (it
doesn't need anything else, so it's not that bad).

b) Do the data splitting in ginFlushBuildState, so that workers don't
generate chunks larger than MaxAllocSize/nworkers (for any key). The
leader then merges at most one chunk per worker at a time, so it still
fits into the alloc limit.

Both seem to work. I like (a) more, because it's more consistent with
how I understand m_w_m. It's weird to say "use up to 20GB of memory" and
then the system overrides that with "1GB". I don't think it affects
performance, though.

I'll experiment with this a bit more, I just wanted to mention the fix I
posted earlier does not actually fix the issue.

I also wonder how far are we from hitting the uint32 limits. FAICS with
m_w_m=24GB we might end up with too many elements, even with serial
index builds. It'd have to be a quite weird data set, though.

regards

-- 
Tomas Vondra

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: PG18 GIN parallel index build crash - invalid memory alloc request size

PG18 GIN parallel index build crash - invalid memory alloc request size

Re: PG18 GIN parallel index build crash - invalid memory alloc request size

Re: PG18 GIN parallel index build crash - invalid memory alloc request size

Re: PG18 GIN parallel index build crash - invalid memory alloc request size

Вложения

Re: PG18 GIN parallel index build crash - invalid memory alloc request size

Вложения