Обсуждение: BUG #18334: Segfault when running a query with parallel workers
The following bug has been logged on the website: Bug reference: 18334 Logged by: Marcin Barczyński Email address: mba.ogolny@gmail.com PostgreSQL version: 13.13 Operating system: Ubuntu 22.04.3 LTS Description: Obfuscated query: WITH dt1 AS ( SELECT right(d.p, -length('STR1') -1) || 'STR4' || f.n AS p1 FROM dc d INNER JOIN fc f ON f.pid = d.id AND f.vid = d.vid WHERE f.vid = func1('STR2') AND d.aids && ARRAY[( SELECT id from dc WHERE p = 'STR1' AND vid = func1('STR2') )] AND right(d.p, -length('STR1') -1) || 'STR4' || f.n != '' ), dt2 AS ( SELECT d.p || 'STR4' || f.n AS p2 FROM dc d INNER JOIN fc f ON f.pid = d.id AND f.vid = d.vid WHERE f.vid = func1('STR3') AND d.aids && ARRAY[( SELECT id from dc WHERE p = '' AND vid = func1('STR3') )] AND d.p || 'STR4' || f.n != '' ) SELECT dt2.p2 FROM dt1 RIGHT OUTER JOIN dt2 ON p1 = p2 WHERE p1 IS NULL; Log messages: 2024-02-03 09:16:33.798 EST [3261686-102] app= LOG: background worker "parallel worker" (PID 2387431) was terminated by signal 11: Segmentation fault 2024-02-03 09:16:33.798 EST [3261686-103] app= DETAIL: Failed process was running: set max_parallel_workers=8; set work_mem='20GB'; Backtrace: #0 0x0000557ba04345ac in dsa_get_address (area=0x557ba22e9668, dp=<optimized out>) at utils/mmgr/./build/../src/backend/utils/mmgr/dsa.c:955 #1 0x0000557ba014ec21 in ExecParallelHashNextTuple (tuple=0x7fc42a891560, hashtable=0x557ba233dcb8) at executor/./build/../src/backend/executor/nodeHash.c:3272 #2 ExecParallelScanHashBucket (hjstate=0x557ba22fdf28, econtext=0x557ba22fddf0) at executor/./build/../src/backend/executor/nodeHash.c:2059 #3 0x0000557ba01514b5 in ExecHashJoinImpl (parallel=<optimized out>, pstate=<optimized out>) at executor/./build/../src/backend/executor/nodeHashjoin.c:455 #4 ExecParallelHashJoin (pstate=<optimized out>) at executor/./build/../src/backend/executor/nodeHashjoin.c:637 #5 0x0000557ba013547d in ExecProcNodeInstr (node=0x557ba22fdf28) at executor/./build/../src/backend/executor/execProcnode.c:467 #6 0x0000557ba012b03d in ExecProcNode (node=0x557ba22fdf28) at executor/./build/../src/include/executor/executor.h:248 #7 ExecutePlan (execute_once=<optimized out>, dest=0x557ba2281a78, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x557ba22fdf28, estate=0x557ba22c1008) at executor/./build/../src/backend/executor/execMain.c:1632 #8 standard_ExecutorRun (queryDesc=0x557ba22d17c0, direction=<optimized out>, count=0, execute_once=<optimized out>) at executor/./build/../src/backend/executor/execMain.c:350 #9 0x00007fc42a976f25 in pgss_ExecutorRun (queryDesc=0x557ba22d17c0, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:1045 #10 0x00007fc42e5d56d2 in explain_ExecutorRun (queryDesc=0x557ba22d17c0, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at ./build/../contrib/auto_explain/auto_explain.c:334 #11 0x0000557ba0131ba9 in ExecutorRun (execute_once=true, count=<optimized out>, direction=ForwardScanDirection, queryDesc=0x557ba22d17c0) at executor/./build/../src/backend/executor/execMain.c:292 #12 ParallelQueryMain (seg=seg@entry=0x557ba2239b18, toc=toc@entry=0x7fc42a890000) at executor/./build/../src/backend/executor/execParallel.c:1448 #13 0x0000557b9fff010e in ParallelWorkerMain (main_arg=<optimized out>) at access/transam/./build/../src/backend/access/transam/parallel.c:1494 #14 0x0000557ba0231ada in StartBackgroundWorker () at postmaster/./build/../src/backend/postmaster/bgworker.c:890 #15 0x0000557ba0241ffe in do_start_bgworker (rw=<optimized out>) at postmaster/./build/../src/backend/postmaster/postmaster.c:5896 #16 maybe_start_bgworkers () at postmaster/./build/../src/backend/postmaster/postmaster.c:6121 #17 0x0000557ba024224d in sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster/./build/../src/backend/postmaster/postmaster.c:5281 #18 <signal handler called> #19 0x00007fc42d65959d in __GI___select (nfds=nfds@entry=8, readfds=readfds@entry=0x7ffda2d1ba20, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7ffda2d1b980) at ../sysdeps/unix/sysv/linux/select.c:69 #20 0x0000557ba02433d6 in ServerLoop () at postmaster/./build/../src/backend/postmaster/postmaster.c:1706 #21 0x0000557ba02450e5 in PostmasterMain (argc=5, argv=<optimized out>) at postmaster/./build/../src/backend/postmaster/postmaster.c:1415 #22 0x0000557b9ff5a017 in main (argc=5, argv=0x557ba2121300) at main/./build/../src/backend/main/main.c:210 It happens non-deterministically but frequently in our environment. I have a core dump and will gladly send additional info if needed.
PG Bug reporting form <noreply@postgresql.org> writes: > Log messages: > 2024-02-03 09:16:33.798 EST [3261686-102] app= LOG: background worker > "parallel worker" (PID 2387431) was terminated by signal 11: Segmentation > fault > 2024-02-03 09:16:33.798 EST [3261686-103] app= DETAIL: Failed process was > running: set max_parallel_workers=8; set work_mem='20GB'; It's hard to do anything with just the query. Can you put together a self-contained test case, including table definitions and some sample data? (The data most likely could be dummy generated data.) It would also be useful to know what non-default settings you are using. regards, tom lane
On Tue, Feb 6, 2024 at 2:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > PG Bug reporting form <noreply@postgresql.org> writes: > > Log messages: > > > 2024-02-03 09:16:33.798 EST [3261686-102] app= LOG: background worker > > "parallel worker" (PID 2387431) was terminated by signal 11: Segmentation > > fault > > 2024-02-03 09:16:33.798 EST [3261686-103] app= DETAIL: Failed process was > > running: set max_parallel_workers=8; set work_mem='20GB'; > > It's hard to do anything with just the query. Can you put together a > self-contained test case, including table definitions and some sample > data? (The data most likely could be dummy generated data.) No, not really. This issue happens on a production machine and a large volume of data (terabytes) is likely the cause of the error. Regards, Marcin Barczyński
On Wed, Feb 7, 2024 at 12:30 AM Marcin Barczyński <mba.ogolny@gmail.com> wrote: > On Tue, Feb 6, 2024 at 2:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > PG Bug reporting form <noreply@postgresql.org> writes: > > > 2024-02-03 09:16:33.798 EST [3261686-102] app= LOG: background worker > > > "parallel worker" (PID 2387431) was terminated by signal 11: Segmentation > > > fault > > > 2024-02-03 09:16:33.798 EST [3261686-103] app= DETAIL: Failed process was > > > running: set max_parallel_workers=8; set work_mem='20GB'; > > > > It's hard to do anything with just the query. Can you put together a > > self-contained test case, including table definitions and some sample > > data? (The data most likely could be dummy generated data.) > > No, not really. This issue happens on a production machine and a large > volume of data (terabytes) is likely the cause of the error. Hi, Could you please show EXPLAIN ANALYZE for the query? In gdb from that core, can you please show "info proc mappings", and in frame 0 "print *area", and in frame 1, "print *tuple" and "print *hashtable"?
Hi Thomas, On Sun, Feb 11, 2024 at 10:31 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Could you please show EXPLAIN ANALYZE for the query? In gdb from that > core, can you please show "info proc mappings", and in frame 0 "print > *area", and in frame 1, "print *tuple" and "print *hashtable"? I'm sorry for my late reply. It happened again, and I'm pasting info you requested from core. PostgreSQL 13.15. Stack trace: #0 0x000056134d5bb011 in dsa_free (area=0x56134e07d718, dp=<optimized out>) at utils/mmgr/./build/../src/backend/utils/mmgr/dsa.c:840 840 utils/mmgr/./build/../src/backend/utils/mmgr/dsa.c: No such file or directory. (gdb) bt #0 0x000056134d5bb011 in dsa_free (area=0x56134e07d718, dp=<optimized out>) at utils/mmgr/./build/../src/backend/utils/mmgr/dsa.c:840 #1 0x000056134d2d6a0c in ExecHashTableDetachBatch (hashtable=hashtable@entry=0x56134e154540) at executor/./build/../src/backend/executor/nodeHash.c:3181 #2 0x000056134d2d821a in ExecParallelHashJoinNewBatch (hjstate=0x56134e087b48) at executor/./build/../src/backend/executor/nodeHashjoin.c:1131 #3 ExecHashJoinImpl (parallel=<optimized out>, pstate=<optimized out>) at executor/./build/../src/backend/executor/nodeHashjoin.c:590 #4 ExecParallelHashJoin (pstate=<optimized out>) at executor/./build/../src/backend/executor/nodeHashjoin.c:637 #5 0x000056134d2bbffd in ExecProcNodeInstr (node=0x56134e087b48) at executor/./build/../src/backend/executor/execProcnode.c:467 #6 0x000056134d2b1bbd in ExecProcNode (node=0x56134e087b48) at executor/./build/../src/include/executor/executor.h:248 #7 ExecutePlan (execute_once=<optimized out>, dest=0x56134dfe1fe8, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x56134e087b48, estate=0x56134e087858) at executor/./build/../src/backend/executor/execMain.c:1632 #8 standard_ExecutorRun (queryDesc=0x56134e0783e0, direction=<optimized out>, count=0, execute_once=<optimized out>) at executor/./build/../src/backend/executor/execMain.c:350 #9 0x00007f3a734c9f25 in pgss_ExecutorRun (queryDesc=0x56134e0783e0, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:1045 #10 0x00007f3a771296d2 in explain_ExecutorRun (queryDesc=0x56134e0783e0, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at ./build/../contrib/auto_explain/auto_explain.c:334 #11 0x000056134d2b8729 in ExecutorRun (execute_once=true, count=<optimized out>, direction=ForwardScanDirection, queryDesc=0x56134e0783e0) at executor/./build/../src/backend/executor/execMain.c:292 #12 ParallelQueryMain (seg=seg@entry=0x56134df98db8, toc=toc@entry=0x7f321dfa4000) at executor/./build/../src/backend/executor/execParallel.c:1448 #13 0x000056134d1767ce in ParallelWorkerMain (main_arg=<optimized out>) at access/transam/./build/../src/backend/access/transam/parallel.c:1494 #14 0x000056134d3b981a in StartBackgroundWorker () at postmaster/./build/../src/backend/postmaster/bgworker.c:890 #15 0x000056134d3c963e in do_start_bgworker (rw=<optimized out>) at postmaster/./build/../src/backend/postmaster/postmaster.c:5896 #16 maybe_start_bgworkers () at postmaster/./build/../src/backend/postmaster/postmaster.c:6121 #17 0x000056134d3c988d in sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster/./build/../src/backend/postmaster/postmaster.c:5281 #18 <signal handler called> #19 0x00007f3a761ac59d in __GI___select (nfds=nfds@entry=8, readfds=readfds@entry=0x7fff97c44720, writefds=writefds@entry=0x0, exceptfds=exceptfds@entry=0x0, timeout=timeout@entry=0x7fff97c44680) at ../sysdeps/unix/sysv/linux/select.c:69 #20 0x000056134d3caa16 in ServerLoop () at postmaster/./build/../src/backend/postmaster/postmaster.c:1706 #21 0x000056134d3cc725 in PostmasterMain (argc=5, argv=<optimized out>) at postmaster/./build/../src/backend/postmaster/postmaster.c:1415 #22 0x000056134d0e0377 in main (argc=5, argv=0x56134de8d300) at main/./build/../src/backend/main/main.c:210 (gdb) info proc mappings Mapped address spaces: Start Addr End Addr Size Offset objfile 0x56134cfab000 0x56134d068000 0xbd000 0x0 /usr/lib/postgresql/13/bin/postgres 0x56134d068000 0x56134d60b000 0x5a3000 0xbd000 /usr/lib/postgresql/13/bin/postgres 0x56134d60b000 0x56134d827000 0x21c000 0x660000 /usr/lib/postgresql/13/bin/postgres 0x56134d827000 0x56134d845000 0x1e000 0x87b000 /usr/lib/postgresql/13/bin/postgres 0x56134d845000 0x56134d854000 0xf000 0x899000 /usr/lib/postgresql/13/bin/postgres 0x7f2e9599e000 0x7f2f1599e000 0x80000000 0x0 /dev/shm/PostgreSQL.940706000 (gdb) print *area $1 = {control = 0x7f321dfa4500, mapping_pinned = false, segment_maps = {{segment = 0x0, mapped_address = 0x7f321dfa4500 "", header = 0x7f321dfa4500, fpm = 0x7f321dfa5d20, pagemap = 0x7f321dfa6168}, {segment = 0x56134dfa1ec8, mapped_address = 0x7f3216cd8000 "", header = 0x7f3216cd8000, fpm = 0x7f3216cd8038, pagemap = 0x7f3216cd8480}, { segment = 0x56134dfa1f18, mapped_address = 0x7f31f6bd7000 "", header = 0x7f31f6bd7000, fpm = 0x7f31f6bd7038, pagemap = 0x7f31f6bd7480}, {segment = 0x56134dfa2078, mapped_address = 0x7f30d60a6000 "", header = 0x7f30d60a6000, fpm = 0x7f30d60a6038, pagemap = 0x7f30d60a6480}, {segment = 0x56134dfa2118, mapped_address = 0x7f30d58a6000 "", header = 0x7f30d58a6000, fpm = 0x7f30d58a6038, pagemap = 0x7f30d58a6480}, {segment = 0x56134dfa20c8, mapped_address = 0x7f30d5ca6000 "", header = 0x7f30d5ca6000, fpm = 0x7f30d5ca6038, pagemap = 0x7f30d5ca6480}, {segment = 0x56134dfa2168, mapped_address = 0x7f30d50a6000 "", header = 0x7f30d50a6000, fpm = 0x7f30d50a6038, pagemap = 0x7f30d50a6480}, { segment = 0x56134dfa21b8, mapped_address = 0x7f30d449e000 "", header = 0x7f30d449e000, fpm = 0x7f30d449e038, pagemap = 0x7f30d449e480}, {segment = 0x56134dfa2208, mapped_address = 0x7f30d2c90000 "", header = 0x7f30d2c90000, fpm = 0x7f30d2c90038, pagemap = 0x7f30d2c90480}, {segment = 0x56134dfa2258, mapped_address = 0x7f30cfc76000 "", header = 0x7f30cfc76000, fpm = 0x7f30cfc76038, pagemap = 0x7f30cfc76480}, {segment = 0x56134ee12048, mapped_address = 0x7f307599e000 "", header = 0x7f307599e000, fpm = 0x7f307599e038, pagemap = 0x7f307599e480}, {segment = 0x56134ee11ff8, mapped_address = 0x7f307b9d0000 "", header = 0x7f307b9d0000, fpm = 0x7f307b9d0038, pagemap = 0x7f307b9d0480}, { segment = 0x56134ee11fa8, mapped_address = 0x7f3087a32000 "", header = 0x7f3087a32000, fpm = 0x7f3087a32038, pagemap = 0x7f3087a32480}, {segment = 0x56134dfa2dd8, mapped_address = 0x7f309faf4000 "", header = 0x7f309faf4000, fpm = 0x7f309faf4038, pagemap = 0x7f309faf4480}, {segment = 0x56134dfa1fb8, mapped_address = 0x7f30d62d3000 "", header = 0x7f30d62d3000, fpm = 0x7f30d62d3038, pagemap = 0x7f30d62d3480}, {segment = 0x56134dfa1f68, mapped_address = 0x7f31365d5000 "", header = 0x7f31365d5000, fpm = 0x7f31365d5038, pagemap = 0x7f31365d5480}, {segment = 0x56134ee12098, mapped_address = 0x7f306599e000 "", header = 0x7f306599e000, fpm = 0x7f306599e038, pagemap = 0x7f306599e480}, { segment = 0x56134ee120e8, mapped_address = 0x7f305599e000 "", header = 0x7f305599e000, fpm = 0x7f305599e038, pagemap = 0x7f305599e480}, {segment = 0x56134ee12138, mapped_address = 0x7f303599e000 "", header = 0x7f303599e000, fpm = 0x7f303599e038, pagemap = 0x7f303599e480}, {segment = 0x56134ee12188, mapped_address = 0x7f301599e000 "", header = 0x7f301599e000, fpm = 0x7f301599e038, pagemap = 0x7f301599e480}, {segment = 0x56134ee121d8, mapped_address = 0x7f2fd599e000 "", header = 0x7f2fd599e000, fpm = 0x7f2fd599e038, pagemap = 0x7f2fd599e480}, {segment = 0x56134ee12228, mapped_address = 0x7f2f9599e000 "", header = 0x7f2f9599e000, fpm = 0x7f2f9599e038, pagemap = 0x7f2f9599e480}, { segment = 0x56134ee12278, mapped_address = 0x7f2f1599e000 "", header = 0x7f2f1599e000, fpm = 0x7f2f1599e038, pagemap = 0x7f2f1599e480}, {segment = 0x56134ee122c8, mapped_address = 0x7f2e9599e000 "", header = 0x7f2e9599e000, fpm = 0x7f2e9599e038, pagemap = 0x7f2e9599e480}, {segment = 0x0, mapped_address = 0x0, header = 0x0, fpm = 0x0, pagemap = 0x0} <repeats 1000 times>}, high_segment_index = 23, freed_segment_counter = 0} (gdb) frame 1 (gdb) print *hashtable $2 = {nbuckets = 67108864, log2_nbuckets = 26, nbuckets_original = 67108864, nbuckets_optimal = 67108864, log2_nbuckets_optimal = 26, buckets = {unshared = 0x7f31f6cd8000, shared = 0x7f31f6cd8000}, keepNulls = false, skewEnabled = false, skewBucket = 0x0, skewBucketLen = 0, nSkewBuckets = 0, skewBucketNums = 0x0, nbatch = 1, curbatch = 0, nbatch_original = 1, nbatch_outstart = 1, growEnabled = true, totalTuples = 65785362, partialTuples = 5057580, skewTuples = 0, innerBatchFile = 0x0, outerBatchFile = 0x0, outer_hashfunctions = 0x56134e1e04b8, inner_hashfunctions = 0x56134e1e0508, hashStrict = 0x56134e1e0558, collations = 0x56134e1e0570, spaceUsed = 0, spaceAllowed = 13958643712, spacePeak = 0, spaceUsedSkew = 0, spaceAllowedSkew = 279172874, hashCxt = 0x56134e1e03a0, batchCxt = 0x56134e1e23b0, chunks = 0x0, current_chunk = 0x0, area = 0x56134e07d718, parallel_state = 0x7f321dfa4400, batches = 0x56134e1e07f8, current_chunk_shared = 0} This is the code where crashed happened https://github.com/postgres/postgres/blob/8e5faba4b918ba6142339c8f55eaa4eb99776a03/src/backend/utils/mmgr/dsa.c#L835-L840: /* Locate the object, span and pool. */ segment_map = get_segment_by_index(area, DSA_EXTRACT_SEGMENT_NUMBER(dp)); pageno = DSA_EXTRACT_OFFSET(dp) / FPM_PAGE_SIZE; span_pointer = segment_map->pagemap[pageno]; span = dsa_get_address(area, span_pointer); superblock = dsa_get_address(area, span->start); (gdb) print *segment_map $4 = {segment = 0x56134dfa2dd8, mapped_address = 0x7f309faf4000 "", header = 0x7f309faf4000, fpm = 0x7f309faf4038, pagemap = 0x7f309faf4480} (gdb) print pageno $5 = 196979 (gdb) print span_pointer $6 = 0 It looks that if `span_pointer` is 0, `span` is NULL and `span->start` causes a segfault. `span_pointer` is 0 because all `segment_map->pagemap` are zeros: (gdb) print segment_map->pagemap[0] $10 = 0 (gdb) print segment_map->pagemap[1] $11 = 0 (gdb) print segment_map->pagemap[2] $12 = 0 (gdb) print segment_map->pagemap[265] $14 = 0 (gdb) print segment_map->pagemap[187387] $15 = 0 (gdb) print segment_map->pagemap[196979] $16 = 0 Regards, Marcin Barczyński
On Thu, May 23, 2024 at 11:59 PM Marcin Barczyński <mba.ogolny@gmail.com> wrote: > (gdb) print *segment_map > $4 = {segment = 0x56134dfa2dd8, mapped_address = 0x7f309faf4000 "", > header = 0x7f309faf4000, fpm = 0x7f309faf4038, pagemap = > 0x7f309faf4480} > > (gdb) print pageno > $5 = 196979 Hmm. Page 196979 is an offset of around 769MB within the segment (pages here are 4k). What does segment_map->segment->mapped_size show? It's OK for the pagemap to contain zeroes, but it should contain non-zero values for pages that contain the start of an allocated object. The actual dsa_pointer has been optimised out but should be visible from frame #1 as batch->chunks. I think its higher 24 bits should contain 13 (the element of area->segment_maps that seems to correspond to the above), and its lower 40 bits should contain that number ~769MB. The things that are unusually high so far in your emails are worker count and work_mem, so that it can make quite large hash tables, in your case up to 13GB. Perhaps there is a silly arithmetic/type problem around large numbers somewhere (perhaps somewhere near 4GB+ segments, but I don't expect segment #13 to be very large IIRC). But then that would fail more often I think... It seems to be rare/intermittent, and yet you don't have any batching or re-bucketing in your problem (nbatch and nbuckets have their original values), so a lot of the more complex parts of the PHJ code are not in play here. Hmm. I wondered if the tricky edge case where a segment gets unmapped and then then remapped in the same slot could be leading to segment confusion. That does involve a bit of memory order footwork. What CPU architecture is this? But alas I can't come up with any case where that could go wrong even if there is an unknown bug in that area, because the no-rebatching, no-rebucketing case doesn't free anything until the end when it frees everything (ie it never frees something and then allocate, a requirement for slot re-use).
On Fri, May 24, 2024 at 12:45 PM Thomas Munro <thomas.munro@gmail.com> wrote: > I wondered if the tricky edge case where a segment gets unmapped and > then then remapped in the same slot could be leading to segment > confusion. That does involve a bit of memory order footwork. What > CPU architecture is this? But alas I can't come up with any case > where that could go wrong even if there is an unknown bug in that > area, because the no-rebatching, no-rebucketing case doesn't free > anything until the end when it frees everything (ie it never frees > something and then allocate, a requirement for slot re-use). ... but if I'm missing something there, it might be a clue visible from gdb if area->control->freed_segment_counter (the one in shared memory) and area->freed_segment_counter (the one in this backend) have different values, if your core captured the segments.
Thank you for looking into this. On Fri, May 24, 2024 at 3:33 AM Thomas Munro <thomas.munro@gmail.com> wrote: > What does segment_map->segment->mapped_size show? (gdb) print *(segment_map->segment) $3 = {node = {prev = 0x56134ee11fa8, next = 0x56134dfa2258}, resowner = 0x56134df98a98, handle = 2051931009, control_slot = 30, impl_private = 0x0, mapped_address = 0x7f309faf4000, mapped_size = 806887424, on_detach = {head = {next = 0x0}}} > The actual dsa_pointer has been optimised out but > should be visible from frame #1 as batch->chunks. (gdb) frame 1 (gdb) print *batch $4 = {buckets = 0, batch_barrier = {mutex = 0 '\000', phase = 0, participants = 0, arrived = 0, elected = 0, static_party = false, condition_variable = {mutex = 0 '\000', wakeup = {head = 0, tail = 0}}}, chunks = 0, size = 0, estimated_size = 0, ntuples = 0, old_ntuples = 0, space_exhausted = false} > What CPU architecture is this? x64, AMD EPYC 9374F > ... but if I'm missing something there, it might be a clue visible > from gdb if area->control->freed_segment_counter (the one in shared > memory) and area->freed_segment_counter (the one in this backend) have > different values, if your core captured the segments. (gdb) p *area->control $1 = {segment_header = {magic = 0, usable_pages = 0, size = 0, prev = 0, next = 0, bin = 0, freed = false}, handle = 0, segment_handles = {0 <repeats 1024 times>}, segment_bins = { 0 <repeats 16 times>}, pools = {{lock = {tranche = 0, state = {value = 0}, waiters = {head = 0, tail = 0}}, spans = {0, 0, 0, 0}} <repeats 38 times>}, total_segment_size = 0, max_total_segment_size = 0, high_segment_index = 0, refcnt = 0, pinned = false, freed_segment_counter = 0, lwlock_tranche_id = 0, lock = {tranche = 0, state = {value = 0}, waiters = {head = 0, tail = 0}}} (gdb) p *area $2 = {control = 0x7f321dfa4500, mapping_pinned = false, segment_maps = {{segment = 0x0, mapped_address = 0x7f321dfa4500 "", header = 0x7f321dfa4500, fpm = 0x7f321dfa5d20, pagemap = 0x7f321dfa6168}, {segment = 0x56134dfa1ec8, mapped_address = 0x7f3216cd8000 "", header = 0x7f3216cd8000, fpm = 0x7f3216cd8038, pagemap = 0x7f3216cd8480}, { segment = 0x56134dfa1f18, mapped_address = 0x7f31f6bd7000 "", header = 0x7f31f6bd7000, fpm = 0x7f31f6bd7038, pagemap = 0x7f31f6bd7480}, {segment = 0x56134dfa2078, mapped_address = 0x7f30d60a6000 "", header = 0x7f30d60a6000, fpm = 0x7f30d60a6038, pagemap = 0x7f30d60a6480}, {segment = 0x56134dfa2118, mapped_address = 0x7f30d58a6000 "", header = 0x7f30d58a6000, fpm = 0x7f30d58a6038, pagemap = 0x7f30d58a6480}, {segment = 0x56134dfa20c8, mapped_address = 0x7f30d5ca6000 "", header = 0x7f30d5ca6000, fpm = 0x7f30d5ca6038, pagemap = 0x7f30d5ca6480}, {segment = 0x56134dfa2168, mapped_address = 0x7f30d50a6000 "", header = 0x7f30d50a6000, fpm = 0x7f30d50a6038, pagemap = 0x7f30d50a6480}, { segment = 0x56134dfa21b8, mapped_address = 0x7f30d449e000 "", header = 0x7f30d449e000, fpm = 0x7f30d449e038, pagemap = 0x7f30d449e480}, {segment = 0x56134dfa2208, mapped_address = 0x7f30d2c90000 "", header = 0x7f30d2c90000, fpm = 0x7f30d2c90038, pagemap = 0x7f30d2c90480}, {segment = 0x56134dfa2258, mapped_address = 0x7f30cfc76000 "", header = 0x7f30cfc76000, fpm = 0x7f30cfc76038, pagemap = 0x7f30cfc76480}, {segment = 0x56134ee12048, mapped_address = 0x7f307599e000 "", header = 0x7f307599e000, fpm = 0x7f307599e038, pagemap = 0x7f307599e480}, {segment = 0x56134ee11ff8, mapped_address = 0x7f307b9d0000 "", header = 0x7f307b9d0000, fpm = 0x7f307b9d0038, pagemap = 0x7f307b9d0480}, { segment = 0x56134ee11fa8, mapped_address = 0x7f3087a32000 "", header = 0x7f3087a32000, fpm = 0x7f3087a32038, pagemap = 0x7f3087a32480}, {segment = 0x56134dfa2dd8, mapped_address = 0x7f309faf4000 "", header = 0x7f309faf4000, fpm = 0x7f309faf4038, pagemap = 0x7f309faf4480}, {segment = 0x56134dfa1fb8, mapped_address = 0x7f30d62d3000 "", header = 0x7f30d62d3000, fpm = 0x7f30d62d3038, pagemap = 0x7f30d62d3480}, {segment = 0x56134dfa1f68, mapped_address = 0x7f31365d5000 "", header = 0x7f31365d5000, fpm = 0x7f31365d5038, pagemap = 0x7f31365d5480}, {segment = 0x56134ee12098, mapped_address = 0x7f306599e000 "", header = 0x7f306599e000, fpm = 0x7f306599e038, pagemap = 0x7f306599e480}, { segment = 0x56134ee120e8, mapped_address = 0x7f305599e000 "", header = 0x7f305599e000, fpm = 0x7f305599e038, pagemap = 0x7f305599e480}, {segment = 0x56134ee12138, mapped_address = 0x7f303599e000 "", header = 0x7f303599e000, fpm = 0x7f303599e038, pagemap = 0x7f303599e480}, {segment = 0x56134ee12188, mapped_address = 0x7f301599e000 "", header = 0x7f301599e000, fpm = 0x7f301599e038, pagemap = 0x7f301599e480}, {segment = 0x56134ee121d8, mapped_address = 0x7f2fd599e000 "", header = 0x7f2fd599e000, fpm = 0x7f2fd599e038, pagemap = 0x7f2fd599e480}, {segment = 0x56134ee12228, mapped_address = 0x7f2f9599e000 "", header = 0x7f2f9599e000, fpm = 0x7f2f9599e038, pagemap = 0x7f2f9599e480}, { segment = 0x56134ee12278, mapped_address = 0x7f2f1599e000 "", header = 0x7f2f1599e000, fpm = 0x7f2f1599e038, pagemap = 0x7f2f1599e480}, {segment = 0x56134ee122c8, mapped_address = 0x7f2e9599e000 "", header = 0x7f2e9599e000, fpm = 0x7f2e9599e038, pagemap = 0x7f2e9599e480}, {segment = 0x0, mapped_address = 0x0, header = 0x0, fpm = 0x0, pagemap = 0x0} <repeats 1000 times>}, high_segment_index = 23, freed_segment_counter = 0} I hope this sheds some light on the issue. Best regards, Marcin Barczyński
Hello! If it would make things easier, I can share the core dump. On Fri, May 24, 2024 at 10:26 AM Marcin Barczyński <mba.ogolny@gmail.com> wrote: > > Thank you for looking into this. > > On Fri, May 24, 2024 at 3:33 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > What does segment_map->segment->mapped_size show? > > (gdb) print *(segment_map->segment) > $3 = {node = {prev = 0x56134ee11fa8, next = 0x56134dfa2258}, resowner > = 0x56134df98a98, handle = 2051931009, control_slot = 30, impl_private > = 0x0, mapped_address = 0x7f309faf4000, > mapped_size = 806887424, on_detach = {head = {next = 0x0}}} > > > The actual dsa_pointer has been optimised out but > > should be visible from frame #1 as batch->chunks. > > (gdb) frame 1 > (gdb) print *batch > $4 = {buckets = 0, batch_barrier = {mutex = 0 '\000', phase = 0, > participants = 0, arrived = 0, elected = 0, static_party = false, > condition_variable = {mutex = 0 '\000', wakeup = {head = 0, > tail = 0}}}, chunks = 0, size = 0, estimated_size = 0, ntuples > = 0, old_ntuples = 0, space_exhausted = false} > > > What CPU architecture is this? > > x64, AMD EPYC 9374F > > > ... but if I'm missing something there, it might be a clue visible > > from gdb if area->control->freed_segment_counter (the one in shared > > memory) and area->freed_segment_counter (the one in this backend) have > > different values, if your core captured the segments. > > (gdb) p *area->control > $1 = {segment_header = {magic = 0, usable_pages = 0, size = 0, prev = > 0, next = 0, bin = 0, freed = false}, handle = 0, segment_handles = {0 > <repeats 1024 times>}, segment_bins = { > 0 <repeats 16 times>}, pools = {{lock = {tranche = 0, state = > {value = 0}, waiters = {head = 0, tail = 0}}, spans = {0, 0, 0, 0}} > <repeats 38 times>}, total_segment_size = 0, > max_total_segment_size = 0, high_segment_index = 0, refcnt = 0, > pinned = false, freed_segment_counter = 0, lwlock_tranche_id = 0, lock > = {tranche = 0, state = {value = 0}, waiters = {head = 0, > tail = 0}}} > > (gdb) p *area > $2 = {control = 0x7f321dfa4500, mapping_pinned = false, segment_maps = > {{segment = 0x0, mapped_address = 0x7f321dfa4500 "", header = > 0x7f321dfa4500, fpm = 0x7f321dfa5d20, > pagemap = 0x7f321dfa6168}, {segment = 0x56134dfa1ec8, > mapped_address = 0x7f3216cd8000 "", header = 0x7f3216cd8000, fpm = > 0x7f3216cd8038, pagemap = 0x7f3216cd8480}, { > segment = 0x56134dfa1f18, mapped_address = 0x7f31f6bd7000 "", > header = 0x7f31f6bd7000, fpm = 0x7f31f6bd7038, pagemap = > 0x7f31f6bd7480}, {segment = 0x56134dfa2078, > mapped_address = 0x7f30d60a6000 "", header = 0x7f30d60a6000, fpm > = 0x7f30d60a6038, pagemap = 0x7f30d60a6480}, {segment = > 0x56134dfa2118, mapped_address = 0x7f30d58a6000 "", > header = 0x7f30d58a6000, fpm = 0x7f30d58a6038, pagemap = > 0x7f30d58a6480}, {segment = 0x56134dfa20c8, mapped_address = > 0x7f30d5ca6000 "", header = 0x7f30d5ca6000, fpm = 0x7f30d5ca6038, > pagemap = 0x7f30d5ca6480}, {segment = 0x56134dfa2168, > mapped_address = 0x7f30d50a6000 "", header = 0x7f30d50a6000, fpm = > 0x7f30d50a6038, pagemap = 0x7f30d50a6480}, { > segment = 0x56134dfa21b8, mapped_address = 0x7f30d449e000 "", > header = 0x7f30d449e000, fpm = 0x7f30d449e038, pagemap = > 0x7f30d449e480}, {segment = 0x56134dfa2208, > mapped_address = 0x7f30d2c90000 "", header = 0x7f30d2c90000, fpm > = 0x7f30d2c90038, pagemap = 0x7f30d2c90480}, {segment = > 0x56134dfa2258, mapped_address = 0x7f30cfc76000 "", > header = 0x7f30cfc76000, fpm = 0x7f30cfc76038, pagemap = > 0x7f30cfc76480}, {segment = 0x56134ee12048, mapped_address = > 0x7f307599e000 "", header = 0x7f307599e000, fpm = 0x7f307599e038, > pagemap = 0x7f307599e480}, {segment = 0x56134ee11ff8, > mapped_address = 0x7f307b9d0000 "", header = 0x7f307b9d0000, fpm = > 0x7f307b9d0038, pagemap = 0x7f307b9d0480}, { > segment = 0x56134ee11fa8, mapped_address = 0x7f3087a32000 "", > header = 0x7f3087a32000, fpm = 0x7f3087a32038, pagemap = > 0x7f3087a32480}, {segment = 0x56134dfa2dd8, > mapped_address = 0x7f309faf4000 "", header = 0x7f309faf4000, fpm > = 0x7f309faf4038, pagemap = 0x7f309faf4480}, {segment = > 0x56134dfa1fb8, mapped_address = 0x7f30d62d3000 "", > header = 0x7f30d62d3000, fpm = 0x7f30d62d3038, pagemap = > 0x7f30d62d3480}, {segment = 0x56134dfa1f68, mapped_address = > 0x7f31365d5000 "", header = 0x7f31365d5000, fpm = 0x7f31365d5038, > pagemap = 0x7f31365d5480}, {segment = 0x56134ee12098, > mapped_address = 0x7f306599e000 "", header = 0x7f306599e000, fpm = > 0x7f306599e038, pagemap = 0x7f306599e480}, { > segment = 0x56134ee120e8, mapped_address = 0x7f305599e000 "", > header = 0x7f305599e000, fpm = 0x7f305599e038, pagemap = > 0x7f305599e480}, {segment = 0x56134ee12138, > mapped_address = 0x7f303599e000 "", header = 0x7f303599e000, fpm > = 0x7f303599e038, pagemap = 0x7f303599e480}, {segment = > 0x56134ee12188, mapped_address = 0x7f301599e000 "", > header = 0x7f301599e000, fpm = 0x7f301599e038, pagemap = > 0x7f301599e480}, {segment = 0x56134ee121d8, mapped_address = > 0x7f2fd599e000 "", header = 0x7f2fd599e000, fpm = 0x7f2fd599e038, > pagemap = 0x7f2fd599e480}, {segment = 0x56134ee12228, > mapped_address = 0x7f2f9599e000 "", header = 0x7f2f9599e000, fpm = > 0x7f2f9599e038, pagemap = 0x7f2f9599e480}, { > segment = 0x56134ee12278, mapped_address = 0x7f2f1599e000 "", > header = 0x7f2f1599e000, fpm = 0x7f2f1599e038, pagemap = > 0x7f2f1599e480}, {segment = 0x56134ee122c8, > mapped_address = 0x7f2e9599e000 "", header = 0x7f2e9599e000, fpm > = 0x7f2e9599e038, pagemap = 0x7f2e9599e480}, {segment = 0x0, > mapped_address = 0x0, header = 0x0, fpm = 0x0, > pagemap = 0x0} <repeats 1000 times>}, high_segment_index = 23, > freed_segment_counter = 0} > > > I hope this sheds some light on the issue. > > Best regards, > Marcin Barczyński