Обсуждение: BUG #17705: Segmentation fault in BufFileLoadBuffer

Поиск
Список
Период
Сортировка

BUG #17705: Segmentation fault in BufFileLoadBuffer

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      17705
Logged by:          Vahur Sinijärv
Email address:      vahursi@gmail.com
PostgreSQL version: 13.9
Operating system:   RHEL 8.7
Description:

Hi!

We started having random segmentation faults with our postgres 13.4 server,
running on RHEL 8.7. It was upgraded to 13.9, but the issue persists. The
database is fairly large, about 250GB on disk.

I got core dump of one of the crashes and it shows SIGSEGV in
BufFileLoadBuffer. I tried to investigate this a little and it seems the
reason for it is that ExecHashJoinGetSavedTuple reads {0, 0} as header,
meaning hashvalue is 0 and tuple length is 0. Line 1277 in nodeHashjoin.c
subtracts sizeof(uint32) from 0 and passes it as size to BufFileRead(). GDB
shows size=18446744073627287632 at frame #1 which is not ((uint64_t) -4),
but -82263984. I think this is caused by BufFileRead which decrements
parameter 'size' by bytes read, so apparently it has read 82263980 bytes,
overwriting BufFile struct passed to BufFileLoadBuffer. Its files field now
contains ascii instead of pointer and file->files[file->curFile]; causes
SIGSEGV.

Why it has read {0, 0} as saved tuple header, or what could have written
these zeroes there, I could not find out...

Stack trace from core dump:
Core was generated by `postgres: mmx mmx ********(32806) INSERT
                      '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000000000077db27 in BufFileLoadBuffer (file=0x7c15548) at
buffile.c:430
430             thisfile = file->files[file->curFile];
(gdb) bt
#0  0x000000000077db27 in BufFileLoadBuffer (file=0x7c15548) at
buffile.c:430
#1  BufFileRead (file=file@entry=0x7c15548, ptr=0x7c15c68,
ptr@entry=0x2da1cbc, size=18446744073627287632)
    at buffile.c:554
#2  0x000000000065b40c in ExecHashJoinGetSavedTuple (file=0x7c15548,
hashvalue=hashvalue@entry=0x7ffe58d22b4c,
    tupleSlot=0x2d6a3b8, hjstate=0x2df8d88) at nodeHashjoin.c:1277
#3  0x000000000065ba96 in ExecHashJoinOuterGetTuple
(hashvalue=0x7ffe58d22b4c, hjstate=0x2df8d88, outerNode=0x2db57f8)
    at nodeHashjoin.c:866
#4  ExecHashJoinImpl (parallel=false, pstate=0x2df8d88) at
nodeHashjoin.c:356
#5  ExecHashJoin (pstate=0x2df8d88) at nodeHashjoin.c:591
#6  0x000000000065bef5 in ExecProcNode (node=0x2df8d88) at
../../../src/include/executor/executor.h:248
#7  ExecHashJoinOuterGetTuple (hashvalue=0x7ffe58d22bdc, hjstate=0x2df8ae8,
outerNode=0x2df8d88) at nodeHashjoin.c:826
#8  ExecHashJoinImpl (parallel=false, pstate=0x2df8ae8) at
nodeHashjoin.c:356
#9  ExecHashJoin (pstate=0x2df8ae8) at nodeHashjoin.c:591
#10 0x000000000065bef5 in ExecProcNode (node=0x2df8ae8) at
../../../src/include/executor/executor.h:248
#11 ExecHashJoinOuterGetTuple (hashvalue=0x7ffe58d22c6c, hjstate=0x2df8848,
outerNode=0x2df8ae8) at nodeHashjoin.c:826
#12 ExecHashJoinImpl (parallel=false, pstate=0x2df8848) at
nodeHashjoin.c:356
#13 ExecHashJoin (pstate=0x2df8848) at nodeHashjoin.c:591
#14 0x000000000065bef5 in ExecProcNode (node=0x2df8848) at
../../../src/include/executor/executor.h:248
#15 ExecHashJoinOuterGetTuple (hashvalue=0x7ffe58d22cfc, hjstate=0x2df85e8,
outerNode=0x2df8848) at nodeHashjoin.c:826
#16 ExecHashJoinImpl (parallel=false, pstate=0x2df85e8) at
nodeHashjoin.c:356
#17 ExecHashJoin (pstate=0x2df85e8) at nodeHashjoin.c:591
#18 0x0000000000666209 in ExecProcNode (node=0x2df85e8) at
../../../src/include/executor/executor.h:248
#19 ExecModifyTable (pstate=0x2df7648) at nodeModifyTable.c:2178
#20 0x000000000063b153 in ExecProcNode (node=0x2df7648) at
../../../src/include/executor/executor.h:248
#21 ExecutePlan (execute_once=<optimized out>, dest=0xd41480 <donothingDR>,
direction=<optimized out>, numberTuples=0,
    sendTuples=<optimized out>, operation=CMD_INSERT,
use_parallel_mode=<optimized out>, planstate=0x2df7648,
    estate=0x2df7218) at execMain.c:1632
#22 standard_ExecutorRun (queryDesc=0x2c2a188, direction=<optimized out>,
count=0, execute_once=<optimized out>)
    at execMain.c:350
#23 0x00000000007aee02 in ProcessQuery (plan=<optimized out>,
sourceText=0x2df1688 "", params=0x0, queryEnv=0x0,
    dest=0xd41480 <donothingDR>, qc=0x7ffe58d231e0) at pquery.c:160
#24 0x00000000007af86b in PortalRunMulti (portal=portal@entry=0x2bc97e8,
isTopLevel=isTopLevel@entry=true,
    setHoldSnapshot=setHoldSnapshot@entry=false, dest=0xd41480
<donothingDR>, dest@entry=0x2b58978,
    altdest=0xd41480 <donothingDR>, altdest@entry=0x2b58978,
qc=qc@entry=0x7ffe58d231e0) at pquery.c:1271
#25 0x00000000007afbda in PortalRun (portal=0x2bc97e8,
count=9223372036854775807, isTopLevel=<optimized out>,
    run_once=<optimized out>, dest=0x2b58978, altdest=0x2b58978,
qc=0x7ffe58d231e0) at pquery.c:788
#26 0x00000000007ad016 in PostgresMain (argc=<optimized out>,
argv=<optimized out>, dbname=<optimized out>,
    username=<optimized out>) at postgres.c:2185
#27 0x000000000072e4ed in BackendRun (port=0x2b786b0, port=0x2b786b0) at
postmaster.c:4550
#28 BackendStartup (port=0x2b786b0) at postmaster.c:4234
#29 ServerLoop () at postmaster.c:1739
#30 0x000000000072f460 in PostmasterMain (argc=argc@entry=15,
argv=argv@entry=0x2b50b00) at postmaster.c:1412
#31 0x0000000000486dd0 in main (argc=15, argv=0x2b50b00) at main.c:210
(gdb) print file
$1 = (BufFile *) 0x7c15548
(gdb) print file->curOffset
$2 = 0
(gdb) print file->files
$3 = (File *) 0x3038333631332c30
(gdb) print file->curFile
$4 = -819986428
(gdb)


Re: BUG #17705: Segmentation fault in BufFileLoadBuffer

От
Thomas Munro
Дата:
On Sun, Dec 4, 2022 at 10:57 AM PG Bug reporting form
<noreply@postgresql.org> wrote:
> We started having random segmentation faults with our postgres 13.4 server,
> running on RHEL 8.7. It was upgraded to 13.9, but the issue persists. The
> database is fairly large, about 250GB on disk.
>
> I got core dump of one of the crashes and it shows SIGSEGV in
> BufFileLoadBuffer. I tried to investigate this a little and it seems the
> reason for it is that ExecHashJoinGetSavedTuple reads {0, 0} as header,
> meaning hashvalue is 0 and tuple length is 0. Line 1277 in nodeHashjoin.c
> subtracts sizeof(uint32) from 0 and passes it as size to BufFileRead(). GDB
> shows size=18446744073627287632 at frame #1 which is not ((uint64_t) -4),
> but -82263984. I think this is caused by BufFileRead which decrements
> parameter 'size' by bytes read, so apparently it has read 82263980 bytes,
> overwriting BufFile struct passed to BufFileLoadBuffer. Its files field now
> contains ascii instead of pointer and file->files[file->curFile]; causes
> SIGSEGV.
>
> Why it has read {0, 0} as saved tuple header, or what could have written
> these zeroes there, I could not find out...

Are you able to reproduce this on demand?  Can you get your hands on
the temporary file(s) it's reading?  How large is it/are they?
Perhaps we could write a little Python/whatever script to read the
tuples back one at a time until it hits this {0, 0} header to confirm
that it's definitely there, ie the bad header has actually been
written out, which would help narrow down the location of the bug.



Re: BUG #17705: Segmentation fault in BufFileLoadBuffer

От
Vahur Sinijärv
Дата:
Hi!

Where are these files normally created and what names do they have? We may still have them since last crash.

Vahur

> On 5. Dec 2022, at 05:41, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Sun, Dec 4, 2022 at 10:57 AM PG Bug reporting form
> <noreply@postgresql.org> wrote:
>> We started having random segmentation faults with our postgres 13.4 server,
>> running on RHEL 8.7. It was upgraded to 13.9, but the issue persists. The
>> database is fairly large, about 250GB on disk.
>>
>> I got core dump of one of the crashes and it shows SIGSEGV in
>> BufFileLoadBuffer. I tried to investigate this a little and it seems the
>> reason for it is that ExecHashJoinGetSavedTuple reads {0, 0} as header,
>> meaning hashvalue is 0 and tuple length is 0. Line 1277 in nodeHashjoin.c
>> subtracts sizeof(uint32) from 0 and passes it as size to BufFileRead(). GDB
>> shows size=18446744073627287632 at frame #1 which is not ((uint64_t) -4),
>> but -82263984. I think this is caused by BufFileRead which decrements
>> parameter 'size' by bytes read, so apparently it has read 82263980 bytes,
>> overwriting BufFile struct passed to BufFileLoadBuffer. Its files field now
>> contains ascii instead of pointer and file->files[file->curFile]; causes
>> SIGSEGV.
>>
>> Why it has read {0, 0} as saved tuple header, or what could have written
>> these zeroes there, I could not find out...
>
> Are you able to reproduce this on demand?  Can you get your hands on
> the temporary file(s) it's reading?  How large is it/are they?
> Perhaps we could write a little Python/whatever script to read the
> tuples back one at a time until it hits this {0, 0} header to confirm
> that it's definitely there, ie the bad header has actually been
> written out, which would help narrow down the location of the bug.



Re: BUG #17705: Segmentation fault in BufFileLoadBuffer

От
Thomas Munro
Дата:
On Mon, Dec 5, 2022 at 7:24 PM Vahur Sinijärv <vahursi@icloud.com> wrote:
> Where are these files normally created and what names do they have? We may still have them since last crash.

Something like base/pgsql_tmp/pgsql_tmp21949.19 where 21949 is a pid
(print MyProcPid in your core), and 19 is a number that keeps going
up.  It may be hard to figure out which one it is though because there
should be some power of 2 number of logical spill files for a single
hash join, and then if any file exceeded 1GB it would spill into
another file (for historical reasons, 90s computers couldn't all do
large files) with hard to predict suffix, and it'd be hard to see
which ones go together.  Hopefully your files didn't exceed 1GB so
hopefully you'll see a set of files all < 1GB created around the same
time, so none of them are 'continuations' that are hard to stitch
together.  Then you could start from the beginning of each file and
walk tuple by tuple to see if you find the {0,0} header.

Those files get cleaned up when you restart, though.