Обсуждение: BUG #17512: Process running query fails with SIGSEV - nodeMemoize.c:349

Поиск
Список
Период
Сортировка

BUG #17512: Process running query fails with SIGSEV - nodeMemoize.c:349

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      17512
Logged by:          Aleš Zelený
Email address:      zeleny.ales@gmail.com
PostgreSQL version: 14.2
Operating system:   CentOS 7
Description:

Version information:
OS: CentOS Linux release 7.9.2009 (Core)
PostgresSQL (RPM installation): PostgreSQL 14.2 on x86_64-pc-linux-gnu,
compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit

PostgreSQL Log:
2022-06-06 21:23:21.147 UTC [1647] 629e7049.66f, 1, 0, 0, , [local],
[unknown], [unknown], [unknown], 00000,  LOG:  connection received:
host=[local]
2022-06-06 21:23:21.148 UTC [1647] 629e7049.66f, 2, 0, 0, 5/7, [local],
sdp_prod, zelenya, [unknown], 00000, authentication LOG:  connection
authenticated: identity="zelenya" method=peer (/pgsql/sdp/cluster/14
/pg_hba.conf:85)
2022-06-06 21:23:21.148 UTC [1647] 629e7049.66f, 3, 0, 0, 5/7, [local],
sdp_prod, zelenya, [unknown], 00000, authentication LOG:  connection
authorized: user=zelenya database=sdp_prod application_name=psql

2022-06-06 21:24:00.695 UTC [1124] 629e7004.464, 22, 0, 0, , , , , , 00000,
LOG:  server process (PID 1647) was terminated by signal 11: Segmentation
fault
2022-06-06 21:24:00.695 UTC [1124] 629e7004.464, 23, 0, 0, , , , , , 00000,
DETAIL:  Failed process was running: explain (analyze, buffers)

2022-06-06 21:24:00.695 UTC [1124] 629e7004.464, 24, 0, 0, , , , , , 00000,
LOG:  terminating any other active server processes
2022-06-06 21:24:00.806 UTC [1124] 629e7004.464, 25, 0, 0, , , , , , 00000,
LOG:  all server processes terminated; reinitializing
2022-06-06 21:24:01.912 UTC [1655] 629e7071.677, 1, 0, 0, , , , , , 00000,
LOG:  database system was interrupted; last known up at 2022-06-06 21:22:12
UTC
2022-06-06 21:24:02.003 UTC [1655] 629e7071.677, 2, 0, 0, , , , , , 42501,
LOG:  could not open directory "pg_tblspc/140428/lost+found": Permission
denied
2022-06-06 21:24:02.021 UTC [1655] 629e7071.677, 3, 0, 0, , , , , , 00000,
LOG:  database system was not properly shut down; automatic recovery in
progress
2022-06-06 21:24:02.028 UTC [1655] 629e7071.677, 4, 0, 0, , , , , , 00000,
LOG:  redo starts at 121C/190000A0
2022-06-06 21:24:02.028 UTC [1655] 629e7071.677, 5, 0, 0, , , , , , 00000,
LOG:  invalid record length at 121C/190000D8: wanted 24, got 0
2022-06-06 21:24:02.028 UTC [1655] 629e7071.677, 6, 0, 0, , , , , , 00000,
LOG:  redo done at 121C/190000A0 system usage: CPU: user: 0.00 s, system:
0.00 s, elapsed: 0.00 s
2022-06-06 21:24:02.039 UTC [1655] 629e7071.677, 7, 0, 0, , , , , , 00000,
LOG:  checkpoint starting: end-of-recovery immediate
2022-06-06 21:24:02.208 UTC [1655] 629e7071.677, 8, 0, 0, , , , , , 00000,
LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 1
removed, 0 recycled; write=0.153 s, sync=0.001 s, total=0.1
72 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB,
estimate=0 kB
2022-06-06 21:24:02.227 UTC [1124] 629e7004.464, 26, 0, 0, , , , , , 00000,
LOG:  database system is ready to accept connections

The process crash with SIGSEV is reproducible on production, core dump was
enabled to get more details:


-bash-4.2$ gdb /usr/pgsql-14/bin/postgres
core.postmaster.19645.ip-172-31-1-64.us-east-2.compute.internal.1654522585
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/pgsql-14/bin/postgres...Reading symbols from
/usr/lib/debug/usr/pgsql-14/bin/postgres.debug...done.
done.

warning: core file may not match specified executable file.
[New LWP 19645]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: sdp01: zelenya sdp_prod [local] EXPLAIN
'.
Program terminated with signal 11, Segmentation fault.
#0  remove_cache_entry (entry=<optimized out>, mstate=<optimized out>) at
nodeMemoize.c:349
349    nodeMemoize.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install
audit-libs-2.8.5-4.el7.x86_64 bzip2-libs-1.0.6-13.el7.x86_64
cyrus-sasl-lib-2.1.26-24.el7_9.x86_64 elfutils-libelf-0.176-5.el7.x86_64
elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-325.el7_9.x86_64
keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-51.el7_9.x86_64
libattr-2.4.46-13.el7.x86_64 libcap-2.22-11.el7.x86_64
libcap-ng-0.7.5-4.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64
libedit-3.0-12.20121213cvs.el7.x86_64 libffi-3.0.13-19.el7.x86_64
libgcc-4.8.5-44.el7.x86_64 libgcrypt-1.5.3-14.el7.x86_64
libgpg-error-1.12-3.el7.x86_64 libicu-50.2-4.el7_7.x86_64
libselinux-2.5-15.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64
libxml2-2.9.1-6.el7_9.6.x86_64 llvm5.0-libs-5.0.1-7.el7.x86_64
lz4-1.8.3-1.el7.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64
nspr-4.32.0-1.el7_9.x86_64 nss-3.67.0-4.el7_9.x86_64
nss-softokn-freebl-3.67.0-3.el7_9.x86_64 nss-util-3.67.0-1.el7_9.x86_64
openldap-2.4.44-25.el7_9.x86_64 openssl-libs-1.0.2k-24.el7_9.x86_64
pam-1.1.8-23.el7.x86_64 pcre-8.32-17.el7.x86_64
pg_qualstats_14-2.0.4-1.rhel7.x86_64 pg_stat_kcache_14-2.2.0-2.rhel7.x86_64
pg_wait_sampling_14-1.1.3-1.rhel7.x86_64 powa_14-4.1.4-1.rhel7.x86_64
systemd-libs-219-78.el7_9.5.x86_64 xz-libs-5.2.2-1.el7.x86_64
zlib-1.2.7-19.el7_9.x86_64

(gdb) bt
#0  remove_cache_entry (entry=<optimized out>, mstate=<optimized out>) at
nodeMemoize.c:349
#1  cache_reduce_memory (mstate=mstate@entry=0x2d0a2a0,
specialkey=specialkey@entry=0x122614e8) at nodeMemoize.c:468
#2  0x0000000000669cce in cache_store_tuple (mstate=mstate@entry=0x2d0a2a0,
slot=slot@entry=0x2d0a7e0) at nodeMemoize.c:635
#3  0x000000000066a0ab in ExecMemoize (pstate=0x2d0a2a0) at
nodeMemoize.c:850
#4  0x0000000000649b28 in ExecProcNodeInstr (node=0x2d0a2a0) at
execProcnode.c:479
#5  0x00000000006710f5 in ExecProcNode (node=0x2d0a2a0) at
../../../src/include/executor/executor.h:257
#6  ExecNestLoop (pstate=0x2c70a38) at nodeNestloop.c:160
#7  0x0000000000649b28 in ExecProcNodeInstr (node=0x2c70a38) at
execProcnode.c:479
#8  0x00000000006735a6 in ExecProcNode (node=0x2c70a38) at
../../../src/include/executor/executor.h:257
#9  ExecSort (pstate=0x2c70828) at nodeSort.c:108
#10 0x0000000000649b28 in ExecProcNodeInstr (node=0x2c70828) at
execProcnode.c:479
#11 0x0000000000653b19 in ExecProcNode (node=0x2c70828) at
../../../src/include/executor/executor.h:257
#12 fetch_input_tuple (aggstate=aggstate@entry=0x2c701c0) at nodeAgg.c:581
#13 0x00000000006568f2 in agg_retrieve_direct (aggstate=0x2c701c0) at
nodeAgg.c:2364
#14 ExecAgg (pstate=0x2c701c0) at nodeAgg.c:2179
#15 0x0000000000649b28 in ExecProcNodeInstr (node=0x2c701c0) at
execProcnode.c:479
#16 0x0000000000670fe9 in ExecProcNode (node=0x2c701c0) at
../../../src/include/executor/executor.h:257
#17 ExecNestLoop (pstate=0x2c70020) at nodeNestloop.c:109
#18 0x0000000000649b28 in ExecProcNodeInstr (node=0x2c70020) at
execProcnode.c:479
#19 0x00000000006735a6 in ExecProcNode (node=0x2c70020) at
../../../src/include/executor/executor.h:257
#20 ExecSort (pstate=0x2c6fe10) at nodeSort.c:108
#21 0x0000000000649b28 in ExecProcNodeInstr (node=0x2c6fe10) at
execProcnode.c:479
#22 0x0000000000653b19 in ExecProcNode (node=0x2c6fe10) at
../../../src/include/executor/executor.h:257
#23 fetch_input_tuple (aggstate=aggstate@entry=0x2c6f838) at nodeAgg.c:581
#24 0x00000000006568f2 in agg_retrieve_direct (aggstate=0x2c6f838) at
nodeAgg.c:2364
#25 ExecAgg (pstate=0x2c6f838) at nodeAgg.c:2179
#26 0x0000000000649b28 in ExecProcNodeInstr (node=0x2c6f838) at
execProcnode.c:479
#27 0x00000000006683b3 in ExecProcNode (node=0x2c6f838) at
../../../src/include/executor/executor.h:257
#28 ExecLimit (pstate=0x2c6f548) at nodeLimit.c:96
#29 0x0000000000649b28 in ExecProcNodeInstr (node=0x2c6f548) at
execProcnode.c:479
#30 0x0000000000643152 in ExecProcNode (node=0x2c6f548) at
../../../src/include/executor/executor.h:257
#31 ExecutePlan (execute_once=<optimized out>, dest=0xa40d00 <donothingDR>,
direction=<optimized out>, numberTuples=0, sendTuples=true,
operation=CMD_SELECT, use_parallel_mode=<optimized out>,
planstate=0x2c6f548, estate=0x2c6f018) at execMain.c:1551
#32 standard_ExecutorRun (queryDesc=0x2ea0188, direction=<optimized out>,
count=0, execute_once=<optimized out>) at execMain.c:361
#33 0x00007f06ba3e461d in pgss_ExecutorRun (queryDesc=0x2ea0188,
direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at
pg_stat_statements.c:1003
#34 0x00007f06b9fd44c1 in pgsk_ExecutorRun () from
/usr/pgsql-14/lib/pg_stat_kcache.so
#35 0x00007f06b9dcbfd1 in pgqs_ExecutorRun () from
/usr/pgsql-14/lib/pg_qualstats.so
#36 0x00000000005d8842 in ExplainOnePlan
(plannedstmt=plannedstmt@entry=0x2e9cbe8, into=into@entry=0x0,
es=es@entry=0x342d228,
    queryString=queryString@entry=0x2c72f58 "explain (analyze, buffers)\n",
' ' <repeats 24 times>, "WITH daily_tech AS (\n", ' ' <repeats 28 times>,
"SELECT\n", ' ' <repeats 33 times>, "\n", ' ' <repeats 32 times>,
"MAX(dmd_current0.split_adj_"...,
    params=params@entry=0x0, queryEnv=<optimized out>,
planduration=planduration@entry=0x7ffc3d75ca60, bufusage=0x7ffc3d75cae0) at
explain.c:593
#37 0x00000000005d8bdd in ExplainOneQuery (query=<optimized out>,
cursorOptions=<optimized out>, into=0x0, es=0x342d228,
    queryString=0x2c72f58 "explain (analyze, buffers)\n", ' ' <repeats 24
times>, "WITH daily_tech AS (\n", ' ' <repeats 28 times>, "SELECT\n", ' '
<repeats 33 times>, "\n", ' ' <repeats 32 times>,
"MAX(dmd_current0.split_adj_"..., params=0x0, queryEnv=0x0)
    at explain.c:410
#38 0x00000000005d9285 in ExplainQuery (pstate=pstate@entry=0x2c77230,
stmt=stmt@entry=0x319b468, params=params@entry=0x0,
dest=dest@entry=0x2c771a0) at explain.c:281
#39 0x00000000007bc575 in standard_ProcessUtility
(pstmt=pstmt@entry=0x2e296b0,
    queryString=queryString@entry=0x2c72f58 "explain (analyze, buffers)\n",
' ' <repeats 24 times>, "WITH daily_tech AS (\n", ' ' <repeats 28 times>,
"SELECT\n", ' ' <repeats 33 times>, "\n", ' ' <repeats 32 times>,
"MAX(dmd_current0.split_adj_"...,
    readOnlyTree=<optimized out>,
context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0,
queryEnv=queryEnv@entry=0x0, dest=dest@entry=0x2c771a0,
qc=qc@entry=0x7ffc3d75cfb0) at utility.c:862
#40 0x00007f06ba3e60a0 in pgss_ProcessUtility (pstmt=0x2e296b0,
    queryString=0x2c72f58 "explain (analyze, buffers)\n", ' ' <repeats 24
times>, "WITH daily_tech AS (\n", ' ' <repeats 28 times>, "SELECT\n", ' '
<repeats 33 times>, "\n", ' ' <repeats 32 times>,
"MAX(dmd_current0.split_adj_"..., readOnlyTree=<optimized out>,
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x2c771a0, qc=0x7ffc3d75cfb0) at pg_stat_statements.c:1135
#41 0x00000000007ba7aa in PortalRunUtility (portal=portal@entry=0x2bef9d8,
pstmt=0x2e296b0, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=true, dest=dest@entry=0x2c771a0,
qc=qc@entry=0x7ffc3d75cfb0) at pquery.c:1155
#42 0x00000000007bab11 in FillPortalStore (portal=portal@entry=0x2bef9d8,
isTopLevel=isTopLevel@entry=true) at pquery.c:1028
#43 0x00000000007bae8b in PortalRun (portal=<optimized out>,
count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimized
out>, dest=0x2e29790, altdest=0x2e29790, qc=0x7ffc3d75d190) at
pquery.c:760
#44 0x00000000007b6be7 in exec_simple_query (
    query_string=0x2c72f58 "explain (analyze, buffers)\n", ' ' <repeats 24
times>, "WITH daily_tech AS (\n", ' ' <repeats 28 times>, "SELECT\n", ' '
<repeats 33 times>, "\n", ' ' <repeats 32 times>,
"MAX(dmd_current0.split_adj_"...) at postgres.c:1214
#45 0x00000000007b8063 in PostgresMain (argc=argc@entry=1,
argv=argv@entry=0x7ffc3d75d560, dbname=0x2bb78e0 "sdp_prod",
username=<optimized out>) at postgres.c:4486
#46 0x000000000048d7ba in BackendRun (port=<optimized out>, port=<optimized
out>) at postmaster.c:4530
#47 BackendStartup (port=0x2ba6060) at postmaster.c:4252
#48 ServerLoop () at postmaster.c:1745
#49 0x0000000000735692 in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x2b75280) at postmaster.c:1417
#50 0x000000000048e8d2 in main (argc=3, argv=0x2b75280) at main.c:209

A small isolated test case is not available so far, the issue occurs on
select involving 450GiB table. Currently, production DB is being restored to
a separated machine to allow future testing/investigation without affecting
other database users.

Thanks for any advice on working around this issue or how to collect better
diagnostic data / additional helpful information.

Kind regards Ales Zeleny


Re: BUG #17512: Process running query fails with SIGSEV - nodeMemoize.c:349

От
David Rowley
Дата:
Thanks for reporting this.

On Tue, 7 Jun 2022 at 13:21, PG Bug reporting form
<noreply@postgresql.org> wrote:
> Program terminated with signal 11, Segmentation fault.
> #0  remove_cache_entry (entry=<optimized out>, mstate=<optimized out>) at
> nodeMemoize.c:349

The relevant line in 14.2 is:

MemoizeKey *key = entry->key;

So entry must be NULL here.

cache_reduce_memory() just removes cache entries starting at the head
of the LRU. Given a correctly behaving hash function and equality
function I can't quite see how we could have something in the LRU list
that's not also stored in the hash table.  The only two functions that
make changes to the hash table and LRU list are remove_cache_entry(),
cache_lookup() and cache_purge_all(). The latter of those 3 does not
really seem like a candidate for the hash table and list getting out
of sync given that it just creates an empty table and empty list.
That makes me suspect that either the hash function or equality
function for the data types in the cache key are misbehaving.

Can you show us the EXPLAIN output for the problem query? Or at the
very least, the relevant "Cache Key" lines.

And can you also show the psql \d output for the tables which are
mentioned in the cache key?

I'm currently thinking that the Assert(entry != NULL) in
cache_reduce_memory() should probably be a runtime check rather than
an Assert. But let's wait to see if we can confirm that something
weird is going on with the cache key data type.

David



Re: BUG #17512: Process running query fails with SIGSEV - nodeMemoize.c:349

От
Aleš Zelený
Дата:
Hello,

The Cache Key plan part is:
                                       ->  Memoize  (cost=0.43..0.84 rows=20 width=8)
                                             Cache Key: os1.date, os1.instrument_id
                                             ->  Index Only Scan using outstanding_shares_instrument_id_date_key on outstanding_shares os2  (cost=0.42..0.83 rows=20 width=8)
                                                   Index Cond: ((instrument_id = os1.instrument_id) AND (date > os1.date))

And the columns are:
                                   Table "outstanding_shares"
    Column     |  Type   | Collation | Nullable |                         Default                        
---------------+---------+-----------+----------+---------------------------------------------------------
 instrument_id | integer |           |          |
 date          | date    |           | not null |

I have to check whether I can share full table structure, query and the query plan. In the meantime, the database was restored on a test machine, so I'll try to reproduce the issue.

Kind regards Ales

út 7. 6. 2022 v 5:17 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
Thanks for reporting this.

On Tue, 7 Jun 2022 at 13:21, PG Bug reporting form
<noreply@postgresql.org> wrote:
> Program terminated with signal 11, Segmentation fault.
> #0  remove_cache_entry (entry=<optimized out>, mstate=<optimized out>) at
> nodeMemoize.c:349

The relevant line in 14.2 is:

MemoizeKey *key = entry->key;

So entry must be NULL here.

cache_reduce_memory() just removes cache entries starting at the head
of the LRU. Given a correctly behaving hash function and equality
function I can't quite see how we could have something in the LRU list
that's not also stored in the hash table.  The only two functions that
make changes to the hash table and LRU list are remove_cache_entry(),
cache_lookup() and cache_purge_all(). The latter of those 3 does not
really seem like a candidate for the hash table and list getting out
of sync given that it just creates an empty table and empty list.
That makes me suspect that either the hash function or equality
function for the data types in the cache key are misbehaving.

Can you show us the EXPLAIN output for the problem query? Or at the
very least, the relevant "Cache Key" lines.

And can you also show the psql \d output for the tables which are
mentioned in the cache key?

I'm currently thinking that the Assert(entry != NULL) in
cache_reduce_memory() should probably be a runtime check rather than
an Assert. But let's wait to see if we can confirm that something
weird is going on with the cache key data type.

David

Re: BUG #17512: Process running query fails with SIGSEV - nodeMemoize.c:349

От
Aleš Zelený
Дата:
Hello,

I was able to reproduce the issue on the database restored to a test server.

So far, it looks that the query I've been using to prove the SIGSEV works fine when I disabled jit by setting

jit=off

The jit package is postgresql14-llvmjit-14.2-1PGDG.rhel7.x86_64

Having it reproducible on the test machine, I can test whatever you instruct me to do to help to identify the root cause.

Kind regards Ales Zeleny

út 7. 6. 2022 v 10:29 odesílatel Aleš Zelený <zeleny.ales@gmail.com> napsal:
Hello,

The Cache Key plan part is:
                                       ->  Memoize  (cost=0.43..0.84 rows=20 width=8)
                                             Cache Key: os1.date, os1.instrument_id
                                             ->  Index Only Scan using outstanding_shares_instrument_id_date_key on outstanding_shares os2  (cost=0.42..0.83 rows=20 width=8)
                                                   Index Cond: ((instrument_id = os1.instrument_id) AND (date > os1.date))

And the columns are:
                                   Table "outstanding_shares"
    Column     |  Type   | Collation | Nullable |                         Default                        
---------------+---------+-----------+----------+---------------------------------------------------------
 instrument_id | integer |           |          |
 date          | date    |           | not null |

I have to check whether I can share full table structure, query and the query plan. In the meantime, the database was restored on a test machine, so I'll try to reproduce the issue.

Kind regards Ales

út 7. 6. 2022 v 5:17 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
Thanks for reporting this.

On Tue, 7 Jun 2022 at 13:21, PG Bug reporting form
<noreply@postgresql.org> wrote:
> Program terminated with signal 11, Segmentation fault.
> #0  remove_cache_entry (entry=<optimized out>, mstate=<optimized out>) at
> nodeMemoize.c:349

The relevant line in 14.2 is:

MemoizeKey *key = entry->key;

So entry must be NULL here.

cache_reduce_memory() just removes cache entries starting at the head
of the LRU. Given a correctly behaving hash function and equality
function I can't quite see how we could have something in the LRU list
that's not also stored in the hash table.  The only two functions that
make changes to the hash table and LRU list are remove_cache_entry(),
cache_lookup() and cache_purge_all(). The latter of those 3 does not
really seem like a candidate for the hash table and list getting out
of sync given that it just creates an empty table and empty list.
That makes me suspect that either the hash function or equality
function for the data types in the cache key are misbehaving.

Can you show us the EXPLAIN output for the problem query? Or at the
very least, the relevant "Cache Key" lines.

And can you also show the psql \d output for the tables which are
mentioned in the cache key?

I'm currently thinking that the Assert(entry != NULL) in
cache_reduce_memory() should probably be a runtime check rather than
an Assert. But let's wait to see if we can confirm that something
weird is going on with the cache key data type.

David

Re: BUG #17512: Process running query fails with SIGSEV - nodeMemoize.c:349

От
Tom Lane
Дата:
> So far, it looks that the query I've been using to prove the SIGSEV works
> fine when I disabled jit by setting
> jit=off

Oh!  So it's (probably) not a Memoize problem, it's an LLVM problem.

> The jit package is postgresql14-llvmjit-14.2-1PGDG.rhel7.x86_64

Hm.  What LLVM version have you got on that machine?  (And I wonder
whether it's the same as what Devrim built this PGDG package against.)

            regards, tom lane



Re: BUG #17512: Process running query fails with SIGSEV - nodeMemoize.c:349

От
Aleš Zelený
Дата:
Hello,

here are the installed llvm packages:

llvm-3.4.2-9.el7.x86_64
llvm5.0-libs-5.0.1-7.el7.x86_64
llvm5.0-5.0.1-7.el7.x86_64
llvm-libs-3.4.2-9.el7.x86_64
postgresql14-llvmjit-14.2-1PGDG.rhel7.x86_64

Kind regards Ales Zeleny

út 7. 6. 2022 v 21:07 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:
> So far, it looks that the query I've been using to prove the SIGSEV works
> fine when I disabled jit by setting
> jit=off

Oh!  So it's (probably) not a Memoize problem, it's an LLVM problem.

> The jit package is postgresql14-llvmjit-14.2-1PGDG.rhel7.x86_64

Hm.  What LLVM version have you got on that machine?  (And I wonder
whether it's the same as what Devrim built this PGDG package against.)

                        regards, tom lane

Re: BUG #17512: Process running query fails with SIGSEV - nodeMemoize.c:349

От
David Rowley
Дата:
On Wed, 8 Jun 2022 at 07:07, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > So far, it looks that the query I've been using to prove the SIGSEV works
> > fine when I disabled jit by setting
> > jit=off
>
> Oh!  So it's (probably) not a Memoize problem, it's an LLVM problem.

I wonder if Memoize should do a better job of handling lookup
failures.  Before I knew the problem was with built-in datatypes, I
suspected this might have been caused by some extension with a broken
hash or equality function.

I decided to try writing a broken datatype by messing with the hash
function in [1] by patching it with the attached
bogus_fixeddecimal_hashfunc.diff then doing:

# create extension fixeddecimal;
# create table t1 (a fixeddecimal not null);
# insert into t1 select x%100 from generate_Series(1,10000)x;
# create index on t1(a);
# set work_mem=64;
# explain analyze select * from t1 inner join t1 t2 on t1.a=t2.a where t1.a<=2;
server closed the connection unexpectedly

If I patch it with the attached rebustify_memoize_code.patch, I get:

# explain analyze select * from t1 inner join t1 t2 on t1.a=t2.a where t1.a<=2;
ERROR:  unable to find memoization table entry

My thoughts are that we likely should make this code more robust,
despite the bug not being related to a broken data type. I'm only just
over the fence on that one though. Any thoughts?

David

[1] https://github.com/2ndQuadrant/fixeddecimal

Вложения

Re: BUG #17512: Process running query fails with SIGSEV - nodeMemoize.c:349

От
Tom Lane
Дата:
David Rowley <dgrowleyml@gmail.com> writes:
> server closed the connection unexpectedly

Ugh.

> If I patch it with the attached rebustify_memoize_code.patch, I get:
> # explain analyze select * from t1 inner join t1 t2 on t1.a=t2.a where t1.a<=2;
> ERROR:  unable to find memoization table entry

> My thoughts are that we likely should make this code more robust,
> despite the bug not being related to a broken data type. I'm only just
> over the fence on that one though. Any thoughts?

Strong +1, but please s/unable to/could not/, per message style
guidelines.

            regards, tom lane



Re: BUG #17512: Process running query fails with SIGSEV - nodeMemoize.c:349

От
David Rowley
Дата:
On Wed, 8 Jun 2022 at 11:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > My thoughts are that we likely should make this code more robust,
> > despite the bug not being related to a broken data type. I'm only just
> > over the fence on that one though. Any thoughts?
>
> Strong +1, but please s/unable to/could not/, per message style
> guidelines.

Thanks for the review.  I've now pushed the adjusted patch.

David



Re: BUG #17512: Process running query fails with SIGSEV - nodeMemoize.c:349

От
Aleš Zelený
Дата:
Hi all,

is there something, on top of the llvm packages, I shall test or provide to help the root cause be analyzed?
If I understand the patch properly, I'll avoid SIGSEV, but still the query with jit=on will fail, except (which is a significant improvement!) the server will not be restarted.

Thanks Ales

út 7. 6. 2022 v 22:26 odesílatel Aleš Zelený <zeleny.ales@gmail.com> napsal:
Hello,

here are the installed llvm packages:

llvm-3.4.2-9.el7.x86_64
llvm5.0-libs-5.0.1-7.el7.x86_64
llvm5.0-5.0.1-7.el7.x86_64
llvm-libs-3.4.2-9.el7.x86_64
postgresql14-llvmjit-14.2-1PGDG.rhel7.x86_64

Kind regards Ales Zeleny

út 7. 6. 2022 v 21:07 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:
> So far, it looks that the query I've been using to prove the SIGSEV works
> fine when I disabled jit by setting
> jit=off

Oh!  So it's (probably) not a Memoize problem, it's an LLVM problem.

> The jit package is postgresql14-llvmjit-14.2-1PGDG.rhel7.x86_64

Hm.  What LLVM version have you got on that machine?  (And I wonder
whether it's the same as what Devrim built this PGDG package against.)

                        regards, tom lane