Обсуждение: BUG #17512: Process running query fails with SIGSEV - nodeMemoize.c:349
BUG #17512: Process running query fails with SIGSEV - nodeMemoize.c:349
От
PG Bug reporting form
Дата:
The following bug has been logged on the website: Bug reference: 17512 Logged by: Aleš Zelený Email address: zeleny.ales@gmail.com PostgreSQL version: 14.2 Operating system: CentOS 7 Description: Version information: OS: CentOS Linux release 7.9.2009 (Core) PostgresSQL (RPM installation): PostgreSQL 14.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit PostgreSQL Log: 2022-06-06 21:23:21.147 UTC [1647] 629e7049.66f, 1, 0, 0, , [local], [unknown], [unknown], [unknown], 00000, LOG: connection received: host=[local] 2022-06-06 21:23:21.148 UTC [1647] 629e7049.66f, 2, 0, 0, 5/7, [local], sdp_prod, zelenya, [unknown], 00000, authentication LOG: connection authenticated: identity="zelenya" method=peer (/pgsql/sdp/cluster/14 /pg_hba.conf:85) 2022-06-06 21:23:21.148 UTC [1647] 629e7049.66f, 3, 0, 0, 5/7, [local], sdp_prod, zelenya, [unknown], 00000, authentication LOG: connection authorized: user=zelenya database=sdp_prod application_name=psql 2022-06-06 21:24:00.695 UTC [1124] 629e7004.464, 22, 0, 0, , , , , , 00000, LOG: server process (PID 1647) was terminated by signal 11: Segmentation fault 2022-06-06 21:24:00.695 UTC [1124] 629e7004.464, 23, 0, 0, , , , , , 00000, DETAIL: Failed process was running: explain (analyze, buffers) 2022-06-06 21:24:00.695 UTC [1124] 629e7004.464, 24, 0, 0, , , , , , 00000, LOG: terminating any other active server processes 2022-06-06 21:24:00.806 UTC [1124] 629e7004.464, 25, 0, 0, , , , , , 00000, LOG: all server processes terminated; reinitializing 2022-06-06 21:24:01.912 UTC [1655] 629e7071.677, 1, 0, 0, , , , , , 00000, LOG: database system was interrupted; last known up at 2022-06-06 21:22:12 UTC 2022-06-06 21:24:02.003 UTC [1655] 629e7071.677, 2, 0, 0, , , , , , 42501, LOG: could not open directory "pg_tblspc/140428/lost+found": Permission denied 2022-06-06 21:24:02.021 UTC [1655] 629e7071.677, 3, 0, 0, , , , , , 00000, LOG: database system was not properly shut down; automatic recovery in progress 2022-06-06 21:24:02.028 UTC [1655] 629e7071.677, 4, 0, 0, , , , , , 00000, LOG: redo starts at 121C/190000A0 2022-06-06 21:24:02.028 UTC [1655] 629e7071.677, 5, 0, 0, , , , , , 00000, LOG: invalid record length at 121C/190000D8: wanted 24, got 0 2022-06-06 21:24:02.028 UTC [1655] 629e7071.677, 6, 0, 0, , , , , , 00000, LOG: redo done at 121C/190000A0 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s 2022-06-06 21:24:02.039 UTC [1655] 629e7071.677, 7, 0, 0, , , , , , 00000, LOG: checkpoint starting: end-of-recovery immediate 2022-06-06 21:24:02.208 UTC [1655] 629e7071.677, 8, 0, 0, , , , , , 00000, LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.153 s, sync=0.001 s, total=0.1 72 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB 2022-06-06 21:24:02.227 UTC [1124] 629e7004.464, 26, 0, 0, , , , , , 00000, LOG: database system is ready to accept connections The process crash with SIGSEV is reproducible on production, core dump was enabled to get more details: -bash-4.2$ gdb /usr/pgsql-14/bin/postgres core.postmaster.19645.ip-172-31-1-64.us-east-2.compute.internal.1654522585 GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /usr/pgsql-14/bin/postgres...Reading symbols from /usr/lib/debug/usr/pgsql-14/bin/postgres.debug...done. done. warning: core file may not match specified executable file. [New LWP 19645] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `postgres: sdp01: zelenya sdp_prod [local] EXPLAIN '. Program terminated with signal 11, Segmentation fault. #0 remove_cache_entry (entry=<optimized out>, mstate=<optimized out>) at nodeMemoize.c:349 349 nodeMemoize.c: No such file or directory. Missing separate debuginfos, use: debuginfo-install audit-libs-2.8.5-4.el7.x86_64 bzip2-libs-1.0.6-13.el7.x86_64 cyrus-sasl-lib-2.1.26-24.el7_9.x86_64 elfutils-libelf-0.176-5.el7.x86_64 elfutils-libs-0.176-5.el7.x86_64 glibc-2.17-325.el7_9.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-51.el7_9.x86_64 libattr-2.4.46-13.el7.x86_64 libcap-2.22-11.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 libcom_err-1.42.9-19.el7.x86_64 libedit-3.0-12.20121213cvs.el7.x86_64 libffi-3.0.13-19.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libgcrypt-1.5.3-14.el7.x86_64 libgpg-error-1.12-3.el7.x86_64 libicu-50.2-4.el7_7.x86_64 libselinux-2.5-15.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64 libxml2-2.9.1-6.el7_9.6.x86_64 llvm5.0-libs-5.0.1-7.el7.x86_64 lz4-1.8.3-1.el7.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64 nspr-4.32.0-1.el7_9.x86_64 nss-3.67.0-4.el7_9.x86_64 nss-softokn-freebl-3.67.0-3.el7_9.x86_64 nss-util-3.67.0-1.el7_9.x86_64 openldap-2.4.44-25.el7_9.x86_64 openssl-libs-1.0.2k-24.el7_9.x86_64 pam-1.1.8-23.el7.x86_64 pcre-8.32-17.el7.x86_64 pg_qualstats_14-2.0.4-1.rhel7.x86_64 pg_stat_kcache_14-2.2.0-2.rhel7.x86_64 pg_wait_sampling_14-1.1.3-1.rhel7.x86_64 powa_14-4.1.4-1.rhel7.x86_64 systemd-libs-219-78.el7_9.5.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-19.el7_9.x86_64 (gdb) bt #0 remove_cache_entry (entry=<optimized out>, mstate=<optimized out>) at nodeMemoize.c:349 #1 cache_reduce_memory (mstate=mstate@entry=0x2d0a2a0, specialkey=specialkey@entry=0x122614e8) at nodeMemoize.c:468 #2 0x0000000000669cce in cache_store_tuple (mstate=mstate@entry=0x2d0a2a0, slot=slot@entry=0x2d0a7e0) at nodeMemoize.c:635 #3 0x000000000066a0ab in ExecMemoize (pstate=0x2d0a2a0) at nodeMemoize.c:850 #4 0x0000000000649b28 in ExecProcNodeInstr (node=0x2d0a2a0) at execProcnode.c:479 #5 0x00000000006710f5 in ExecProcNode (node=0x2d0a2a0) at ../../../src/include/executor/executor.h:257 #6 ExecNestLoop (pstate=0x2c70a38) at nodeNestloop.c:160 #7 0x0000000000649b28 in ExecProcNodeInstr (node=0x2c70a38) at execProcnode.c:479 #8 0x00000000006735a6 in ExecProcNode (node=0x2c70a38) at ../../../src/include/executor/executor.h:257 #9 ExecSort (pstate=0x2c70828) at nodeSort.c:108 #10 0x0000000000649b28 in ExecProcNodeInstr (node=0x2c70828) at execProcnode.c:479 #11 0x0000000000653b19 in ExecProcNode (node=0x2c70828) at ../../../src/include/executor/executor.h:257 #12 fetch_input_tuple (aggstate=aggstate@entry=0x2c701c0) at nodeAgg.c:581 #13 0x00000000006568f2 in agg_retrieve_direct (aggstate=0x2c701c0) at nodeAgg.c:2364 #14 ExecAgg (pstate=0x2c701c0) at nodeAgg.c:2179 #15 0x0000000000649b28 in ExecProcNodeInstr (node=0x2c701c0) at execProcnode.c:479 #16 0x0000000000670fe9 in ExecProcNode (node=0x2c701c0) at ../../../src/include/executor/executor.h:257 #17 ExecNestLoop (pstate=0x2c70020) at nodeNestloop.c:109 #18 0x0000000000649b28 in ExecProcNodeInstr (node=0x2c70020) at execProcnode.c:479 #19 0x00000000006735a6 in ExecProcNode (node=0x2c70020) at ../../../src/include/executor/executor.h:257 #20 ExecSort (pstate=0x2c6fe10) at nodeSort.c:108 #21 0x0000000000649b28 in ExecProcNodeInstr (node=0x2c6fe10) at execProcnode.c:479 #22 0x0000000000653b19 in ExecProcNode (node=0x2c6fe10) at ../../../src/include/executor/executor.h:257 #23 fetch_input_tuple (aggstate=aggstate@entry=0x2c6f838) at nodeAgg.c:581 #24 0x00000000006568f2 in agg_retrieve_direct (aggstate=0x2c6f838) at nodeAgg.c:2364 #25 ExecAgg (pstate=0x2c6f838) at nodeAgg.c:2179 #26 0x0000000000649b28 in ExecProcNodeInstr (node=0x2c6f838) at execProcnode.c:479 #27 0x00000000006683b3 in ExecProcNode (node=0x2c6f838) at ../../../src/include/executor/executor.h:257 #28 ExecLimit (pstate=0x2c6f548) at nodeLimit.c:96 #29 0x0000000000649b28 in ExecProcNodeInstr (node=0x2c6f548) at execProcnode.c:479 #30 0x0000000000643152 in ExecProcNode (node=0x2c6f548) at ../../../src/include/executor/executor.h:257 #31 ExecutePlan (execute_once=<optimized out>, dest=0xa40d00 <donothingDR>, direction=<optimized out>, numberTuples=0, sendTuples=true, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x2c6f548, estate=0x2c6f018) at execMain.c:1551 #32 standard_ExecutorRun (queryDesc=0x2ea0188, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:361 #33 0x00007f06ba3e461d in pgss_ExecutorRun (queryDesc=0x2ea0188, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at pg_stat_statements.c:1003 #34 0x00007f06b9fd44c1 in pgsk_ExecutorRun () from /usr/pgsql-14/lib/pg_stat_kcache.so #35 0x00007f06b9dcbfd1 in pgqs_ExecutorRun () from /usr/pgsql-14/lib/pg_qualstats.so #36 0x00000000005d8842 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x2e9cbe8, into=into@entry=0x0, es=es@entry=0x342d228, queryString=queryString@entry=0x2c72f58 "explain (analyze, buffers)\n", ' ' <repeats 24 times>, "WITH daily_tech AS (\n", ' ' <repeats 28 times>, "SELECT\n", ' ' <repeats 33 times>, "\n", ' ' <repeats 32 times>, "MAX(dmd_current0.split_adj_"..., params=params@entry=0x0, queryEnv=<optimized out>, planduration=planduration@entry=0x7ffc3d75ca60, bufusage=0x7ffc3d75cae0) at explain.c:593 #37 0x00000000005d8bdd in ExplainOneQuery (query=<optimized out>, cursorOptions=<optimized out>, into=0x0, es=0x342d228, queryString=0x2c72f58 "explain (analyze, buffers)\n", ' ' <repeats 24 times>, "WITH daily_tech AS (\n", ' ' <repeats 28 times>, "SELECT\n", ' ' <repeats 33 times>, "\n", ' ' <repeats 32 times>, "MAX(dmd_current0.split_adj_"..., params=0x0, queryEnv=0x0) at explain.c:410 #38 0x00000000005d9285 in ExplainQuery (pstate=pstate@entry=0x2c77230, stmt=stmt@entry=0x319b468, params=params@entry=0x0, dest=dest@entry=0x2c771a0) at explain.c:281 #39 0x00000000007bc575 in standard_ProcessUtility (pstmt=pstmt@entry=0x2e296b0, queryString=queryString@entry=0x2c72f58 "explain (analyze, buffers)\n", ' ' <repeats 24 times>, "WITH daily_tech AS (\n", ' ' <repeats 28 times>, "SELECT\n", ' ' <repeats 33 times>, "\n", ' ' <repeats 32 times>, "MAX(dmd_current0.split_adj_"..., readOnlyTree=<optimized out>, context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=dest@entry=0x2c771a0, qc=qc@entry=0x7ffc3d75cfb0) at utility.c:862 #40 0x00007f06ba3e60a0 in pgss_ProcessUtility (pstmt=0x2e296b0, queryString=0x2c72f58 "explain (analyze, buffers)\n", ' ' <repeats 24 times>, "WITH daily_tech AS (\n", ' ' <repeats 28 times>, "SELECT\n", ' ' <repeats 33 times>, "\n", ' ' <repeats 32 times>, "MAX(dmd_current0.split_adj_"..., readOnlyTree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x2c771a0, qc=0x7ffc3d75cfb0) at pg_stat_statements.c:1135 #41 0x00000000007ba7aa in PortalRunUtility (portal=portal@entry=0x2bef9d8, pstmt=0x2e296b0, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=true, dest=dest@entry=0x2c771a0, qc=qc@entry=0x7ffc3d75cfb0) at pquery.c:1155 #42 0x00000000007bab11 in FillPortalStore (portal=portal@entry=0x2bef9d8, isTopLevel=isTopLevel@entry=true) at pquery.c:1028 #43 0x00000000007bae8b in PortalRun (portal=<optimized out>, count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimized out>, dest=0x2e29790, altdest=0x2e29790, qc=0x7ffc3d75d190) at pquery.c:760 #44 0x00000000007b6be7 in exec_simple_query ( query_string=0x2c72f58 "explain (analyze, buffers)\n", ' ' <repeats 24 times>, "WITH daily_tech AS (\n", ' ' <repeats 28 times>, "SELECT\n", ' ' <repeats 33 times>, "\n", ' ' <repeats 32 times>, "MAX(dmd_current0.split_adj_"...) at postgres.c:1214 #45 0x00000000007b8063 in PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffc3d75d560, dbname=0x2bb78e0 "sdp_prod", username=<optimized out>) at postgres.c:4486 #46 0x000000000048d7ba in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4530 #47 BackendStartup (port=0x2ba6060) at postmaster.c:4252 #48 ServerLoop () at postmaster.c:1745 #49 0x0000000000735692 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x2b75280) at postmaster.c:1417 #50 0x000000000048e8d2 in main (argc=3, argv=0x2b75280) at main.c:209 A small isolated test case is not available so far, the issue occurs on select involving 450GiB table. Currently, production DB is being restored to a separated machine to allow future testing/investigation without affecting other database users. Thanks for any advice on working around this issue or how to collect better diagnostic data / additional helpful information. Kind regards Ales Zeleny
Thanks for reporting this. On Tue, 7 Jun 2022 at 13:21, PG Bug reporting form <noreply@postgresql.org> wrote: > Program terminated with signal 11, Segmentation fault. > #0 remove_cache_entry (entry=<optimized out>, mstate=<optimized out>) at > nodeMemoize.c:349 The relevant line in 14.2 is: MemoizeKey *key = entry->key; So entry must be NULL here. cache_reduce_memory() just removes cache entries starting at the head of the LRU. Given a correctly behaving hash function and equality function I can't quite see how we could have something in the LRU list that's not also stored in the hash table. The only two functions that make changes to the hash table and LRU list are remove_cache_entry(), cache_lookup() and cache_purge_all(). The latter of those 3 does not really seem like a candidate for the hash table and list getting out of sync given that it just creates an empty table and empty list. That makes me suspect that either the hash function or equality function for the data types in the cache key are misbehaving. Can you show us the EXPLAIN output for the problem query? Or at the very least, the relevant "Cache Key" lines. And can you also show the psql \d output for the tables which are mentioned in the cache key? I'm currently thinking that the Assert(entry != NULL) in cache_reduce_memory() should probably be a runtime check rather than an Assert. But let's wait to see if we can confirm that something weird is going on with the cache key data type. David
Hello,
The Cache Key plan part is:
-> Memoize (cost=0.43..0.84 rows=20 width=8)
Cache Key: os1.date, os1.instrument_id
-> Index Only Scan using outstanding_shares_instrument_id_date_key on outstanding_shares os2 (cost=0.42..0.83 rows=20 width=8)
Index Cond: ((instrument_id = os1.instrument_id) AND (date > os1.date))
Cache Key: os1.date, os1.instrument_id
-> Index Only Scan using outstanding_shares_instrument_id_date_key on outstanding_shares os2 (cost=0.42..0.83 rows=20 width=8)
Index Cond: ((instrument_id = os1.instrument_id) AND (date > os1.date))
And the columns are:
Table "outstanding_shares"
Column | Type | Collation | Nullable | Default
---------------+---------+-----------+----------+---------------------------------------------------------
instrument_id | integer | | |
date | date | | not null |
Column | Type | Collation | Nullable | Default
---------------+---------+-----------+----------+---------------------------------------------------------
instrument_id | integer | | |
date | date | | not null |
I have to check whether I can share full table structure, query and the query plan. In the meantime, the database was restored on a test machine, so I'll try to reproduce the issue.
Kind regards Ales
út 7. 6. 2022 v 5:17 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:
Thanks for reporting this.
On Tue, 7 Jun 2022 at 13:21, PG Bug reporting form
<noreply@postgresql.org> wrote:
> Program terminated with signal 11, Segmentation fault.
> #0 remove_cache_entry (entry=<optimized out>, mstate=<optimized out>) at
> nodeMemoize.c:349
The relevant line in 14.2 is:
MemoizeKey *key = entry->key;
So entry must be NULL here.
cache_reduce_memory() just removes cache entries starting at the head
of the LRU. Given a correctly behaving hash function and equality
function I can't quite see how we could have something in the LRU list
that's not also stored in the hash table. The only two functions that
make changes to the hash table and LRU list are remove_cache_entry(),
cache_lookup() and cache_purge_all(). The latter of those 3 does not
really seem like a candidate for the hash table and list getting out
of sync given that it just creates an empty table and empty list.
That makes me suspect that either the hash function or equality
function for the data types in the cache key are misbehaving.
Can you show us the EXPLAIN output for the problem query? Or at the
very least, the relevant "Cache Key" lines.
And can you also show the psql \d output for the tables which are
mentioned in the cache key?
I'm currently thinking that the Assert(entry != NULL) in
cache_reduce_memory() should probably be a runtime check rather than
an Assert. But let's wait to see if we can confirm that something
weird is going on with the cache key data type.
David
Hello,
I was able to reproduce the issue on the database restored to a test server.
So far, it looks that the query I've been using to prove the SIGSEV works fine when I disabled jit by setting
jit=off
The jit package is postgresql14-llvmjit-14.2-1PGDG.rhel7.x86_64
Having it reproducible on the test machine, I can test whatever you instruct me to do to help to identify the root cause.
Kind regards Ales Zeleny
út 7. 6. 2022 v 10:29 odesílatel Aleš Zelený <zeleny.ales@gmail.com> napsal:
Hello,The Cache Key plan part is:-> Memoize (cost=0.43..0.84 rows=20 width=8)
Cache Key: os1.date, os1.instrument_id
-> Index Only Scan using outstanding_shares_instrument_id_date_key on outstanding_shares os2 (cost=0.42..0.83 rows=20 width=8)
Index Cond: ((instrument_id = os1.instrument_id) AND (date > os1.date))And the columns are:Table "outstanding_shares"
Column | Type | Collation | Nullable | Default
---------------+---------+-----------+----------+---------------------------------------------------------
instrument_id | integer | | |
date | date | | not null |I have to check whether I can share full table structure, query and the query plan. In the meantime, the database was restored on a test machine, so I'll try to reproduce the issue.Kind regards Alesút 7. 6. 2022 v 5:17 odesílatel David Rowley <dgrowleyml@gmail.com> napsal:Thanks for reporting this.
On Tue, 7 Jun 2022 at 13:21, PG Bug reporting form
<noreply@postgresql.org> wrote:
> Program terminated with signal 11, Segmentation fault.
> #0 remove_cache_entry (entry=<optimized out>, mstate=<optimized out>) at
> nodeMemoize.c:349
The relevant line in 14.2 is:
MemoizeKey *key = entry->key;
So entry must be NULL here.
cache_reduce_memory() just removes cache entries starting at the head
of the LRU. Given a correctly behaving hash function and equality
function I can't quite see how we could have something in the LRU list
that's not also stored in the hash table. The only two functions that
make changes to the hash table and LRU list are remove_cache_entry(),
cache_lookup() and cache_purge_all(). The latter of those 3 does not
really seem like a candidate for the hash table and list getting out
of sync given that it just creates an empty table and empty list.
That makes me suspect that either the hash function or equality
function for the data types in the cache key are misbehaving.
Can you show us the EXPLAIN output for the problem query? Or at the
very least, the relevant "Cache Key" lines.
And can you also show the psql \d output for the tables which are
mentioned in the cache key?
I'm currently thinking that the Assert(entry != NULL) in
cache_reduce_memory() should probably be a runtime check rather than
an Assert. But let's wait to see if we can confirm that something
weird is going on with the cache key data type.
David
> So far, it looks that the query I've been using to prove the SIGSEV works > fine when I disabled jit by setting > jit=off Oh! So it's (probably) not a Memoize problem, it's an LLVM problem. > The jit package is postgresql14-llvmjit-14.2-1PGDG.rhel7.x86_64 Hm. What LLVM version have you got on that machine? (And I wonder whether it's the same as what Devrim built this PGDG package against.) regards, tom lane
Hello,
here are the installed llvm packages:
llvm-3.4.2-9.el7.x86_64
llvm5.0-libs-5.0.1-7.el7.x86_64
llvm5.0-5.0.1-7.el7.x86_64
llvm-libs-3.4.2-9.el7.x86_64
postgresql14-llvmjit-14.2-1PGDG.rhel7.x86_64
llvm5.0-libs-5.0.1-7.el7.x86_64
llvm5.0-5.0.1-7.el7.x86_64
llvm-libs-3.4.2-9.el7.x86_64
postgresql14-llvmjit-14.2-1PGDG.rhel7.x86_64
Kind regards Ales Zeleny
út 7. 6. 2022 v 21:07 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:
> So far, it looks that the query I've been using to prove the SIGSEV works
> fine when I disabled jit by setting
> jit=off
Oh! So it's (probably) not a Memoize problem, it's an LLVM problem.
> The jit package is postgresql14-llvmjit-14.2-1PGDG.rhel7.x86_64
Hm. What LLVM version have you got on that machine? (And I wonder
whether it's the same as what Devrim built this PGDG package against.)
regards, tom lane
On Wed, 8 Jun 2022 at 07:07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > So far, it looks that the query I've been using to prove the SIGSEV works > > fine when I disabled jit by setting > > jit=off > > Oh! So it's (probably) not a Memoize problem, it's an LLVM problem. I wonder if Memoize should do a better job of handling lookup failures. Before I knew the problem was with built-in datatypes, I suspected this might have been caused by some extension with a broken hash or equality function. I decided to try writing a broken datatype by messing with the hash function in [1] by patching it with the attached bogus_fixeddecimal_hashfunc.diff then doing: # create extension fixeddecimal; # create table t1 (a fixeddecimal not null); # insert into t1 select x%100 from generate_Series(1,10000)x; # create index on t1(a); # set work_mem=64; # explain analyze select * from t1 inner join t1 t2 on t1.a=t2.a where t1.a<=2; server closed the connection unexpectedly If I patch it with the attached rebustify_memoize_code.patch, I get: # explain analyze select * from t1 inner join t1 t2 on t1.a=t2.a where t1.a<=2; ERROR: unable to find memoization table entry My thoughts are that we likely should make this code more robust, despite the bug not being related to a broken data type. I'm only just over the fence on that one though. Any thoughts? David [1] https://github.com/2ndQuadrant/fixeddecimal
Вложения
David Rowley <dgrowleyml@gmail.com> writes: > server closed the connection unexpectedly Ugh. > If I patch it with the attached rebustify_memoize_code.patch, I get: > # explain analyze select * from t1 inner join t1 t2 on t1.a=t2.a where t1.a<=2; > ERROR: unable to find memoization table entry > My thoughts are that we likely should make this code more robust, > despite the bug not being related to a broken data type. I'm only just > over the fence on that one though. Any thoughts? Strong +1, but please s/unable to/could not/, per message style guidelines. regards, tom lane
On Wed, 8 Jun 2022 at 11:43, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > David Rowley <dgrowleyml@gmail.com> writes: > > My thoughts are that we likely should make this code more robust, > > despite the bug not being related to a broken data type. I'm only just > > over the fence on that one though. Any thoughts? > > Strong +1, but please s/unable to/could not/, per message style > guidelines. Thanks for the review. I've now pushed the adjusted patch. David
Hi all,
is there something, on top of the llvm packages, I shall test or provide to help the root cause be analyzed?
If I understand the patch properly, I'll avoid SIGSEV, but still the query with jit=on will fail, except (which is a significant improvement!) the server will not be restarted.
Thanks Ales
út 7. 6. 2022 v 22:26 odesílatel Aleš Zelený <zeleny.ales@gmail.com> napsal:
Hello,here are the installed llvm packages:llvm-3.4.2-9.el7.x86_64
llvm5.0-libs-5.0.1-7.el7.x86_64
llvm5.0-5.0.1-7.el7.x86_64
llvm-libs-3.4.2-9.el7.x86_64
postgresql14-llvmjit-14.2-1PGDG.rhel7.x86_64Kind regards Ales Zelenyút 7. 6. 2022 v 21:07 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:> So far, it looks that the query I've been using to prove the SIGSEV works
> fine when I disabled jit by setting
> jit=off
Oh! So it's (probably) not a Memoize problem, it's an LLVM problem.
> The jit package is postgresql14-llvmjit-14.2-1PGDG.rhel7.x86_64
Hm. What LLVM version have you got on that machine? (And I wonder
whether it's the same as what Devrim built this PGDG package against.)
regards, tom lane