Обсуждение: terminate called after throwing an instance of 'std::bad_alloc'

Поиск
Список
Период
Сортировка

terminate called after throwing an instance of 'std::bad_alloc'

От
Justin Pryzby
Дата:
A VM crashed which is now running PG13.0 on centos7:

Sep 30 19:40:08 database7 abrt-hook-ccpp: Process 17905 (postgres) of user 26 killed by SIGABRT - dumping core
Core was generated by `postgres: telsasoft ts 192.168.122.11(34608) SELECT    '.

Unfortunately, the filesystem wasn't large enough and the corefile is
truncated.

The first badness in our logfiles looks like this ; this is the very head of
the logfile:

|[pryzbyj@database7 ~]$ sudo gzip -dc /var/log/postgresql/crash-postgresql-2020-09-30_194000.log.gz |head
|[sudo] password for pryzbyj: 
|terminate called after throwing an instance of 'std::bad_alloc'
|  what():  std::bad_alloc
|< 2020-09-30 19:40:09.653 ADT  >LOG:  checkpoint starting: time
|< 2020-09-30 19:40:17.002 ADT  >LOG:  checkpoint complete: wrote 74 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0
recycled;write=7.331 s, sync=0.006 s, total=7.348 s; sync files=25, longest=0.002 s, average=0.000 s; distance=295 kB,
estimate=4183kB
 
|< 2020-09-30 19:40:22.642 ADT  >LOG:  server process (PID 17905) was terminated by signal 6: Aborted
|< 2020-09-30 19:40:22.642 ADT  >DETAIL:  Failed process was running: --BEGIN SQL
|        SELECT * FROM

I was able to grep the filesystem to find what looks like the preceding logfile
(which our script had immediately rotated in its attempt to be helpful).

|< 2020-09-30 19:39:09.096 ADT  >LOG:  checkpoint starting: time
|< 2020-09-30 19:39:12.640 ADT  >LOG:  checkpoint complete: wrote 35 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0
recycled;write=3.523 s, sync=0.006 s, total=3.544 s; sync files=14, longest=0.002 s, average=0.000 s; distance=103 kB,
estimate=4615kB
 

This seemed familiar, and I found it had happened 18 months ago on a different
server:

|terminate called after throwing an instance of 'std::bad_alloc'
|  what():  std::bad_alloc
|< 2019-02-01 15:36:54.434 CST  >LOG:  server process (PID 13557) was terminated by signal 6: Aborted
|< 2019-02-01 15:36:54.434 CST  >DETAIL:  Failed process was running:
|            SELECT *, '' as n
|            FROM a, b WHERE
|            a.id = b.id AND
|            b.c IS NOT NULL
|            ORDER BY b.c LIMIT 9

I have a log of pg_settings, which shows that this server ran v11.1 until
2019-02-14, when it was upgraded to v11.2.  Since v11.2 included a fix for a
bug I reported involving JIT and wide tables, I probably saw this crash and
dismissed it, even though tables here named "a" and "b" have only ~30 columns
combined.  The query that crashed in 2019 is actually processing a small queue,
and runs every 5sec.

The query that crashed today is a "day" level query which runs every 15min, so
it ran 70some times today with no issue.

Our DBs use postgis, and today's crash JOINs to the table with geometry
columns, but does not use them at all.

But the 2019 doesn't even include the geometry table.  I'm not sure if these
are even the same crash, but if they are, I think it's maybe an JIT issue and
not postgis (??)

I've had JIT disabled since 2019, due to no performance benefit for us, but
I've been re-enabling it during upgrades and transitions, and instead disabling
jit_tuple_deforming (since this performs badly for columns with high attnum).
So maybe this will recur before too long.



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Tom Lane
Дата:
Justin Pryzby <pryzby@telsasoft.com> writes:
> A VM crashed which is now running PG13.0 on centos7:
> Sep 30 19:40:08 database7 abrt-hook-ccpp: Process 17905 (postgres) of user 26 killed by SIGABRT - dumping core
> Core was generated by `postgres: telsasoft ts 192.168.122.11(34608) SELECT    '.

> Unfortunately, the filesystem wasn't large enough and the corefile is
> truncated.

Hm, can you fix that?  A stack trace would be mighty handy here.

            regards, tom lane



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Andres Freund
Дата:
Hi,

On 2020-09-30 21:16:10 -0500, Justin Pryzby wrote:
> Sep 30 19:40:08 database7 abrt-hook-ccpp: Process 17905 (postgres) of user 26 killed by SIGABRT - dumping core
> Core was generated by `postgres: telsasoft ts 192.168.122.11(34608) SELECT    '.
> 
> Unfortunately, the filesystem wasn't large enough and the corefile is
> truncated.

Too bad.


> |[pryzbyj@database7 ~]$ sudo gzip -dc /var/log/postgresql/crash-postgresql-2020-09-30_194000.log.gz |head
> |[sudo] password for pryzbyj: 
> |terminate called after throwing an instance of 'std::bad_alloc'
> |  what():  std::bad_alloc

I think this shouldn't happen while we're inside LLVM. To detect /
handle allocation failures in the middle of LLVM code in a halfway
sensible way, we install an allocation error handler just while in LLVM
code (c.f. llvm_enter_fatal_on_oom/llvm_leave_fatal_on_oom).

It's possible of course the JIT somehow corrupted data that is passed
somewhere to postgis, which then caused postgis to run out of
memory. But it seems like that'd need a fair bit of bad luck to not
cause an other error before that.


> But the 2019 doesn't even include the geometry table.  I'm not sure if these
> are even the same crash, but if they are, I think it's maybe an JIT issue and
> not postgis (??)

It'd be good to get a backtrace - but this doesn't immediately scream
out loud "I am a JIT bug" to me.

Greetings,

Andres Freund



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Justin Pryzby
Дата:
On Wed, Sep 30, 2020 at 09:16:09PM -0500, Justin Pryzby wrote:
> Our DBs use postgis, and today's crash JOINs to the table with geometry
> columns, but does not use them at all.

I'm wrong here - it does return the geometry columns.

We didn't use JIT with v12, but I've been setting jit=on along with v13
upgrades, so I was trying to convince myself that there's a change in v13, and
not just our config..

I've seen repeated OOMs on a small number of reports.  It looks like JIT
optimization/inlining used a good deal of RAM in v12, and even more on v13.
Maybe that's what's expected; it looks like this scales with the number of
partitions and number of functions, so it could be arbitrarily large, and not
respectful of work_mem (below, at the 4MB default).

 CREATE TABLE t(a int) PARTITION BY RANGE(a);
 \pset pager off
 SELECT format('CREATE TABLE t%s PARTITION OF t FOR VALUES FROM (%s)TO(%s) WITH (parallel_workers=2)', i, 100*i,
100*i+100)FROM generate_series(0,666)i;
 
 \gexec
 INSERT INTO t SELECT i%10000 FROM generate_series(1,99999)i;
 CREATE INDEX ON t(left(a::text,1));
 VACUUM ANALYZE t;
 \o /dev/null \\ SET client_min_messages=debug; SET log_executor_stats=on; SET jit=on; SET jit_expressions=on; SET
jit_above_cost=0;SET jit_optimize_above_cost=1; SET jit_inline_above_cost=1; explain( analyze) SELECT * FROM t WHERE
substr(a::text,length(a::text)-6,6)='1'OR substr(a::text,length(a::text)-6,6)='2' OR
substr(a::text,length(a::text)-6,6)='3'OR substr(a::text,length(a::text)-6,6)='4' OR
substr(a::text,length(a::text)-6,6)='5'OR substr(a::text,length(a::text)-6,6)='6';
 

In v13:
With optimize_above=-1, inline_above=-1: 333472 kB max resident size
With optimize_above=1, inline_above=1: 654036 kB max resident size

In v12:
With optimize_above=-1, inline_above=-1: 330988 kB max resident size
With optimize_above=1, inline_above=1: 556844 kB max resident size

These tests are on the same centos7 server, which has:
llvm5.0-libs-5.0.1-7.el7.x86_64

This appears to be similar under centos8 with LLVM9.

The OOM itself could be due to our JIT settings and maybe not a bug.  My
thinking is that when we hit OOM, that sometimes manifests itself as the C++
exception, which somehow trickles up into an SIGABRT in postgres.  

I don't know if there's a way to capture useful debugging info on OOM, but the
live processes do seem to leaking memory, or otherwise not reusing it.

For example, look at a this huge backend, which is connected to the process
which have been experiencing OOM.  The backend is 4.2GB while idle, so I'd have
hoped for it to have released everything it could.  I'm aware that free()
usually cannot return small allocations to the kernel, but then should not lead
to continued growth over time.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
                                                                                                         
 
18000 postgres  20   0 4975152   4.2g   4684 S   0.0 55.6  81:29.87 postgres: main main 10.124.252.14(58296) idle
                                                                                                         
 

main=# SELECT * FROM pg_stat_activity WHERE pid=18000;
datid            | 151510
datname          | main
pid              | 18000
leader_pid       | 
usesysid         | 16714
usename          | main
application_name | run_thresholds --daemon
client_addr      | 10.124.252.14
client_hostname  | 
client_port      | 58296
backend_start    | 2020-10-02 22:26:16.676936-06
xact_start       | 
query_start      | 2020-10-03 12:37:37.639971-06
state_change     | 2020-10-03 12:37:37.640183-06
wait_event_type  | Client
wait_event       | ClientRead
state            | idle
backend_xid      | 
backend_xmin     | 
query            | ROLLBACK
backend_type     | client backend

This is mostly just in "heap":

$ sudo cat /proc/18000/maps |awk -Wnon-decimal -F '[- ]' '{a="0x"$1; b="0x"$2; print (b-a)/1024**2,$0}' |sort -gr
|head
3428.65 010d6000-d757c000 rw-p 00000000 00:00 0                                  [heap]
512.004 7f3ec96e8000-7f3ee96e9000 rw-p 00000000 00:00 0 
384.004 7f3e9e6e1000-7f3eb66e2000 rw-p 00000000 00:00 0 
148 2aaaaac00000-2aaab4000000 rw-s 00000000 00:0e 44184914                   /anon_hugepage (deleted)

The postgres memory contexts don't show anything nearly that large.
  MessageContext: 25236472 total in 20 blocks; 2990656 free (4 chunks); 22245816 used
  CacheMemoryContext: 25892160 total in 42 blocks; 22770984 free (28285 chunks); 3121176 used
Grand total: 70879896 bytes in 1466 blocks; 29518440 free (29097 chunks); 41361456 used

I was able to make a small, apparent leak like so.  This applies to
postgis3.0/postgres12/LLVM5/centos7, and
postgis3.1alpha/postgres13/LLVM9/centos8.

Inlining is essential to see the leak, optimization is not.  Showing every 9th
query:

$ for a in `seq 1 99`; do echo "SET jit_inline_above_cost=1; SET jit_optimize_above_cost=-1; SET jit_above_cost=0; SET
jit_expressions=on;SET log_executor_stats=on; SET client_min_messages=debug; SET jit=on; explain analyze SELECT
st_x(site_geo)FROM sites WHERE site_geo IS NOT NULL;"; done |psql ts 2>&1 |grep 'resident' |sed -n '1~9p'
 

!       40552 kB max resident size
!       41188 kB max resident size
!       41820 kB max resident size
!       42576 kB max resident size
!       43080 kB max resident size
!       43508 kB max resident size
!       44572 kB max resident size
!       45040 kB max resident size
!       45588 kB max resident size
!       46024 kB max resident size
!       46492 kB max resident size

Which would lead to an OOM in our longlived process.

-- 
Justin



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Andres Freund
Дата:
Hi,

On 2020-10-03 17:02:36 -0500, Justin Pryzby wrote:
> We didn't use JIT with v12, but I've been setting jit=on along with v13
> upgrades, so I was trying to convince myself that there's a change in v13, and
> not just our config..
>
> I've seen repeated OOMs on a small number of reports.  It looks like JIT
> optimization/inlining used a good deal of RAM in v12, and even more on v13.
> Maybe that's what's expected; it looks like this scales with the number of
> partitions and number of functions, so it could be arbitrarily large, and not
> respectful of work_mem (below, at the 4MB default).

> In v13:
> With optimize_above=-1, inline_above=-1: 333472 kB max resident size
> With optimize_above=1, inline_above=1: 654036 kB max resident size
>
> In v12:
> With optimize_above=-1, inline_above=-1: 330988 kB max resident size
> With optimize_above=1, inline_above=1: 556844 kB max resident size

That's probably to some degree just the shared library being loaded /
faulted in (see below for more detail).  I don't think there's any
difference between 12/13 in what inlining / optimization does, so I
don't quite understand the difference.


> The OOM itself could be due to our JIT settings and maybe not a bug.  My
> thinking is that when we hit OOM, that sometimes manifests itself as the C++
> exception, which somehow trickles up into an SIGABRT in postgres.

C++ code throws an exception when memory allocation fails, unless a
different error handler is set. While in LLVM for JITing we set a
different error handler (which doesn't throw, but FATALs), but that
doesn't appear to be done for all the postgis code. Nor does it - as
it really should, catch the exeption.  An unhandled exception is treated
as an abort(), which in turn does a SIGABRT.


> For example, look at a this huge backend, which is connected to the process
> which have been experiencing OOM.  The backend is 4.2GB while idle, so I'd have
> hoped for it to have released everything it could.  I'm aware that free()
> usually cannot return small allocations to the kernel, but then should not lead
> to continued growth over time.

The below turned out to be somewhat obsoleted by what you wrote below,
which I unfortunately hadn't yet read - but I think it's still good
information, so I'm not gonna delete it:

RES includes all the shared buffers that a backend touched, unless you
use huge pages. At least on linux RES is unfortunately pretty much
useless for postgres. You can gain a bit more information by looking at
/proc/$pid/status and /proc/$pid/smaps_rollup

E.g. here is a backend without having run any JIT (pruned all of these):

andres@awork3:~$ grep ^Rss /proc/2810440/status
RssAnon:        2324 kB
RssFile:        6932 kB
RssShmem:        4868 kB

andres@awork3:~$ grep ^Pss /proc/2810440/smaps_rollup
Pss:                3912 kB
Pss_Anon:           1177 kB
Pss_File:           1004 kB
Pss_Shmem:          1730 kB

and after loading the llvmjit library (LOAD 'llvmjit'), but not actually
jitting anything:

andres@awork3:~$ grep ^Rss /proc/2810440/status
RssAnon:        7528 kB
RssFile:      104844 kB
RssShmem:        5204 kB

andres@awork3:~$ grep ^Pss /proc/2810440/smaps_rollup
Pss:               57773 kB
Pss_Anon:           6436 kB
Pss_File:          49606 kB
Pss_Shmem:          1730 kB

Note how RSS has increased by ~100MB - but all of that is
file-backed. It's just the shared library being loaded (not even all
required to be in memory). And if you look at the 'Pss' (p stands for
proportional) values you can see how it's lower, because it's shared
between all the processes that have JIT loaded - in my case here that
was just two session, hence halved.

The actually allocated memory differs between the two, but it's ~2.3MB
vs 7.5MB.

(you also can see how a postgres backend doesn't actually weigh 2MB,
because a lot of the memory allocations are shared between postmaster
and backends)


And if you additionally utilize buffers in shared buffers, RSS also will
included that. E.g.

postgres[2810440][1]=# SHOW shared_buffers;
┌────────────────┐
│ shared_buffers │
├────────────────┤
│ 2GB            │
└────────────────┘
(1 row)

postgres[2810325][1]=# \dt+ large
                                   List of relations
┌────────┬───────┬───────┬────────┬─────────────┬───────────────┬────────┬─────────────┐
│ Schema │ Name  │ Type  │ Owner  │ Persistence │ Access Method │  Size  │ Description │
├────────┼───────┼───────┼────────┼─────────────┼───────────────┼────────┼─────────────┤
│ public │ large │ table │ andres │ permanent   │ heap          │ 501 MB │             │
└────────┴───────┴───────┴────────┴─────────────┴───────────────┴────────┴─────────────┘
(1 row)

andres@awork3:~$ grep ^Rss /proc/2810440/status
RssAnon:        8284 kB
RssFile:      107236 kB
RssShmem:       19184 kB
andres@awork3:~$ grep ^Pss /proc/2810440/smaps_rollup
Pss:               63282 kB
Pss_Anon:           7413 kB
Pss_File:          50034 kB
Pss_Shmem:          5834 kB

postgres[2810440][1]=# SELECT count(generate_series), count(random) FROM large;
...

andres@awork3:~$ grep ^Rss /proc/2810440/status
RssAnon:        8592 kB
RssFile:      107464 kB
RssShmem:      390328 kB
andres@awork3:~$ grep ^Pss /proc/2810440/smaps_rollup
Pss:              248889 kB
Pss_Anon:           7560 kB
Pss_File:          50069 kB
Pss_Shmem:        191259 kB

So now, despite only ~8.5MB of allocations, ps's RSS will show a memory
usage considerably larger:

andres@awork3:~$ ps -u --pid 2810440
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
andres   2810440  0.2  0.2 2420584 506384 ?      Ss   15:19   0:02 postgres: dev assert: andres postgres [local] idle


> The postgres memory contexts don't show anything nearly that large.
>   MessageContext: 25236472 total in 20 blocks; 2990656 free (4 chunks); 22245816 used
>   CacheMemoryContext: 25892160 total in 42 blocks; 22770984 free (28285 chunks); 3121176 used
> Grand total: 70879896 bytes in 1466 blocks; 29518440 free (29097 chunks); 41361456 used

I wonder if it'd be worth trying to redirect all LLVM allocations to be
done into a memory context. Not to be able to free it, but to be able to
see it more clearly. But it looks like it'd be too fragile to do so.


> I was able to make a small, apparent leak like so.  This applies to
> postgis3.0/postgres12/LLVM5/centos7, and
> postgis3.1alpha/postgres13/LLVM9/centos8.
>
> Inlining is essential to see the leak, optimization is not.  Showing every 9th
> query:
>
> $ for a in `seq 1 99`; do echo "SET jit_inline_above_cost=1; SET jit_optimize_above_cost=-1; SET jit_above_cost=0;
SETjit_expressions=on; SET log_executor_stats=on; SET client_min_messages=debug; SET jit=on; explain analyze SELECT
st_x(site_geo)FROM sites WHERE site_geo IS NOT NULL;"; done |psql ts 2>&1 |grep 'resident' |sed -n '1~9p'
 
> !       46024 kB max resident size
> !       46492 kB max resident size

Huh, that's certainly not good. Looking.

Greetings,

Andres Freund



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Justin Pryzby
Дата:
On Sat, Oct 03, 2020 at 04:01:49PM -0700, Andres Freund wrote:
> The below turned out to be somewhat obsoleted by what you wrote below,
> which I unfortunately hadn't yet read - but I think it's still good
> information, so I'm not gonna delete it:

Right, I understand that RES RAM might includes shared buffers and demand
loaded file-backed and mmap stuff.  But on our servers, I tentatively wouldn't
expect it to be more than a handful of GB.  It might be true that most of our
report processes run in a forked children, specifically to avoid leaks, and the
forked children would have their own, shortlived backend, which would hide any
postgres leak (in addition to having tried to use jit since v11).

> > I was able to make a small, apparent leak like so.  This applies to
> > postgis3.0/postgres12/LLVM5/centos7, and
> > postgis3.1alpha/postgres13/LLVM9/centos8.
> >
> > Inlining is essential to see the leak, optimization is not.  Showing every 9th
> > query:
> >
> > $ for a in `seq 1 99`; do echo "SET jit_inline_above_cost=1; SET jit_optimize_above_cost=-1; SET jit_above_cost=0;
SETjit_expressions=on; SET log_executor_stats=on; SET client_min_messages=debug; SET jit=on; explain analyze SELECT
st_x(site_geo)FROM sites WHERE site_geo IS NOT NULL;"; done |psql ts 2>&1 |grep 'resident' |sed -n '1~9p'
 
> > !       46024 kB max resident size
> > !       46492 kB max resident size
> 
> Huh, that's certainly not good. Looking.

Reproduce like:
postgres=# CREATE EXTENSION postgis;
postgres=# CREATE TABLE geo(g geometry(point));
postgres=# INSERT INTO geo SELECT st_GeometryFromText('POINT(11 12)',0) FROM generate_series(1,99999);
postgres=# SET jit_above_cost=0; SET jit_inline_above_cost=0; SET client_min_messages=debug; SET log_executor_stats=on;
explainanalyze SELECT st_x(g) FROM geo;
 



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Andres Freund
Дата:
Hi,

On 2020-10-03 18:30:46 -0500, Justin Pryzby wrote:
> On Sat, Oct 03, 2020 at 04:01:49PM -0700, Andres Freund wrote:
> > > I was able to make a small, apparent leak like so.  This applies to
> > > postgis3.0/postgres12/LLVM5/centos7, and
> > > postgis3.1alpha/postgres13/LLVM9/centos8.
> > >
> > > Inlining is essential to see the leak, optimization is not.  Showing every 9th
> > > query:
> > >
> > > $ for a in `seq 1 99`; do echo "SET jit_inline_above_cost=1; SET jit_optimize_above_cost=-1; SET
jit_above_cost=0;SET jit_expressions=on; SET log_executor_stats=on; SET client_min_messages=debug; SET jit=on; explain
analyzeSELECT st_x(site_geo) FROM sites WHERE site_geo IS NOT NULL;"; done |psql ts 2>&1 |grep 'resident' |sed -n
'1~9p'
> > > !       46024 kB max resident size
> > > !       46492 kB max resident size
> > 
> > Huh, that's certainly not good. Looking.
> 
> Reproduce like:
> postgres=# CREATE EXTENSION postgis;
> postgres=# CREATE TABLE geo(g geometry(point));
> postgres=# INSERT INTO geo SELECT st_GeometryFromText('POINT(11 12)',0) FROM generate_series(1,99999);
> postgres=# SET jit_above_cost=0; SET jit_inline_above_cost=0; SET client_min_messages=debug; SET
log_executor_stats=on;explain analyze SELECT st_x(g) FROM geo;
 

It's an issue inside LLVM. I'm not yet sure how to avoid it. Basically
whenever we load an IR module to inline from, it renames its type names
to make them not conflict with already known type names in the current
LLVMContext. That, over time, leads to a large number of type names in
memory. We can avoid the re-loading of the module by instead cloning it,
but that still leads to new types being created for some internal
reasons.  It's possible to create/drop separate LLVMContext instances,
but we'd have to figure out when to do so - it's too expensive to always
do that.

Greetings,

Andres Freund



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Andres Freund
Дата:
Hi,

On 2020-10-03 19:01:27 -0700, Andres Freund wrote:
> On 2020-10-03 18:30:46 -0500, Justin Pryzby wrote:
> > On Sat, Oct 03, 2020 at 04:01:49PM -0700, Andres Freund wrote:
> > > > I was able to make a small, apparent leak like so.  This applies to
> > > > postgis3.0/postgres12/LLVM5/centos7, and
> > > > postgis3.1alpha/postgres13/LLVM9/centos8.
> > > >
> > > > Inlining is essential to see the leak, optimization is not.  Showing every 9th
> > > > query:
> > > >
> > > > $ for a in `seq 1 99`; do echo "SET jit_inline_above_cost=1; SET jit_optimize_above_cost=-1; SET
jit_above_cost=0;SET jit_expressions=on; SET log_executor_stats=on; SET client_min_messages=debug; SET jit=on; explain
analyzeSELECT st_x(site_geo) FROM sites WHERE site_geo IS NOT NULL;"; done |psql ts 2>&1 |grep 'resident' |sed -n
'1~9p'
> > > > !       46024 kB max resident size
> > > > !       46492 kB max resident size
> > >
> > > Huh, that's certainly not good. Looking.
> >
> > Reproduce like:
> > postgres=# CREATE EXTENSION postgis;
> > postgres=# CREATE TABLE geo(g geometry(point));
> > postgres=# INSERT INTO geo SELECT st_GeometryFromText('POINT(11 12)',0) FROM generate_series(1,99999);
> > postgres=# SET jit_above_cost=0; SET jit_inline_above_cost=0; SET client_min_messages=debug; SET
log_executor_stats=on;explain analyze SELECT st_x(g) FROM geo;
 
>
> It's an issue inside LLVM. I'm not yet sure how to avoid it. Basically
> whenever we load an IR module to inline from, it renames its type names
> to make them not conflict with already known type names in the current
> LLVMContext. That, over time, leads to a large number of type names in
> memory. We can avoid the re-loading of the module by instead cloning it,
> but that still leads to new types being created for some internal
> reasons.  It's possible to create/drop separate LLVMContext instances,
> but we'd have to figure out when to do so - it's too expensive to always
> do that.

I spent a lot of time (probably too much time) trying to find a good way
out of that. I think I mostly figured it out, but it's gonna take me a
bit longer to nail down the loose ends. The nice part is that it
actually speeds up inlining a lot. The bad part is that it's not a small
change, so I don't yet know how to deal with existing releases.

Details:

The issue with the leaks is basically that when LLVM loads a bitcode
module it doesn't "unify" named types with types in other bitcode module
- they could after all be entirely independent. That means that we end
up with a lot of types after a while of running.

That ought to be addressable by not re-loading modules we inline from
from disk every time we inline, but unfortunately that doesn't work well
either: The "cross module import" logic unfortunately modifies the types
used in the "source module", leading to subsequent imports not working
well. That last issue is why I had moved to re-loading modules from
"disk" during import.

The partial patch I've written doesn't use the existing cross module
import code anymore, but moves to the lower level functionality it
uses. For traditional compilers, which is what that linker is mainly
written for, it makes sense to reduce memory usage by "destructively"
moving code from the source into the target - it'll only be used
once. But for the JIT case it's better to do so non-destructively. That
requires us to implement some "type resolution" between the source and
target modules, but then gains not needing to reload / copy the source
module anymore, saving a lot of the costs.


It's possible that for existing releases we ought to go for a simpler
approach, even if it's not as good. The only really feasible way I see
is to basically count the number of times inlining has been done, and
then just re-initialize after a certain time. But that's pretty darn
yucky.

Greetings,

Andres Freund



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Andres Freund
Дата:
Hi,

Sorry for the second send of this email, but somehow I managed to mangle
the headers in the last version I sent.

On 2020-10-03 19:01:27 -0700, Andres Freund wrote:
> On 2020-10-03 18:30:46 -0500, Justin Pryzby wrote:
> > On Sat, Oct 03, 2020 at 04:01:49PM -0700, Andres Freund wrote:
> > > > I was able to make a small, apparent leak like so.  This applies to
> > > > postgis3.0/postgres12/LLVM5/centos7, and
> > > > postgis3.1alpha/postgres13/LLVM9/centos8.
> > > >
> > > > Inlining is essential to see the leak, optimization is not.  Showing every 9th
> > > > query:
> > > >
> > > > $ for a in `seq 1 99`; do echo "SET jit_inline_above_cost=1; SET jit_optimize_above_cost=-1; SET
jit_above_cost=0;SET jit_expressions=on; SET log_executor_stats=on; SET client_min_messages=debug; SET jit=on; explain
analyzeSELECT st_x(site_geo) FROM sites WHERE site_geo IS NOT NULL;"; done |psql ts 2>&1 |grep 'resident' |sed -n
'1~9p'
> > > > !       46024 kB max resident size
> > > > !       46492 kB max resident size
> > >
> > > Huh, that's certainly not good. Looking.
> >
> > Reproduce like:
> > postgres=# CREATE EXTENSION postgis;
> > postgres=# CREATE TABLE geo(g geometry(point));
> > postgres=# INSERT INTO geo SELECT st_GeometryFromText('POINT(11 12)',0) FROM generate_series(1,99999);
> > postgres=# SET jit_above_cost=0; SET jit_inline_above_cost=0; SET client_min_messages=debug; SET
log_executor_stats=on;explain analyze SELECT st_x(g) FROM geo;
 
>
> It's an issue inside LLVM. I'm not yet sure how to avoid it. Basically
> whenever we load an IR module to inline from, it renames its type names
> to make them not conflict with already known type names in the current
> LLVMContext. That, over time, leads to a large number of type names in
> memory. We can avoid the re-loading of the module by instead cloning it,
> but that still leads to new types being created for some internal
> reasons.  It's possible to create/drop separate LLVMContext instances,
> but we'd have to figure out when to do so - it's too expensive to always
> do that.

I spent a lot of time (probably too much time) trying to find a good way
out of that. I think I mostly figured it out, but it's gonna take me a
bit longer to nail down the loose ends. The nice part is that it
actually speeds up inlining a lot. The bad part is that it's not a small
change, so I don't yet know how to deal with existing releases.

Details:

The issue with the leaks is basically that when LLVM loads a bitcode
module it doesn't "unify" named types with types in other bitcode module
- they could after all be entirely independent. That means that we end
up with a lot of types after a while of running.

That ought to be addressable by not re-loading modules we inline from
from disk every time we inline, but unfortunately that doesn't work well
either: The "cross module import" logic unfortunately modifies the types
used in the "source module", leading to subsequent imports not working
well. That last issue is why I had moved to re-loading modules from
"disk" during import.

The partial patch I've written doesn't use the existing cross module
import code anymore, but moves to the lower level functionality it
uses. For traditional compilers, which is what that linker is mainly
written for, it makes sense to reduce memory usage by "destructively"
moving code from the source into the target - it'll only be used
once. But for the JIT case it's better to do so non-destructively. That
requires us to implement some "type resolution" between the source and
target modules, but then gains not needing to reload / copy the source
module anymore, saving a lot of the costs.


It's possible that for existing releases we ought to go for a simpler
approach, even if it's not as good. The only really feasible way I see
is to basically count the number of times inlining has been done, and
then just re-initialize after a certain time. But that's pretty darn
yucky.

Greetings,

Andres Freund



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Justin Pryzby
Дата:
I'm re-raising this issue (since I needed to track down why I'd reflexively
saved a 512g backup of the filesystem where this first happened).

I'm 99% sure the "bad_alloc" is from LLVM.  It happened multiple times on
different servers (running a similar report) after setting jit=on during pg13
upgrade, and never happened since re-setting jit=off.

I reproduced a memory leak, but I don't think you publicized a fix for it.

I *think* the memory leak probably explains the bad_alloc.  We were seeing
postgres killed by OOM every ~16hours, although I don't see how that's possible
from such a tiny leak.  Our query also uses postgis (at first I said it
didn't), so it may be that they're involved.

I'd be happy to run with a prototype fix for the leak to see if the other issue
does (not) recur.

On Wed, Oct 07, 2020 at 04:02:42PM -0700, Andres Freund wrote:
> Hi,
> 
> Sorry for the second send of this email, but somehow I managed to mangle
> the headers in the last version I sent.
> 
> On 2020-10-03 19:01:27 -0700, Andres Freund wrote:
> > On 2020-10-03 18:30:46 -0500, Justin Pryzby wrote:
> > > On Sat, Oct 03, 2020 at 04:01:49PM -0700, Andres Freund wrote:
> > > > > I was able to make a small, apparent leak like so.  This applies to
> > > > > postgis3.0/postgres12/LLVM5/centos7, and
> > > > > postgis3.1alpha/postgres13/LLVM9/centos8.
> > > > >
> > > > > Inlining is essential to see the leak, optimization is not.  Showing every 9th
> > > > > query:
> > > > >
> > > > > $ for a in `seq 1 99`; do echo "SET jit_inline_above_cost=1; SET jit_optimize_above_cost=-1; SET
jit_above_cost=0;SET jit_expressions=on; SET log_executor_stats=on; SET client_min_messages=debug; SET jit=on; explain
analyzeSELECT st_x(site_geo) FROM sites WHERE site_geo IS NOT NULL;"; done |psql ts 2>&1 |grep 'resident' |sed -n
'1~9p'
> > > > > !       46024 kB max resident size
> > > > > !       46492 kB max resident size
> > > >
> > > > Huh, that's certainly not good. Looking.
> > >
> > > Reproduce like:
> > > postgres=# CREATE EXTENSION postgis;
> > > postgres=# CREATE TABLE geo(g geometry(point));
> > > postgres=# INSERT INTO geo SELECT st_GeometryFromText('POINT(11 12)',0) FROM generate_series(1,99999);
> > > postgres=# SET jit_above_cost=0; SET jit_inline_above_cost=0; SET client_min_messages=debug; SET
log_executor_stats=on;explain analyze SELECT st_x(g) FROM geo;
 
> >
> > It's an issue inside LLVM. I'm not yet sure how to avoid it. Basically
> > whenever we load an IR module to inline from, it renames its type names
> > to make them not conflict with already known type names in the current
> > LLVMContext. That, over time, leads to a large number of type names in
> > memory. We can avoid the re-loading of the module by instead cloning it,
> > but that still leads to new types being created for some internal
> > reasons.  It's possible to create/drop separate LLVMContext instances,
> > but we'd have to figure out when to do so - it's too expensive to always
> > do that.
> 
> I spent a lot of time (probably too much time) trying to find a good way
> out of that. I think I mostly figured it out, but it's gonna take me a
> bit longer to nail down the loose ends. The nice part is that it
> actually speeds up inlining a lot. The bad part is that it's not a small
> change, so I don't yet know how to deal with existing releases.
> 
> Details:
> 
> The issue with the leaks is basically that when LLVM loads a bitcode
> module it doesn't "unify" named types with types in other bitcode module
> - they could after all be entirely independent. That means that we end
> up with a lot of types after a while of running.
> 
> That ought to be addressable by not re-loading modules we inline from
> from disk every time we inline, but unfortunately that doesn't work well
> either: The "cross module import" logic unfortunately modifies the types
> used in the "source module", leading to subsequent imports not working
> well. That last issue is why I had moved to re-loading modules from
> "disk" during import.
> 
> The partial patch I've written doesn't use the existing cross module
> import code anymore, but moves to the lower level functionality it
> uses. For traditional compilers, which is what that linker is mainly
> written for, it makes sense to reduce memory usage by "destructively"
> moving code from the source into the target - it'll only be used
> once. But for the JIT case it's better to do so non-destructively. That
> requires us to implement some "type resolution" between the source and
> target modules, but then gains not needing to reload / copy the source
> module anymore, saving a lot of the costs.
> 
> 
> It's possible that for existing releases we ought to go for a simpler
> approach, even if it's not as good. The only really feasible way I see
> is to basically count the number of times inlining has been done, and
> then just re-initialize after a certain time. But that's pretty darn
> yucky.
> 
> Greetings,
> 
> Andres Freund



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Andres Freund
Дата:
Hi,

On 2020-12-18 17:56:07 -0600, Justin Pryzby wrote:
> I'd be happy to run with a prototype fix for the leak to see if the other issue
> does (not) recur.

I just posted a prototype fix to
https://www.postgresql.org/message-id/20210417021602.7dilihkdc7oblrf7%40alap3.anarazel.de
(just because that was the first thread I re-found). It'd be cool if you
could have a look!

Greetings,

Andres Freund



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Justin Pryzby
Дата:
On Fri, Apr 16, 2021 at 07:17:55PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2020-12-18 17:56:07 -0600, Justin Pryzby wrote:
> > I'd be happy to run with a prototype fix for the leak to see if the other issue
> > does (not) recur.
> 
> I just posted a prototype fix to
https://www.postgresql.org/message-id/20210417021602.7dilihkdc7oblrf7%40alap3.anarazel.de
> (just because that was the first thread I re-found). It'd be cool if you
> could have a look!

This doesn't seem to address the problem triggered by the reproducer at
https://www.postgresql.org/message-id/20210331040751.GU4431@telsasoft.com
(sorry I didn't CC you)

CREATE OR REPLACE FUNCTION cfn() RETURNS void LANGUAGE PLPGSQL AS $$ declare a record; begin FOR a IN SELECT
generate_series(1,99)LOOP PERFORM format('select 1'); END LOOP; END $$;
 
$ yes 'SET jit_above_cost=0; SET jit_inline_above_cost=0; SET jit=on; SET client_min_messages=debug; SET
log_executor_stats=on;SELECT cfn();' |head -11 |psql 2>&1 |grep 'max resident'
 

-- 
Justin



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Justin Pryzby
Дата:
On Fri, Apr 16, 2021 at 09:48:54PM -0500, Justin Pryzby wrote:
> On Fri, Apr 16, 2021 at 07:17:55PM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2020-12-18 17:56:07 -0600, Justin Pryzby wrote:
> > > I'd be happy to run with a prototype fix for the leak to see if the other issue
> > > does (not) recur.
> > 
> > I just posted a prototype fix to
https://www.postgresql.org/message-id/20210417021602.7dilihkdc7oblrf7%40alap3.anarazel.de
> > (just because that was the first thread I re-found). It'd be cool if you
> > could have a look!
> 
> This doesn't seem to address the problem triggered by the reproducer at
> https://www.postgresql.org/message-id/20210331040751.GU4431@telsasoft.com
> (sorry I didn't CC you)

I take that back - I forgot that this doesn't release RAM until hitting a
threshold.

Without the patch, it looks like:

$ yes 'SET jit_above_cost=0; SET jit_inline_above_cost=0; SET jit=on; SET client_min_messages=debug; SET
log_executor_stats=on;SELECT cfn();' |head -11 |psql -h /tmp postgres 2>&1 |grep 'max resident'
 
!       61820 kB max resident size
!       65020 kB max resident size
!       68812 kB max resident size
!       71152 kB max resident size
!       76820 kB max resident size
!       78760 kB max resident size
!       81140 kB max resident size
!       83520 kB max resident size
!       93084 kB max resident size
!       94756 kB max resident size
!       96416 kB max resident size

With the patch and #define LLVMJIT_LLVM_CONTEXT_REUSE_MAX 1, it looks like
this:

!       61436 kB max resident size
!       61572 kB max resident size
!       63236 kB max resident size
!       63236 kB max resident size
!       63556 kB max resident size
!       63556 kB max resident size
!       63880 kB max resident size
!       65416 kB max resident size
!       65416 kB max resident size
!       65416 kB max resident size
!       65416 kB max resident size

-- 
Justin



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
"Andres Freund"
Дата:
Hi,

On Fri, Apr 16, 2021, at 20:18, Justin Pryzby wrote:
> On Fri, Apr 16, 2021 at 09:48:54PM -0500, Justin Pryzby wrote:
> > On Fri, Apr 16, 2021 at 07:17:55PM -0700, Andres Freund wrote:
> > > Hi,
> > > 
> > > On 2020-12-18 17:56:07 -0600, Justin Pryzby wrote:
> > > > I'd be happy to run with a prototype fix for the leak to see if the other issue
> > > > does (not) recur.
> > > 
> > > I just posted a prototype fix to
https://www.postgresql.org/message-id/20210417021602.7dilihkdc7oblrf7%40alap3.anarazel.de
> > > (just because that was the first thread I re-found). It'd be cool if you
> > > could have a look!
> > 
> > This doesn't seem to address the problem triggered by the reproducer at
> > https://www.postgresql.org/message-id/20210331040751.GU4431@telsasoft.com
> > (sorry I didn't CC you)
> 
> I take that back - I forgot that this doesn't release RAM until hitting a
> threshold.
> 

Phew.

> Without the patch, it looks like:
> 
> $ yes 'SET jit_above_cost=0; SET jit_inline_above_cost=0; SET jit=on; 
> SET client_min_messages=debug; SET log_executor_stats=on; SELECT 
> cfn();' |head -11 |psql -h /tmp postgres 2>&1 |grep 'max resident'
> !       61820 kB max resident size
> !       65020 kB max resident size
> !       68812 kB max resident size
> !       71152 kB max resident size
> !       76820 kB max resident size
> !       78760 kB max resident size
> !       81140 kB max resident size
> !       83520 kB max resident size
> !       93084 kB max resident size
> !       94756 kB max resident size
> !       96416 kB max resident size
> 
> With the patch and #define LLVMJIT_LLVM_CONTEXT_REUSE_MAX 1, it looks like
> this:
> 
> !       61436 kB max resident size
> !       61572 kB max resident size
> !       63236 kB max resident size
> !       63236 kB max resident size
> !       63556 kB max resident size
> !       63556 kB max resident size
> !       63880 kB max resident size
> !       65416 kB max resident size
> !       65416 kB max resident size
> !       65416 kB max resident size
> !       65416 kB max resident size

If you set the define to much lower, how low does this get? I didn't see an easy way to make the recreation of the
contextmemory usage dependant, but that'd of course be much better than an #iterations based approach. I'll play around
withyour test tomorrow or so, if you don't, but now the next step is risotto...
 

Regards,

Andres



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Justin Pryzby
Дата:
On Fri, Apr 16, 2021 at 10:18:37PM -0500, Justin Pryzby wrote:
> On Fri, Apr 16, 2021 at 09:48:54PM -0500, Justin Pryzby wrote:
> > On Fri, Apr 16, 2021 at 07:17:55PM -0700, Andres Freund wrote:
> > > Hi,
> > > 
> > > On 2020-12-18 17:56:07 -0600, Justin Pryzby wrote:
> > > > I'd be happy to run with a prototype fix for the leak to see if the other issue
> > > > does (not) recur.
> > > 
> > > I just posted a prototype fix to
https://www.postgresql.org/message-id/20210417021602.7dilihkdc7oblrf7%40alap3.anarazel.de
> > > (just because that was the first thread I re-found). It'd be cool if you
> > > could have a look!
> > 
> > This doesn't seem to address the problem triggered by the reproducer at
> > https://www.postgresql.org/message-id/20210331040751.GU4431@telsasoft.com
> > (sorry I didn't CC you)
> 
> I take that back - I forgot that this doesn't release RAM until hitting a
> threshold.

I tried this on the query that was causing the original c++ exception.

It still grows to 2GB within 5min.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
                                                                                                         
 
23084 postgres  20   0 2514364   1.6g  29484 R  99.7 18.2   3:40.87 postgres: telsasoft ts 192.168.122.11(50892) SELECT
                                                                                                         
 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
                                                                                                         
 
23084 postgres  20   0 3046960   2.1g  29484 R 100.0 24.1   4:30.64 postgres: telsasoft ts 192.168.122.11(50892) SELECT
                                                                                                         
 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
                                                                                                         
 
23084 postgres  20   0 4323500   3.3g  29488 R  99.7 38.4   8:20.63 postgres: telsasoft ts 192.168.122.11(50892) SELECT
                                                                                                         
 

When I first reported this issue, the affected process was a long-running,
single-threaded python tool.  We since updated it (partially to avoid issues
like this) to use multiprocessing, therefor separate postgres backends.

I'm now realizing that that's RAM use for a single query, not from continuous
leaks across multiple queries.  This is still true with the patch even if I
#define LLVMJIT_LLVM_CONTEXT_REUSE_MAX 1

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
                                                                                                         
 
28438 postgres  20   0 3854264   2.8g  29428 R  98.7 33.2   8:56.79 postgres: telsasoft ts 192.168.122.11(53614) BIND
                                                                                                         
 

python3 ./jitleak.py # runs telsasoft reports
INFO:  recreating LLVM context after 2 uses
INFO:  recreating LLVM context after 2 uses
INFO:  recreating LLVM context after 2 uses
INFO:  recreating LLVM context after 2 uses
INFO:  recreating LLVM context after 2 uses
PID 27742 finished running report; est=None rows=40745; cols=34; ... duration:538
INFO:  recreating LLVM context after 81492 uses

I did:

-               llvm_llvm_context_reuse_count = 0;
                Assert(llvm_context != NULL);
+               elog(INFO, "recreating LLVM context after %zu uses", llvm_llvm_context_reuse_count);
+               llvm_llvm_context_reuse_count = 0;

Maybe we're missing this condition somehow ?
        if (llvm_jit_context_in_use_count == 0 &&

Also, I just hit this assertion by cancelling the query with ^C / sigint.  But
I don't have a reprodcer for it.

< 2021-04-17 19:14:23.509 ADT telsasoft >PANIC:  LLVMJitContext in use count not 0 at exit (is 1)

-- 
Justin



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Andres Freund
Дата:
Hi,

On 2021-04-17 19:13:24 -0500, Justin Pryzby wrote:
> I'm now realizing that that's RAM use for a single query, not from continuous
> leaks across multiple queries.

What's the memory usage with inlining disabled, and whats the usage
without jit?


> This is still true with the patch even if I
> #define LLVMJIT_LLVM_CONTEXT_REUSE_MAX 1
>
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
> 28438 postgres  20   0 3854264   2.8g  29428 R  98.7 33.2   8:56.79 postgres: telsasoft ts 192.168.122.11(53614)
BIND
>
> python3 ./jitleak.py # runs telsasoft reports
> INFO:  recreating LLVM context after 2 uses
> INFO:  recreating LLVM context after 2 uses
> INFO:  recreating LLVM context after 2 uses
> INFO:  recreating LLVM context after 2 uses
> INFO:  recreating LLVM context after 2 uses
> PID 27742 finished running report; est=None rows=40745; cols=34; ... duration:538
> INFO:  recreating LLVM context after 81492 uses
>
> I did:
>
> -               llvm_llvm_context_reuse_count = 0;
>                 Assert(llvm_context != NULL);
> +               elog(INFO, "recreating LLVM context after %zu uses", llvm_llvm_context_reuse_count);
> +               llvm_llvm_context_reuse_count = 0;
>
> Maybe we're missing this condition somehow ?
>         if (llvm_jit_context_in_use_count == 0 &&

Do you utilize cursors or such?


> Also, I just hit this assertion by cancelling the query with ^C / sigint.  But
> I don't have a reprodcer for it.
>
> < 2021-04-17 19:14:23.509 ADT telsasoft >PANIC:  LLVMJitContext in use count not 0 at exit (is 1)

Hm, I'll try to see how that happens - if not, do you have a backtrace?

Greetings,

Andres Freund



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Justin Pryzby
Дата:
On Mon, Apr 19, 2021 at 09:41:30AM -0700, Andres Freund wrote:
> On 2021-04-17 19:13:24 -0500, Justin Pryzby wrote:
> > I'm now realizing that that's RAM use for a single query, not from continuous
> > leaks across multiple queries.
> 
> What's the memory usage with inlining disabled, and whats the usage
> without jit?

PGOPTIONS="-c jit=off -c jit_inline_above_cost=0 -c jit_above_cost=0 -c client_min_messages=debug -c
log_executor_stats=on"python3 `pwd`/jitleak.py
 
!       [5.272604 s user, 0.329912 s system total]
!       120948 kB max resident size

PGOPTIONS="-c jit=on -c jit_inline_above_cost=0 -c jit_above_cost=0 -c client_min_messages=debug -c
log_executor_stats=on"python3 `pwd`/jitleak.py
 
DEBUG:  time to inline: 0.003s, opt: 0.000s, emit: 0.003s
... repeated many times ...
!       [585.090087 s user, 31.835228 s system total]
!       4639556 kB max resident size

PGOPTIONS="-c jit=on -c jit_inline_above_cost=-1 -c jit_above_cost=0 -c client_min_messages=debug -c
log_executor_stats=on"python3 `pwd`/jitleak.py
 
DEBUG:  time to inline: 0.000s, opt: 0.000s, emit: 0.003s
... repeated many times ... which is confusing, since inlining is disabled ...
!       [318.556489 s user, 8.413007 s system total]
!       159436 kB max resident size

> > Maybe we're missing this condition somehow ?
> >         if (llvm_jit_context_in_use_count == 0 &&
> 
> Do you utilize cursors or such?

Not in this backend/process.
I believe postgis/mapserver uses cursors in a separate process, though...

Our report runs like:
SELECT large_object_oid FROM ... WHERE name=$1 AND user=$2 ... FOR UPDATE -- our reports are stored as LOs and read by
readlo
begin;
SET enable_nestloop=off; -- I'm sorry to say but nothing else has avoids occasional pathological query plans
SELECT report text...;
rollback;

> > Also, I just hit this assertion by cancelling the query with ^C / sigint.  But
> > I don't have a reprodcer for it.
> >
> > < 2021-04-17 19:14:23.509 ADT telsasoft >PANIC:  LLVMJitContext in use count not 0 at exit (is 1)
> 
> Hm, I'll try to see how that happens - if not, do you have a backtrace?

This may be because I backpatched to v13, which had one conflict (0a2bc5d61).
Maybe this is relevant, too:  bab150045

If you could provide a patch for v13, it's a good bet there's no issue that I
didn't cause.

I'm able to reproduce the crash like this, with my patch to v13, but not your
patch on master.

python3 -c "import pg; db=pg.DB('dbname=postgres'); db.query('SET jit_above_cost=0; SET jit_inline_above_cost=0; SET
jit=on;SET client_min_messages=debug'); db.query('begin'); db.query_formatted('SELECT 1 FROM generate_series(1,99)a
WHEREa=%s', [1], inline=False);"
 

-- 
Justin



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Justin Pryzby
Дата:
I don't know if this is related to the other issues, but this seems leaky.

create or replace function fn() returns void language plpgsql as $$ declare rec int; begin SELECT relpages INTO rec
FROMpg_class LIMIT 1; end $$;
 
explain analyze
SELECT fn()
FROM generate_series(1,999);

PGOPTIONS='-cclient_min_messages=debug -clog_executor_stats=on -clog_min_duration_statement=-1 -cjit=on
-cjit_above_cost=0-cjit_inline_above_cost=0' psql ts -f jitleak.sql
 
!       70172 kB max resident size

PGOPTIONS='-cclient_min_messages=debug -clog_executor_stats=on -clog_min_duration_statement=-1 -cjit=on
-cjit_above_cost=0-cjit_inline_above_cost=-1' psql ts -f jitleak.sql
 
!       25672 kB max resident size



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Justin Pryzby
Дата:
On Tue, Apr 20, 2021 at 12:38:26AM -0500, Justin Pryzby wrote:
> I don't know if this is related to the other issues, but this seems leaky.

And it explains how the context use counter can exceed its threshold.

create or replace function fn() returns void language plpgsql as $$ declare rec int; begin SELECT relpages INTO rec
FROMpg_class LIMIT 1; end $$;
 
explain analyze
SELECT fn()
FROM generate_series(1,99);

SELECT SUM(a) FROM (VALUES(1))a(a);

time PGOPTIONS='-cclient_min_messages=debug -clog_executor_stats=off -clog_min_duration_statement=-1 -cjit=on
-cjit_above_cost=0-cjit_inline_above_cost=0' psql ts -f jitleak.sql
 
...
psql:jitleak.sql:6: DEBUG:  recreating LLVM context after 100 uses

Question: does the jit context need to be recreated only when inlining is
enabled?  Or maybe it's better if it's not conditionalized like that..



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Andres Freund
Дата:
Hi,

On 2021-04-20 00:58:21 -0500, Justin Pryzby wrote:
> On Tue, Apr 20, 2021 at 12:38:26AM -0500, Justin Pryzby wrote:
> > I don't know if this is related to the other issues, but this seems leaky.
> 
> And it explains how the context use counter can exceed its threshold.
> 
> create or replace function fn() returns void language plpgsql as $$ declare rec int; begin SELECT relpages INTO rec
FROMpg_class LIMIT 1; end $$;
 
> explain analyze
> SELECT fn()
> FROM generate_series(1,99);
> 
> SELECT SUM(a) FROM (VALUES(1))a(a);
> 
> time PGOPTIONS='-cclient_min_messages=debug -clog_executor_stats=off -clog_min_duration_statement=-1 -cjit=on
-cjit_above_cost=0-cjit_inline_above_cost=0' psql ts -f jitleak.sql
 
> ...
> psql:jitleak.sql:6: DEBUG:  recreating LLVM context after 100 uses

Right - at the moment the context can only be recreated when there's no
JITed queries ongoing. That's why I asked whether your "real" query
contains function calls, and whether those functions are expensive
enough to be JITed.


> Question: does the jit context need to be recreated only when inlining is
> enabled?  Or maybe it's better if it's not conditionalized like that..

It'd be sufficient to do it when doing inlining, but currently that's
not tracked separately.

Greetings,

Andres Freund



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Justin Pryzby
Дата:
On Tue, Apr 20, 2021 at 05:20:56PM -0700, Andres Freund wrote:
> On 2021-04-20 00:58:21 -0500, Justin Pryzby wrote:
> > On Tue, Apr 20, 2021 at 12:38:26AM -0500, Justin Pryzby wrote:
> > > I don't know if this is related to the other issues, but this seems leaky.
> > 
> > And it explains how the context use counter can exceed its threshold.
> > 
> > create or replace function fn() returns void language plpgsql as $$ declare rec int; begin SELECT relpages INTO rec
FROMpg_class LIMIT 1; end $$;
 
> Right - at the moment the context can only be recreated when there's no
> JITed queries ongoing. That's why I asked whether your "real" query
> contains function calls, 

The real query has a bunch of function calls, including aggregates and custom
plpgsql functions.  I reduced the whole thing to the above.

ts=# explain analyze SELECT fn() FROM generate_series(1,999);
!       71692 kB max resident size

ts=# explain analyze SELECT fn() FROM generate_series(1,9999);
!       332852 kB max resident size

Which doesn't seem to leak if I "SELECT 1 INTO rec" instead of "SELECT relpages
INTO rec".

> and whether those functions are expensive enough to be JITed.

The real query has:
| Append  (cost=44.54..192946.35 rows=79 width=741)

That's a query over a 2 day period (midnight to midnight+2), so it's not hard
for me to believe it sometimes exceeds 100k cost units (jit_inline_above_cost),
depending on where we are in that interval, and on planner statistics.  It
might've been different when I first reported the problem, too.

I'm not sure what you mean about expensive enough to be jited ?
I think our custom functions don't have a COST set.

explain analyze is showing:

JIT:
  Functions: 201

-- 
Justin



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Andres Freund
Дата:
Hi,

On 2021-04-20 20:03:13 -0500, Justin Pryzby wrote:
> That's a query over a 2 day period (midnight to midnight+2), so it's not hard
> for me to believe it sometimes exceeds 100k cost units (jit_inline_above_cost),
> depending on where we are in that interval, and on planner statistics.  It
> might've been different when I first reported the problem, too.
> 
> I'm not sure what you mean about expensive enough to be jited ?
> I think our custom functions don't have a COST set.

What get's JITed is queries. Unless your functions are fully inlinable
SQL functions that means that query executions in functions are
separately JITed. And need to separately pass the cost minimums.


> explain analyze is showing:
> 
> JIT:
>   Functions: 201

Note that this isn't about how many plpsql functions or such are called,
but about how many JITed functions are generated. Typically there's one
function per expression, and one for each table accessed.

Greetings,

Andres Freund



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Justin Pryzby
Дата:
On Tue, Apr 20, 2021 at 06:16:28PM -0700, Andres Freund wrote:
> On 2021-04-20 20:03:13 -0500, Justin Pryzby wrote:
> > That's a query over a 2 day period (midnight to midnight+2), so it's not hard
> > for me to believe it sometimes exceeds 100k cost units (jit_inline_above_cost),
> > depending on where we are in that interval, and on planner statistics.  It
> > might've been different when I first reported the problem, too.
> > 
> > I'm not sure what you mean about expensive enough to be jited ?
> > I think our custom functions don't have a COST set.
> 
> What get's JITed is queries. Unless your functions are fully inlinable
> SQL functions that means that query executions in functions are
> separately JITed. And need to separately pass the cost minimums.

I think the queries our plpgsql functions make wouldn't be expensive - mostly
lookups by primary key.  And any functions called by the queries within the
plpgsql function would not have COST set.

What do you think of the test case I sent?
| SELECT fn() FROM generate_series

If the JIT contexts can't be reset for this query, I assume the same reason
applies to our more complicated query.

> Note that this isn't about how many plpsql functions or such are called,
> but about how many JITed functions are generated. Typically there's one
> function per expression, and one for each table accessed.

Thanks for the explanations.  Note that in our config I have set
jit_tuple_deforming=off, so I think the table accesses don't use functions,
right?

Oh...that appears to be relevant somehow:

$ PGOPTIONS='-c client_min_messages=debug -c log_executor_stats=on -c jit=on -c jit_above_cost=0 -c
jit_inline_above_cost=0'psql ts -c "explain analyze SELECT fn() FROM generate_series(1,999)"
 
!       69872 kB max resident size

$ PGOPTIONS='-c client_min_messages=debug -c log_executor_stats=on -c jit=on -c jit_above_cost=0 -c
jit_inline_above_cost=0-c jit_tuple_deforming=on' psql ts -c "explain analyze SELECT fn() FROM generate_series(1,999)"
 
!       36140 kB max resident size


-- 
Justin



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Justin Pryzby
Дата:
I think you were thinking that the jit_inline_above_cost=0 for queries within
plpgsql functions is pathological and not a useful test case, since the
function can be called an arbitrarily large number of times, but the jit
context can't be reset mid-query due to nonzero reference count.

I think you were wondering whether in our normal environment the plpgsql
functions run queries that are expensive enough to be jitted.  I think they're
not expensive enough to be JITed - but it makes debugging/diagnosing/testing
much harder if it somehow "doesn't count" to set inline_above_cost=0.

I think that means 1) it's still a memory leak, since it's O(ntuples), which is
not generally okay....but, 2) it's certainly better than nothing.  It'd be good
enough for back branches to have a mitigation rather than be killed by OOM.  

I'd forgotten doing it, but I had run your patch on one of the servers which
first crashed.  I remembered after upgrading to v13.3 and having the processed
killed by OOM twice per day for the last 3 days (on a very consistent
schedule).

Could you provide a patch to apply to v13 ?  My attempted backpatch introduced
some issues.

On Mon, Apr 19, 2021 at 12:41:36PM -0500, Justin Pryzby wrote:
> > > Also, I just hit this assertion by cancelling the query with ^C / sigint.  But
> > > I don't have a reprodcer for it.
> > >
> > > < 2021-04-17 19:14:23.509 ADT telsasoft >PANIC:  LLVMJitContext in use count not 0 at exit (is 1)
> > 
> > Hm, I'll try to see how that happens - if not, do you have a backtrace?
> 
> This may be because I backpatched to v13, which had one conflict (0a2bc5d61).
> Maybe this is relevant, too:  bab150045
> 
> If you could provide a patch for v13, it's a good bet there's no issue that I
> didn't cause.
> 
> I'm able to reproduce the crash like this, with my patch to v13, but not your
> patch on master.
> 
> python3 -c "import pg; db=pg.DB('dbname=postgres'); db.query('SET jit_above_cost=0; SET jit_inline_above_cost=0; SET
jit=on;SET client_min_messages=debug'); db.query('begin'); db.query_formatted('SELECT 1 FROM generate_series(1,99)a
WHEREa=%s', [1], inline=False);"
 



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Justin Pryzby
Дата:
There's a memory leak affecting JIT expressions, even when inlining,
optimization, and tuple deforming are disabled.

The server that got OOM after installing PG13.3 (because your patch wasn't
applied) still gets OOM even with inline_above_cost=-1, optimize_above_cost=-1,
and deforming=off.

Actually I think this is an even more severe leak, even though it seems to be
infrequent.  The inlining leak is slow enough that I don't think that it,
alone, would cause a crash.  It's possible there's a race condition or
something that makes this hard to notice or reproduce, or the inlining leak is
just easier to find.

I guess your patch avoids this issue, too, since it didn't OOM with your patch.

-- 
Justin



Re: terminate called after throwing an instance of 'std::bad_alloc'

От
Justin Pryzby
Дата:
Thread starting here:
https://www.postgresql.org/message-id/20201001021609.GC8476%40telsasoft.com

On Fri, Dec 18, 2020 at 05:56:07PM -0600, Justin Pryzby wrote:
> I'm 99% sure the "bad_alloc" is from LLVM.  It happened multiple times on
> different servers (running a similar report) after setting jit=on during pg13
> upgrade, and never happened since re-setting jit=off.

Since this recurred a few times recently (now running pg14.0), and I finally
managed to get a non-truncated corefile...

Core was generated by `postgres: telsasoft ts 192.168.122.13(40808) SELECT '.
Program terminated with signal 6, Aborted.

#0  0x00007f2497880337 in raise () from /lib64/libc.so.6
#1  0x00007f2497881a28 in abort () from /lib64/libc.so.6
#2  0x00007f2487cbf265 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so
#3  0x00007f2487c66696 in __cxxabiv1::__terminate(void (*)()) () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so
#4  0x00007f2487c666c3 in std::terminate() () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so
#5  0x00007f2487c687d3 in __cxa_throw () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so
#6  0x00007f2487c686cd in operator new(unsigned long) () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so
#7  0x00007f2486477b9c in llvm::DenseMap<llvm::APInt, std::unique_ptr<llvm::ConstantInt,
std::default_delete<llvm::ConstantInt>>, llvm::DenseMapAPIntKeyInfo, llvm::detail::DenseMapPair<llvm::APInt,
std::unique_ptr<llvm::ConstantInt,std::default_delete<llvm::ConstantInt> > > >::grow(unsigned int) () from
/usr/lib64/llvm5.0/lib/libLLVM-5.0.so
#8  0x00007f2486477f29 in llvm::ConstantInt::get(llvm::LLVMContext&, llvm::APInt const&) () from
/usr/lib64/llvm5.0/lib/libLLVM-5.0.so
#9  0x00007f2486478263 in llvm::ConstantInt::get(llvm::IntegerType*, unsigned long, bool) () from
/usr/lib64/llvm5.0/lib/libLLVM-5.0.so
#10 0x00007f2488f66c18 in l_ptr_const (type=0x3000650, ptr=<optimized out>) at
../../../../src/include/jit/llvmjit_emit.h:29
#11 llvm_compile_expr (state=<optimized out>) at llvmjit_expr.c:246
#12 0x0000000000635499 in ExecReadyExpr (state=state@entry=0x1a5317178) at execExpr.c:885
#13 0x00000000006388a5 in ExecInitQual (qual=0x1a6d07de8, parent=parent@entry=0x1a4c4ade0) at execExpr.c:280
#14 0x0000000000661d9d in ExecInitHashJoin (node=node@entry=0x1a6c935b0, estate=estate@entry=0x1a4da87b8,
eflags=eflags@entry=16)at nodeHashjoin.c:733
 
#15 0x0000000000648f6e in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at
execProcnode.c:307
#16 0x000000000065e8fd in ExecInitHash (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at
nodeHash.c:381
#17 0x0000000000648e87 in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at
execProcnode.c:365
#18 0x0000000000661cd5 in ExecInitHashJoin (node=node@entry=0x1a6c93ce0, estate=estate@entry=0x1a4da87b8,
eflags=eflags@entry=16)at nodeHashjoin.c:663
 
#19 0x0000000000648f6e in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at
execProcnode.c:307
#20 0x00000000006723e4 in ExecInitSort (node=node@entry=0x1a6c94a50, estate=estate@entry=0x1a4da87b8,
eflags=eflags@entry=16)at nodeSort.c:210
 
#21 0x0000000000648f2f in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at
execProcnode.c:320
#22 0x0000000000655ef9 in ExecInitAgg (node=node@entry=0x1a6c95100, estate=estate@entry=0x1a4da87b8,
eflags=eflags@entry=16)at nodeAgg.c:3382
 
#23 0x0000000000648ef0 in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at
execProcnode.c:340
#24 0x0000000000674628 in ExecInitSubqueryScan (node=node@entry=0x1a6c94020, estate=estate@entry=0x1a4da87b8,
eflags=eflags@entry=16)at nodeSubqueryscan.c:126
 
#25 0x0000000000649055 in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at
execProcnode.c:249
#26 0x0000000000658e61 in ExecInitAppend (node=node@entry=0x1a6c60008, estate=estate@entry=0x1a4da87b8,
eflags=eflags@entry=16)at nodeAppend.c:232
 
#27 0x0000000000649166 in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at
execProcnode.c:181
#28 0x0000000000670e1a in ExecInitResult (node=node@entry=0x1a6c95c40, estate=estate@entry=0x1a4da87b8,
eflags=eflags@entry=16)at nodeResult.c:210
 
#29 0x00000000006491a5 in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at
execProcnode.c:166
#30 0x0000000000658e61 in ExecInitAppend (node=node@entry=0x1a6c5d548, estate=estate@entry=0x1a4da87b8,
eflags=eflags@entry=16)at nodeAppend.c:232
 
#31 0x0000000000649166 in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at
execProcnode.c:181
#32 0x0000000000674628 in ExecInitSubqueryScan (node=node@entry=0x1a6ca5368, estate=estate@entry=0x1a4da87b8,
eflags=eflags@entry=16)at nodeSubqueryscan.c:126
 
#33 0x0000000000649055 in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at
execProcnode.c:249
#34 0x00000000006723e4 in ExecInitSort (node=node@entry=0x1a6ca5a28, estate=estate@entry=0x1a4da87b8,
eflags=eflags@entry=16)at nodeSort.c:210
 
#35 0x0000000000648f2f in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at
execProcnode.c:320
#36 0x00000000006793eb in ExecInitWindowAgg (node=node@entry=0x1a6ca6ab0, estate=estate@entry=0x1a4da87b8,
eflags=eflags@entry=16)at nodeWindowAgg.c:2314
 
#37 0x0000000000648edb in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at
execProcnode.c:345
#38 0x0000000000674628 in ExecInitSubqueryScan (node=node@entry=0x1a6ca5628, estate=estate@entry=0x1a4da87b8,
eflags=eflags@entry=16)at nodeSubqueryscan.c:126
 
#39 0x0000000000649055 in ExecInitNode (node=<optimized out>, estate=<optimized out>, eflags=<optimized out>) at
execProcnode.c:249
#40 0x0000000000642ab1 in InitPlan (eflags=16, queryDesc=0x344dc88) at execMain.c:936
#41 standard_ExecutorStart (queryDesc=0x344dc88, eflags=16) at execMain.c:263
#42 0x00007f248b45656d in pgss_ExecutorStart (queryDesc=0x344dc88, eflags=<optimized out>) at pg_stat_statements.c:965
#43 0x00007f248b03a861 in explain_ExecutorStart (queryDesc=0x344dc88, eflags=<optimized out>) at auto_explain.c:301
#44 0x00000000007b8d1b in PortalStart (portal=0x1be2168, params=0x0, eflags=0, snapshot=0x0) at pquery.c:512

-- 
Justin



Re: terminate called after throwing an instance of 'std::bad_alloc' (llvmjit)

От
Justin Pryzby
Дата:
On Wed, Nov 10, 2021 at 09:56:44AM -0600, Justin Pryzby wrote:
> Thread starting here:
> https://www.postgresql.org/message-id/20201001021609.GC8476%40telsasoft.com
> 
> On Fri, Dec 18, 2020 at 05:56:07PM -0600, Justin Pryzby wrote:
> > I'm 99% sure the "bad_alloc" is from LLVM.  It happened multiple times on
> > different servers (running a similar report) after setting jit=on during pg13
> > upgrade, and never happened since re-setting jit=off.
> 
> Since this recurred a few times recently (now running pg14.0), and I finally
> managed to get a non-truncated corefile...

I think the reason this recurred is that, since upgrading to pg14, I no longer
had your memleak patches applied.  I'd forgotten about it, but was probably
running a locally compiled postgres with your patches applied.

I should've mentioned that this crash was associated with the message from the
original problem report:

|terminate called after throwing an instance of 'std::bad_alloc'
|  what():  std::bad_alloc

The leak discussed on other threads seems fixed by your patches - I compiled
v14 and now running with no visible leaks since last week.
https://www.postgresql.org/message-id/flat/20210417021602.7dilihkdc7oblrf7@alap3.anarazel.de

As I understand it, there's still an issue with an allocation failure causing
SIGABRT rather than FATAL.

It took me several tries to get the corefile since the process is huge, caused
by the leak (and abrtd wanted to truncate it, nullifying its utility).

-rw-------. 1 postgres postgres 8.4G Nov 10 08:57 /var/lib/pgsql/14/data/core.31345

I installed more debug packages to get a fuller stacktrace.

#0  0x00007f2497880337 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007f2497881a28 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007f2487cbf265 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so
No symbol table info available.
#3  0x00007f2487c66696 in __cxxabiv1::__terminate(void (*)()) () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so
No symbol table info available.
#4  0x00007f2487c666c3 in std::terminate() () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so
No symbol table info available.
#5  0x00007f2487c687d3 in __cxa_throw () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so
No symbol table info available.
#6  0x00007f2487c686cd in operator new(unsigned long) () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so
No symbol table info available.
#7  0x00007f2486477b9c in allocateBuckets (this=0x2ff7f38, this=0x2ff7f38, Num=<optimized out>) at
/usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:753
No locals.
#8  llvm::DenseMap<llvm::APInt, std::unique_ptr<llvm::ConstantInt, std::default_delete<llvm::ConstantInt> >,
llvm::DenseMapAPIntKeyInfo,llvm::detail::DenseMapPair<llvm::APInt, std::unique_ptr<llvm::ConstantInt,
std::default_delete<llvm::ConstantInt>> > >::grow (this=this@entry=0x2ff7f38, AtLeast=<optimized out>)
 
    at /usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:691
        OldNumBuckets = 33554432
        OldBuckets = 0x7f23f3e42010
#9  0x00007f2486477f29 in grow (AtLeast=<optimized out>, this=0x2ff7f38) at
/usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:461
No locals.
#10 InsertIntoBucketImpl<llvm::APInt> (TheBucket=<optimized out>, Lookup=..., Key=..., this=0x2ff7f38) at
/usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:510
        NewNumEntries = <optimized out>
        EmptyKey = <optimized out>
#11 InsertIntoBucket<llvm::APInt const&> (Key=..., TheBucket=<optimized out>, this=0x2ff7f38) at
/usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:471
No locals.
#12 FindAndConstruct (Key=..., this=0x2ff7f38) at /usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:271
        TheBucket = <optimized out>
#13 operator[] (Key=..., this=0x2ff7f38) at /usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:275
No locals.
#14 llvm::ConstantInt::get (Context=..., V=...) at /usr/src/debug/llvm-5.0.1.src/lib/IR/Constants.cpp:550
        pImpl = 0x2ff7eb0
#15 0x00007f2486478263 in llvm::ConstantInt::get (Ty=0x2ff85a8, V=<optimized out>, isSigned=isSigned@entry=false) at
/usr/src/debug/llvm-5.0.1.src/lib/IR/Constants.cpp:571
No locals.
#16 0x00007f248648673d in LLVMConstInt (IntTy=<optimized out>, N=<optimized out>, SignExtend=SignExtend@entry=0) at
/usr/src/debug/llvm-5.0.1.src/lib/IR/Core.cpp:952
No locals.
#17 0x00007f2488f66c18 in l_ptr_const (type=0x3000650, ptr=<optimized out>) at
../../../../src/include/jit/llvmjit_emit.h:29
        c = <optimized out>
#18 llvm_compile_expr (state=<optimized out>) at llvmjit_expr.c:246
        op = 0x1a5317690
        opcode = EEOP_OUTER_VAR
        opno = 5
        parent = <optimized out>
        funcname = 0x1a53184e8 "evalexpr_4827_151"
        context = 0x1ba79b8
        b = <optimized out>
        mod = 0x1a5513d30
        eval_fn = <optimized out>
        entry = <optimized out>
        v_state = 0x1a5ce09e0
        v_econtext = 0x1a5ce0a08
        v_isnullp = 0x1a5ce0a30
        v_tmpvaluep = 0x1a5ce0aa8
        v_tmpisnullp = 0x1a5ce0b48
        starttime = {tv_sec = 10799172, tv_nsec = 781670770}
        endtime = {tv_sec = 7077194792, tv_nsec = 0}
        __func__ = "llvm_compile_expr"
[...]



Re: terminate called after throwing an instance of 'std::bad_alloc' (llvmjit)

От
Justin Pryzby
Дата:
There's no leak after running for ~5 weeks.

$ ps -O lstart,vsize,rss 17930
  PID                  STARTED    VSZ   RSS S TTY          TIME COMMAND
17930 Tue Nov 30 15:35:26 2021 1019464 117424 S ?      7-04:54:03 postgres: telsasoft ts 192.168.122.13(57640) idle

Unless you suggest otherwise , I'm planning to restart the DB soon and go back
to running the pgdg rpm binaries with jit=off rather than what I compiled and
patched locally.

On Thu, Nov 18, 2021 at 03:20:39PM -0600, Justin Pryzby wrote:
> On Wed, Nov 10, 2021 at 09:56:44AM -0600, Justin Pryzby wrote:
> > Thread starting here:
> > https://www.postgresql.org/message-id/20201001021609.GC8476%40telsasoft.com
> > 
> > On Fri, Dec 18, 2020 at 05:56:07PM -0600, Justin Pryzby wrote:
> > > I'm 99% sure the "bad_alloc" is from LLVM.  It happened multiple times on
> > > different servers (running a similar report) after setting jit=on during pg13
> > > upgrade, and never happened since re-setting jit=off.
> > 
> > Since this recurred a few times recently (now running pg14.0), and I finally
> > managed to get a non-truncated corefile...
> 
> I think the reason this recurred is that, since upgrading to pg14, I no longer
> had your memleak patches applied.  I'd forgotten about it, but was probably
> running a locally compiled postgres with your patches applied.
> 
> I should've mentioned that this crash was associated with the message from the
> original problem report:
> 
> |terminate called after throwing an instance of 'std::bad_alloc'
> |  what():  std::bad_alloc
> 
> The leak discussed on other threads seems fixed by your patches - I compiled
> v14 and now running with no visible leaks since last week.
> https://www.postgresql.org/message-id/flat/20210417021602.7dilihkdc7oblrf7@alap3.anarazel.de
> 
> As I understand it, there's still an issue with an allocation failure causing
> SIGABRT rather than FATAL.
> 
> It took me several tries to get the corefile since the process is huge, caused
> by the leak (and abrtd wanted to truncate it, nullifying its utility).
> 
> -rw-------. 1 postgres postgres 8.4G Nov 10 08:57 /var/lib/pgsql/14/data/core.31345
> 
> I installed more debug packages to get a fuller stacktrace.
> 
> #0  0x00007f2497880337 in raise () from /lib64/libc.so.6
> No symbol table info available.
> #1  0x00007f2497881a28 in abort () from /lib64/libc.so.6
> No symbol table info available.
> #2  0x00007f2487cbf265 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so
> No symbol table info available.
> #3  0x00007f2487c66696 in __cxxabiv1::__terminate(void (*)()) () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so
> No symbol table info available.
> #4  0x00007f2487c666c3 in std::terminate() () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so
> No symbol table info available.
> #5  0x00007f2487c687d3 in __cxa_throw () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so
> No symbol table info available.
> #6  0x00007f2487c686cd in operator new(unsigned long) () from /usr/lib64/llvm5.0/lib/libLLVM-5.0.so
> No symbol table info available.
> #7  0x00007f2486477b9c in allocateBuckets (this=0x2ff7f38, this=0x2ff7f38, Num=<optimized out>) at
/usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:753
> No locals.
> #8  llvm::DenseMap<llvm::APInt, std::unique_ptr<llvm::ConstantInt, std::default_delete<llvm::ConstantInt> >,
llvm::DenseMapAPIntKeyInfo,llvm::detail::DenseMapPair<llvm::APInt, std::unique_ptr<llvm::ConstantInt,
std::default_delete<llvm::ConstantInt>> > >::grow (this=this@entry=0x2ff7f38, AtLeast=<optimized out>)
 
>     at /usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:691
>         OldNumBuckets = 33554432
>         OldBuckets = 0x7f23f3e42010
> #9  0x00007f2486477f29 in grow (AtLeast=<optimized out>, this=0x2ff7f38) at
/usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:461
> No locals.
> #10 InsertIntoBucketImpl<llvm::APInt> (TheBucket=<optimized out>, Lookup=..., Key=..., this=0x2ff7f38) at
/usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:510
>         NewNumEntries = <optimized out>
>         EmptyKey = <optimized out>
> #11 InsertIntoBucket<llvm::APInt const&> (Key=..., TheBucket=<optimized out>, this=0x2ff7f38) at
/usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:471
> No locals.
> #12 FindAndConstruct (Key=..., this=0x2ff7f38) at /usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:271
>         TheBucket = <optimized out>
> #13 operator[] (Key=..., this=0x2ff7f38) at /usr/src/debug/llvm-5.0.1.src/include/llvm/ADT/DenseMap.h:275
> No locals.
> #14 llvm::ConstantInt::get (Context=..., V=...) at /usr/src/debug/llvm-5.0.1.src/lib/IR/Constants.cpp:550
>         pImpl = 0x2ff7eb0
> #15 0x00007f2486478263 in llvm::ConstantInt::get (Ty=0x2ff85a8, V=<optimized out>, isSigned=isSigned@entry=false) at
/usr/src/debug/llvm-5.0.1.src/lib/IR/Constants.cpp:571
> No locals.
> #16 0x00007f248648673d in LLVMConstInt (IntTy=<optimized out>, N=<optimized out>, SignExtend=SignExtend@entry=0) at
/usr/src/debug/llvm-5.0.1.src/lib/IR/Core.cpp:952
> No locals.
> #17 0x00007f2488f66c18 in l_ptr_const (type=0x3000650, ptr=<optimized out>) at
../../../../src/include/jit/llvmjit_emit.h:29
>         c = <optimized out>
> #18 llvm_compile_expr (state=<optimized out>) at llvmjit_expr.c:246
>         op = 0x1a5317690
>         opcode = EEOP_OUTER_VAR
>         opno = 5
>         parent = <optimized out>
>         funcname = 0x1a53184e8 "evalexpr_4827_151"
>         context = 0x1ba79b8
>         b = <optimized out>
>         mod = 0x1a5513d30
>         eval_fn = <optimized out>
>         entry = <optimized out>
>         v_state = 0x1a5ce09e0
>         v_econtext = 0x1a5ce0a08
>         v_isnullp = 0x1a5ce0a30
>         v_tmpvaluep = 0x1a5ce0aa8
>         v_tmpisnullp = 0x1a5ce0b48
>         starttime = {tv_sec = 10799172, tv_nsec = 781670770}
>         endtime = {tv_sec = 7077194792, tv_nsec = 0}
>         __func__ = "llvm_compile_expr"
> [...]