Re: terminate called after throwing an instance of 'std::bad_alloc'

Поиск
Список
Период
Сортировка
От Justin Pryzby
Тема Re: terminate called after throwing an instance of 'std::bad_alloc'
Дата
Msg-id 20201218235607.GC30237@telsasoft.com
обсуждение исходный текст
Ответ на Re: terminate called after throwing an instance of 'std::bad_alloc'  (Andres Freund <andres@anarazel.de>)
Ответы Re: terminate called after throwing an instance of 'std::bad_alloc'  (Andres Freund <andres@anarazel.de>)
Re: terminate called after throwing an instance of 'std::bad_alloc'  (Justin Pryzby <pryzby@telsasoft.com>)
Список pgsql-hackers
I'm re-raising this issue (since I needed to track down why I'd reflexively
saved a 512g backup of the filesystem where this first happened).

I'm 99% sure the "bad_alloc" is from LLVM.  It happened multiple times on
different servers (running a similar report) after setting jit=on during pg13
upgrade, and never happened since re-setting jit=off.

I reproduced a memory leak, but I don't think you publicized a fix for it.

I *think* the memory leak probably explains the bad_alloc.  We were seeing
postgres killed by OOM every ~16hours, although I don't see how that's possible
from such a tiny leak.  Our query also uses postgis (at first I said it
didn't), so it may be that they're involved.

I'd be happy to run with a prototype fix for the leak to see if the other issue
does (not) recur.

On Wed, Oct 07, 2020 at 04:02:42PM -0700, Andres Freund wrote:
> Hi,
> 
> Sorry for the second send of this email, but somehow I managed to mangle
> the headers in the last version I sent.
> 
> On 2020-10-03 19:01:27 -0700, Andres Freund wrote:
> > On 2020-10-03 18:30:46 -0500, Justin Pryzby wrote:
> > > On Sat, Oct 03, 2020 at 04:01:49PM -0700, Andres Freund wrote:
> > > > > I was able to make a small, apparent leak like so.  This applies to
> > > > > postgis3.0/postgres12/LLVM5/centos7, and
> > > > > postgis3.1alpha/postgres13/LLVM9/centos8.
> > > > >
> > > > > Inlining is essential to see the leak, optimization is not.  Showing every 9th
> > > > > query:
> > > > >
> > > > > $ for a in `seq 1 99`; do echo "SET jit_inline_above_cost=1; SET jit_optimize_above_cost=-1; SET
jit_above_cost=0;SET jit_expressions=on; SET log_executor_stats=on; SET client_min_messages=debug; SET jit=on; explain
analyzeSELECT st_x(site_geo) FROM sites WHERE site_geo IS NOT NULL;"; done |psql ts 2>&1 |grep 'resident' |sed -n
'1~9p'
> > > > > !       46024 kB max resident size
> > > > > !       46492 kB max resident size
> > > >
> > > > Huh, that's certainly not good. Looking.
> > >
> > > Reproduce like:
> > > postgres=# CREATE EXTENSION postgis;
> > > postgres=# CREATE TABLE geo(g geometry(point));
> > > postgres=# INSERT INTO geo SELECT st_GeometryFromText('POINT(11 12)',0) FROM generate_series(1,99999);
> > > postgres=# SET jit_above_cost=0; SET jit_inline_above_cost=0; SET client_min_messages=debug; SET
log_executor_stats=on;explain analyze SELECT st_x(g) FROM geo;
 
> >
> > It's an issue inside LLVM. I'm not yet sure how to avoid it. Basically
> > whenever we load an IR module to inline from, it renames its type names
> > to make them not conflict with already known type names in the current
> > LLVMContext. That, over time, leads to a large number of type names in
> > memory. We can avoid the re-loading of the module by instead cloning it,
> > but that still leads to new types being created for some internal
> > reasons.  It's possible to create/drop separate LLVMContext instances,
> > but we'd have to figure out when to do so - it's too expensive to always
> > do that.
> 
> I spent a lot of time (probably too much time) trying to find a good way
> out of that. I think I mostly figured it out, but it's gonna take me a
> bit longer to nail down the loose ends. The nice part is that it
> actually speeds up inlining a lot. The bad part is that it's not a small
> change, so I don't yet know how to deal with existing releases.
> 
> Details:
> 
> The issue with the leaks is basically that when LLVM loads a bitcode
> module it doesn't "unify" named types with types in other bitcode module
> - they could after all be entirely independent. That means that we end
> up with a lot of types after a while of running.
> 
> That ought to be addressable by not re-loading modules we inline from
> from disk every time we inline, but unfortunately that doesn't work well
> either: The "cross module import" logic unfortunately modifies the types
> used in the "source module", leading to subsequent imports not working
> well. That last issue is why I had moved to re-loading modules from
> "disk" during import.
> 
> The partial patch I've written doesn't use the existing cross module
> import code anymore, but moves to the lower level functionality it
> uses. For traditional compilers, which is what that linker is mainly
> written for, it makes sense to reduce memory usage by "destructively"
> moving code from the source into the target - it'll only be used
> once. But for the JIT case it's better to do so non-destructively. That
> requires us to implement some "type resolution" between the source and
> target modules, but then gains not needing to reload / copy the source
> module anymore, saving a lot of the costs.
> 
> 
> It's possible that for existing releases we ought to go for a simpler
> approach, even if it's not as good. The only really feasible way I see
> is to basically count the number of times inlining has been done, and
> then just re-initialize after a certain time. But that's pretty darn
> yucky.
> 
> Greetings,
> 
> Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: [PATCH] Implements SPIN_LOCK on ARM
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: Refactoring HMAC in the core code