Обсуждение: Need help debugging SIGBUS crashes
Hello, please excuse I am writing here, I wrote earlier to the users list but got no answer. I am observing repeated SIGBUS crashes of the postgres backend binary on FreeBSD, starting at Feb 2, every couple of weeks. The postgres is 15.15, the FreeBSD Release was 14.3, the crashes happen in malloc(). The crashes happened on different PG clusters (running off the same binaries), so they cannot be pinpointed to a specific application. After following a few red herrings, I figured that I had patched into the NUMA allocation policy in the kernel at Dec 18, so I obviousley thought this being the actual cause for the crashes. But apparently it isn't. I removed the patches that would relate to malloc() (and left only those relating to ZFS) - and after some days got another crash. So, yesterday I upgraded to FreeBSD 14.4, removed all my patches for NUMA, and in addition disabled NUMA entirely with vm.numa.disabled=1 and added debugging info for libc. I intended to also add debugging to postgres - but tonight I already got another crash: the problem is apparently not related to NUMA. The backtrace is disgusting - all interesting things are optimized away. :( So I am now quite clueless on how to proceed further, and could really use some educated inspiration. I can not even say if this is a postgres issue or a FreeBSD issue (but it doesn't happen to any other program). I could probably rebuild the OS with -O0 - but is this the best way to proceed? (Filing a bug report against FreeBSD without specific reproduction info would most likely just mean that it stays on my desk anyway.) I'm now following your paper on what info you like to have on bug reports - I'm still thinking there might be something I have changed on the system (I'm always thinking that, because I tend to change things to my liking, albeit in most cases it then is actually a bug ;) ) and I shouldn't yet report a bug - but I can't think of anything now. ## My configuration file: hba_file = '/usr/local/etc/pg-col/pg_hba.conf' ident_file = '/usr/local/etc/pg-col/pg_ident.conf' listen_addresses = 'admn-e.intra.daemon.contact' port = 5434 # (change requires restart) max_connections = 60 # (change requires restart) unix_socket_permissions = 0777 # begin with 0 to use octal notation krb_server_keyfile = '/usr/local/etc/pg-col/krb5.keytab' shared_buffers = 40MB # min 128kB temp_buffers = 20MB # min 800kB work_mem = 50MB # min 64kB maintenance_work_mem = 50MB # min 1MB max_stack_depth = 40MB # min 100kB dynamic_shared_memory_type = posix # the default is the first option max_files_per_process = 200 # min 25 effective_io_concurrency = 12 # 1-1000; 0 disables prefetching max_parallel_workers_per_gather = 3 # taken from max_parallel_workers synchronous_commit = off # synchronization level; wal_sync_method = fsync # the default is the first option full_page_writes = off # recover from partial page writes wal_compression = on # enable compression of full-page writes wal_init_zero = off # zero-fill new WAL files wal_writer_delay = 2000ms # 1-10000 milliseconds checkpoint_timeout = 240min # range 30s-1d checkpoint_completion_target = 0.0 # checkpoint target duration, 0.0 - 1.0 max_wal_size = 4GB archive_mode = on # enables archiving; off, on, or always archive_command = '/ext/libexec/RedoLog.copy "%f" "%p"' archive_timeout = 86400 # force a logfile segment switch after this seq_page_cost = 0.5 # measured on an arbitrary scale random_page_cost = 1.2 # same scale as above / PMc: SSD effective_cache_size = 1GB default_statistics_target = 1000 # range 1-10000 log_destination = 'syslog' syslog_ident = 'pg-col' log_min_messages = info # values in order of decreasing detail: log_min_duration_statement = 10000 # -1 is disabled, 0 logs all statements log_checkpoints = on log_connections = on log_disconnections = on log_error_verbosity = terse # terse, default, or verbose messages log_line_prefix = '%u:%d[%r] ' # special values: log_lock_waits = on # log lock waits >= deadlock_timeout log_temp_files = 10000 # log temporary files equal or larger log_timezone = 'Europe/Berlin' update_process_title = on track_io_timing = on track_wal_io_timing = on autovacuum = on # Enable autovacuum subprocess? 'on' autovacuum_naptime = 5min # time between autovacuum runs autovacuum_vacuum_scale_factor = 0.01 # fraction of table size before vacuum autovacuum_analyze_scale_factor = 0.05 # fraction of table size before analyze temp_tablespaces = 'l1only' # a list of tablespace names, '' uses datestyle = 'german, dmy' timezone = 'Europe/Berlin' lc_messages = 'en_US.UTF-8' # locale for system error message lc_monetary = 'en_US.UTF-8' # locale for monetary formatting lc_numeric = 'en_US.UTF-8' # locale for number formatting lc_time = 'de_DE.UTF-8' # locale for time formatting default_text_search_config = 'pg_catalog.german' ## settings for startup: postgresql_data="/var/db/pg-col/data15" postgresql_flags="-w -m fast -o --config_file=/usr/local/etc/pg-col/postgresql.conf" postgresql_initdb_flags="--encoding=utf-8 --lc-collate=de_DE.UTF-8 --lc-ctype=de_DE.UTF-8 --lc-messages=en_US.UTF-8 --lc-monetary=en_US.UTF-8--lc-numeric=en_US.UTF-8 --lc-time=en_US.UTF-8" # postgres --version postgres (PostgreSQL) 15.15 # uname -a FreeBSD 14.4-RELEASE FreeBSD 14.4-RELEASE (commit a456f852d145) And this is the backtrace - a couple of other partial bactraces are here: https://forums.freebsd.org/threads/trouble-bus-errors-and-segfaults-in-libc-so-from-postgres.101876/ (this apparently comes from differing locations, but seems to crash at the same place) * thread #1, name = 'postgres', stop reason = signal SIGBUS * frame #0: 0x00000008296ad173 libc.so.7`extent_try_coalesce_impl(tsdn=0x00003e616a889090, arena=0x00003e616aa00980, r_extent_hooks=0x0000000820c5be88,rtree_ctx=0x00003e616a8890c0, extents=0x00003e616aa058d8, extent=0x00003e616ab0ec00, coalesced=0x0000000000000000,growing_retained=<unavailable>, inactive_only=<unavailable>) at jemalloc_extent.c:0 frame #1: 0x00000008296aa0d8 libc.so.7`extent_record [inlined] extent_try_coalesce(tsdn=0x00003e616a889090, arena=0x00003e616aa00980,r_extent_hooks=0x0000000820c5be88, rtree_ctx=0x00003e616a8890c0, extents=0x00003e616aa058d8, extent=0x00003e616ab0ec00,coalesced=0x0000000000000000, growing_retained=<unavailable>) at jemalloc_extent.c:1678:9 frame #2: 0x00000008296aa0c3 libc.so.7`extent_record(tsdn=0x00003e616a889090, arena=0x00003e616aa00980, r_extent_hooks=0x0000000820c5be88,extents=0x00003e616aa058d8, extent=0x00003e616ab0ec00, growing_retained=<unavailable>)at jemalloc_extent.c:1717:12 frame #3: 0x00000008296aaf39 libc.so.7`__je_extent_alloc_wrapper [inlined] extent_grow_retained(tsdn=0x00003e616a889090,arena=0x00003e616aa00980, r_extent_hooks=0x0000000820c5be88, size=32768, pad=4096,alignment=<unavailable>, slab=<unavailable>, szind=<unavailable>, zero=<unavailable>, commit=<unavailable>) at jemalloc_extent.c:1383:4 frame #4: 0x00000008296aaef7 libc.so.7`__je_extent_alloc_wrapper [inlined] extent_alloc_retained(tsdn=0x00003e616a889090,arena=0x00003e616aa00980, r_extent_hooks=0x0000000820c5be88, new_addr=0x0000000000000000,size=32768, pad=4096, alignment=<unavailable>, slab=<unavailable>, szind=<unavailable>, zero=<unavailable>,commit=<unavailable>) at jemalloc_extent.c:1480:12 frame #5: 0x00000008296aaef7 libc.so.7`__je_extent_alloc_wrapper(tsdn=0x00003e616a889090, arena=0x00003e616aa00980, r_extent_hooks=0x0000000820c5be88,new_addr=0x0000000000000000, size=32768, pad=4096, alignment=64, slab=<unavailable>, szind=40,zero=0x0000000820c5bedf, commit=0x0000000820c5be87) at jemalloc_extent.c:1539:21 frame #6: 0x0000000829687afd libc.so.7`__je_arena_extent_alloc_large(tsdn=<unavailable>, arena=0x00003e616aa00980, usize=32768,alignment=<unavailable>, zero=0x0000000820c5bedf) at jemalloc_arena.c:448:12 frame #7: 0x00000008296afca0 libc.so.7`__je_large_palloc(tsdn=0x00003e616a889090, arena=<unavailable>, usize=<unavailable>,alignment=64, zero=<unavailable>) at jemalloc_large.c:47:43 frame #8: 0x00000008296afb02 libc.so.7`__je_large_malloc(tsdn=<unavailable>, arena=<unavailable>, usize=<unavailable>,zero=<unavailable>) at jemalloc_large.c:17:9 [artificial] frame #9: 0x000000082967c8a7 libc.so.7`__je_malloc_default [inlined] tcache_alloc_large(tsd=0x00003e616a889090, arena=<unavailable>,tcache=0x00003e616a889280, size=<unavailable>, binind=<unavailable>, zero=false, slow_path=false) attcache_inlines.h:124:9 frame #10: 0x000000082967c82c libc.so.7`__je_malloc_default [inlined] arena_malloc(tsdn=0x00003e616a889090, arena=0x0000000000000000,size=<unavailable>, ind=<unavailable>, zero=false, tcache=0x00003e616a889280, slow_path=false) atarena_inlines_b.h:169:11 frame #11: 0x000000082967c818 libc.so.7`__je_malloc_default [inlined] iallocztm(tsdn=0x00003e616a889090, size=<unavailable>,ind=<unavailable>, zero=false, tcache=0x00003e616a889280, is_internal=false, arena=0x0000000000000000,slow_path=false) at jemalloc_internal_inlines_c.h:53:8 frame #12: 0x000000082967c818 libc.so.7`__je_malloc_default [inlined] imalloc_no_sample(sopts=<unavailable>, dopts=<unavailable>,tsd=0x00003e616a889090, size=<unavailable>, usize=32768, ind=<unavailable>) at jemalloc_jemalloc.c:1953:9 frame #13: 0x000000082967c818 libc.so.7`__je_malloc_default [inlined] imalloc_body(sopts=<unavailable>, dopts=<unavailable>,tsd=0x00003e616a889090) at jemalloc_jemalloc.c:2153:16 frame #14: 0x000000082967c818 libc.so.7`__je_malloc_default [inlined] imalloc(sopts=<unavailable>, dopts=<unavailable>)at jemalloc_jemalloc.c:2262:10 frame #15: 0x000000082967c74a libc.so.7`__je_malloc_default(size=<unavailable>) at jemalloc_jemalloc.c:2293:2 frame #16: 0x00000000009c0577 postgres`AllocSetContextCreateInternal + 199 frame #17: 0x000000000058e56c postgres`StartTransaction + 332 frame #18: 0x000000000058e3be postgres`StartTransactionCommand + 30 frame #19: 0x00000000009a82fd postgres`InitPostgres + 461 frame #20: 0x00000000007b8e0d postgres`AutoVacWorkerMain + 765 frame #21: 0x00000000007b8ac7 postgres`StartAutoVacWorker + 39 frame #22: 0x00000000007c0c21 postgres`sigusr1_handler + 785 frame #23: 0x0000000822ae79b6 libthr.so.3`handle_signal(actp=0x0000000820c5c600, sig=30, info=0x0000000820c5c9f0, ucp=0x0000000820c5c680)at thr_sig.c:318:3 frame #24: 0x0000000822ae6eba libthr.so.3`thr_sighandler(sig=30, info=0x0000000820c5c9f0, _ucp=0x0000000820c5c680) atthr_sig.c:261:2 frame #25: 0x00000008210662d3 frame #26: 0x00000000007c2545 postgres`ServerLoop + 1605 frame #27: 0x00000000007bffa3 postgres`PostmasterMain + 3251 frame #28: 0x0000000000720601 postgres`main + 801 frame #29: 0x000000082958015c libc.so.7`__libc_start1(argc=4, argv=0x0000000820c5d6b0, env=0x0000000820c5d6d8, cleanup=<unavailable>,mainX=(postgres`main)) at libc_start1.c:180:7 frame #30: 0x00000000004ff4e4 postgres`_start + 36
On 3/17/26 13:17, Peter 'PMc' Much wrote:
> ...
>
> The backtrace is disgusting - all interesting things are optimized
> away. :(
> So I am now quite clueless on how to proceed further, and could
> really use some educated inspiration. I can not even say if this is
> a postgres issue or a FreeBSD issue (but it doesn't happen to any
> other program). I could probably rebuild the OS with -O0 - but is
> this the best way to proceed?
> (Filing a bug report against FreeBSD without specific reproduction
> info would most likely just mean that it stays on my desk anyway.)
>
I agree it's hard to deduce anything from the backtraces with the
interesting bits optimized out. Rebuilding the OS with -O0 might be an
overkill, I'd probably start by building just Postgres. That'd at least
give us some idea what happens there, you could inspect the memory
context etc.
I'm not a FreeBSD expert, but this seems a bit suspicious:
frame #23: 0x0000000822ae79b6
libthr.so.3`handle_signal(actp=0x0000000820c5c600, sig=30,
info=0x0000000820c5c9f0, ucp=0x0000000820c5c680) at thr_sig.c:318:3
frame #24: 0x0000000822ae6eba libthr.so.3`thr_sighandler(sig=30,
info=0x0000000820c5c9f0, _ucp=0x0000000820c5c680) at thr_sig.c:261:2
I mean, libthr seems to be a 1:1 with pthreads. Are you using threads in
some way? Perhaps an extension using threads? That could cause weird
failures, including weird SIGBUS ones.
regards
--
Tomas Vondra
Hi, On Tue, Mar 17, 2026 at 1:27 PM Peter 'PMc' Much <pmc@citylink.dinoex.sub.org> wrote: > > Hello, > please excuse I am writing here, I wrote earlier to the users list > but got no answer. > > I am observing repeated SIGBUS crashes of the postgres backend binary > on FreeBSD, starting at Feb 2, every couple of weeks. > The postgres is 15.15, the FreeBSD Release was 14.3, the crashes > happen in malloc(). > > The crashes happened on different PG clusters (running off the same > binaries), so they cannot be pinpointed to a specific application. > > After following a few red herrings, I figured that I had patched > into the NUMA allocation policy in the kernel at Dec 18, so I > obviousley thought this being the actual cause for the crashes. But > apparently it isn't. I removed the patches that would relate to > malloc() (and left only those relating to ZFS) - and after some > days got another crash. > > So, yesterday I upgraded to FreeBSD 14.4, removed all my patches > for NUMA, and in addition disabled NUMA entirely with > vm.numa.disabled=1 > and added debugging info for libc. I intended to also add debugging > to postgres - but tonight I already got another crash: the problem > is apparently not related to NUMA. [..] > frame #6: 0x0000000829687afd libc.so.7`__je_arena_extent_alloc_large(tsdn=<unavailable>, arena=0x00003e616aa00980,usize=32768, alignment=<unavailable>, zero=0x0000000820c5bedf) at jemalloc_arena.c:448:12 > frame #7: 0x00000008296afca0 libc.so.7`__je_large_palloc(tsdn=0x00003e616a889090, arena=<unavailable>, usize=<unavailable>,alignment=64, zero=<unavailable>) at jemalloc_large.c:47:43 > frame #8: 0x00000008296afb02 libc.so.7`__je_large_malloc(tsdn=<unavailable>, arena=<unavailable>, usize=<unavailable>,zero=<unavailable>) at jemalloc_large.c:17:9 [artificial] [..] Not an answer from a regular FreeBSD guy, but more questions: So have you removed those ZFS patches or not? (You said You reverted only NUMA ones)? Maybe those ZFS patches they corrupt some memory and jemalloc just hits those regions? I would revert the kernel to stock thing as nobody would be able to tell otherwise what's happening there :) Are You using hugepages? The jemalloc stack also contains "_large_" so can we assume jemalloc is using hugepages ? I don't know if that might help, but last time I hunted down SIGBUS [0] it was due to our incorrect patches (causing NUMA hugepages imbalances across nodes; our patch has some pause there, but what I did to track it down was to stack trace to Linux's kernel do_sigbus() routine via eBPF). Possibly You could hijack/ detect some traps and/or hijack some routines using DTrace that's in FreeBSD and that would get some hints? -J. [0] - https://www.postgresql.org/message-id/CAKZiRmww2P6QAzu6W%2BvxB89i5Ha-YRSHMeyr6ax2Lymcu3LUcw%40mail.gmail.com
On 3/17/26 14:40, Tomas Vondra wrote: > On 3/17/26 13:17, Peter 'PMc' Much wrote: >> ... >> >> The backtrace is disgusting - all interesting things are optimized >> away. :( >> So I am now quite clueless on how to proceed further, and could >> really use some educated inspiration. I can not even say if this is >> a postgres issue or a FreeBSD issue (but it doesn't happen to any >> other program). I could probably rebuild the OS with -O0 - but is >> this the best way to proceed? >> (Filing a bug report against FreeBSD without specific reproduction >> info would most likely just mean that it stays on my desk anyway.) >> > > I agree it's hard to deduce anything from the backtraces with the > interesting bits optimized out. Rebuilding the OS with -O0 might be an > overkill, I'd probably start by building just Postgres. That'd at least > give us some idea what happens there, you could inspect the memory > context etc. > > I'm not a FreeBSD expert, but this seems a bit suspicious: > > frame #23: 0x0000000822ae79b6 > libthr.so.3`handle_signal(actp=0x0000000820c5c600, sig=30, > info=0x0000000820c5c9f0, ucp=0x0000000820c5c680) at thr_sig.c:318:3 > frame #24: 0x0000000822ae6eba libthr.so.3`thr_sighandler(sig=30, > info=0x0000000820c5c9f0, _ucp=0x0000000820c5c680) at thr_sig.c:261:2 > > I mean, libthr seems to be a 1:1 with pthreads. Are you using threads in > some way? Perhaps an extension using threads? That could cause weird > failures, including weird SIGBUS ones. > BTW the first thing I'd try is testing memory with memtest86+ or a similar tool. I don't know what hardware you're using, but I recently dealt with weird failures on older machines, and it turned out to be faulty RAM modules. regards -- Tomas Vondra
Tomas Vondra <tomas@vondra.me> writes:
> On 3/17/26 13:17, Peter 'PMc' Much wrote:
>> So I am now quite clueless on how to proceed further, and could
>> really use some educated inspiration. I can not even say if this is
>> a postgres issue or a FreeBSD issue (but it doesn't happen to any
>> other program).
> I agree it's hard to deduce anything from the backtraces with the
> interesting bits optimized out. Rebuilding the OS with -O0 might be an
> overkill, I'd probably start by building just Postgres. That'd at least
> give us some idea what happens there, you could inspect the memory
> context etc.
What I'm seeing is that malloc's internal data structures are already
corrupt during startup of an autovacuum worker. I think the most
likely theory is that this somehow traces to our old habit of
launching postmaster child processes from a signal handler, something
that violates the spirit and probably the letter of POSIX, and which
we can clearly see was being done here. But we got rid of that in PG
v16, so if I were Peter my first move would be to upgrade to something
later than 15.x.
Why it was okay in older FreeBSD and not so much in v14, who knows?
But the FreeBSD guys will almost certainly wash their hands of the
matter the moment they see this stack trace. I don't think there's
a lot of point in digging deeper unless it still reproduces with
a newer Postgres.
regards, tom lane
On Tue, Mar 17, 2026 at 02:40:24PM +0100, Tomas Vondra wrote: ! I agree it's hard to deduce anything from the backtraces with the ! interesting bits optimized out. Rebuilding the OS with -O0 might be an ! overkill, I'd probably start by building just Postgres. That'd at least ! give us some idea what happens there, you could inspect the memory ! context etc. ! ! I'm not a FreeBSD expert, but this seems a bit suspicious: ! ! frame #23: 0x0000000822ae79b6 ! libthr.so.3`handle_signal(actp=0x0000000820c5c600, sig=30, ! info=0x0000000820c5c9f0, ucp=0x0000000820c5c680) at thr_sig.c:318:3 ! frame #24: 0x0000000822ae6eba libthr.so.3`thr_sighandler(sig=30, ! info=0x0000000820c5c9f0, _ucp=0x0000000820c5c680) at thr_sig.c:261:2 ! ! I mean, libthr seems to be a 1:1 with pthreads. Are you using threads in ! some way? Perhaps an extension using threads? That could cause weird ! failures, including weird SIGBUS ones. Hi, thanks for noting this - I didn't look that far up the stack. The only extensions in this specific cluster are pg_freespacemap and plpgsql. Others may have hstore and plpython3u. I have currently no clue where this could come from, but will keep it in mind. Thank You for noticing it. PMc
On Tue, Mar 17, 2026 at 02:50:25PM +0100, Jakub Wartak wrote:
!
! Not an answer from a regular FreeBSD guy, but more questions:
!
! So have you removed those ZFS patches or not? (You said You reverted only
! NUMA ones)?
They are completely removed now.
! Maybe those ZFS patches they corrupt some memory and jemalloc just
! hits those regions? I would revert the kernel to stock thing
Yes, I would, too, but I can't. There are patches for kerberos
(FreeBSD 14 still uses that very old Heimdal implementation, that
is why I am kind of stuck with PG 15, and upgrading that one will
be a bit of work), there are patches to make IPv6 fragmentation work
with the firewalls - in short, removing all of the patches will make
the SSO and networking fall apart entirely, and make the site
nonfunctional.
OTOH this crash seems to prefer happening in production. Last night
when it happened, the machine was busy rebuilding the OS etc. for
other nodes to upgrade to 14.4, and then I got bored and additionally
did run an LLM for entertainment. So the server had some 25 GB paged
out, when the nightly housekeeping started to push daily log data
into the databases - which then led to the crash.
That means,
A) I have no good idea how to properly reproduce such conditions
in a test scenario, and
B) it is not impossible that there is a bug (somewhere), that just
doesn't usually happen to orderly people who run their databases
in rather overprovisioned conditions.
! Are You using hugepages? The jemalloc stack also contains "_large_" so can we
! assume jemalloc is using hugepages ?
I think I remember I once tried to, but hugepages with postgres do not
work on FreeBSD. The docs also say:
"this setting is supported only on Linux and Windows."
! I don't know if that might help, but last time I hunted down SIGBUS [0] it was
! due to our incorrect patches (causing NUMA hugepages imbalances across nodes;
! our patch has some pause there, but what I did to track it down was to
! stack trace
! to Linux's kernel do_sigbus() routine via eBPF). Possibly You could hijack/
! detect some traps and/or hijack some routines using DTrace that's in FreeBSD and
! that would get some hints?
Thank You, currently everything helps. :)
DTrace is super cool, but then it also needs to understand the code
first before getting useful insight from it.
So any approach will imply a bunch of work, and I am currently looking
for the shortest path to an unknown target. ;)
PMc
On Tue, Mar 17, 2026 at 02:54:29PM +0100, Tomas Vondra wrote:
!
! BTW the first thing I'd try is testing memory with memtest86+ or a
! similar tool. I don't know what hardware you're using, but I recently
! dealt with weird failures on older machines, and it turned out to be
! faulty RAM modules.
Yes, I considered this as one of the first things, and then ruled
it out, because
A) this is a Haswell-Xeon EP, and not only does it have ECC, but I
have already seen it identify a bad stick on this very hardware,
B) given that we have address space randomization, and that the
postgres memory footprint is kept rather small here (it should
then use ZFS ARC), I consider it highly unlikely that a memory
defect would always hit at the very same code-line, and never
anywhere else or in a different program.
PMc
On Tue, Mar 17, 2026 at 10:12:07AM -0400, Tom Lane wrote:
! Tomas Vondra <tomas@vondra.me> writes:
! > On 3/17/26 13:17, Peter 'PMc' Much wrote:
! >> So I am now quite clueless on how to proceed further, and could
! >> really use some educated inspiration. I can not even say if this is
! >> a postgres issue or a FreeBSD issue (but it doesn't happen to any
! >> other program).
!
! > I agree it's hard to deduce anything from the backtraces with the
! > interesting bits optimized out. Rebuilding the OS with -O0 might be an
! > overkill, I'd probably start by building just Postgres. That'd at least
! > give us some idea what happens there, you could inspect the memory
! > context etc.
!
! What I'm seeing is that malloc's internal data structures are already
! corrupt during startup of an autovacuum worker. I think the most
! likely theory is that this somehow traces to our old habit of
! launching postmaster child processes from a signal handler, something
! that violates the spirit and probably the letter of POSIX, and which
! we can clearly see was being done here. But we got rid of that in PG
! v16, so if I were Peter my first move would be to upgrade to something
! later than 15.x.
I was considering, if there is an issue inside FreeBSD (which it
somehow looks like), then I want it hunted down as such, rather than
having it possibly covered up by using a newer version that might
do things differently.
Now, what I understand here is:
A) I can stop searching for who is creating the SIGUSR1 signals,
because these are created inside of PG Rel. 15.
B) there is a potential issue in doing fork() within a sighandler,
and then continuing to do malloc() in that new process, therefore
this practice has been abandoned from PG Rel.16 onwards.
In that case there is indeed good reason to upgrade.
The one thing I don't get is then: as this has apparently nothing to
do with any special configurations on my site, but is a genuine issue,
then why does it happen now to me (and didn't blow up elsewhere
already some ten years ago)?
! Why it was okay in older FreeBSD and not so much in v14, who knows?
Maybe it wasn't. Here it appeared out of thin air in February, while
the system was upgraded from 13.5 to 14.3 in July'25, and did run
without problems for these eight months.
So this is not directly or solely related to FBSD R.14, and while it
happens more likely during massive memory use, but this also is not
stingent. Neither did I find any other solid determining condition.
So yes, if there is reason to believe the annoyance might just
disappear in PG-16, then that is likely the most viable strategy.
Thanks a lot for all inspiration! :)
PMc
"Peter 'PMc' Much" <pmc@citylink.dinoex.sub.org> writes:
> On Tue, Mar 17, 2026 at 10:12:07AM -0400, Tom Lane wrote:
> ! Why it was okay in older FreeBSD and not so much in v14, who knows?
> Maybe it wasn't. Here it appeared out of thin air in February, while
> the system was upgraded from 13.5 to 14.3 in July'25, and did run
> without problems for these eight months.
> So this is not directly or solely related to FBSD R.14, and while it
> happens more likely during massive memory use, but this also is not
> stingent. Neither did I find any other solid determining condition.
Yeah, it seems likely that there is some additional triggering
condition that we don't understand; otherwise there would be more
people complaining than just you. But if updating to PG16 gets
rid of the problem, I'm not sure it is worth the time to try to
narrow down what that additional trigger is.
Of course, if you still see the issue after upgrading, we'll have
to dig harder.
regards, tom lane
Hi, On 2026-03-17 16:56:48 -0400, Tom Lane wrote: > "Peter 'PMc' Much" <pmc@citylink.dinoex.sub.org> writes: > > On Tue, Mar 17, 2026 at 10:12:07AM -0400, Tom Lane wrote: > > ! Why it was okay in older FreeBSD and not so much in v14, who knows? > > > Maybe it wasn't. Here it appeared out of thin air in February, while > > the system was upgraded from 13.5 to 14.3 in July'25, and did run > > without problems for these eight months. > > So this is not directly or solely related to FBSD R.14, and while it > > happens more likely during massive memory use, but this also is not > > stingent. Neither did I find any other solid determining condition. > > Yeah, it seems likely that there is some additional triggering > condition that we don't understand; otherwise there would be more > people complaining than just you. One issue we've seen in the past (on some other BSD, I think NetBSD?) is signal handlers used a C function in a shared library, the function was never used before the signal handler, and that dynamic symbol resolution allocated memory. Which then contributed to deadlocks and/or corruption of alloctor metadata. You could check if that's a factor by exporting LD_BIND_NOW. The way the signal handling worked before 16 should not really lead to corrupt allocator datastructures, as the signal handler is only allowed to run in a period in which the normal execution is suspended (or only calls async signal safe code, e.g. after waking up, until reaching the sigmask calls to block the signal again). ISTM, there either needed to be another signal handler that allocated memory that was interrupted by SIGUSR1 or that postmaster allocated memory while the signal was unmasked. The dynamic linker doing function resolution could be an explanation. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes:
> On 2026-03-17 16:56:48 -0400, Tom Lane wrote:
>> Yeah, it seems likely that there is some additional triggering
>> condition that we don't understand; otherwise there would be more
>> people complaining than just you.
> One issue we've seen in the past (on some other BSD, I think NetBSD?) is
> signal handlers used a C function in a shared library, the function was never
> used before the signal handler, and that dynamic symbol resolution allocated
> memory. Which then contributed to deadlocks and/or corruption of alloctor
> metadata.
Yeah, that sounds familiar. I think I found that while trying to get
PG to run reliably on NetBSD/hppa, but I don't remember any details.
regards, tom lane
On Tue, Mar 17, 2026 at 04:56:48PM -0400, Tom Lane wrote:
! "Peter 'PMc' Much" <pmc@citylink.dinoex.sub.org> writes:
! > On Tue, Mar 17, 2026 at 10:12:07AM -0400, Tom Lane wrote:
! > ! Why it was okay in older FreeBSD and not so much in v14, who knows?
!
! > Maybe it wasn't. Here it appeared out of thin air in February, while
! > the system was upgraded from 13.5 to 14.3 in July'25, and did run
! > without problems for these eight months.
! > So this is not directly or solely related to FBSD R.14, and while it
! > happens more likely during massive memory use, but this also is not
! > stingent. Neither did I find any other solid determining condition.
!
! Yeah, it seems likely that there is some additional triggering
! condition that we don't understand; otherwise there would be more
! people complaining than just you. But if updating to PG16 gets
! rid of the problem, I'm not sure it is worth the time to try to
! narrow down what that additional trigger is.
!
! Of course, if you still see the issue after upgrading, we'll have
! to dig harder.
Sadly, here it is again with PG r16.13, at the same place as before.
* thread #1, name = 'postgres', stop reason = signal SIGBUS
* frame #0: 0x000000082bba3159 libc.so.7`extent_arena_get [inlined] extent_arena_ind_get(extent=0x79f696918ed45a56)
atextent_inlines.h:40:23
frame #1: 0x000000082bba3159 libc.so.7`extent_arena_get(extent=0x79f696918ed45a56) at extent_inlines.h:49:23
frame #2: 0x000000082bba3a14 libc.so.7`extent_can_coalesce(arena=0x00003d43fd800980, extents=0x00003d43fd8058d8,
inner=0x00003d43fd90f080,outer=0x79f696918ed45a56) at jemalloc_extent.c:1565:6
frame #3: 0x000000082bba363b libc.so.7`extent_try_coalesce_impl(tsdn=0x00003d43fd67a090, arena=0x00003d43fd800980,
r_extent_hooks=0x0000000820af5198,rtree_ctx=0x00003d43fd67a0c0, extents=0x00003d43fd8058d8, extent=0x00003d43fd90f080,
coalesced=0x0000000000000000,growing_retained=true, inactive_only=false) at jemalloc_extent.c:1628:24
frame #4: 0x000000082bba3448 libc.so.7`extent_try_coalesce(tsdn=0x00003d43fd67a090, arena=0x00003d43fd800980,
r_extent_hooks=0x0000000820af5198,rtree_ctx=0x00003d43fd67a0c0, extents=0x00003d43fd8058d8, extent=0x00003d43fd90f080,
coalesced=0x0000000000000000,growing_retained=true) at jemalloc_extent.c:1680:9
frame #5: 0x000000082bba055f libc.so.7`extent_record(tsdn=0x00003d43fd67a090, arena=0x00003d43fd800980,
r_extent_hooks=0x0000000820af5198,extents=0x00003d43fd8058d8, extent=0x00003d43fd90f080, growing_retained=true) at
jemalloc_extent.c:1719:12
frame #6: 0x000000082bba6043 libc.so.7`extent_grow_retained(tsdn=0x00003d43fd67a090, arena=0x00003d43fd800980,
r_extent_hooks=0x0000000820af5198,size=65536, pad=4096, alignment=64, slab=false, szind=44, zero=0x0000000820af51ef,
commit=0x0000000820af5197)at jemalloc_extent.c:1385:4
frame #7: 0x000000082bba0f3f libc.so.7`extent_alloc_retained(tsdn=0x00003d43fd67a090, arena=0x00003d43fd800980,
r_extent_hooks=0x0000000820af5198,new_addr=0x0000000000000000, size=65536, pad=4096, alignment=64, slab=false,
szind=44,zero=0x0000000820af51ef, commit=0x0000000820af5197) at jemalloc_extent.c:1482:12
frame #8: 0x000000082bba0d39 libc.so.7`__je_extent_alloc_wrapper(tsdn=0x00003d43fd67a090, arena=0x00003d43fd800980,
r_extent_hooks=0x0000000820af5198,new_addr=0x0000000000000000, size=65536, pad=4096, alignment=64, slab=false,
szind=44,zero=0x0000000820af51ef, commit=0x0000000820af5197) at jemalloc_extent.c:1541:21
frame #9: 0x000000082bb7a87d libc.so.7`__je_arena_extent_alloc_large(tsdn=<unavailable>, arena=0x00003d43fd800980,
usize=65536,alignment=<unavailable>, zero=0x0000000820af51ef) at jemalloc_arena.c:448:12
frame #10: 0x000000082bba77b0 libc.so.7`__je_large_palloc(tsdn=0x00003d43fd67a090, arena=<unavailable>,
usize=<unavailable>,alignment=64, zero=<unavailable>) at jemalloc_large.c:47:43
frame #11: 0x000000082bba7612 libc.so.7`__je_large_malloc(tsdn=<unavailable>, arena=<unavailable>,
usize=<unavailable>,zero=<unavailable>) at jemalloc_large.c:17:9 [artificial]
frame #12: 0x000000082bb7c477 libc.so.7`__je_arena_malloc_hard(tsdn=<unavailable>, arena=<unavailable>,
size=<unavailable>,ind=<unavailable>, zero=<unavailable>) at jemalloc_arena.c:1528:9 [artificial]
frame #13: 0x000000082bb6f5a7 libc.so.7`__je_malloc_default [inlined] arena_malloc(tsdn=0x00003d43fd67a090,
arena=0x0000000000000000,size=<unavailable>, ind=<unavailable>, zero=false, tcache=0x00003d43fd67a280, slow_path=false)
atarena_inlines_b.h:176:9
frame #14: 0x000000082bb6f598 libc.so.7`__je_malloc_default [inlined] iallocztm(tsdn=0x00003d43fd67a090,
size=<unavailable>,ind=<unavailable>, zero=false, tcache=0x00003d43fd67a280, is_internal=false,
arena=0x0000000000000000,slow_path=false) at jemalloc_internal_inlines_c.h:53:8
frame #15: 0x000000082bb6f598 libc.so.7`__je_malloc_default [inlined] imalloc_no_sample(sopts=<unavailable>,
dopts=<unavailable>,tsd=0x00003d43fd67a090, size=<unavailable>, usize=65536, ind=<unavailable>) at
jemalloc_jemalloc.c:1953:9
frame #16: 0x000000082bb6f598 libc.so.7`__je_malloc_default [inlined] imalloc_body(sopts=<unavailable>,
dopts=<unavailable>,tsd=0x00003d43fd67a090) at jemalloc_jemalloc.c:2153:16
frame #17: 0x000000082bb6f598 libc.so.7`__je_malloc_default [inlined] imalloc(sopts=<unavailable>,
dopts=<unavailable>)at jemalloc_jemalloc.c:2262:10
frame #18: 0x000000082bb6f4ca libc.so.7`__je_malloc_default(size=<unavailable>) at jemalloc_jemalloc.c:2293:2
frame #19: 0x000000082bb6fa2d libc.so.7`__malloc(size=<unavailable>) at jemalloc_jemalloc.c:0 [artificial]
frame #20: 0x000000082bad08a4 libc.so.7`_dns_gethostbyaddr(rval=0x0000000820af5a90, cb_data=<unavailable>,
ap=<unavailable>)at gethostbydns.c:619:13
frame #21: 0x000000082badeab2 libc.so.7`_nsdispatch(retval=0x0000000820af5a90, disp_tab=0x000000082bbd8800,
database="",method_name="", defaults=<unavailable>) at nsdispatch.c:726:14
frame #22: 0x000000082bad2be8 libc.so.7`gethostbyaddr_r(addr=0x0000000820af5ae0, len=<unavailable>,
af=<unavailable>,hp=0x000000082bbebda0, buf="", buflen=8800, result=0x0000000820af5a90, h_errnop=0x0000000820af5a8c) at
gethostnamadr.c:650:9
frame #23: 0x000000082bad34f9 libc.so.7`gethostbyaddr(addr=0x0000000820af5ae0, len=16, af=28) at
gethostnamadr.c:700:6
frame #24: 0x000000082baddcd8 libc.so.7`getipnodebyaddr(src=0x0000000820af5ae0, len=<unavailable>, af=28,
errp=0x0000000820af5b50)at name6.c:378:7
frame #25: 0x000000082bad4242 libc.so.7`getnameinfo_inet(afd=0x000000082bbd8980, sa=0x00003d43fda5e098,
salen=<unavailable>,host=<unavailable>, hostlen=<unavailable>, serv=<unavailable>, servlen=0, flags=4) at
getnameinfo.c:311:8
frame #26: 0x000000082bad405d libc.so.7`getnameinfo(sa=<unavailable>, salen=<unavailable>, host=<unavailable>,
hostlen=<unavailable>,serv=<unavailable>, servlen=<unavailable>, flags=4) at getnameinfo.c:157:10
frame #27: 0x0000000000a85081 postgres`pg_getnameinfo_all + 177
frame #28: 0x0000000000774262 postgres`hba_getauthmethod + 1202
frame #29: 0x000000000076a412 postgres`ClientAuthentication + 50
frame #30: 0x0000000000a49fd1 postgres`InitPostgres + 2273
frame #31: 0x00000000008eac4d postgres`PostgresMain + 285
frame #32: 0x0000000000857108 postgres`BackendRun + 40
frame #33: 0x0000000000855a1a postgres`ServerLoop + 7866
frame #34: 0x000000000085300e postgres`PostmasterMain + 3278
frame #35: 0x000000000077bac3 postgres`main + 803
frame #36: 0x000000082ba72edc libc.so.7`__libc_start1(argc=4, argv=0x0000000820af8700, env=0x0000000820af8728,
cleanup=<unavailable>,mainX=(postgres`main)) at libc_start1.c:180:7
frame #37: 0x0000000000556de4 postgres`_start + 36
This is frame #3, and 'extent_t *next' does not seem to point to an
extent_t:
1601 static extent_t *
1602 extent_try_coalesce_impl(tsdn_t *tsdn, arena_t *arena,
1603 extent_hooks_t **r_extent_hooks, rtree_ctx_t *rtree_ctx, extents_t *extents,
1604 extent_t *extent, bool *coalesced, bool growing_retained,
1605 bool inactive_only) {
1606 /*
1607 * We avoid checking / locking inactive neighbors for large size
1608 * classes, since they are eagerly coalesced on deallocation which can
1609 * cause lock contention.
1610 */
1611 /*
1612 * Continue attempting to coalesce until failure, to protect against
1613 * races with other threads that are thwarted by this one.
1614 */
1615 bool again;
1616 do {
1617 again = false;
1618
1619 /* Try to coalesce forward. */
1620 extent_t *next = extent_lock_from_addr(tsdn, rtree_ctx,
1621 extent_past_get(extent), inactive_only);
1622 if (next != NULL) {
1623 /*
1624 * extents->mtx only protects against races for
1625 * like-state extents, so call extent_can_coalesce()
1626 * before releasing next's pool lock.
1627 */
1628 bool can_coalesce = extent_can_coalesce(arena, extents,
1629 extent, next);
(lldb) p next
(extent_t *) 0x79f696918ed45a56
(lldb) p *next
error: Couldn't apply expression side effects : Couldn't dematerialize a result variable: couldn't read its memory
(lldb) p extent
(extent_t *) 0x00003d43fd90f080
(lldb) p *extent
(extent_t) {
e_bits = 8796153896960
e_addr = 0x00003d43fe211000
= (e_size_esn = 2551808, e_bsize = 2551808)
ql_link = {
qre_next = 0x00003d43fd90f080
qre_prev = 0x00003d43fd90f080
}
ph_link = {
phn_prev = NULL
phn_next = NULL
phn_lchild = NULL
}
= {
e_slab_data = {
bitmap = ([0] = 0, [1] = 0, [2] = 0, [3] = 0, [4] = 0, [5] = 0, [6] = 0, [7] = 0)
}
= {
e_alloc_time = (ns = 0)
e_prof_tctx = (repr = 0x0000000000000000)
}
}
}