Обсуждение: Valgrind failures in Apply Launcher's bgworker_quickdie() exit

Поиск
Список
Период
Сортировка

Valgrind failures in Apply Launcher's bgworker_quickdie() exit

От
Thomas Munro
Дата:
Hello,

Since libcrypto.so is implicated, Andres asked me off-list if my
changes to random number state initialisation might be linked to
skink's failures beginning 12 or 15 days ago.  It appears not, as it
was green for several runs after that commit.  Looking at the report:

==2802== VALGRINDERROR-BEGIN
==2802== Invalid read of size 8
==2802==    at 0x4DED5A5: check_free (dlerror.c:188)
==2802==    by 0x4DEDAB1: free_key_mem (dlerror.c:221)
==2802==    by 0x4DEDAB1: __dlerror_main_freeres (dlerror.c:239)
==2802==    by 0x55DCF81: __libc_freeres (in /lib/x86_64-linux-gnu/libc-2.28.so)
==2802==    by 0x482D19E: _vgnU_freeres (vg_preloaded.c:77)
==2802==    by 0x478AD3: bgworker_quickdie (bgworker.c:661)
==2802==    by 0x48626AF: ??? (in /lib/x86_64-linux-gnu/libpthread-2.28.so)
==2802==    by 0x556DB76: epoll_wait (epoll_wait.c:30)
==2802==    by 0x4E25B9: WaitEventSetWaitBlock (latch.c:1078)
==2802==    by 0x4E25B9: WaitEventSetWait (latch.c:1030)
==2802==    by 0x4E28C1: WaitLatchOrSocket (latch.c:407)
==2802==    by 0x4E29A6: WaitLatch (latch.c:347)
==2802==    by 0x49E03E: ApplyLauncherMain (launcher.c:1062)
==2802==    by 0x479831: StartBackgroundWorker (bgworker.c:834)
==2802==  Address 0x7fd7e28 is 12 bytes after a block of size 12 alloc'd
==2802==    at 0x483577F: malloc (vg_replace_malloc.c:299)
==2802==    by 0x4C3BD38: CRYPTO_zalloc (in
/usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==2802==    by 0x4C37F8D: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==2802==    by 0x4C615B9: RAND_DRBG_get0_public (in
/usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==2802==    by 0x4C615EF: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==2802==    by 0x675D75: pg_strong_random (pg_strong_random.c:135)
==2802==    by 0x4848EB: RandomCancelKey (postmaster.c:5251)
==2802==    by 0x484909: assign_backendlist_entry (postmaster.c:5822)
==2802==    by 0x4873BA: do_start_bgworker (postmaster.c:5692)
==2802==    by 0x487701: maybe_start_bgworkers (postmaster.c:5955)
==2802==    by 0x4878C2: reaper (postmaster.c:2940)
==2802==    by 0x48626AF: ??? (in /lib/x86_64-linux-gnu/libpthread-2.28.so)
==2802==
==2802== VALGRINDERROR-END

The function __libc_freeres is a special glibc entry point provided
for leak checkers to call explicitly if they want glibc to clean up
after itself (normally it doesn't bother).  The specific thing being
cleaned up here is a piece of thread local storage that belongs to the
dynamic linker support code:

https://github.com/lattera/glibc/blob/master/dlfcn/dlerror.c#L228

Since we don't see strcmp() or free() at the top of the stack (and
assuming they aren't inlined), I think the line numbers must line up
with current glibc HEAD as of today, and it must be failing on
accessing rec->errstring at line 188, meaning that rec (the value
stored as a thread specific key) is a bad pointer.  That's quite
strange and I don't have an explanation; if libcrypto overran its
buffer, for example, that would perhaps trash rec->errstring but we'd
still be able to read the pointer itself.  So I wonder if libcrypto.so
is a red herring here.

It's Debian unstable, which could be a factor.  Bugs in glibc?
That's 2.28, out for 3 months now, but then why only in Apply
Launcher? Did we trash 'key', or the thread specific pointer table, or
is my assessment above wrong, and somehow it's really errstring that
is a bad pointer (which would allow for a more mundane explanation,
like someone trashed a bit of heap memory by overrunning a buffer)?

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Valgrind failures in Apply Launcher's bgworker_quickdie() exit

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> Since libcrypto.so is implicated, Andres asked me off-list if my
> changes to random number state initialisation might be linked to
> skink's failures beginning 12 or 15 days ago.  It appears not, as it
> was green for several runs after that commit.
> ...
> It's Debian unstable, which could be a factor.  Bugs in glibc?

Has anyone tried to reproduce this on other platforms?

It'd be interesting to know which updates were applied to skink's
host before the first failing run.

            regards, tom lane


Re: Valgrind failures in Apply Launcher's bgworker_quickdie() exit

От
Andres Freund
Дата:

On December 13, 2018 6:01:04 PM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> Since libcrypto.so is implicated, Andres asked me off-list if my
>> changes to random number state initialisation might be linked to
>> skink's failures beginning 12 or 15 days ago.  It appears not, as it
>> was green for several runs after that commit.
>> ...
>> It's Debian unstable, which could be a factor.  Bugs in glibc?

Note it's only failing on master...


>Has anyone tried to reproduce this on other platforms?

I recently also hit this locally, but since that's also Debian unstable... Note that removing openssl "fixed" the issue
forme. 


>It'd be interesting to know which updates were applied to skink's
>host before the first failing run.

I'll check the logs when I'm back at a real computer.

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


Re: Valgrind failures in Apply Launcher's bgworker_quickdie() exit

От
Thomas Munro
Дата:
On Fri, Dec 14, 2018 at 1:15 PM Andres Freund <andres@anarazel.de> wrote:
> On December 13, 2018 6:01:04 PM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >Thomas Munro <thomas.munro@enterprisedb.com> writes:
> >> Since libcrypto.so is implicated, Andres asked me off-list if my
> >> changes to random number state initialisation might be linked to
> >> skink's failures beginning 12 or 15 days ago.  It appears not, as it
> >> was green for several runs after that commit.
> >> ...
> >> It's Debian unstable, which could be a factor.  Bugs in glibc?
>
> Note it's only failing on master...

Here are the commits between the last green run and the first Valgrind failure:

7d4524aed3 Fri Nov 30 22:53:18 2018 UTC  Fix tablespace path TAP test
of pg_verify_checksums for msys
9dc1225855 Fri Nov 30 13:20:49 2018 UTC  Silence compiler warning
dcfdf56e89 Fri Nov 30 06:20:43 2018 UTC  Fix typo.
5c99513975 Fri Nov 30 01:34:45 2018 UTC  Fix various checksum check
problems for pg_verify_checksums and base backups
a1c91dd110 Fri Nov 30 01:14:58 2018 UTC  Switch pg_verify_checksums
back to a blacklist
d328991578 Thu Nov 29 23:28:10 2018 UTC  Document handling of
invalid/ambiguous timestamp input near DST boundaries.
88bdbd3f74 Thu Nov 29 21:42:53 2018 UTC  Add log_statement_sample_rate parameter
826eff57c4 Thu Nov 29 20:53:44 2018 UTC  Ensure static libraries have
correct mod time even if ranlib messes it up.
68120427f4 Thu Nov 29 13:01:11 2018 UTC  doc: Add appendix detailing
some limits of PostgreSQL
44e22647f8 Thu Nov 29 09:00:08 2018 UTC  Add pg_partition_tree to
documentation index
431f1599a2 Thu Nov 29 01:31:12 2018 UTC  Add support for
NO_INSTALLCHECK in MSVC scripts
2ac180c286 Thu Nov 29 01:14:26 2018 UTC  Fix minor typo in dsa.c.
d79fb5d237 Thu Nov 29 00:39:07 2018 UTC  Add missing NO_INSTALLCHECK
in commit_ts and test_rls_hooks
4c703369af Thu Nov 29 00:12:19 2018 UTC  Fix handling of synchronous
replication for stopping WAL senders
1a990b207b Wed Nov 28 22:42:54 2018 UTC  Have BufFileSize() ereport()
on FileSize() failure.

f2cbffc7a6 Wed Nov 28 12:55:54 2018 UTC  Only allow one recovery target setting
eae9143d9a Wed Nov 28 12:34:10 2018 UTC  C comment:  remove extra '*'
0f9cdd7dca Wed Nov 28 01:12:30 2018 UTC  Don't set PAM_RHOST for Unix sockets.
f69c959df0 Wed Nov 28 00:43:08 2018 UTC  Do not decode TOAST data for
table rewrites
d1ce4ed2d5 Tue Nov 27 23:48:51 2018 UTC  Use wildcard to match parens
after CREATE STATISTICS
d67dae036b Tue Nov 27 22:58:10 2018 UTC  Don't count zero-filled
buffers as 'read' in EXPLAIN.
471a7af585 Tue Nov 27 20:16:55 2018 UTC  Ensure consistent sort order
of large objects in pg_dump.
b238527664 Tue Nov 27 18:07:03 2018 UTC  Fix jit compilation bug on wide tables.
f17889b221 Tue Nov 27 14:16:14 2018 UTC  Update ssl test certificates and keys
4c8750a9cc Tue Nov 27 07:26:05 2018 UTC  Fix ac218aa4f6 to work on
versions before 9.5.
ac218aa4f6 Tue Nov 27 01:00:43 2018 UTC  Update pg_upgrade test for
reg* to include regrole and regnamespace.
7a9d6779d9 Tue Nov 27 00:41:29 2018 UTC  doc:  fix wording for
plpgsql, add "and"
54bb22f66a Mon Nov 26 23:27:34 2018 UTC  Fix typo introduced in 578b229718.
12a53c732c Mon Nov 26 22:37:08 2018 UTC  Fix pg_upgrade for oid removal.
70d7e507ef Mon Nov 26 22:32:51 2018 UTC  Fix translation of special
characters in psql's LaTeX output modes.
95dcb8fc05 Mon Nov 26 20:30:24 2018 UTC  Avoid locale-dependent output
in numericlocale check.
67ed3b9d73 Mon Nov 26 20:24:14 2018 UTC  Fix sample output for
hash_metapage_info query
aa2ba50c2c Mon Nov 26 20:18:55 2018 UTC  Add CSV table output mode in psql.
9a98984f49 Mon Nov 26 17:41:42 2018 UTC  Improve regression test
coverage for psql output formats.
a7eece4fc9 Mon Nov 26 17:31:20 2018 UTC  Fix breakage of "\pset format latex".
36d442a25a Mon Nov 26 15:38:19 2018 UTC  Clarify that cross-row
constraints are unsupported
664f01b613 Mon Nov 26 07:43:19 2018 UTC  Revert "Fix typo in
documentation of toast storage"
058ef3a1a8 Mon Nov 26 06:49:23 2018 UTC  Fix typo in documentation of
toast storage
1d7dd18686 Mon Nov 26 02:12:11 2018 UTC  Revert all new recent changes
to add PGXS options for TAP and isolation
3955cae0c5 Mon Nov 26 01:49:49 2018 UTC  Fix regression test handling
of test_decoding with MSVC
b0b1f4183a Mon Nov 26 00:42:21 2018 UTC  Disable temporarily TAP tests
for contrib/bloom/
03faa4a8dd Sun Nov 25 23:39:19 2018 UTC  Add PGXS options to control
TAP and isolation tests

Maybe you could try reverting f17889b221 (change of key size) on the
superstition that it has something to do with libcrypto.so, but I have
no clue why Apply Launcher in particular would be affected by that.
Otherwise... assuming 2dedf4d9 doesn't show the problem, you could do
the world's slowest bisect 2dedf4d9 -> 7d4524aed3 in ~5.4 steps...

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Valgrind failures in Apply Launcher's bgworker_quickdie() exit

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On December 13, 2018 6:01:04 PM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Has anyone tried to reproduce this on other platforms?

> I recently also hit this locally, but since that's also Debian unstable... Note that removing openssl "fixed" the
issuefor me. 

FWIW, I tried to reproduce this on Fedora 28 and RHEL6, without success.
It's possible that there's some significant detail of your configuration
that I didn't match, but on the whole "bug in Debian unstable" seems
like the most probable theory right now.

            regards, tom lane


Re: Valgrind failures in Apply Launcher's bgworker_quickdie() exit

От
Thomas Munro
Дата:
On Fri, Dec 14, 2018 at 4:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On December 13, 2018 6:01:04 PM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Has anyone tried to reproduce this on other platforms?
>
> > I recently also hit this locally, but since that's also Debian unstable... Note that removing openssl "fixed" the
issuefor me.
 
>
> FWIW, I tried to reproduce this on Fedora 28 and RHEL6, without success.
> It's possible that there's some significant detail of your configuration
> that I didn't match, but on the whole "bug in Debian unstable" seems
> like the most probable theory right now.

I was keen to try to bisect this, but I couldn't reproduce it on a
freshly upgraded Debian unstable VM, with --with-openssl, using "make
installcheck" under src/test/authentication.  I even tried using the
gold linker as skink does.  Maybe I'm using the wrong checker
options... Andres, can we see your exact valgrind invocation?

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Valgrind failures in Apply Launcher's bgworker_quickdie() exit

От
Andres Freund
Дата:
Hi,

On 2018-12-16 22:33:00 +1100, Thomas Munro wrote:
> On Fri, Dec 14, 2018 at 4:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > On December 13, 2018 6:01:04 PM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >> Has anyone tried to reproduce this on other platforms?
> >
> > > I recently also hit this locally, but since that's also Debian unstable... Note that removing openssl "fixed" the
issuefor me.
 
> >
> > FWIW, I tried to reproduce this on Fedora 28 and RHEL6, without success.
> > It's possible that there's some significant detail of your configuration
> > that I didn't match, but on the whole "bug in Debian unstable" seems
> > like the most probable theory right now.
> 
> I was keen to try to bisect this, but I couldn't reproduce it on a
> freshly upgraded Debian unstable VM, with --with-openssl, using "make
> installcheck" under src/test/authentication.  I even tried using the
> gold linker as skink does.  Maybe I'm using the wrong checker
> options... Andres, can we see your exact valgrind invocation?

Ok, I think I've narrowed this down a bit further. But far from
completely.  I don't think you need particularly special options, but
it's easy to miss the error, because it doesn't cause postmaster to exit
with an error.

It only happens when a bgworker is shutdown with SIGQUIT (be it
directly, or via postmaster immediate shutdown):

$ valgrind --quiet --error-exitcode=55 --suppressions=/home/andres/src/postgresql/src/tools/valgrind.supp
--suppressions=/home/andres/tmp/valgrind-global.supp--trace-children=yes --track-origins=yes --read-var-info=no
--num-callers=20--leak-check=no --gen-suppressions=all
/home/andres/build/postgres/dev-assert/vpath/src/backend/postgres-D /srv/dev/pgdev-dev
 
2018-12-16 12:53:26.274 PST [1187] LOG:  listening on IPv4 address "127.0.0.1", port 5433

$ kill -QUIT 1187

==1194== Invalid read of size 8
==1194==    at 0x4C3B5A5: check_free (dlerror.c:188)
==1194==    by 0x4C3BAB1: free_key_mem (dlerror.c:221)
==1194==    by 0x4C3BAB1: __dlerror_main_freeres (dlerror.c:239)
==1194==    by 0x53D6F81: __libc_freeres (in /lib/x86_64-linux-gnu/libc-2.28.so)
==1194==    by 0x482D19E: _vgnU_freeres (vg_preloaded.c:77)
==1194==    by 0x567F54: bgworker_quickdie (bgworker.c:662)
==1194==    by 0x48A86AF: ??? (in /lib/x86_64-linux-gnu/libpthread-2.28.so)
==1194==    by 0x5367B76: epoll_wait (epoll_wait.c:30)
==1194==    by 0x5EE7CC: WaitEventSetWaitBlock (latch.c:1078)
==1194==    by 0x5EE6A5: WaitEventSetWait (latch.c:1030)
==1194==    by 0x5EDDBC: WaitLatchOrSocket (latch.c:407)
==1194==    by 0x5EDC23: WaitLatch (latch.c:347)
==1194==    by 0x5992D7: ApplyLauncherMain (launcher.c:1062)
==1194==    by 0x568245: StartBackgroundWorker (bgworker.c:835)
==1194==    by 0x57C295: do_start_bgworker (postmaster.c:5742)
==1194==    by 0x57C631: maybe_start_bgworkers (postmaster.c:5955)
==1194==    by 0x578C3C: reaper (postmaster.c:2940)
==1194==    by 0x48A86AF: ??? (in /lib/x86_64-linux-gnu/libpthread-2.28.so)
==1194==    by 0x535F3B6: select (select.c:41)
==1194==    by 0x576A9F: ServerLoop (postmaster.c:1677)
==1194==    by 0x57642A: PostmasterMain (postmaster.c:1386)
==1194==  Address 0x708d488 is 12 bytes after a block of size 12 alloc'd
==1194==    at 0x483577F: malloc (vg_replace_malloc.c:299)
==1194==    by 0x4AD8D38: CRYPTO_zalloc (mem.c:230)
==1194==    by 0x4AD4F8D: ossl_init_get_thread_local (init.c:66)
==1194==    by 0x4AD4F8D: ossl_init_get_thread_local (init.c:59)
==1194==    by 0x4AD4F8D: ossl_init_thread_start (init.c:426)
==1194==    by 0x4AFE5B9: RAND_DRBG_get0_public (drbg_lib.c:1118)
==1194==    by 0x4AFE5EF: drbg_bytes (drbg_lib.c:963)
==1194==    by 0x7F6DD9: pg_strong_random (pg_strong_random.c:135)
==1194==    by 0x57B70F: RandomCancelKey (postmaster.c:5251)
==1194==    by 0x57C367: assign_backendlist_entry (postmaster.c:5822)
==1194==    by 0x57C0F2: do_start_bgworker (postmaster.c:5692)
==1194==    by 0x57C631: maybe_start_bgworkers (postmaster.c:5955)
==1194==    by 0x578C3C: reaper (postmaster.c:2940)
==1194==    by 0x48A86AF: ??? (in /lib/x86_64-linux-gnu/libpthread-2.28.so)
==1194==    by 0x535F3B6: select (select.c:41)
==1194==    by 0x576A9F: ServerLoop (postmaster.c:1677)
==1194==    by 0x57642A: PostmasterMain (postmaster.c:1386)
==1194==    by 0x4997E0: main (main.c:228)


I now suspect this is a more longrunning issue than I thought. Not all
my valgrind buildfarm branches have ssl enabled (due to an ssl issue a
while back). And previously this wouldn't have been caught, because it
doesn't cause postmaster to fail, it's just that Andrew added a script
that checks logs for valgrind bleats.


The interesting bit is that if I replace the _exit(2) in
bgworker_quickdie() with an exit(2) (i.e. processing atexit handlers),
or manully add an OPENSSL_cleanup() before the _exit(2), valgrind
doesn't find errors.

The fact that one needs an immediate shutdown in a bgworker, with
openssl enabled, explains why this is hard to hit...

Greetings,

Andres Freund


Re: Valgrind failures in Apply Launcher's bgworker_quickdie() exit

От
Thomas Munro
Дата:
On Mon, Dec 17, 2018 at 7:57 AM Andres Freund <andres@anarazel.de> wrote:
> The interesting bit is that if I replace the _exit(2) in
> bgworker_quickdie() with an exit(2) (i.e. processing atexit handlers),
> or manully add an OPENSSL_cleanup() before the _exit(2), valgrind
> doesn't find errors.

Weird.  Well I can see that there were bugs last year where OpenSSL
failed to clean up its thread locals[1], and after they fixed that,
cases where it bogusly cleaned up someone else's thread locals[2].
Maybe there is some interference between pthread keys or something
like that.

[1] https://github.com/openssl/openssl/issues/3033
[2] https://github.com/openssl/openssl/issues/3584


-- 
Thomas Munro
http://www.enterprisedb.com


Re: Valgrind failures in Apply Launcher's bgworker_quickdie() exit

От
Andres Freund
Дата:
Hi,

On 2018-12-17 08:25:38 +1100, Thomas Munro wrote:
> On Mon, Dec 17, 2018 at 7:57 AM Andres Freund <andres@anarazel.de> wrote:
> > The interesting bit is that if I replace the _exit(2) in
> > bgworker_quickdie() with an exit(2) (i.e. processing atexit handlers),
> > or manully add an OPENSSL_cleanup() before the _exit(2), valgrind
> > doesn't find errors.
> 
> Weird.  Well I can see that there were bugs last year where OpenSSL
> failed to clean up its thread locals[1], and after they fixed that,
> cases where it bogusly cleaned up someone else's thread locals[2].
> Maybe there is some interference between pthread keys or something
> like that.
> 
> [1] https://github.com/openssl/openssl/issues/3033
> [2] https://github.com/openssl/openssl/issues/3584

What confuses the heck out of me is that it happens on _exit(). Those
issues ought to be only visible when doing exit(), no?

I guess there's also a good argument to make that valgrind running it's
intercept in the _exit() case is a bit dubious (given that's going to be
used in cases where e.g. a signal handler might have interrupted a
malloc), but given the stacktraces here I don't think that can be the
cause.

Greetings,

Andres Freund


Re: Valgrind failures in Apply Launcher's bgworker_quickdie() exit

От
Andres Freund
Дата:
Hi,

On 2018-12-16 13:48:00 -0800, Andres Freund wrote:
> Hi,
> 
> On 2018-12-17 08:25:38 +1100, Thomas Munro wrote:
> > On Mon, Dec 17, 2018 at 7:57 AM Andres Freund <andres@anarazel.de> wrote:
> > > The interesting bit is that if I replace the _exit(2) in
> > > bgworker_quickdie() with an exit(2) (i.e. processing atexit handlers),
> > > or manully add an OPENSSL_cleanup() before the _exit(2), valgrind
> > > doesn't find errors.
> > 
> > Weird.  Well I can see that there were bugs last year where OpenSSL
> > failed to clean up its thread locals[1], and after they fixed that,
> > cases where it bogusly cleaned up someone else's thread locals[2].
> > Maybe there is some interference between pthread keys or something
> > like that.
> > 
> > [1] https://github.com/openssl/openssl/issues/3033
> > [2] https://github.com/openssl/openssl/issues/3584
> 
> What confuses the heck out of me is that it happens on _exit(). Those
> issues ought to be only visible when doing exit(), no?
> 
> I guess there's also a good argument to make that valgrind running it's
> intercept in the _exit() case is a bit dubious (given that's going to be
> used in cases where e.g. a signal handler might have interrupted a
> malloc), but given the stacktraces here I don't think that can be the
> cause.

I've for now put --run-libc-freeres=no into skink's config. Locally that
"fixes" the issue for me, but of course is not a proper solution. But I
want to see whether that allows running all tests under valgrind.

Also fixed things so 10 and 11 also run with openssl enabled, previoulsy
only master, 9.5 and 9.6 did so.

Greetings,

Andres Freund


Re: Valgrind failures in Apply Launcher's bgworker_quickdie() exit

От
Andres Freund
Дата:
Hi,

On 2018-12-17 15:35:01 -0800, Andres Freund wrote:
> On 2018-12-16 13:48:00 -0800, Andres Freund wrote:
> > On 2018-12-17 08:25:38 +1100, Thomas Munro wrote:
> > > On Mon, Dec 17, 2018 at 7:57 AM Andres Freund <andres@anarazel.de> wrote:
> > > > The interesting bit is that if I replace the _exit(2) in
> > > > bgworker_quickdie() with an exit(2) (i.e. processing atexit handlers),
> > > > or manully add an OPENSSL_cleanup() before the _exit(2), valgrind
> > > > doesn't find errors.
> > > 
> > > Weird.  Well I can see that there were bugs last year where OpenSSL
> > > failed to clean up its thread locals[1], and after they fixed that,
> > > cases where it bogusly cleaned up someone else's thread locals[2].
> > > Maybe there is some interference between pthread keys or something
> > > like that.
> > > 
> > > [1] https://github.com/openssl/openssl/issues/3033
> > > [2] https://github.com/openssl/openssl/issues/3584
> > 
> > What confuses the heck out of me is that it happens on _exit(). Those
> > issues ought to be only visible when doing exit(), no?
> > 
> > I guess there's also a good argument to make that valgrind running it's
> > intercept in the _exit() case is a bit dubious (given that's going to be
> > used in cases where e.g. a signal handler might have interrupted a
> > malloc), but given the stacktraces here I don't think that can be the
> > cause.
> 
> I've for now put --run-libc-freeres=no into skink's config. Locally that
> "fixes" the issue for me, but of course is not a proper solution. But I
> want to see whether that allows running all tests under valgrind.

Turns out to be caused by a glibc bug:
https://sourceware.org/bugzilla/show_bug.cgi?id=24476

The reason it only fails if ssl is enabled, and only after the openssl
randomness was integrated, is that openssl randomness initialization
creates a TLS variable, which glibc then frees accidentally (as it tries
to free something not initialized).

Thus this can be "worked around" by doing something like
shared_preload_libraries=pg_stat_statements, as dlopening a library
initializes the relevant state.

Greetings,

Andres Freund