Обсуждение: Valgrind failures in Apply Launcher's bgworker_quickdie() exit
Hello, Since libcrypto.so is implicated, Andres asked me off-list if my changes to random number state initialisation might be linked to skink's failures beginning 12 or 15 days ago. It appears not, as it was green for several runs after that commit. Looking at the report: ==2802== VALGRINDERROR-BEGIN ==2802== Invalid read of size 8 ==2802== at 0x4DED5A5: check_free (dlerror.c:188) ==2802== by 0x4DEDAB1: free_key_mem (dlerror.c:221) ==2802== by 0x4DEDAB1: __dlerror_main_freeres (dlerror.c:239) ==2802== by 0x55DCF81: __libc_freeres (in /lib/x86_64-linux-gnu/libc-2.28.so) ==2802== by 0x482D19E: _vgnU_freeres (vg_preloaded.c:77) ==2802== by 0x478AD3: bgworker_quickdie (bgworker.c:661) ==2802== by 0x48626AF: ??? (in /lib/x86_64-linux-gnu/libpthread-2.28.so) ==2802== by 0x556DB76: epoll_wait (epoll_wait.c:30) ==2802== by 0x4E25B9: WaitEventSetWaitBlock (latch.c:1078) ==2802== by 0x4E25B9: WaitEventSetWait (latch.c:1030) ==2802== by 0x4E28C1: WaitLatchOrSocket (latch.c:407) ==2802== by 0x4E29A6: WaitLatch (latch.c:347) ==2802== by 0x49E03E: ApplyLauncherMain (launcher.c:1062) ==2802== by 0x479831: StartBackgroundWorker (bgworker.c:834) ==2802== Address 0x7fd7e28 is 12 bytes after a block of size 12 alloc'd ==2802== at 0x483577F: malloc (vg_replace_malloc.c:299) ==2802== by 0x4C3BD38: CRYPTO_zalloc (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==2802== by 0x4C37F8D: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==2802== by 0x4C615B9: RAND_DRBG_get0_public (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==2802== by 0x4C615EF: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1) ==2802== by 0x675D75: pg_strong_random (pg_strong_random.c:135) ==2802== by 0x4848EB: RandomCancelKey (postmaster.c:5251) ==2802== by 0x484909: assign_backendlist_entry (postmaster.c:5822) ==2802== by 0x4873BA: do_start_bgworker (postmaster.c:5692) ==2802== by 0x487701: maybe_start_bgworkers (postmaster.c:5955) ==2802== by 0x4878C2: reaper (postmaster.c:2940) ==2802== by 0x48626AF: ??? (in /lib/x86_64-linux-gnu/libpthread-2.28.so) ==2802== ==2802== VALGRINDERROR-END The function __libc_freeres is a special glibc entry point provided for leak checkers to call explicitly if they want glibc to clean up after itself (normally it doesn't bother). The specific thing being cleaned up here is a piece of thread local storage that belongs to the dynamic linker support code: https://github.com/lattera/glibc/blob/master/dlfcn/dlerror.c#L228 Since we don't see strcmp() or free() at the top of the stack (and assuming they aren't inlined), I think the line numbers must line up with current glibc HEAD as of today, and it must be failing on accessing rec->errstring at line 188, meaning that rec (the value stored as a thread specific key) is a bad pointer. That's quite strange and I don't have an explanation; if libcrypto overran its buffer, for example, that would perhaps trash rec->errstring but we'd still be able to read the pointer itself. So I wonder if libcrypto.so is a red herring here. It's Debian unstable, which could be a factor. Bugs in glibc? That's 2.28, out for 3 months now, but then why only in Apply Launcher? Did we trash 'key', or the thread specific pointer table, or is my assessment above wrong, and somehow it's really errstring that is a bad pointer (which would allow for a more mundane explanation, like someone trashed a bit of heap memory by overrunning a buffer)? -- Thomas Munro http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> writes: > Since libcrypto.so is implicated, Andres asked me off-list if my > changes to random number state initialisation might be linked to > skink's failures beginning 12 or 15 days ago. It appears not, as it > was green for several runs after that commit. > ... > It's Debian unstable, which could be a factor. Bugs in glibc? Has anyone tried to reproduce this on other platforms? It'd be interesting to know which updates were applied to skink's host before the first failing run. regards, tom lane
On December 13, 2018 6:01:04 PM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Thomas Munro <thomas.munro@enterprisedb.com> writes: >> Since libcrypto.so is implicated, Andres asked me off-list if my >> changes to random number state initialisation might be linked to >> skink's failures beginning 12 or 15 days ago. It appears not, as it >> was green for several runs after that commit. >> ... >> It's Debian unstable, which could be a factor. Bugs in glibc? Note it's only failing on master... >Has anyone tried to reproduce this on other platforms? I recently also hit this locally, but since that's also Debian unstable... Note that removing openssl "fixed" the issue forme. >It'd be interesting to know which updates were applied to skink's >host before the first failing run. I'll check the logs when I'm back at a real computer. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Fri, Dec 14, 2018 at 1:15 PM Andres Freund <andres@anarazel.de> wrote: > On December 13, 2018 6:01:04 PM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >Thomas Munro <thomas.munro@enterprisedb.com> writes: > >> Since libcrypto.so is implicated, Andres asked me off-list if my > >> changes to random number state initialisation might be linked to > >> skink's failures beginning 12 or 15 days ago. It appears not, as it > >> was green for several runs after that commit. > >> ... > >> It's Debian unstable, which could be a factor. Bugs in glibc? > > Note it's only failing on master... Here are the commits between the last green run and the first Valgrind failure: 7d4524aed3 Fri Nov 30 22:53:18 2018 UTC Fix tablespace path TAP test of pg_verify_checksums for msys 9dc1225855 Fri Nov 30 13:20:49 2018 UTC Silence compiler warning dcfdf56e89 Fri Nov 30 06:20:43 2018 UTC Fix typo. 5c99513975 Fri Nov 30 01:34:45 2018 UTC Fix various checksum check problems for pg_verify_checksums and base backups a1c91dd110 Fri Nov 30 01:14:58 2018 UTC Switch pg_verify_checksums back to a blacklist d328991578 Thu Nov 29 23:28:10 2018 UTC Document handling of invalid/ambiguous timestamp input near DST boundaries. 88bdbd3f74 Thu Nov 29 21:42:53 2018 UTC Add log_statement_sample_rate parameter 826eff57c4 Thu Nov 29 20:53:44 2018 UTC Ensure static libraries have correct mod time even if ranlib messes it up. 68120427f4 Thu Nov 29 13:01:11 2018 UTC doc: Add appendix detailing some limits of PostgreSQL 44e22647f8 Thu Nov 29 09:00:08 2018 UTC Add pg_partition_tree to documentation index 431f1599a2 Thu Nov 29 01:31:12 2018 UTC Add support for NO_INSTALLCHECK in MSVC scripts 2ac180c286 Thu Nov 29 01:14:26 2018 UTC Fix minor typo in dsa.c. d79fb5d237 Thu Nov 29 00:39:07 2018 UTC Add missing NO_INSTALLCHECK in commit_ts and test_rls_hooks 4c703369af Thu Nov 29 00:12:19 2018 UTC Fix handling of synchronous replication for stopping WAL senders 1a990b207b Wed Nov 28 22:42:54 2018 UTC Have BufFileSize() ereport() on FileSize() failure. f2cbffc7a6 Wed Nov 28 12:55:54 2018 UTC Only allow one recovery target setting eae9143d9a Wed Nov 28 12:34:10 2018 UTC C comment: remove extra '*' 0f9cdd7dca Wed Nov 28 01:12:30 2018 UTC Don't set PAM_RHOST for Unix sockets. f69c959df0 Wed Nov 28 00:43:08 2018 UTC Do not decode TOAST data for table rewrites d1ce4ed2d5 Tue Nov 27 23:48:51 2018 UTC Use wildcard to match parens after CREATE STATISTICS d67dae036b Tue Nov 27 22:58:10 2018 UTC Don't count zero-filled buffers as 'read' in EXPLAIN. 471a7af585 Tue Nov 27 20:16:55 2018 UTC Ensure consistent sort order of large objects in pg_dump. b238527664 Tue Nov 27 18:07:03 2018 UTC Fix jit compilation bug on wide tables. f17889b221 Tue Nov 27 14:16:14 2018 UTC Update ssl test certificates and keys 4c8750a9cc Tue Nov 27 07:26:05 2018 UTC Fix ac218aa4f6 to work on versions before 9.5. ac218aa4f6 Tue Nov 27 01:00:43 2018 UTC Update pg_upgrade test for reg* to include regrole and regnamespace. 7a9d6779d9 Tue Nov 27 00:41:29 2018 UTC doc: fix wording for plpgsql, add "and" 54bb22f66a Mon Nov 26 23:27:34 2018 UTC Fix typo introduced in 578b229718. 12a53c732c Mon Nov 26 22:37:08 2018 UTC Fix pg_upgrade for oid removal. 70d7e507ef Mon Nov 26 22:32:51 2018 UTC Fix translation of special characters in psql's LaTeX output modes. 95dcb8fc05 Mon Nov 26 20:30:24 2018 UTC Avoid locale-dependent output in numericlocale check. 67ed3b9d73 Mon Nov 26 20:24:14 2018 UTC Fix sample output for hash_metapage_info query aa2ba50c2c Mon Nov 26 20:18:55 2018 UTC Add CSV table output mode in psql. 9a98984f49 Mon Nov 26 17:41:42 2018 UTC Improve regression test coverage for psql output formats. a7eece4fc9 Mon Nov 26 17:31:20 2018 UTC Fix breakage of "\pset format latex". 36d442a25a Mon Nov 26 15:38:19 2018 UTC Clarify that cross-row constraints are unsupported 664f01b613 Mon Nov 26 07:43:19 2018 UTC Revert "Fix typo in documentation of toast storage" 058ef3a1a8 Mon Nov 26 06:49:23 2018 UTC Fix typo in documentation of toast storage 1d7dd18686 Mon Nov 26 02:12:11 2018 UTC Revert all new recent changes to add PGXS options for TAP and isolation 3955cae0c5 Mon Nov 26 01:49:49 2018 UTC Fix regression test handling of test_decoding with MSVC b0b1f4183a Mon Nov 26 00:42:21 2018 UTC Disable temporarily TAP tests for contrib/bloom/ 03faa4a8dd Sun Nov 25 23:39:19 2018 UTC Add PGXS options to control TAP and isolation tests Maybe you could try reverting f17889b221 (change of key size) on the superstition that it has something to do with libcrypto.so, but I have no clue why Apply Launcher in particular would be affected by that. Otherwise... assuming 2dedf4d9 doesn't show the problem, you could do the world's slowest bisect 2dedf4d9 -> 7d4524aed3 in ~5.4 steps... -- Thomas Munro http://www.enterprisedb.com
Andres Freund <andres@anarazel.de> writes: > On December 13, 2018 6:01:04 PM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Has anyone tried to reproduce this on other platforms? > I recently also hit this locally, but since that's also Debian unstable... Note that removing openssl "fixed" the issuefor me. FWIW, I tried to reproduce this on Fedora 28 and RHEL6, without success. It's possible that there's some significant detail of your configuration that I didn't match, but on the whole "bug in Debian unstable" seems like the most probable theory right now. regards, tom lane
On Fri, Dec 14, 2018 at 4:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: > > On December 13, 2018 6:01:04 PM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Has anyone tried to reproduce this on other platforms? > > > I recently also hit this locally, but since that's also Debian unstable... Note that removing openssl "fixed" the issuefor me. > > FWIW, I tried to reproduce this on Fedora 28 and RHEL6, without success. > It's possible that there's some significant detail of your configuration > that I didn't match, but on the whole "bug in Debian unstable" seems > like the most probable theory right now. I was keen to try to bisect this, but I couldn't reproduce it on a freshly upgraded Debian unstable VM, with --with-openssl, using "make installcheck" under src/test/authentication. I even tried using the gold linker as skink does. Maybe I'm using the wrong checker options... Andres, can we see your exact valgrind invocation? -- Thomas Munro http://www.enterprisedb.com
Hi, On 2018-12-16 22:33:00 +1100, Thomas Munro wrote: > On Fri, Dec 14, 2018 at 4:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Andres Freund <andres@anarazel.de> writes: > > > On December 13, 2018 6:01:04 PM PST, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> Has anyone tried to reproduce this on other platforms? > > > > > I recently also hit this locally, but since that's also Debian unstable... Note that removing openssl "fixed" the issuefor me. > > > > FWIW, I tried to reproduce this on Fedora 28 and RHEL6, without success. > > It's possible that there's some significant detail of your configuration > > that I didn't match, but on the whole "bug in Debian unstable" seems > > like the most probable theory right now. > > I was keen to try to bisect this, but I couldn't reproduce it on a > freshly upgraded Debian unstable VM, with --with-openssl, using "make > installcheck" under src/test/authentication. I even tried using the > gold linker as skink does. Maybe I'm using the wrong checker > options... Andres, can we see your exact valgrind invocation? Ok, I think I've narrowed this down a bit further. But far from completely. I don't think you need particularly special options, but it's easy to miss the error, because it doesn't cause postmaster to exit with an error. It only happens when a bgworker is shutdown with SIGQUIT (be it directly, or via postmaster immediate shutdown): $ valgrind --quiet --error-exitcode=55 --suppressions=/home/andres/src/postgresql/src/tools/valgrind.supp --suppressions=/home/andres/tmp/valgrind-global.supp--trace-children=yes --track-origins=yes --read-var-info=no --num-callers=20--leak-check=no --gen-suppressions=all /home/andres/build/postgres/dev-assert/vpath/src/backend/postgres-D /srv/dev/pgdev-dev 2018-12-16 12:53:26.274 PST [1187] LOG: listening on IPv4 address "127.0.0.1", port 5433 $ kill -QUIT 1187 ==1194== Invalid read of size 8 ==1194== at 0x4C3B5A5: check_free (dlerror.c:188) ==1194== by 0x4C3BAB1: free_key_mem (dlerror.c:221) ==1194== by 0x4C3BAB1: __dlerror_main_freeres (dlerror.c:239) ==1194== by 0x53D6F81: __libc_freeres (in /lib/x86_64-linux-gnu/libc-2.28.so) ==1194== by 0x482D19E: _vgnU_freeres (vg_preloaded.c:77) ==1194== by 0x567F54: bgworker_quickdie (bgworker.c:662) ==1194== by 0x48A86AF: ??? (in /lib/x86_64-linux-gnu/libpthread-2.28.so) ==1194== by 0x5367B76: epoll_wait (epoll_wait.c:30) ==1194== by 0x5EE7CC: WaitEventSetWaitBlock (latch.c:1078) ==1194== by 0x5EE6A5: WaitEventSetWait (latch.c:1030) ==1194== by 0x5EDDBC: WaitLatchOrSocket (latch.c:407) ==1194== by 0x5EDC23: WaitLatch (latch.c:347) ==1194== by 0x5992D7: ApplyLauncherMain (launcher.c:1062) ==1194== by 0x568245: StartBackgroundWorker (bgworker.c:835) ==1194== by 0x57C295: do_start_bgworker (postmaster.c:5742) ==1194== by 0x57C631: maybe_start_bgworkers (postmaster.c:5955) ==1194== by 0x578C3C: reaper (postmaster.c:2940) ==1194== by 0x48A86AF: ??? (in /lib/x86_64-linux-gnu/libpthread-2.28.so) ==1194== by 0x535F3B6: select (select.c:41) ==1194== by 0x576A9F: ServerLoop (postmaster.c:1677) ==1194== by 0x57642A: PostmasterMain (postmaster.c:1386) ==1194== Address 0x708d488 is 12 bytes after a block of size 12 alloc'd ==1194== at 0x483577F: malloc (vg_replace_malloc.c:299) ==1194== by 0x4AD8D38: CRYPTO_zalloc (mem.c:230) ==1194== by 0x4AD4F8D: ossl_init_get_thread_local (init.c:66) ==1194== by 0x4AD4F8D: ossl_init_get_thread_local (init.c:59) ==1194== by 0x4AD4F8D: ossl_init_thread_start (init.c:426) ==1194== by 0x4AFE5B9: RAND_DRBG_get0_public (drbg_lib.c:1118) ==1194== by 0x4AFE5EF: drbg_bytes (drbg_lib.c:963) ==1194== by 0x7F6DD9: pg_strong_random (pg_strong_random.c:135) ==1194== by 0x57B70F: RandomCancelKey (postmaster.c:5251) ==1194== by 0x57C367: assign_backendlist_entry (postmaster.c:5822) ==1194== by 0x57C0F2: do_start_bgworker (postmaster.c:5692) ==1194== by 0x57C631: maybe_start_bgworkers (postmaster.c:5955) ==1194== by 0x578C3C: reaper (postmaster.c:2940) ==1194== by 0x48A86AF: ??? (in /lib/x86_64-linux-gnu/libpthread-2.28.so) ==1194== by 0x535F3B6: select (select.c:41) ==1194== by 0x576A9F: ServerLoop (postmaster.c:1677) ==1194== by 0x57642A: PostmasterMain (postmaster.c:1386) ==1194== by 0x4997E0: main (main.c:228) I now suspect this is a more longrunning issue than I thought. Not all my valgrind buildfarm branches have ssl enabled (due to an ssl issue a while back). And previously this wouldn't have been caught, because it doesn't cause postmaster to fail, it's just that Andrew added a script that checks logs for valgrind bleats. The interesting bit is that if I replace the _exit(2) in bgworker_quickdie() with an exit(2) (i.e. processing atexit handlers), or manully add an OPENSSL_cleanup() before the _exit(2), valgrind doesn't find errors. The fact that one needs an immediate shutdown in a bgworker, with openssl enabled, explains why this is hard to hit... Greetings, Andres Freund
On Mon, Dec 17, 2018 at 7:57 AM Andres Freund <andres@anarazel.de> wrote: > The interesting bit is that if I replace the _exit(2) in > bgworker_quickdie() with an exit(2) (i.e. processing atexit handlers), > or manully add an OPENSSL_cleanup() before the _exit(2), valgrind > doesn't find errors. Weird. Well I can see that there were bugs last year where OpenSSL failed to clean up its thread locals[1], and after they fixed that, cases where it bogusly cleaned up someone else's thread locals[2]. Maybe there is some interference between pthread keys or something like that. [1] https://github.com/openssl/openssl/issues/3033 [2] https://github.com/openssl/openssl/issues/3584 -- Thomas Munro http://www.enterprisedb.com
Hi, On 2018-12-17 08:25:38 +1100, Thomas Munro wrote: > On Mon, Dec 17, 2018 at 7:57 AM Andres Freund <andres@anarazel.de> wrote: > > The interesting bit is that if I replace the _exit(2) in > > bgworker_quickdie() with an exit(2) (i.e. processing atexit handlers), > > or manully add an OPENSSL_cleanup() before the _exit(2), valgrind > > doesn't find errors. > > Weird. Well I can see that there were bugs last year where OpenSSL > failed to clean up its thread locals[1], and after they fixed that, > cases where it bogusly cleaned up someone else's thread locals[2]. > Maybe there is some interference between pthread keys or something > like that. > > [1] https://github.com/openssl/openssl/issues/3033 > [2] https://github.com/openssl/openssl/issues/3584 What confuses the heck out of me is that it happens on _exit(). Those issues ought to be only visible when doing exit(), no? I guess there's also a good argument to make that valgrind running it's intercept in the _exit() case is a bit dubious (given that's going to be used in cases where e.g. a signal handler might have interrupted a malloc), but given the stacktraces here I don't think that can be the cause. Greetings, Andres Freund
Hi, On 2018-12-16 13:48:00 -0800, Andres Freund wrote: > Hi, > > On 2018-12-17 08:25:38 +1100, Thomas Munro wrote: > > On Mon, Dec 17, 2018 at 7:57 AM Andres Freund <andres@anarazel.de> wrote: > > > The interesting bit is that if I replace the _exit(2) in > > > bgworker_quickdie() with an exit(2) (i.e. processing atexit handlers), > > > or manully add an OPENSSL_cleanup() before the _exit(2), valgrind > > > doesn't find errors. > > > > Weird. Well I can see that there were bugs last year where OpenSSL > > failed to clean up its thread locals[1], and after they fixed that, > > cases where it bogusly cleaned up someone else's thread locals[2]. > > Maybe there is some interference between pthread keys or something > > like that. > > > > [1] https://github.com/openssl/openssl/issues/3033 > > [2] https://github.com/openssl/openssl/issues/3584 > > What confuses the heck out of me is that it happens on _exit(). Those > issues ought to be only visible when doing exit(), no? > > I guess there's also a good argument to make that valgrind running it's > intercept in the _exit() case is a bit dubious (given that's going to be > used in cases where e.g. a signal handler might have interrupted a > malloc), but given the stacktraces here I don't think that can be the > cause. I've for now put --run-libc-freeres=no into skink's config. Locally that "fixes" the issue for me, but of course is not a proper solution. But I want to see whether that allows running all tests under valgrind. Also fixed things so 10 and 11 also run with openssl enabled, previoulsy only master, 9.5 and 9.6 did so. Greetings, Andres Freund
Hi, On 2018-12-17 15:35:01 -0800, Andres Freund wrote: > On 2018-12-16 13:48:00 -0800, Andres Freund wrote: > > On 2018-12-17 08:25:38 +1100, Thomas Munro wrote: > > > On Mon, Dec 17, 2018 at 7:57 AM Andres Freund <andres@anarazel.de> wrote: > > > > The interesting bit is that if I replace the _exit(2) in > > > > bgworker_quickdie() with an exit(2) (i.e. processing atexit handlers), > > > > or manully add an OPENSSL_cleanup() before the _exit(2), valgrind > > > > doesn't find errors. > > > > > > Weird. Well I can see that there were bugs last year where OpenSSL > > > failed to clean up its thread locals[1], and after they fixed that, > > > cases where it bogusly cleaned up someone else's thread locals[2]. > > > Maybe there is some interference between pthread keys or something > > > like that. > > > > > > [1] https://github.com/openssl/openssl/issues/3033 > > > [2] https://github.com/openssl/openssl/issues/3584 > > > > What confuses the heck out of me is that it happens on _exit(). Those > > issues ought to be only visible when doing exit(), no? > > > > I guess there's also a good argument to make that valgrind running it's > > intercept in the _exit() case is a bit dubious (given that's going to be > > used in cases where e.g. a signal handler might have interrupted a > > malloc), but given the stacktraces here I don't think that can be the > > cause. > > I've for now put --run-libc-freeres=no into skink's config. Locally that > "fixes" the issue for me, but of course is not a proper solution. But I > want to see whether that allows running all tests under valgrind. Turns out to be caused by a glibc bug: https://sourceware.org/bugzilla/show_bug.cgi?id=24476 The reason it only fails if ssl is enabled, and only after the openssl randomness was integrated, is that openssl randomness initialization creates a TLS variable, which glibc then frees accidentally (as it tries to free something not initialized). Thus this can be "worked around" by doing something like shared_preload_libraries=pg_stat_statements, as dlopening a library initializes the relevant state. Greetings, Andres Freund