Обсуждение: Reduce function call costs on ELF platforms
Hi, There's two related, but somewhat different aspects to $subject. TL;DR: We use use -fvisibility=hidden + explicit symbol visiblity, -Wl,-Bdynamic, -fno-plt 1) Cross-translation-unit calls in extension library A while ago I was looking at a profile of a workload that spent a good chunk of time in an extension. Looking at the instruction level profile it showed that some of that time was spent doing more-complicated-than-necessary function calls to other functions within the extension. Basically they way we currently build our extensions, the compiler & linker assume every symbol inside the extension libraries needs to be interceptable by the main binary. Which means that all function calls to symbols visible outside the current translation unit need to be made indirectly via the PLT. An example of this (picked from plpgsql, for simplicity) 0000000000024a40 <plpgsql_inline_handler>: { ... func = plpgsql_compile_inline(codeblock->source_text); 24a80: 48 8b 85 a8 fe ff ff mov -0x158(%rbp),%rax 24a87: 48 8b 78 08 mov 0x8(%rax),%rdi 24a8b: e8 20 41 fe ff call 8bb0 <plpgsql_compile_inline@plt> ... 0000000000008bb0 <plpgsql_compile_inline@plt>: 8bb0: ff 25 da ac 02 00 jmp *0x2acda(%rip) # 33890 <plpgsql_compile_inline@@Base+0x24de0> 8bb6: 68 12 01 00 00 push $0x112 8bbb: e9 c0 ee ff ff jmp 7a80 <_init+0x18> I.e. plpgsql_inline_handler doesn't call plpgsql_compile_inline() directly, it calls plpgsql_compile_inline@plt(), which then loads the target address for plpgsql_compile_inline() from the global offset table. Depending on the linker settings / flags passed to dlopen() that'll point to yet another wrapper function (doing a dynamic symbol lookup on the first call, putting the real address in the GOT). This can be addressed to some degree by using explicit symbol visibility markers, as I propose in [1]. With that patch applied compiler / linker know that plpgsql_compile_inline() is not an external symbol, and therefore doesn't need to go through the PLT/GOT. That changes the above to: func = plpgsql_compile_inline(codeblock->source_text); 23000: 48 8b 85 a8 fe ff ff mov -0x158(%rbp),%rax 23007: 48 8b 78 08 mov 0x8(%rax),%rdi 2300b: e8 00 a1 fe ff call d110 <plpgsql_compile_inline> which unsurprisingly is cheaper. 2) Calls to exported functions in extension library However, this does *not* address the issue fully. When an extension calls a function that has to be exported, the symbol with continue to be loaded from the PLT. E.g. hstorePairs() has to be exported, because it's called from transform modules. That results in calls to hstorePairs() from within hstore.so to go through the PLT. e.g. 000000000000e380 <hstore_subscript_assign>: { ... e427: e8 e4 59 ff ff call 3e10 <hstorePairs@plt> In theory we could mark such symbols as "protected" while compiling hstore.so and as "default" otherwise, but that's pretty complicated. And there are some toolchain issues with protected visibility. The easier approach for this class of issues is to use the linker option -Bsymbolic. That turns the above into a plain function call 000000000000e250 <hstore_subscript_assign>: { ... e2f7: e8 f4 a2 ff ff call 85f0 <hstorePairs> As it turns out we already use -Bsymbolic on some platforms (solaris, hpux). But not elsehwere. 3) Function calls from extension library to main binary 4) C library function calls However, even with the above done, calls into shared libraries still go through the PLT. This is particularly annoying for functions like palloc() that are quite performance sensitive and where there's no potential use of intercepting the function call with a different shared library. E.g. the optimized disassembly add_dummy_return() looks like 000000000000bc30 <add_dummy_return>: { ... new = palloc0(sizeof(PLpgSQL_stmt_block)); bc4d: bf 38 00 00 00 mov $0x38,%edi bc52: e8 d9 a7 ff ff call 6430 <palloc0@plt> ... 0000000000006430 <palloc0@plt>: 6430: ff 25 d2 bb 02 00 jmp *0x2bbd2(%rip) # 32008 <palloc0> 6436: 68 01 00 00 00 push $0x1 643b: e9 d0 ff ff ff jmp 6410 <_init+0x20> Obviously we cannot easily avoid indirection entirely in this case. The offset to call palloc0 is not known when plpgsql.so is built. But we don't actually need a two-level indirection. By compiling with -fno-plt, the above becomes: 000000000000b130 <add_dummy_return>: { ... new = palloc0(sizeof(PLpgSQL_stmt_block)); b14d: bf 38 00 00 00 mov $0x38,%edi b152: ff 15 80 66 02 00 call *0x26680(%rip) # 317d8 <palloc0> I.e. a single level of indirection. This has more benefits than just removing one layer of indirection. Here's what gcc's man page says: -fno-plt Do not use the PLT for external function calls in position-independent code. Instead, load the callee address at call sites from the GOT and branch to it. This leads to more efficient code by eliminating PLT stubs and exposing GOT loads to optimizations. In some cases this allows functions to use the sibling-call optimization where that previously was not possible (i.e. for x86 use "jmp" instead of "call" to call another function when that function call is the last thing done in a function, thereby reusing the call frame and reducing the cost of returns). This doesn't just matter for extension libraries. It's also relevant for the main binary (i.e. the upsides are bigger / more widely applicable) - every function call to libc goes through PLT+GOT (well, with a dynamically linked libc). This includes things that are often called in performance critical bits, like strlen. E.g. without -fno-plt raw_parser() calls strlen via the plt: cur_token_length = strlen(yyextra->core_yy_extra.scanbuf + *llocp); 2775a6: 49 63 55 00 movslq 0x0(%r13),%rdx 2775aa: 4c 8b 3b mov (%rbx),%r15 2775ad: 48 89 4d c0 mov %rcx,-0x40(%rbp) 2775b1: 49 8d 3c 17 lea (%r15,%rdx,1),%rdi 2775b5: 48 89 55 c8 mov %rdx,-0x38(%rbp) 2775b9: e8 82 03 e5 ff call c7940 <strlen@plt> but not with -fno-plt: cur_token_length = strlen(yyextra->core_yy_extra.scanbuf + *llocp); 2838e6: 49 63 55 00 movslq 0x0(%r13),%rdx 2838ea: 4c 8b 3b mov (%rbx),%r15 2838ed: 48 89 4d c0 mov %rcx,-0x40(%rbp) 2838f1: 49 8d 3c 17 lea (%r15,%rdx,1),%rdi 2838f5: 48 89 55 c8 mov %rdx,-0x38(%rbp) 2838f9: ff 15 09 45 66 00 call *0x664509(%rip) # 8e7e08 <strlen@GLIBC_2.2.5> I haven't run detailed benchmarks in isolation, but have seen some good results. It obviously is heavily workload dependent. Greetings, Andres Freund [1] https://postgr.es/m/20211101020311.av6hphdl6xbjbuif%40alap3.anarazel.de
Andres Freund <andres@anarazel.de> writes: > Basically they way we currently build our extensions, the compiler & linker > assume every symbol inside the extension libraries needs to be interceptable > by the main binary. Which means that all function calls to symbols visible > outside the current translation unit need to be made indirectly via the PLT. Yeah, that would be nice to improve. > The easier approach for this class of issues is to use the linker option > -Bsymbolic. I don't recall details, but we've previously rejected the idea of trying to use -Bsymbolic widely; apparently it has undesirable side-effects on some platforms. See commit message for e3d77ea6b (hopefully there's some detail in the email thread [1]). It sounds like you're not actually proposing that, but I thought it would be a good idea to note the hazard here. > By compiling with -fno-plt, the above becomes: Does -fno-plt amount to an ABI change? If so, I'm worried that it'd break the ability to compile extensions with a different compiler. Also, we have at least some places where there actually are cross-calls between extensions, eg hstore_perl -> plperl. Do we need to worry about breaking those? regards, tom lane [1] https://www.postgresql.org/message-id/flat/153626613985.23143.4743626885618266803%40wrigleys.postgresql.org
Hi, On 2021-11-22 17:32:21 -0500, Tom Lane wrote: > > The easier approach for this class of issues is to use the linker option > > -Bsymbolic. > > I don't recall details, but we've previously rejected the idea of > trying to use -Bsymbolic widely; apparently it has undesirable > side-effects on some platforms. See commit message for e3d77ea6b > (hopefully there's some detail in the email thread [1]). Hm. I didn't really see reasons to not use -Bsymbolic in that thread. > It sounds like you're not actually proposing that, but I thought it would be > a good idea to note the hazard here. I think it'd be good to use, but it'll be much less important once we use symbol visibility. > > By compiling with -fno-plt, the above becomes: > > Does -fno-plt amount to an ABI change? If so, I'm worried that it'd > break the ability to compile extensions with a different compiler. No, it is a change at the function callsite thats transparent to the function itself (unless the called function goes out of its way to do stuff like inspecting frame pointers or such, but they're not available by default on most platforms anyway). It does however change symbol binding, basically making all symbols bound eagerly. Which I guess theoretically could be considered an ABI change, because it removes the ability to intercept symbols referenced in a previously loaded shared library, with a subsequently loaded library (e.g. loaded with RTLD_DEEPBIND) function before the symbol is used. But that seems like a stretch. And I think most ELF platforms/linux distributions have/are moving towards using -Wl,-z,now -Wl,-z,relro also makes symbols bound eagerly. > Also, we have at least some places where there actually are cross-calls > between extensions, eg hstore_perl -> plperl. Do we need to worry about > breaking those? It does require a bit of care in the symbol visibility patch, basically marking all such symbols as exported (which may happen implicitly via PG_FUNCTION_INFO_V1). For extensions that are referenced that way that actually seems like a good thing, because it makes it clearer what could be referenced that way. Greetings, Andres Freund
On Mon, Nov 22, 2021 at 03:57:45PM -0800, Andres Freund wrote: > It does however change symbol binding, basically making all symbols bound > eagerly. Which I guess theoretically could be considered an ABI change, > because it removes the ability to intercept symbols referenced in a previously > loaded shared library, with a subsequently loaded library (e.g. loaded with > RTLD_DEEPBIND) function before the symbol is used. But that seems like a > stretch. And I think most ELF platforms/linux distributions have/are moving > towards using -Wl,-z,now -Wl,-z,relro also makes symbols bound eagerly. I found this really interesting, and I am surprised how things got so suboptimal. Has it always been this way? Is it the use of C++ that is causing this by default? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Hi, On 2021-11-22 20:13:28 -0500, Bruce Momjian wrote: > On Mon, Nov 22, 2021 at 03:57:45PM -0800, Andres Freund wrote: > > It does however change symbol binding, basically making all symbols bound > > eagerly. Which I guess theoretically could be considered an ABI change, > > because it removes the ability to intercept symbols referenced in a previously > > loaded shared library, with a subsequently loaded library (e.g. loaded with > > RTLD_DEEPBIND) function before the symbol is used. But that seems like a > > stretch. And I think most ELF platforms/linux distributions have/are moving > > towards using -Wl,-z,now -Wl,-z,relro also makes symbols bound eagerly. > > I found this really interesting, and I am surprised how things got so > suboptimal. It's always been this way on ELF afaict (*). I don't think anybody would design ELF the same way today. I think it's pretty clear that defaulting to making symbols interceptable was a bad idea. But it's where we are... > Has it always been this way? Is it the use of C++ that is causing this by > default? Nope, this is with plain C. Greetings, Andres Freund (*) I guess you can argue it got a tad worse with the increasing use of position independent executables, but that's a relatively small difference in the scope of the issue.
On 22.11.21 23:32, Tom Lane wrote: >> The easier approach for this class of issues is to use the linker option >> -Bsymbolic. > I don't recall details, but we've previously rejected the idea of > trying to use -Bsymbolic widely; apparently it has undesirable > side-effects on some platforms. See commit message for e3d77ea6b > (hopefully there's some detail in the email thread [1]). It sounds > like you're not actually proposing that, but I thought it would be > a good idea to note the hazard here. Also, IIRC, -Bsymbolic was once frowned upon by packaging policies, since it prevents use of LD_PRELOAD. I'm not sure what the current thinking there is, however.
Hi, On 2021-11-23 17:28:08 +0100, Peter Eisentraut wrote: > On 22.11.21 23:32, Tom Lane wrote: > > > The easier approach for this class of issues is to use the linker option > > > -Bsymbolic. > > I don't recall details, but we've previously rejected the idea of > > trying to use -Bsymbolic widely; apparently it has undesirable > > side-effects on some platforms. See commit message for e3d77ea6b > > (hopefully there's some detail in the email thread [1]). It sounds > > like you're not actually proposing that, but I thought it would be > > a good idea to note the hazard here. > > Also, IIRC, -Bsymbolic was once frowned upon by packaging policies, since it > prevents use of LD_PRELOAD. I'm not sure what the current thinking there > is, however. It doesn't break some (most?) of the uses of LD_PRELOAD. In particular, it doesn't break things like replacing the malloc implementation. When do you have a symbol that you want to override *inside* your library (executables already bind to their own symbols at compile time)? I've seen that for replacing buggy functions in closed source things, but that's about it? Greetings, Andres Freund
On 11/24/21 13:55, Andres Freund wrote: > Hi, > > On 2021-11-23 17:28:08 +0100, Peter Eisentraut wrote: >> On 22.11.21 23:32, Tom Lane wrote: >>>> The easier approach for this class of issues is to use the linker option >>>> -Bsymbolic. >>> I don't recall details, but we've previously rejected the idea of >>> trying to use -Bsymbolic widely; apparently it has undesirable >>> side-effects on some platforms. See commit message for e3d77ea6b >>> (hopefully there's some detail in the email thread [1]). It sounds >>> like you're not actually proposing that, but I thought it would be >>> a good idea to note the hazard here. >> Also, IIRC, -Bsymbolic was once frowned upon by packaging policies, since it >> prevents use of LD_PRELOAD. I'm not sure what the current thinking there >> is, however. > It doesn't break some (most?) of the uses of LD_PRELOAD. In particular, it > doesn't break things like replacing the malloc implementation. When do you > have a symbol that you want to override *inside* your library (executables > already bind to their own symbols at compile time)? I've seen that for > replacing buggy functions in closed source things, but that's about it? > Which things does it break exactly? I have a case where a library that is LD_PRELOADed calls PQsetSSLKeyPassHook_OpenSSL() in its constructor function. I'd be very unhappy if that stopped working (and so would our client). cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Hi, On 2021-11-24 17:55:03 -0500, Andrew Dunstan wrote: > On 11/24/21 13:55, Andres Freund wrote: > > On 2021-11-23 17:28:08 +0100, Peter Eisentraut wrote: > >> On 22.11.21 23:32, Tom Lane wrote: > >>>> The easier approach for this class of issues is to use the linker option > >>>> -Bsymbolic. > >>> I don't recall details, but we've previously rejected the idea of > >>> trying to use -Bsymbolic widely; apparently it has undesirable > >>> side-effects on some platforms. See commit message for e3d77ea6b > >>> (hopefully there's some detail in the email thread [1]). It sounds > >>> like you're not actually proposing that, but I thought it would be > >>> a good idea to note the hazard here. > >> Also, IIRC, -Bsymbolic was once frowned upon by packaging policies, since it > >> prevents use of LD_PRELOAD. I'm not sure what the current thinking there > >> is, however. > > It doesn't break some (most?) of the uses of LD_PRELOAD. In particular, it > > doesn't break things like replacing the malloc implementation. When do you > > have a symbol that you want to override *inside* your library (executables > > already bind to their own symbols at compile time)? I've seen that for > > replacing buggy functions in closed source things, but that's about it? > > > > Which things does it break exactly? -Bsymbolic causes symbols that are defined and referenced within one shared library to use that definition. E.g. if a shared lib has a function "do_something()" and some of its code calls do_something(), you cannot use LD_PRELOAD (or a definition in the main binary) to redirect the call to do_something() inside the shared library to something else. I.e. if a shared library calls a function that's *not* defined within that shared library, -Bsymbolic doesn't have an effect for that symbol. > I have a case where a library that > is LD_PRELOADed calls PQsetSSLKeyPassHook_OpenSSL() in its constructor > function. I'd be very unhappy if that stopped working (and so would our > client). Bsymbolic shouldn't affect that at all. Greetings, Andres Freund
On 11/24/21 22:57, Andres Freund wrote: > >> Which things does it break exactly? > -Bsymbolic causes symbols that are defined and referenced within one shared > library to use that definition. E.g. if a shared lib has a function > "do_something()" and some of its code calls do_something(), you cannot use > LD_PRELOAD (or a definition in the main binary) to redirect the call to > do_something() inside the shared library to something else. > > I.e. if a shared library calls a function that's *not* defined within that > shared library, -Bsymbolic doesn't have an effect for that symbol. > > >> I have a case where a library that >> is LD_PRELOADed calls PQsetSSLKeyPassHook_OpenSSL() in its constructor >> function. I'd be very unhappy if that stopped working (and so would our >> client). > Bsymbolic shouldn't affect that at all. > Thanks for the explanation. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com