Обсуждение: v13: Performance regression related to FORTIFY_SOURCE
There was a previous thread[1], but I think it needs some wider discussion. I brought up an issue where GCC in combination with FORTIFY_SOURCE[2] causes a perf regression for logical tapes after introducing LogicalTapeSetExtend()[3]. Unfortunately, FORTIFY_SOURCE is used by default on ubuntu. I have not observed the problem with clang. There is no reason why the change should trigger the regression, but it does. The slowdown is due to GCC switching to an inlined version of memcpy() for LogicalTapeWrite() at logtape.c:768. The change[3] seems to have little if anything to do with that. GCC's Object Size Checking[4] doc says: "There are built-in functions added for many common string operation functions, e.g., for memcpy __builtin___memcpy_chk built-in is provided. This built-in has an additional last argument, which is the number of bytes remaining in the object the dest argument points to or (size_t) -1 if the size is not known. The built-in functions are optimized into the normal string functions like memcpy if the last argument is (size_t) -1 or if it is known at compile time that the destination object will not be overflowed..." In other words, if GCC knows the size of the object it tries to either verify at compile time that it will never overflow, or it inserts a runtime check. But if it doesn't know the size of the object, there's nothing it can do so it just uses memcpy() like normal. Knowing the destination buffer size at compile time would be impossible (before or after my change) because palloc() doesn't have the alloc_size attribute[5] specified. Even if it is specified (which I tried), and if the compiler was smart enough (which it's not), it could still only come up with a maximum size because the offset changes at runtime. Regardless, I tried printing out the results of: __builtin_object_size (lt->buffer + lt->pos, 0) and the result is always -1 (unknown). I have attached a workaround patch which restores the performance, and it's isolatted to logtape.c, but it's ugly (and not a little bit). The questions are: 1. Is my analysis correct? 2. What is the scale of this problem? What about other platforms or compilers? Are there other cases in PostgreSQL that might suffer from the use of FORTIFY_SOURCE? 3. Even if this is the compiler's fault, should we still fix it? 4. Does the attached fix have any dangers of regressing on other compilers/platforms? 5. Does anyone have a suggestion for a better fix? Regards, Jeff Davis [1] https://postgr.es/m/91ca648cfd1f99bf07981487a7d81a1ec926caad.camel@j-davis.com [2] https://fedoraproject.org/wiki/Security_Features?rd=Security/Features#Compile_Time_Buffer_Checks_.28FORTIFY_SOURCE.29 [3] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=24d85952 [4] https://gcc.gnu.org/onlinedocs/gcc/Object-Size-Checking.html [5] https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-alloc_005fsize-function-attribute
Вложения
On Sun, Apr 19, 2020 at 3:07 PM Jeff Davis <pgsql@j-davis.com> wrote: > 1. Is my analysis correct? > 2. What is the scale of this problem? What about other platforms or > compilers? Are there other cases in PostgreSQL that might suffer from > the use of FORTIFY_SOURCE? > 3. Even if this is the compiler's fault, should we still fix it? The precedent set by MemSetAligned() is that sometimes we see the code generated by very common standard library functions as a problem for us to fix, or to paper over. Is it possible that the issue has something to do with what the compiler knows about the alignment of the tapes back when they were a flexible array vs. now, where it's a separate allocation? Perhaps I'm over reaching, but it occurs to me that MemSetAligned() is itself concerned about the alignment of data returned from palloc(). Could be a similar issue here, too. Some guy on the internet says that microarchitectural issues can make __memcpy_avx_unaligned() a lot faster that the "rep movsq" instruction (which you mentioned was a factor on the other thread) in some cases [1]. This explanation sounds kind of plausible. [1] https://news.ycombinator.com/item?id=12050579 -- Peter Geoghegan
On Sun, 2020-04-19 at 16:19 -0700, Peter Geoghegan wrote: > Is it possible that the issue has something to do with what the > compiler knows about the alignment of the tapes back when they were a > flexible array vs. now, where it's a separate allocation? Perhaps I'm > over reaching, but it occurs to me that MemSetAligned() is itself > concerned about the alignment of data returned from palloc(). Could > be > a similar issue here, too. The memcpy() is for the buffer, not the array of LogicalTapes, so I don't really see how that would happen. > Some guy on the internet says that microarchitectural issues can make > __memcpy_avx_unaligned() a lot faster that the "rep movsq" > instruction > (which you mentioned was a factor on the other thread) in some cases > [1]. This explanation sounds kind of plausible. > > [1] https://news.ycombinator.com/item?id=12050579 That raises another consideration: perhaps this is not uniformly a regression, but actually faster in some situations? If so, what situations? Regards, Jeff Davis
Speaking with my RMT hat on, I'm concerned that this item is not moving forward at all. ISTM we first and foremost need to decide whether this is a problem worth worrying about, or not. If it is something worth worrying about, let's discuss what's a good fix for it. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, 2020-04-19 at 16:19 -0700, Peter Geoghegan wrote: > Is it possible that the issue has something to do with what the > compiler knows about the alignment of the tapes back when they were a > flexible array vs. now, where it's a separate allocation? Perhaps I'm > over reaching, but it occurs to me that MemSetAligned() is itself > concerned about the alignment of data returned from palloc(). Could > be > a similar issue here, too. Perhaps, but if so, what remedy would that suggest? Regards, Jeff Davis
On Thu, 2020-06-04 at 16:35 -0400, Alvaro Herrera wrote: > If it is something worth worrying about, let's discuss what's a good > fix for it. I did post a fix for it, but it's not a very clean fix. I'm slightly inclined to proceed with that fix, but I was hoping someone else would have a better suggestion. How about if I wait another week, and if we still don't have a better fix, I will commit this one. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Thu, 2020-06-04 at 16:35 -0400, Alvaro Herrera wrote: >> If it is something worth worrying about, let's discuss what's a good >> fix for it. > I did post a fix for it, but it's not a very clean fix. I'm slightly > inclined to proceed with that fix, but I was hoping someone else would > have a better suggestion. > How about if I wait another week, and if we still don't have a better > fix, I will commit this one. TBH, I don't think we should do this, at least not on the strength of the evidence you posted so far. It looks to me like you are micro-optimizing for one compiler on one platform. Moreover, you're basically trying to work around a compiler codegen bug that might not be there next year. I think what'd make more sense is to file this as a gcc bug ("why doesn't it remove the useless object size check?") and see what they say about that. If the answer is that this isn't a gcc bug for whatever reason, then we could think about whether we should work around it on the source-code level. Even then, I'd want more evidence than has been presented about this not causing a regression on other toolchains/CPUs. regards, tom lane
Hi, On 2020-04-19 15:07:22 -0700, Jeff Davis wrote: > I brought up an issue where GCC in combination with FORTIFY_SOURCE[2] > causes a perf regression for logical tapes after introducing > LogicalTapeSetExtend()[3]. Unfortunately, FORTIFY_SOURCE is used by > default on ubuntu. I have not observed the problem with clang. > > There is no reason why the change should trigger the regression, but it > does. The slowdown is due to GCC switching to an inlined version of > memcpy() for LogicalTapeWrite() at logtape.c:768. The change[3] seems > to have little if anything to do with that. FWIW, with gcc 10 and glibc 2.30 I don't see such a switch. Taking a profile shows me: │ nthistime = TapeBlockPayloadSize - lt->pos; │ if (nthistime > size) 3.01 │1 b0: cmp %rdx,%r12 1.09 │ cmovbe %r12,%rdx │ memcpy(): │ │ __fortify_function void * │ __NTH (memcpy (void *__restrict __dest, const void *__restrict __src, │ size_t __len)) │ { │ return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest)); 2.44 │ mov %r13,%rsi │ LogicalTapeWrite(): │ nthistime = size; │ Assert(nthistime > 0); │ │ memcpy(lt->buffer + lt->pos, ptr, nthistime); 2.49 │ add 0x28(%rbx),%rdi 0.28 │ mov %rdx,%r15 │ memcpy(): 4.65 │ → callq memcpy@plt │ LogicalTapeWrite(): I.e. normal memcpy is getting called. That's with -D_FORTIFY_SOURCE=2 With which compiler / libc versions did you encounter this? Greetings, Andres Freund
On Thu, 2020-06-04 at 21:41 -0400, Tom Lane wrote: > I think what'd make more sense is to file this as a gcc bug ("why > doesn't > it remove the useless object size check?") Filed: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95556 Regards, Jeff Davis
On Fri, 2020-06-05 at 14:49 -0700, Andres Freund wrote: > FWIW, with gcc 10 and glibc 2.30 I don't see such a switch. Taking a > profile shows me: ... > 4.65 │ → callq memcpy@plt > │ LogicalTapeWrite(): > > I.e. normal memcpy is getting called. > > That's with -D_FORTIFY_SOURCE=2 That's good news, although people will be using ubuntu 18.04 for a while. Just to confirm, would you mind trying the example programs in the GCC bug report? https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95556 > With which compiler / libc versions did you encounter this? gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 gcc-9 (Ubuntu 9.2.1-19ubuntu1~18.04.york0) 9.2.1 20191109 libc-dev-bin/bionic,now 2.27-3ubuntu1 amd64 [installed] Regards, Jeff Davis
On Thu, 2020-06-04 at 16:35 -0400, Alvaro Herrera wrote: > If it is something worth worrying about, let's discuss what's a good > fix for it. While making a minimal test case for the GCC bug report, I found another surprisingly-small workaround. Patch attached. Regards, Jeff Davis
Вложения
Jeff Davis <pgsql@j-davis.com> writes: > On Thu, 2020-06-04 at 16:35 -0400, Alvaro Herrera wrote: >> If it is something worth worrying about, let's discuss what's a good >> fix for it. > While making a minimal test case for the GCC bug report, I found > another surprisingly-small workaround. Patch attached. Ugh :-( ... but perhaps you could get the same result like this: -#define TapeBlockPayloadSize (BLCKSZ - sizeof(TapeBlockTrailer)) +#define TapeBlockPayloadSize (BLCKSZ - (int) sizeof(TapeBlockTrailer)) Or possibly casting the whole thing to int or unsigned int would be better. Point being that I bet it's int vs long that is making the difference. regards, tom lane
On Fri, 2020-06-05 at 21:50 -0400, Tom Lane wrote: > Or possibly casting the whole thing to int or unsigned int would be > better. Point being that I bet it's int vs long that is making the > difference. That did it, and it's much more tolerable as a workaround. Thank you. I haven't tested end-to-end that it solves the problem, but I'm pretty sure it will. Regards, Jeff Davis
Hi, On 2020-06-05 18:39:28 -0700, Jeff Davis wrote: > On Fri, 2020-06-05 at 14:49 -0700, Andres Freund wrote: > > FWIW, with gcc 10 and glibc 2.30 I don't see such a switch. Taking a > > profile shows me: > > ... > > > 4.65 │ → callq memcpy@plt > > │ LogicalTapeWrite(): > > > > I.e. normal memcpy is getting called. > > > > That's with -D_FORTIFY_SOURCE=2 > > That's good news, although people will be using ubuntu 18.04 for a > while. > > Just to confirm, would you mind trying the example programs in the GCC > bug report? > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95556 I get "call memcpy@PLT" for both files. With various debian versions of gcc (7,8,9,10). But, very curiously, I do see the difference when compiling with gcc-snapshot (which is a debian package wrapping a recent snapshot from upstream gcc). Greetings, Andres Freund