Обсуждение: v13: Performance regression related to FORTIFY_SOURCE

Поиск
Список
Период
Сортировка

v13: Performance regression related to FORTIFY_SOURCE

От
Jeff Davis
Дата:
There was a previous thread[1], but I think it needs some wider
discussion.

I brought up an issue where GCC in combination with FORTIFY_SOURCE[2]
causes a perf regression for logical tapes after introducing
LogicalTapeSetExtend()[3]. Unfortunately, FORTIFY_SOURCE is used by
default on ubuntu. I have not observed the problem with clang.

There is no reason why the change should trigger the regression, but it
does. The slowdown is due to GCC switching to an inlined version of
memcpy() for LogicalTapeWrite() at logtape.c:768. The change[3] seems
to have little if anything to do with that.

GCC's Object Size Checking[4] doc says:

  "There are built-in functions added for many common
   string operation functions, e.g., for memcpy 
   __builtin___memcpy_chk built-in is provided. This
   built-in has an additional last argument, which is
   the number of bytes remaining in the object the dest
   argument points to or (size_t) -1 if the size is not
   known. The built-in functions are optimized into the
   normal string functions like memcpy if the last
   argument is (size_t) -1 or if it is known at compile
   time that the destination object will not be
   overflowed..."

In other words, if GCC knows the size of the object it tries to either
verify at compile time that it will never overflow, or it inserts a
runtime check. But if it doesn't know the size of the object, there's
nothing it can do so it just uses memcpy() like normal.

Knowing the destination buffer size at compile time would be impossible
(before or after my change) because palloc() doesn't have the
alloc_size attribute[5] specified. Even if it is specified (which I
tried), and if the compiler was smart enough (which it's not), it could
still only come up with a maximum size because the offset changes at
runtime. Regardless, I tried printing out the results of:
        __builtin_object_size (lt->buffer + lt->pos, 0)
and the result is always -1 (unknown).

I have attached a workaround patch which restores the performance, and
it's isolatted to logtape.c, but it's ugly (and not a little bit).

The questions are:

1. Is my analysis correct?
2. What is the scale of this problem? What about other platforms or
compilers? Are there other cases in PostgreSQL that might suffer from
the use of FORTIFY_SOURCE?
3. Even if this is the compiler's fault, should we still fix it?
4. Does the attached fix have any dangers of regressing on other
compilers/platforms?
5. Does anyone have a suggestion for a better fix?

Regards,
    Jeff Davis

[1] 
https://postgr.es/m/91ca648cfd1f99bf07981487a7d81a1ec926caad.camel@j-davis.com
[2] 
https://fedoraproject.org/wiki/Security_Features?rd=Security/Features#Compile_Time_Buffer_Checks_.28FORTIFY_SOURCE.29
[3] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=24d85952
[4] https://gcc.gnu.org/onlinedocs/gcc/Object-Size-Checking.html
[5] 
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-alloc_005fsize-function-attribute

Вложения

Re: v13: Performance regression related to FORTIFY_SOURCE

От
Peter Geoghegan
Дата:
On Sun, Apr 19, 2020 at 3:07 PM Jeff Davis <pgsql@j-davis.com> wrote:
> 1. Is my analysis correct?
> 2. What is the scale of this problem? What about other platforms or
> compilers? Are there other cases in PostgreSQL that might suffer from
> the use of FORTIFY_SOURCE?
> 3. Even if this is the compiler's fault, should we still fix it?

The precedent set by MemSetAligned() is that sometimes we see the code
generated by very common standard library functions as a problem for
us to fix, or to paper over.

Is it possible that the issue has something to do with what the
compiler knows about the alignment of the tapes back when they were a
flexible array vs. now, where it's a separate allocation? Perhaps I'm
over reaching, but it occurs to me that MemSetAligned() is itself
concerned about the alignment of data returned from palloc(). Could be
a similar issue here, too.

Some guy on the internet says that microarchitectural issues can make
__memcpy_avx_unaligned() a lot faster that the "rep movsq" instruction
(which you mentioned was a factor on the other thread) in some cases
[1]. This explanation sounds kind of plausible.

[1] https://news.ycombinator.com/item?id=12050579
-- 
Peter Geoghegan



Re: v13: Performance regression related to FORTIFY_SOURCE

От
Jeff Davis
Дата:
On Sun, 2020-04-19 at 16:19 -0700, Peter Geoghegan wrote:
> Is it possible that the issue has something to do with what the
> compiler knows about the alignment of the tapes back when they were a
> flexible array vs. now, where it's a separate allocation? Perhaps I'm
> over reaching, but it occurs to me that MemSetAligned() is itself
> concerned about the alignment of data returned from palloc(). Could
> be
> a similar issue here, too.

The memcpy() is for the buffer, not the array of LogicalTapes, so I
don't really see how that would happen.

> Some guy on the internet says that microarchitectural issues can make
> __memcpy_avx_unaligned() a lot faster that the "rep movsq"
> instruction
> (which you mentioned was a factor on the other thread) in some cases
> [1]. This explanation sounds kind of plausible.
> 
> [1] https://news.ycombinator.com/item?id=12050579

That raises another consideration: perhaps this is not uniformly a
regression, but actually faster in some situations? If so, what
situations?

Regards,
    Jeff Davis





Re: v13: Performance regression related to FORTIFY_SOURCE

От
Alvaro Herrera
Дата:
Speaking with my RMT hat on, I'm concerned that this item is not moving
forward at all.  ISTM we first and foremost need to decide whether this
is a problem worth worrying about, or not.

If it is something worth worrying about, let's discuss what's a good
fix for it.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: v13: Performance regression related to FORTIFY_SOURCE

От
Jeff Davis
Дата:
On Sun, 2020-04-19 at 16:19 -0700, Peter Geoghegan wrote:
> Is it possible that the issue has something to do with what the
> compiler knows about the alignment of the tapes back when they were a
> flexible array vs. now, where it's a separate allocation? Perhaps I'm
> over reaching, but it occurs to me that MemSetAligned() is itself
> concerned about the alignment of data returned from palloc(). Could
> be
> a similar issue here, too.

Perhaps, but if so, what remedy would that suggest?

Regards,
    Jeff Davis





Re: v13: Performance regression related to FORTIFY_SOURCE

От
Jeff Davis
Дата:
On Thu, 2020-06-04 at 16:35 -0400, Alvaro Herrera wrote:
> If it is something worth worrying about, let's discuss what's a good
> fix for it.

I did post a fix for it, but it's not a very clean fix. I'm slightly
inclined to proceed with that fix, but I was hoping someone else would
have a better suggestion.

How about if I wait another week, and if we still don't have a better
fix, I will commit this one.

Regards,
    Jeff Davis





Re: v13: Performance regression related to FORTIFY_SOURCE

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On Thu, 2020-06-04 at 16:35 -0400, Alvaro Herrera wrote:
>> If it is something worth worrying about, let's discuss what's a good
>> fix for it.

> I did post a fix for it, but it's not a very clean fix. I'm slightly
> inclined to proceed with that fix, but I was hoping someone else would
> have a better suggestion.
> How about if I wait another week, and if we still don't have a better
> fix, I will commit this one.

TBH, I don't think we should do this, at least not on the strength of the
evidence you posted so far.  It looks to me like you are micro-optimizing
for one compiler on one platform.  Moreover, you're basically trying to
work around a compiler codegen bug that might not be there next year.

I think what'd make more sense is to file this as a gcc bug ("why doesn't
it remove the useless object size check?") and see what they say about
that.  If the answer is that this isn't a gcc bug for whatever reason,
then we could think about whether we should work around it on the
source-code level.  Even then, I'd want more evidence than has been
presented about this not causing a regression on other toolchains/CPUs.

            regards, tom lane



Re: v13: Performance regression related to FORTIFY_SOURCE

От
Andres Freund
Дата:
Hi,

On 2020-04-19 15:07:22 -0700, Jeff Davis wrote:
> I brought up an issue where GCC in combination with FORTIFY_SOURCE[2]
> causes a perf regression for logical tapes after introducing
> LogicalTapeSetExtend()[3]. Unfortunately, FORTIFY_SOURCE is used by
> default on ubuntu. I have not observed the problem with clang.
>
> There is no reason why the change should trigger the regression, but it
> does. The slowdown is due to GCC switching to an inlined version of
> memcpy() for LogicalTapeWrite() at logtape.c:768. The change[3] seems
> to have little if anything to do with that.

FWIW, with gcc 10 and glibc 2.30 I don't see such a switch. Taking a
profile shows me:

       │       nthistime = TapeBlockPayloadSize - lt->pos;
       │       if (nthistime > size)
  3.01 │1  b0:   cmp          %rdx,%r12
  1.09 │         cmovbe       %r12,%rdx
       │       memcpy():
       │
       │       __fortify_function void *
       │       __NTH (memcpy (void *__restrict __dest, const void *__restrict __src,
       │       size_t __len))
       │       {
       │       return __builtin___memcpy_chk (__dest, __src, __len, __bos0 (__dest));
  2.44 │         mov          %r13,%rsi
       │       LogicalTapeWrite():
       │       nthistime = size;
       │       Assert(nthistime > 0);
       │
       │       memcpy(lt->buffer + lt->pos, ptr, nthistime);
  2.49 │         add          0x28(%rbx),%rdi
  0.28 │         mov          %rdx,%r15
       │       memcpy():
  4.65 │       → callq        memcpy@plt
       │       LogicalTapeWrite():

I.e. normal memcpy is getting called.

That's with -D_FORTIFY_SOURCE=2

With which compiler / libc versions did you encounter this?

Greetings,

Andres Freund



Re: v13: Performance regression related to FORTIFY_SOURCE

От
Jeff Davis
Дата:
On Thu, 2020-06-04 at 21:41 -0400, Tom Lane wrote:
> I think what'd make more sense is to file this as a gcc bug ("why
> doesn't
> it remove the useless object size check?") 

Filed:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95556

Regards,
    Jeff Davis





Re: v13: Performance regression related to FORTIFY_SOURCE

От
Jeff Davis
Дата:
On Fri, 2020-06-05 at 14:49 -0700, Andres Freund wrote:
> FWIW, with gcc 10 and glibc 2.30 I don't see such a switch. Taking a
> profile shows me:

...

>   4.65 │       → callq        memcpy@plt
>        │       LogicalTapeWrite():
> 
> I.e. normal memcpy is getting called.
> 
> That's with -D_FORTIFY_SOURCE=2

That's good news, although people will be using ubuntu 18.04 for a
while.

Just to confirm, would you mind trying the example programs in the GCC
bug report?

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95556

> With which compiler / libc versions did you encounter this?

gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
gcc-9 (Ubuntu 9.2.1-19ubuntu1~18.04.york0) 9.2.1 20191109
libc-dev-bin/bionic,now 2.27-3ubuntu1 amd64 [installed]

Regards,
    Jeff Davis





Re: v13: Performance regression related to FORTIFY_SOURCE

От
Jeff Davis
Дата:
On Thu, 2020-06-04 at 16:35 -0400, Alvaro Herrera wrote:
> If it is something worth worrying about, let's discuss what's a good
> fix for it.

While making a minimal test case for the GCC bug report, I found
another surprisingly-small workaround. Patch attached.

Regards,
    Jeff Davis


Вложения

Re: v13: Performance regression related to FORTIFY_SOURCE

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On Thu, 2020-06-04 at 16:35 -0400, Alvaro Herrera wrote:
>> If it is something worth worrying about, let's discuss what's a good
>> fix for it.

> While making a minimal test case for the GCC bug report, I found
> another surprisingly-small workaround. Patch attached.

Ugh :-( ... but perhaps you could get the same result like this:

-#define TapeBlockPayloadSize  (BLCKSZ - sizeof(TapeBlockTrailer))
+#define TapeBlockPayloadSize  (BLCKSZ - (int) sizeof(TapeBlockTrailer))

Or possibly casting the whole thing to int or unsigned int would be
better.  Point being that I bet it's int vs long that is making the
difference.

            regards, tom lane



Re: v13: Performance regression related to FORTIFY_SOURCE

От
Jeff Davis
Дата:
On Fri, 2020-06-05 at 21:50 -0400, Tom Lane wrote:
> Or possibly casting the whole thing to int or unsigned int would be
> better.  Point being that I bet it's int vs long that is making the
> difference.

That did it, and it's much more tolerable as a workaround. Thank you.

I haven't tested end-to-end that it solves the problem, but I'm pretty
sure it will.

Regards,
    Jeff Davis





Re: v13: Performance regression related to FORTIFY_SOURCE

От
Andres Freund
Дата:
Hi,

On 2020-06-05 18:39:28 -0700, Jeff Davis wrote:
> On Fri, 2020-06-05 at 14:49 -0700, Andres Freund wrote:
> > FWIW, with gcc 10 and glibc 2.30 I don't see such a switch. Taking a
> > profile shows me:
> 
> ...
> 
> >   4.65 │       → callq        memcpy@plt
> >        │       LogicalTapeWrite():
> > 
> > I.e. normal memcpy is getting called.
> > 
> > That's with -D_FORTIFY_SOURCE=2
> 
> That's good news, although people will be using ubuntu 18.04 for a
> while.
> 
> Just to confirm, would you mind trying the example programs in the GCC
> bug report?
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95556

I get "call memcpy@PLT" for both files. With various debian versions of
gcc (7,8,9,10). But, very curiously, I do see the difference when
compiling with gcc-snapshot (which is a debian package wrapping a recent
snapshot from upstream gcc).

Greetings,

Andres Freund