aarch64 build uses very slow assembly, it is fixable

Поиск
Список
Период
Сортировка
От	Daniel Farina
Тема	aarch64 build uses very slow assembly, it is fixable
Дата	18 октября 2020 г. 09:46:49
Msg-id	CACN56+P1astF5zvocrT7--Mu2dQWFS0eQ31xNmX=b=98y9fMSw@mail.gmail.com обсуждение исходный текст
Список	pgsql-pkg-yum
Дерево обсуждения
So, I was microbenchmarking on AWS Graviton 2 a.k.a. Neoverse N1-ish
processors (on instances c6g, m6g) and noticed that TPS was sensitive
to the number of clients and dropping to low throughputs, particularly
around 32 clients:

m6g.16xlarge, scale factor 1, select-only, 386.7K TPS, pgbench -S -j 4
--time=60 --client=34
m6g.16xlarge, scale factor 1, select-only, 596.3K TPS, pgbench -S -j 4
--time=60 --client=33
m6g.16xlarge, scale factor 1, select-only, 670.5K TPS, pgbench -S -j 4
--time=60 --client=32
m6g.16xlarge, scale factor 1, select-only, 641.4K TPS, pgbench -S -j 4
--time=60 --client=30

If you increase clients more, this can decrease to 145K TPS, or worse,
and all bulk of all time is spent in LWLockAcquire.

This email https://www.postgresql.org/message-id/099F69EE-51D3-4214-934A-1F28C0A1A7A7@amazon.com
reports some weaknesses in generated instructions for aarch64, but it
is does not relate an improvement of this magnitude...but it is: I can
get 591K TPS even with 100 clients once using "casal" by a number of
means, instead of 145K. It doesn't improve the best case by so much,
but it degrades far more gracefully while offering more throughput.

In the profiler, the difference between using "casal" and
"ldaxr"/"stlxr" is whether postgres spends a majority of its time in
snapshot acquisition or barely any, and how steep the ramp of
degradation of more connections is. Once the atomic stuff is out of
the way, much more difficult memory loads and stores in the planner
are the new bottleneck...a pretty big improvement.

Okay, so the gains are very great. How do we get these instructions emitted?

One option is to compile with -march=armv8.2-a. This works on older
compilers, but will break the code for an ARM chip without the right
features. This is how I started experimenting.

Another, to use -moutline-atomics, as the previous mailing list post
mentions. This is available in newer GCCs than I can easily get on
CentOS 8.  CentOS 8 comes with 8.3.1, and gcc-toolset-9 loads 9.2.1,
which also doesn't include it...per
https://gcc.gnu.org/gcc-9/changes.html, gcc 9.4 is required. Here's
the patch introducing outline-atomics:

commit 3950b229a5ed6710f30241c2ddc3c74909bf4740
Author: Richard Henderson <richard.henderson@linaro.org>
Date:   Thu Sep 19 14:36:43 2019 +0000

    aarch64: Implement -moutline-atomics

More recently, -moutline-atomics became the default:

commit cd4b68527988f42c10c0d6c10e812d299887e0c2
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date:   Thu Apr 30 13:12:13 2020 +0100

    [AArch64] Make -moutline-atomics on by default

Given I did not identify an easy way to obtain an rpm with any
compiler new enough to have -moutline-atomics on or off by default, I
compiled a new version of GCC, 10.2, and ran ./configure without
additional flags (save symbols, for disassembly). It works, and emits
a hybrid assembly that selects between "casal" and "ldaxr"/"stlxr".

I also attempted to measure any gains from using -mtune=neoverse-n1,
available in such  new GCC, which not only avoids emitting the generic
atomics code, but changes various cost metrics and so on as well. This
was worth maybe 1% gain or less, and I hardly think the small
improvement is from avoiding a couple of branches around the CAS --
there's just too little time spent in that part of the program either
way.

So unfortunately, this does not leave fantastic options for generating
good code on CentOS, for lack of handy GCC versions. But I wanted to
let you know of these limitations in what's commonly available and
that the problem is solvable...and worth solving.

Included are disassemblies for reference.

With gcc 8, per normal CentOS:

Bad, very slow, no indirection, ldaxr"/"stlxr"

            Disassembly of section .text:
            0000000000897084 <pg_atomic_compare_exchange_u32_impl>:
            pg_atomic_compare_exchange_u32_impl():
            #if !defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32) &&
defined(HAVE_GCC__ATOMIC_INT32_CAS)
            #define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
            static inline bool
            pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
            uint32 *expected, uint32 newval)
            {
  0.00        sub   sp, sp, #0x20
  0.00        str   x0, [sp, #24]
  0.01        str   x1, [sp, #16]
  0.00        str   w2, [sp, #12]
            /* FIXME: we can probably use a lower consistency model */
            return __atomic_compare_exchange_n(&ptr->value, expected,
newval, false,
  0.00        ldr   x0, [sp, #24]
  0.02        ldr   x1, [sp, #16]
  0.01        ldr   w1, [x1]
  0.01        ldr   w3, [sp, #12]
 26.86  20:   ldaxr w2, [x0]
 72.04        cmp   w2, w1
            ↓ b.ne  34
  0.04        stlxr w4, w3, [x0]
  0.04      ↑ cbnz  w4, 20
  0.89  34:   cset  w0, eq  // eq = none
  0.01        cmp   w0, #0x0
  0.00      ↓ b.ne  48
  0.03        ldr   x1, [sp, #16]
  0.01        str   w2, [x1]
            __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
            }
  0.01  48:   add   sp, sp, #0x20
            ← ret

Good, fast, no indirection, casal, -march=armv8.2-a on an older compiler:

           Disassembly of section .text:
            0000000000896d58 <pg_atomic_compare_exchange_u32_impl>:
            pg_atomic_compare_exchange_u32_impl():
            #if !defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32) &&
defined(HAVE_GCC__ATOMIC_INT32_CAS)
            #define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
            static inline bool
            pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
            uint32 *expected, uint32 newval)
            {
  0.16        sub   sp, sp, #0x20
  0.17        str   x0, [sp, #24]
  0.60        str   x1, [sp, #16]
              str   w2, [sp, #12]
            /* FIXME: we can probably use a lower consistency model */
            return __atomic_compare_exchange_n(&ptr->value, expected,
newval, false,
  0.33        ldr   x0, [sp, #24]
  1.27        ldr   x1, [sp, #16]
  0.61        ldr   w1, [x1]
  1.82        ldr   w3, [sp, #12]
  1.98        mov   w2, w1
              casal w2, w3, [x0]
 87.18        cmp   w2, w1
              cset  w0, eq  // eq = none
              cmp   w0, #0x0
            ↓ b.ne  40
  0.22        ldr   x1, [sp, #16]
  0.33        str   w2, [x1]
            __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
            }
  5.23  40:   add   sp, sp, #0x20
  0.11      ← ret

Okay, moving onto gcc 10.2 disassembly.

This is what outline-atomics looks like:

Disassembly of section .text:

            00000000008ab5a4 <pg_atomic_compare_exchange_u32_impl>:
            pg_atomic_compare_exchange_u32_impl():
            #if !defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32) &&
defined(HAVE_GCC__ATOMIC_INT32_CAS)
            #define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
            static inline bool
            pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
            uint32 *expected, uint32 newval)
            {
  0.73        stp  x29, x30, [sp, #-64]!
  5.49        mov  x29, sp
              str  x19, [sp, #16]
  1.81        str  x0, [sp, #56]
  5.49        str  x1, [sp, #48]
  1.10        str  w2, [sp, #44]
            /* FIXME: we can probably use a lower consistency model */
            return __atomic_compare_exchange_n(&ptr->value, expected,
newval, false,
              ldr  x1, [sp, #56]
  6.58        ldr  x0, [sp, #48]
  3.29        ldr  w19, [x0]
 15.00        ldr  w0, [sp, #44]
 11.33        mov  x2, x1
              mov  w1, w0
              mov  w0, w19
            → bl   __aarch64_cas4_acq_rel
              cmp  w0, w19
              mov  w2, w0
              cset w0, eq  // eq = none
              cmp  w0, #0x0
            ↓ b.ne 54
  2.56        ldr  x1, [sp, #48]
  3.29        str  w2, [x1]
            __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
            }
 33.86  54:   ldr  x19, [sp, #16]
  4.00        ldp  x29, x30, [sp], #64
  5.49      ← ret

and inside __aarch64_cas4_acq_rel:

            Disassembly of section .text:

            0000000000ab2d30 <__aarch64_cas4_acq_rel>:
            __aarch64_cas4_acq_rel():
            cbz     w(tmp0), \label
            .endm

            #ifdef L_cas

            STARTFN NAME(cas)
  0.46        hint  #0x22
            JUMP_IF_NOT_LSE 8f
  0.21        adrp  x16, hist_entries+0x1f750
              ldrb  w16, [x16, #2260]
  2.30      ↓ cbz   w16, 18
            # define CAS    glue4(cas, A, L, S)     s(0), s(1), [x2]
            #else
            # define CAS    .inst 0x08a07c41 + B + M
            #endif

            CAS             /* s(0), s(1), [x2] */
  0.75        casal w0, w1, [x2]
            ret
 96.27      ← ret

            8:      UXT             s(tmp0), s(0)
        18:   mov   w16, w0
            0:      LDXR            s(0), [x2]
        1c:   ldaxr w0, [x2]
            cmp             s(0), s(tmp0)
              cmp   w0, w16
            bne             1f
            ↓ b.ne  30
            STXR            w(tmp1), s(1), [x2]
              stlxr w17, w1, [x2]
            cbnz            w(tmp1), 0b
            ↑ cbnz  w17, 1c
            1:      ret
        30: ← ret

You can see the bad ldaxr code at the bottom, never executed.
В списке pgsql-pkg-yum по дате отправления:
Предыдущее

От: "Regina Obe"
Дата: 15 октября 2020 г., 20:00:23
Сообщение: RE: Adding YUM packages to repository
Следующее

От: Devrim Gündüz
Дата: 21 октября 2020 г., 20:02:36
Сообщение: Heads up: --sign is going away
Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

aarch64 build uses very slow assembly, it is fixable

Предыдущее

Следующее