aarch64 build uses very slow assembly, it is fixable
От | Daniel Farina |
---|---|
Тема | aarch64 build uses very slow assembly, it is fixable |
Дата | |
Msg-id | CACN56+P1astF5zvocrT7--Mu2dQWFS0eQ31xNmX=b=98y9fMSw@mail.gmail.com обсуждение исходный текст |
Список | pgsql-pkg-yum |
So, I was microbenchmarking on AWS Graviton 2 a.k.a. Neoverse N1-ish processors (on instances c6g, m6g) and noticed that TPS was sensitive to the number of clients and dropping to low throughputs, particularly around 32 clients: m6g.16xlarge, scale factor 1, select-only, 386.7K TPS, pgbench -S -j 4 --time=60 --client=34 m6g.16xlarge, scale factor 1, select-only, 596.3K TPS, pgbench -S -j 4 --time=60 --client=33 m6g.16xlarge, scale factor 1, select-only, 670.5K TPS, pgbench -S -j 4 --time=60 --client=32 m6g.16xlarge, scale factor 1, select-only, 641.4K TPS, pgbench -S -j 4 --time=60 --client=30 If you increase clients more, this can decrease to 145K TPS, or worse, and all bulk of all time is spent in LWLockAcquire. This email https://www.postgresql.org/message-id/099F69EE-51D3-4214-934A-1F28C0A1A7A7@amazon.com reports some weaknesses in generated instructions for aarch64, but it is does not relate an improvement of this magnitude...but it is: I can get 591K TPS even with 100 clients once using "casal" by a number of means, instead of 145K. It doesn't improve the best case by so much, but it degrades far more gracefully while offering more throughput. In the profiler, the difference between using "casal" and "ldaxr"/"stlxr" is whether postgres spends a majority of its time in snapshot acquisition or barely any, and how steep the ramp of degradation of more connections is. Once the atomic stuff is out of the way, much more difficult memory loads and stores in the planner are the new bottleneck...a pretty big improvement. Okay, so the gains are very great. How do we get these instructions emitted? One option is to compile with -march=armv8.2-a. This works on older compilers, but will break the code for an ARM chip without the right features. This is how I started experimenting. Another, to use -moutline-atomics, as the previous mailing list post mentions. This is available in newer GCCs than I can easily get on CentOS 8. CentOS 8 comes with 8.3.1, and gcc-toolset-9 loads 9.2.1, which also doesn't include it...per https://gcc.gnu.org/gcc-9/changes.html, gcc 9.4 is required. Here's the patch introducing outline-atomics: commit 3950b229a5ed6710f30241c2ddc3c74909bf4740 Author: Richard Henderson <richard.henderson@linaro.org> Date: Thu Sep 19 14:36:43 2019 +0000 aarch64: Implement -moutline-atomics More recently, -moutline-atomics became the default: commit cd4b68527988f42c10c0d6c10e812d299887e0c2 Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com> Date: Thu Apr 30 13:12:13 2020 +0100 [AArch64] Make -moutline-atomics on by default Given I did not identify an easy way to obtain an rpm with any compiler new enough to have -moutline-atomics on or off by default, I compiled a new version of GCC, 10.2, and ran ./configure without additional flags (save symbols, for disassembly). It works, and emits a hybrid assembly that selects between "casal" and "ldaxr"/"stlxr". I also attempted to measure any gains from using -mtune=neoverse-n1, available in such new GCC, which not only avoids emitting the generic atomics code, but changes various cost metrics and so on as well. This was worth maybe 1% gain or less, and I hardly think the small improvement is from avoiding a couple of branches around the CAS -- there's just too little time spent in that part of the program either way. So unfortunately, this does not leave fantastic options for generating good code on CentOS, for lack of handy GCC versions. But I wanted to let you know of these limitations in what's commonly available and that the problem is solvable...and worth solving. Included are disassemblies for reference. With gcc 8, per normal CentOS: Bad, very slow, no indirection, ldaxr"/"stlxr" Disassembly of section .text: 0000000000897084 <pg_atomic_compare_exchange_u32_impl>: pg_atomic_compare_exchange_u32_impl(): #if !defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32) && defined(HAVE_GCC__ATOMIC_INT32_CAS) #define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32 static inline bool pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 *expected, uint32 newval) { 0.00 sub sp, sp, #0x20 0.00 str x0, [sp, #24] 0.01 str x1, [sp, #16] 0.00 str w2, [sp, #12] /* FIXME: we can probably use a lower consistency model */ return __atomic_compare_exchange_n(&ptr->value, expected, newval, false, 0.00 ldr x0, [sp, #24] 0.02 ldr x1, [sp, #16] 0.01 ldr w1, [x1] 0.01 ldr w3, [sp, #12] 26.86 20: ldaxr w2, [x0] 72.04 cmp w2, w1 ↓ b.ne 34 0.04 stlxr w4, w3, [x0] 0.04 ↑ cbnz w4, 20 0.89 34: cset w0, eq // eq = none 0.01 cmp w0, #0x0 0.00 ↓ b.ne 48 0.03 ldr x1, [sp, #16] 0.01 str w2, [x1] __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); } 0.01 48: add sp, sp, #0x20 ← ret Good, fast, no indirection, casal, -march=armv8.2-a on an older compiler: Disassembly of section .text: 0000000000896d58 <pg_atomic_compare_exchange_u32_impl>: pg_atomic_compare_exchange_u32_impl(): #if !defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32) && defined(HAVE_GCC__ATOMIC_INT32_CAS) #define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32 static inline bool pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 *expected, uint32 newval) { 0.16 sub sp, sp, #0x20 0.17 str x0, [sp, #24] 0.60 str x1, [sp, #16] str w2, [sp, #12] /* FIXME: we can probably use a lower consistency model */ return __atomic_compare_exchange_n(&ptr->value, expected, newval, false, 0.33 ldr x0, [sp, #24] 1.27 ldr x1, [sp, #16] 0.61 ldr w1, [x1] 1.82 ldr w3, [sp, #12] 1.98 mov w2, w1 casal w2, w3, [x0] 87.18 cmp w2, w1 cset w0, eq // eq = none cmp w0, #0x0 ↓ b.ne 40 0.22 ldr x1, [sp, #16] 0.33 str w2, [x1] __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); } 5.23 40: add sp, sp, #0x20 0.11 ← ret Okay, moving onto gcc 10.2 disassembly. This is what outline-atomics looks like: Disassembly of section .text: 00000000008ab5a4 <pg_atomic_compare_exchange_u32_impl>: pg_atomic_compare_exchange_u32_impl(): #if !defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32) && defined(HAVE_GCC__ATOMIC_INT32_CAS) #define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32 static inline bool pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 *expected, uint32 newval) { 0.73 stp x29, x30, [sp, #-64]! 5.49 mov x29, sp str x19, [sp, #16] 1.81 str x0, [sp, #56] 5.49 str x1, [sp, #48] 1.10 str w2, [sp, #44] /* FIXME: we can probably use a lower consistency model */ return __atomic_compare_exchange_n(&ptr->value, expected, newval, false, ldr x1, [sp, #56] 6.58 ldr x0, [sp, #48] 3.29 ldr w19, [x0] 15.00 ldr w0, [sp, #44] 11.33 mov x2, x1 mov w1, w0 mov w0, w19 → bl __aarch64_cas4_acq_rel cmp w0, w19 mov w2, w0 cset w0, eq // eq = none cmp w0, #0x0 ↓ b.ne 54 2.56 ldr x1, [sp, #48] 3.29 str w2, [x1] __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); } 33.86 54: ldr x19, [sp, #16] 4.00 ldp x29, x30, [sp], #64 5.49 ← ret and inside __aarch64_cas4_acq_rel: Disassembly of section .text: 0000000000ab2d30 <__aarch64_cas4_acq_rel>: __aarch64_cas4_acq_rel(): cbz w(tmp0), \label .endm #ifdef L_cas STARTFN NAME(cas) 0.46 hint #0x22 JUMP_IF_NOT_LSE 8f 0.21 adrp x16, hist_entries+0x1f750 ldrb w16, [x16, #2260] 2.30 ↓ cbz w16, 18 # define CAS glue4(cas, A, L, S) s(0), s(1), [x2] #else # define CAS .inst 0x08a07c41 + B + M #endif CAS /* s(0), s(1), [x2] */ 0.75 casal w0, w1, [x2] ret 96.27 ← ret 8: UXT s(tmp0), s(0) 18: mov w16, w0 0: LDXR s(0), [x2] 1c: ldaxr w0, [x2] cmp s(0), s(tmp0) cmp w0, w16 bne 1f ↓ b.ne 30 STXR w(tmp1), s(1), [x2] stlxr w17, w1, [x2] cbnz w(tmp1), 0b ↑ cbnz w17, 1c 1: ret 30: ← ret You can see the bad ldaxr code at the bottom, never executed.
В списке pgsql-pkg-yum по дате отправления: