Обсуждение: PSA: New intel MDS vulnerability mitigations cause measurableslowdown
Hi, There's a new set of CPU vulnerabilities, so far only affecting intel CPUs. Cribbing from the linux-kernel announcement I'm referring to https://xenbits.xen.org/xsa/advisory-297.html for details. The "fix" is for the OS to perform some extra mitigations: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html https://www.kernel.org/doc/html/latest/x86/mds.html#mds *And* SMT/hyperthreading needs to be disabled, to be fully safe. Fun. I've run a quick pgbench benchmark: *Without* disabling SMT, for readonly pgbench, I'm seeing regressions between 7-11%, depending on the size of shared_buffers (and some runtime variations). That's just on my laptop, with an i7-6820HQ / Haswell CPU. I'd be surprised if there weren't adversarial loads with bigger slowdowns - what gets more expensive with the mitigations is syscalls. Most OSs / distributions either have rolled these changes out already, or will do so soon. So it's likely that most of us and our users will be affected by this soon. At least on linux the part of the mitigation that makes syscalls slower (blowing away buffers at the end of a sycall) is enabled by default, but SMT is not disabled by default. Greetings, Andres Freund
On Wed, May 15, 2019 at 10:31 AM Andres Freund <andres@anarazel.de> wrote: > *Without* disabling SMT, for readonly pgbench, I'm seeing regressions > between 7-11%, depending on the size of shared_buffers (and some runtime > variations). That's just on my laptop, with an i7-6820HQ / Haswell CPU. > I'd be surprised if there weren't adversarial loads with bigger > slowdowns - what gets more expensive with the mitigations is syscalls. Yikes. This all in warm shared buffers, right? So effectively this is the cost of recvfrom() and sendto() going up? Did you use -M prepared? If not, there would also be a couple of lseek(SEEK_END) calls in between for planning... I wonder how many more syscall-taxing mitigations we need before relation size caching pays off. -- Thomas Munro https://enterprisedb.com
Hi, On 2019-05-15 12:52:47 +1200, Thomas Munro wrote: > On Wed, May 15, 2019 at 10:31 AM Andres Freund <andres@anarazel.de> wrote: > > *Without* disabling SMT, for readonly pgbench, I'm seeing regressions > > between 7-11%, depending on the size of shared_buffers (and some runtime > > variations). That's just on my laptop, with an i7-6820HQ / Haswell CPU. > > I'd be surprised if there weren't adversarial loads with bigger > > slowdowns - what gets more expensive with the mitigations is syscalls. > > Yikes. This all in warm shared buffers, right? Not initially, but it ought to warm up quite quickly. I ran something boiling down to pgbench -q -i -s 200; psql -c 'vacuum (freeze, analyze, verbose)'; pgbench -n -S -c 32 -j 32 -S -M prepared -T 100 -P1. As both pgbench -i's COPY and VACUUM use ringbuffers, initially s_b will effectively be empty. > So effectively this is the cost of recvfrom() and sendto() going up? Plus epoll_wait(). And read(), for the cases where s_b was smaller than the data. > Did you use -M prepared? Yes. > If not, there would also be a couple of lseek(SEEK_END) calls in > between for planning... I wonder how many more syscall-taxing > mitigations we need before relation size caching pays off. Yea, I suspect we're going to have to go there soon for a number of reasons. - Andres
Hi, On 2019-05-14 15:30:52 -0700, Andres Freund wrote: > There's a new set of CPU vulnerabilities, so far only affecting intel > CPUs. Cribbing from the linux-kernel announcement I'm referring to > https://xenbits.xen.org/xsa/advisory-297.html > for details. > > The "fix" is for the OS to perform some extra mitigations: > https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html > https://www.kernel.org/doc/html/latest/x86/mds.html#mds > > *And* SMT/hyperthreading needs to be disabled, to be fully safe. > > Fun. > > I've run a quick pgbench benchmark: > > *Without* disabling SMT, for readonly pgbench, I'm seeing regressions > between 7-11%, depending on the size of shared_buffers (and some runtime > variations). That's just on my laptop, with an i7-6820HQ / Haswell CPU. > I'd be surprised if there weren't adversarial loads with bigger > slowdowns - what gets more expensive with the mitigations is syscalls. The profile after the mitigations looks like: + 3.62% postgres [kernel.vmlinux] [k] do_syscall_64 + 2.99% postgres postgres [.] _bt_compare + 2.76% postgres postgres [.] hash_search_with_hash_value + 2.33% postgres [kernel.vmlinux] [k] entry_SYSCALL_64 + 1.69% pgbench [kernel.vmlinux] [k] do_syscall_64 + 1.61% postgres postgres [.] AllocSetAlloc 1.41% postgres postgres [.] PostgresMain + 1.22% pgbench [kernel.vmlinux] [k] entry_SYSCALL_64 + 1.11% postgres postgres [.] LWLockAcquire + 0.86% postgres postgres [.] PinBuffer + 0.80% postgres postgres [.] LockAcquireExtended + 0.78% postgres [kernel.vmlinux] [k] psi_task_change 0.76% pgbench pgbench [.] threadRun 0.69% postgres postgres [.] LWLockRelease + 0.69% postgres postgres [.] SearchCatCache1 0.66% postgres postgres [.] LockReleaseAll + 0.65% postgres postgres [.] GetSnapshotData + 0.58% postgres postgres [.] hash_seq_search 0.54% postgres postgres [.] hash_search + 0.53% postgres [kernel.vmlinux] [k] __switch_to + 0.53% postgres postgres [.] hash_any 0.52% pgbench libpq.so.5.12 [.] pqParseInput3 0.50% pgbench [kernel.vmlinux] [k] do_raw_spin_lock where do_syscall_64 show this instruction profile: │ static __always_inline bool arch_static_branch_jump(struct static_key *key, bool branch) │ { │ asm_volatile_goto("1:" 1.58 │ ↓ jmpq bd │ mds_clear_cpu_buffers(): │ * Works with any segment selector, but a valid writable │ * data segment is the fastest variant. │ * │ * "cc" clobber is required because VERW modifies ZF. │ */ │ asm volatile("verw %[ds]" : : [ds] "m" (ds) : "cc"); 77.38 │ verw 0x13fea53(%rip) # ffffffff82400ee0 <ds.4768> │ do_syscall_64(): │ } │ │ syscall_return_slowpath(regs); │ } 13.18 │ bd: pop %rbx 0.08 │ pop %rbp │ ← retq │ nr = syscall_trace_enter(regs); │ c0: mov %rbp,%rdi │ → callq syscall_trace_enter Where verw is the instruction that was recycled to now have the side-effect of flushing CPU buffers. Greetings, Andres Freund
On Wed, May 15, 2019 at 1:13 PM Andres Freund <andres@anarazel.de> wrote: > > I've run a quick pgbench benchmark: > > > > *Without* disabling SMT, for readonly pgbench, I'm seeing regressions > > between 7-11%, depending on the size of shared_buffers (and some runtime > > variations). That's just on my laptop, with an i7-6820HQ / Haswell CPU. > > I'd be surprised if there weren't adversarial loads with bigger > > slowdowns - what gets more expensive with the mitigations is syscalls. This stuff landed in my FreeBSD 13.0-CURRENT kernel, so I was curious to measure it with and without the earlier mitigations. On my humble i7-8550U laptop with the new 1.22 microcode installed, with my usual settings of PTI=on and IBRS=off, so far MDS=VERW gives me ~1.5% loss of TPS with a single client, up to 4.3% loss of TPS for 16 clients, but it didn't go higher when I tried 32 clients. This was a tiny scale 10 database, though in a quick test it didn't look like it was worse with scale 100. With all three mitigations activated, my little dev machine has gone from being able to do ~11.8 million baseline syscalls per second to ~1.6 million, or ~1.4 million with the AVX variant of the mitigation. Raw getuid() syscalls per second: PTI IBRS MDS=off MDS=VERW MDS=AVX ===== ===== ======== ======== ======== off off 11798658 4764159 3274043 off on 2652564 1941606 1655356 on off 4973053 2932906 2339779 on on 1988527 1556922 1378798 pgbench read-only transactions per second, 1 client thread: PTI IBRS MDS=off MDS=VERW MDS=AVX ===== ===== ======== ======== ======== off off 19393 18949 18615 off on 17946 17586 17323 on off 19381 19015 18696 on on 18045 17709 17418 pgbench -M prepared read-only transactions per second, 1 client thread: PTI IBRS MDS=off MDS=VERW MDS=AVX ===== ===== ======== ======== ======== off off 35020 34049 33200 off on 31658 30902 30229 on off 35445 34353 33415 on on 32415 31599 30712 pgbench -M prepared read-only transactions per second, 4 client threads: PTI IBRS MDS=off MDS=VERW MDS=AVX ===== ===== ======== ======== ======== off off 79515 76898 76465 off on 63608 62220 61952 on off 77863 75431 74847 on on 62709 60790 60575 pgbench -M prepared read-only transactions per second, 16 client threads: PTI IBRS MDS=off MDS=VERW MDS=AVX ===== ===== ======== ======== ======== off off 125984 121164 120468 off on 112884 108346 107984 on off 121032 116156 115462 on on 108889 104636 104027 time gmake -s check: PTI IBRS MDS=off MDS=VERW MDS=AVX ===== ===== ======== ======== ======== off off 16.78 16.85 17.03 off on 18.19 18.81 19.08 on off 16.67 16.86 17.33 on on 18.58 18.83 18.99 -- Thomas Munro https://enterprisedb.com
Re: PSA: New intel MDS vulnerability mitigations cause measurable slowdown
От
Albert Cervera i Areny
Дата:
Missatge de Thomas Munro <thomas.munro@gmail.com> del dia dj., 16 de maig 2019 a les 13:09: > > On Wed, May 15, 2019 at 1:13 PM Andres Freund <andres@anarazel.de> wrote: > > > I've run a quick pgbench benchmark: > > > > > > *Without* disabling SMT, for readonly pgbench, I'm seeing regressions > > > between 7-11%, depending on the size of shared_buffers (and some runtime > > > variations). That's just on my laptop, with an i7-6820HQ / Haswell CPU. > > > I'd be surprised if there weren't adversarial loads with bigger > > > slowdowns - what gets more expensive with the mitigations is syscalls. > > This stuff landed in my FreeBSD 13.0-CURRENT kernel, so I was curious > to measure it with and without the earlier mitigations. On my humble > i7-8550U laptop with the new 1.22 microcode installed, with my usual > settings of PTI=on and IBRS=off, so far MDS=VERW gives me ~1.5% loss > of TPS with a single client, up to 4.3% loss of TPS for 16 clients, > but it didn't go higher when I tried 32 clients. This was a tiny > scale 10 database, though in a quick test it didn't look like it was > worse with scale 100. > > With all three mitigations activated, my little dev machine has gone > from being able to do ~11.8 million baseline syscalls per second to Did you mean "1.8"? > ~1.6 million, or ~1.4 million with the AVX variant of the mitigation. > > Raw getuid() syscalls per second: > > PTI IBRS MDS=off MDS=VERW MDS=AVX > ===== ===== ======== ======== ======== > off off 11798658 4764159 3274043 > off on 2652564 1941606 1655356 > on off 4973053 2932906 2339779 > on on 1988527 1556922 1378798 > > pgbench read-only transactions per second, 1 client thread: > > PTI IBRS MDS=off MDS=VERW MDS=AVX > ===== ===== ======== ======== ======== > off off 19393 18949 18615 > off on 17946 17586 17323 > on off 19381 19015 18696 > on on 18045 17709 17418 > > pgbench -M prepared read-only transactions per second, 1 client thread: > > PTI IBRS MDS=off MDS=VERW MDS=AVX > ===== ===== ======== ======== ======== > off off 35020 34049 33200 > off on 31658 30902 30229 > on off 35445 34353 33415 > on on 32415 31599 30712 > > pgbench -M prepared read-only transactions per second, 4 client threads: > > PTI IBRS MDS=off MDS=VERW MDS=AVX > ===== ===== ======== ======== ======== > off off 79515 76898 76465 > off on 63608 62220 61952 > on off 77863 75431 74847 > on on 62709 60790 60575 > > pgbench -M prepared read-only transactions per second, 16 client threads: > > PTI IBRS MDS=off MDS=VERW MDS=AVX > ===== ===== ======== ======== ======== > off off 125984 121164 120468 > off on 112884 108346 107984 > on off 121032 116156 115462 > on on 108889 104636 104027 > > time gmake -s check: > > PTI IBRS MDS=off MDS=VERW MDS=AVX > ===== ===== ======== ======== ======== > off off 16.78 16.85 17.03 > off on 18.19 18.81 19.08 > on off 16.67 16.86 17.33 > on on 18.58 18.83 18.99 > > -- > Thomas Munro > https://enterprisedb.com > > -- Albert Cervera i Areny http://www.NaN-tic.com Tel. 93 553 18 03
On 5/16/19 12:24 PM, Albert Cervera i Areny wrote: > Missatge de Thomas Munro <thomas.munro@gmail.com> del dia dj., 16 de > maig 2019 a les 13:09: >> With all three mitigations activated, my little dev machine has gone >> from being able to do ~11.8 million baseline syscalls per second to > > Did you mean "1.8"? Not in what I thought I saw: >> ~1.6 million, or ~1.4 million ... >> >> PTI IBRS MDS=off MDS=VERW MDS=AVX >> ===== ===== ======== ======== ======== >> off off 11798658 4764159 3274043 ^^^^^^^^ >> off on 2652564 1941606 1655356 >> on off 4973053 2932906 2339779 >> on on 1988527 1556922 1378798 ^^^^^^^ ^^^^^^^ -Chap
On Fri, May 17, 2019 at 5:26 AM Chapman Flack <chap@anastigmatix.net> wrote: > On 5/16/19 12:24 PM, Albert Cervera i Areny wrote: > > Missatge de Thomas Munro <thomas.munro@gmail.com> del dia dj., 16 de > > maig 2019 a les 13:09: > >> With all three mitigations activated, my little dev machine has gone > >> from being able to do ~11.8 million baseline syscalls per second to > > > > Did you mean "1.8"? > > Not in what I thought I saw: > > >> ~1.6 million, or ~1.4 million ... > >> > >> PTI IBRS MDS=off MDS=VERW MDS=AVX > >> ===== ===== ======== ======== ======== > >> off off 11798658 4764159 3274043 > ^^^^^^^^ > >> off on 2652564 1941606 1655356 > >> on off 4973053 2932906 2339779 > >> on on 1988527 1556922 1378798 > ^^^^^^^ ^^^^^^^ Right. Actually it's worse than that -- after I posted I realised that I had some debug stuff enabled in my kernel that was slowing things down a bit, so I reran the tests overnight with a production kernel and here is what I see this morning. It's actually ~17.8 million syscalls/sec -> ~1.7 million syscalls/sec, if you go from all mitigations off to all mitigations on, or -> ~3.2 million for just PTI + MDS. And the loss of TPS is ~5% for the case I was most interested in, just turning on MDS=VERW if you already had PTI on and IBRS off. Raw getuid() syscalls per second: PTI IBRS MDS=off MDS=VERW MDS=AVX ===== ===== ======== ======== ======== off off 17771744 5372032 3575035 off on 3060923 2166527 1817052 on off 5622591 3150883 2463934 on on 2213190 1687748 1475605 pgbench read-only transactions per second, 1 client thread: PTI IBRS MDS=off MDS=VERW MDS=AVX ===== ===== ======== ======== ======== off off 22414 22103 21571 off on 21298 20817 20418 on off 22473 22080 21550 on on 21286 20850 20386 pgbench -M prepared read-only transactions per second, 1 client thread: PTI IBRS MDS=off MDS=VERW MDS=AVX ===== ===== ======== ======== ======== off off 43508 42476 41123 off on 40729 39483 38555 on off 44110 42989 42012 on on 41143 39990 38798 pgbench -M prepared read-only transactions per second, 4 client threads: PTI IBRS MDS=off MDS=VERW MDS=AVX ===== ===== ======== ======== ======== off off 100735 97689 96662 off on 80142 77804 77064 on off 100540 97010 95827 on on 79492 76976 76226 pgbench -M prepared read-only transactions per second, 16 client threads: PTI IBRS MDS=off MDS=VERW MDS=AVX ===== ===== ======== ======== ======== off off 161015 152978 152556 off on 145605 139438 139179 on off 155359 147691 146987 on on 140976 134978 134177 pgbench -M prepared read-only transactions per second, 16 client threads: PTI IBRS MDS=off MDS=VERW MDS=AVX ===== ===== ======== ======== ======== off off 157986 150132 149436 off on 142618 136220 135901 on off 153482 146214 145839 on on 138650 133074 132142 -- Thomas Munro https://enterprisedb.com
On Fri, May 17, 2019 at 9:42 AM Thomas Munro <thomas.munro@gmail.com> wrote: > pgbench -M prepared read-only transactions per second, 16 client threads: (That second "16 client threads" line should read "32 client threads".) -- Thomas Munro https://enterprisedb.com