Обсуждение: use SIMD in GetPrivateRefCountEntry()
(new thread) On Wed, Sep 03, 2025 at 02:47:25PM -0400, Andres Freund wrote: >> I see a variety for increased CPU usage: >> >> 1) The private ref count infrastructure in bufmgr.c gets a bit slower once >> more buffers are pinned > > The problem mainly seems to be that the branches in the loop at the start of > GetPrivateRefCountEntry() are entirely unpredictable in this workload. I had > an old patch that tried to make it possible to use SIMD for the search, by > using a separate array for the Buffer ids - with that gcc generates fairly > crappy code, but does make the code branchless. > > Here that substantially reduces the overhead of doing prefetching. Afterwards > it's not a meaningful source of misses anymore. I quickly hacked together some patches for this. 0001 adds new static variables so that we have a separate array of the buffers and the index for the current ReservedRefCountEntry. 0002 optimizes the linear search in GetPrivateRefCountEntry() using our simd.h routines. This stuff feels expensive (see vector8_highbit_mask()'s implementation for AArch64), but if the main goal is to avoid branches, I think this is about as "branchless" as we can make it. I'm going to stare at this a bit longer, but I figured I'd get something on the lists while it is fresh in my mind. -- nathan
Вложения
Sorry for the noise. I fixed x86-64 builds in v2. -- nathan
Вложения
03.10.2025 23:51, Nathan Bossart пишет: > Sorry for the noise. I fixed x86-64 builds in v2. > Why not just use simplehash for private ref counts? Without separation on array and overflow parts. Just single damn simple hash table. -- regards Yura Sokolov aka funny-falcon
Hi, On October 24, 2025 3:43:34 PM GMT+03:00, Yura Sokolov <y.sokolov@postgrespro.ru> wrote: >03.10.2025 23:51, Nathan Bossart пишет: >> Sorry for the noise. I fixed x86-64 builds in v2. >> > >Why not just use simplehash for private ref counts? >Without separation on array and overflow parts. >Just single damn simple hash table. It's to expensive for common access patterns in my benchmarks. Buffer accesses are very very very common and hash tableshave no spatial locality. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Fri, Oct 3, 2025 at 10:48 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > I quickly hacked together some patches for this. 0001 adds new static > variables so that we have a separate array of the buffers and the index for > the current ReservedRefCountEntry. 0002 optimizes the linear search in > GetPrivateRefCountEntry() using our simd.h routines. This stuff feels > expensive (see vector8_highbit_mask()'s implementation for AArch64), but if > the main goal is to avoid branches, I think this is about as "branchless" > as we can make it. I'm going to stare at this a bit longer, but I figured > I'd get something on the lists while it is fresh in my mind. I was unable to notice any improvements in any of the microbenchmarks that I've been using to test the index prefetching patch set. For whatever reason, these test cases are neither improved nor regressed by your patch series. I've never really played around with SIMD before. Is the precise CPU microarchitecture relevant? Are power management settings important? -- Peter Geoghegan
On Fri, Oct 24, 2025 at 4:32 PM Peter Geoghegan <pg@bowt.ie> wrote: > I was unable to notice any improvements in any of the microbenchmarks > that I've been using to test the index prefetching patch set. For > whatever reason, these test cases are neither improved nor regressed > by your patch series. Correction: appears to be a regression at higher client counts with standard pgbench SELECT + the index prefetching patchset + your v2 patchset. Not a massive one (about a 5% loss in TPS/throughput), and not one that I can reproduce at lower client counts. There are 16 physical cores on this machine, and that seems to be around the cutoff for getting these regressions. I've disabled turboboost and typerthreading on this machine, since I find that that leads to more consistent performance, at least at lower client counts. -- Peter Geoghegan