Обсуждение: use SIMD in GetPrivateRefCountEntry()

Поиск
Список
Период
Сортировка

use SIMD in GetPrivateRefCountEntry()

От
Nathan Bossart
Дата:
(new thread)

On Wed, Sep 03, 2025 at 02:47:25PM -0400, Andres Freund wrote:
>> I see a variety for increased CPU usage:
>> 
>> 1) The private ref count infrastructure in bufmgr.c gets a bit slower once
>>    more buffers are pinned
> 
> The problem mainly seems to be that the branches in the loop at the start of
> GetPrivateRefCountEntry() are entirely unpredictable in this workload.  I had
> an old patch that tried to make it possible to use SIMD for the search, by
> using a separate array for the Buffer ids - with that gcc generates fairly
> crappy code, but does make the code branchless.
> 
> Here that substantially reduces the overhead of doing prefetching. Afterwards
> it's not a meaningful source of misses anymore.

I quickly hacked together some patches for this.  0001 adds new static
variables so that we have a separate array of the buffers and the index for
the current ReservedRefCountEntry.  0002 optimizes the linear search in
GetPrivateRefCountEntry() using our simd.h routines.  This stuff feels
expensive (see vector8_highbit_mask()'s implementation for AArch64), but if
the main goal is to avoid branches, I think this is about as "branchless"
as we can make it.  I'm going to stare at this a bit longer, but I figured
I'd get something on the lists while it is fresh in my mind.

-- 
nathan

Вложения

Re: use SIMD in GetPrivateRefCountEntry()

От
Nathan Bossart
Дата:
Sorry for the noise.  I fixed x86-64 builds in v2.

-- 
nathan

Вложения

Re: use SIMD in GetPrivateRefCountEntry()

От
Yura Sokolov
Дата:
03.10.2025 23:51, Nathan Bossart пишет:
> Sorry for the noise.  I fixed x86-64 builds in v2.
>

Why not just use simplehash for private ref counts?
Without separation on array and overflow parts.
Just single damn simple hash table.

-- 
regards
Yura Sokolov aka funny-falcon



Re: use SIMD in GetPrivateRefCountEntry()

От
Andres Freund
Дата:
Hi,

On October 24, 2025 3:43:34 PM GMT+03:00, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:
>03.10.2025 23:51, Nathan Bossart пишет:
>> Sorry for the noise.  I fixed x86-64 builds in v2.
>>
>
>Why not just use simplehash for private ref counts?
>Without separation on array and overflow parts.
>Just single damn simple hash table.

It's to expensive for common access patterns in my benchmarks. Buffer accesses are very very very common and hash
tableshave no spatial locality. 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: use SIMD in GetPrivateRefCountEntry()

От
Peter Geoghegan
Дата:
On Fri, Oct 3, 2025 at 10:48 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> I quickly hacked together some patches for this.  0001 adds new static
> variables so that we have a separate array of the buffers and the index for
> the current ReservedRefCountEntry.  0002 optimizes the linear search in
> GetPrivateRefCountEntry() using our simd.h routines.  This stuff feels
> expensive (see vector8_highbit_mask()'s implementation for AArch64), but if
> the main goal is to avoid branches, I think this is about as "branchless"
> as we can make it.  I'm going to stare at this a bit longer, but I figured
> I'd get something on the lists while it is fresh in my mind.

I was unable to notice any improvements in any of the microbenchmarks
that I've been using to test the index prefetching patch set. For
whatever reason, these test cases are neither improved nor regressed
by your patch series.

I've never really played around with SIMD before. Is the precise CPU
microarchitecture relevant? Are power management settings important?

--
Peter Geoghegan



Re: use SIMD in GetPrivateRefCountEntry()

От
Peter Geoghegan
Дата:
On Fri, Oct 24, 2025 at 4:32 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I was unable to notice any improvements in any of the microbenchmarks
> that I've been using to test the index prefetching patch set. For
> whatever reason, these test cases are neither improved nor regressed
> by your patch series.

Correction: appears to be a regression at higher client counts with
standard pgbench SELECT + the index prefetching patchset + your v2
patchset. Not a massive one (about a 5% loss in TPS/throughput), and
not one that I can reproduce at lower client counts.

There are 16 physical cores on this machine, and that seems to be
around the cutoff for getting these regressions. I've disabled
turboboost and typerthreading on this machine, since I find that that
leads to more consistent performance, at least at lower client counts.

--
Peter Geoghegan