Обсуждение: Add RISC-V Zbb popcount optimization
Hello. Attached is a small patch that enables hardware popcount on RISC-V when available and also sets the arch flag to 'rv64gc_zbb'flag when appropriate. best. -greg
Вложения
Hi, On 2026-03-21 12:54:10 -0400, Greg Burd wrote: > Attached is a small patch that enables hardware popcount on RISC-V when > available and also sets the arch flag to 'rv64gc_zbb' flag when appropriate. Maybe I'm missing something: How is the latter approach safe without a runtime check? Just because it compiled on the build machine with -march=rv64gc_zbb added doesn't mean it runs on either the build machine or any other machine? If this worked, the compiler could just always specify -march=rv64gc_zbb, no? Greetings, Andres Freund
On Sat, Mar 21, 2026, at 2:36 PM, Andres Freund wrote: > Hi, > > On 2026-03-21 12:54:10 -0400, Greg Burd wrote: >> Attached is a small patch that enables hardware popcount on RISC-V when >> available and also sets the arch flag to 'rv64gc_zbb' flag when appropriate. > > Maybe I'm missing something: How is the latter approach safe without a runtime > check? Just because it compiled on the build machine with -march=rv64gc_zbb > added doesn't mean it runs on either the build machine or any other machine? > > If this worked, the compiler could just always specify -march=rv64gc_zbb, no? Hey Andres, thanks for taking a look. You are correct, mea culpa for not catching this before I sent it out. If the second test succeeds the patch will add `-march=rv64gc_zbb`to `CFLAGS` globally, which means without the runtime check the binary will crash with SIGILL on systemswithout Zbb. I'll rework... :) > Greetings, > > Andres Freund best. -greg
On Sat, Mar 21, 2026 at 11:56 PM Greg Burd <greg@burd.me> wrote: > Attached is a small patch that enables hardware popcount on RISC-V when available and also sets the arch flag to 'rv64gc_zbb'flag when appropriate. I have to ask what the point is -- isn't that like putting a 4-inch exhaust tip on a go-kart? -- John Naylor Amazon Web Services
On Sat, Mar 21, 2026, at 10:14 PM, John Naylor wrote: > On Sat, Mar 21, 2026 at 11:56 PM Greg Burd <greg@burd.me> wrote: >> Attached is a small patch that enables hardware popcount on RISC-V when available and also sets the arch flag to 'rv64gc_zbb'flag when appropriate. > > I have to ask what the point is -- isn't that like putting a 4-inch > exhaust tip on a go-kart? Hey John, The point is to go fast, right? And to look cool (with awesome 4-inch exhaust tips) if possible! ;-P gburd@rv:~/ws/postgres$ gcc -O2 -o popcnt-wo-zbb riscv-popcnt.c gburd@rv:~/ws/postgres$ gcc -O2 -march=rv64gc_zbb -o popcnt-zbb riscv-popcnt.c gburd@rv:~/ws/postgres$ ./popcnt-wo-zbb && ./popcnt-zbb sw popcount: 0.196 sec ( 510.08 MB/s) hw popcount: 0.293 sec ( 341.48 MB/s) diff: 0.67x match: 406261900 bits counted sw popcount: 0.182 sec ( 548.86 MB/s) hw popcount: 0.044 sec ( 2279.89 MB/s) diff: 4.15x match: 406261900 bits counted But my first email/patch was incomplete/rushed, I should have followed the pattern used for similar ARM-specific logic. v2 attached along with a test program. > -- > John Naylor > Amazon Web Services best. -greg
Вложения
Hi, On 2026-03-22 13:43:43 -0400, Greg Burd wrote: > On Sat, Mar 21, 2026, at 10:14 PM, John Naylor wrote: > > On Sat, Mar 21, 2026 at 11:56 PM Greg Burd <greg@burd.me> wrote: > >> Attached is a small patch that enables hardware popcount on RISC-V when available and also sets the arch flag to 'rv64gc_zbb'flag when appropriate. > > > > I have to ask what the point is -- isn't that like putting a 4-inch > > exhaust tip on a go-kart? > The point is to go fast, right? And to look cool (with awesome 4-inch exhaust tips) if possible! ;-P > > gburd@rv:~/ws/postgres$ gcc -O2 -o popcnt-wo-zbb riscv-popcnt.c > gburd@rv:~/ws/postgres$ gcc -O2 -march=rv64gc_zbb -o popcnt-zbb riscv-popcnt.c > gburd@rv:~/ws/postgres$ ./popcnt-wo-zbb && ./popcnt-zbb > sw popcount: 0.196 sec ( 510.08 MB/s) > hw popcount: 0.293 sec ( 341.48 MB/s) > > diff: 0.67x > match: 406261900 bits counted > sw popcount: 0.182 sec ( 548.86 MB/s) > hw popcount: 0.044 sec ( 2279.89 MB/s) > > diff: 4.15x > match: 406261900 bits counted > > But my first email/patch was incomplete/rushed, I should have followed the pattern used for similar ARM-specific logic. v2 attached along with a test program. Sure, but what PG workloads are actually affected to a meaningful degree by this? And are those, on riscv, actually most bottlenecked by popcount performance? I'm also pretty doubtful all the effort to e.g. add AVX 512 popcount was spent all that effectively - hard to believe there's any real world workloads where that gain is worth the squeeze. At least for aarch64 and x86-64 there's real world use of those platforms, making niche-y perf improvements somewhat worthwhile. Whereas there's afaict not yet a whole lot of riscv production adoption. Once you add CPU dispatch to the cost it gets a heck of a lot less clearly worthwhile. You need heuristics to decide when the dispatch cost is worth it and even then it's going to slow down your non-worthwhile case somewhat. That's one of the things that make's riscv's decision to put so many crucial features into optional extensions so annoying for people that write non-embedded software. - Andres
On Sun, Mar 22, 2026, at 2:01 PM, Andres Freund wrote: > Hi, > > On 2026-03-22 13:43:43 -0400, Greg Burd wrote: >> On Sat, Mar 21, 2026, at 10:14 PM, John Naylor wrote: >> > On Sat, Mar 21, 2026 at 11:56 PM Greg Burd <greg@burd.me> wrote: >> >> Attached is a small patch that enables hardware popcount on RISC-V when available and also sets the arch flag to 'rv64gc_zbb'flag when appropriate. >> > >> > I have to ask what the point is -- isn't that like putting a 4-inch >> > exhaust tip on a go-kart? >> The point is to go fast, right? And to look cool (with awesome 4-inch exhaust tips) if possible! ;-P >> >> gburd@rv:~/ws/postgres$ gcc -O2 -o popcnt-wo-zbb riscv-popcnt.c >> gburd@rv:~/ws/postgres$ gcc -O2 -march=rv64gc_zbb -o popcnt-zbb riscv-popcnt.c >> gburd@rv:~/ws/postgres$ ./popcnt-wo-zbb && ./popcnt-zbb >> sw popcount: 0.196 sec ( 510.08 MB/s) >> hw popcount: 0.293 sec ( 341.48 MB/s) >> >> diff: 0.67x >> match: 406261900 bits counted >> sw popcount: 0.182 sec ( 548.86 MB/s) >> hw popcount: 0.044 sec ( 2279.89 MB/s) >> >> diff: 4.15x >> match: 406261900 bits counted >> >> But my first email/patch was incomplete/rushed, I should have followed the pattern used for similar ARM-specific logic. v2 attached along with a test program. > > Sure, but what PG workloads are actually affected to a meaningful degree by > this? And are those, on riscv, actually most bottlenecked by popcount > performance? > > I'm also pretty doubtful all the effort to e.g. add AVX 512 popcount was spent > all that effectively - hard to believe there's any real world workloads where > that gain is worth the squeeze. At least for aarch64 and x86-64 there's real > world use of those platforms, making niche-y perf improvements somewhat > worthwhile. Whereas there's afaict not yet a whole lot of riscv production > adoption. > > Once you add CPU dispatch to the cost it gets a heck of a lot less clearly > worthwhile. You need heuristics to decide when the dispatch cost is worth it > and even then it's going to slow down your non-worthwhile case somewhat. > > That's one of the things that make's riscv's decision to put so many crucial > features into optional extensions so annoying for people that write > non-embedded software. Hey Andres, All fair points. RISC-V is annoying, the idea of CPU extensions is just one reason. To be honest, I'm not sure it is worthit either! That said, this patch isn't a huge "squeeze" (or unprecedented) and it does provide some "juice" (4x faster). It has the shape of the ARM equivalent, so to me it fell into that category of things we'd commit. But I get it, as I said to start - all fair points. > - Andres best. -greg
On Sun, Mar 22, 2026 at 02:01:50PM -0400, Andres Freund wrote: > I'm also pretty doubtful all the effort to e.g. add AVX 512 popcount was spent > all that effectively - hard to believe there's any real world workloads where > that gain is worth the squeeze. At least for aarch64 and x86-64 there's real > world use of those platforms, making niche-y perf improvements somewhat > worthwhile. Whereas there's afaict not yet a whole lot of riscv production > adoption. That work was partially motivated by vector stuff that used popcount functions pretty heavily, but yeah, the complexity compared to the gains is the main reason I've been pushing to just use simd.h elsewhere (i.e., SSE2 and Neon). I'd still consider using AVX-512, etc. for things if the impact on real-world workloads was huge, though. -- nathan
On Mon, Mar 23, 2026, at 11:09 AM, Nathan Bossart wrote:
> On Sun, Mar 22, 2026 at 02:01:50PM -0400, Andres Freund wrote:
>> I'm also pretty doubtful all the effort to e.g. add AVX 512 popcount was spent
>> all that effectively - hard to believe there's any real world workloads where
>> that gain is worth the squeeze. At least for aarch64 and x86-64 there's real
>> world use of those platforms, making niche-y perf improvements somewhat
>> worthwhile. Whereas there's afaict not yet a whole lot of riscv production
>> adoption.
Hey Nathan,
> That work was partially motivated by vector stuff that used popcount
> functions pretty heavily, but yeah, the complexity compared to the gains is
> the main reason I've been pushing to just use simd.h elsewhere (i.e., SSE2
> and Neon). I'd still consider using AVX-512, etc. for things if the impact
> on real-world workloads was huge, though.
Yes, that and by research done while trying to understand why my RISC-V build farm animal "greenfly" (OrangePi RV2 with
aVisionFive 2 CPU: RISC-V RV64GC + Zba/Zbb/Zbc/Zbs) is failing consistently.
> --
> nathan
Forgive me, while $subject only mentions popcount I couldn't help myself so I added a few more RISC-V patches including
abug fix that I hope makes greenfly happy again.
0001 - This is a bug fix for DES/RISC-V/Clang DES initialization.
------> Join me in "the rabbit hole" on this issue if you care to...
The existing software DES (as shown by the build-farm animal "greenfly" [1]) fails because Clang 20 has an
auto-vectorizationbug that we trigger in the DES initialization code (des_init() function), not the DES encryption
algorithmitself.
I searched the LLVM issue tracker, here are the issues that caught my eye:
1. Issue #176001 - "RISC-V Wrong code at -O1"
- Vector peephole optimization with vmerge folding
- Fixed by PR #176077 (merged Jan 2024)
- Link: https://github.com/llvm/llvm-project/issues/176001
2. Issue #187458 - "Wrong code for vector.extract.last.active"
- Large index issues with zvl1024b
- Partially fixed, still work ongoing
- Link: https://github.com/llvm/llvm-project/issues/187458
3. Issue #171978 - "RISC-V Wrong code at -O2/O3"
- Illegal instruction from mismatched EEW
- Under investigation
- Link: https://github.com/llvm/llvm-project/issues/171978
4. PR #176105 - "Fix i64 gather/scatter cost on rv32"
- Cost model fixes for scatter/gather (merged Jan 2026)
- Link: https://github.com/llvm/llvm-project/pull/176105
My fix in 0001 is simply adding this in a few places in crypt-des.c:
#if defined(__riscv) && defined(__clang__)
pg_memory_barrier();
#endif
While searching I ran across a different solution, adding `-mllvm -riscv-v-vector-bits-min=0` sets the minimum vector
bitwidth for RISC-V vector extension in LLVM to 0 disabling all vectorization forcing scalar code generation, no RVV
instructionsare emitted. This would prevent the DES bug at the cost of any vectorization anywhere in the binary.
While that might also fix the other intermittent bug we'd been seeing on greenfly (not tested) disablnig all RVV
optimizationsseems to heavy handed to me.
------> Moving on.
0002 - (was "0001" in v2) this is unchanged, it implements popcount using Zbb extension on RISC-V
0003 - is a small patch that adapted from the Google Abseil project's RISC-V CRC32C implementation [1]. It is *a lot
faster*than the software crc32c we fall back to now (see: riscv-crc32c.c). This algorithm requires the Zbc (or Zbkc)
extension(for clmul) so the patch tests for that at build and adds the '-march' flag when it is. However, as is the
casefor Zbb and popcnt in, the presence of Zbc (or Zbkc) must be detected at runtime. That's done following the
pre-existingpattern used for ARM features. This does introduce some runtime overhead and complexity, not more than
requiredI hope.
I attached test code, and results at the end of this email:
* riscv-popcnt.c - unchanged
* riscv-crc32c.c - new, based on work in the Google Abseil project
* riscv-des.c - highlights the fix for DES using Clang on RISC-V
I guess the question for 002 and/or 003 is if the "juice" is worth the "squeeze" or not. There is a lot of performance
juiceto be had IMO. But some might argue that RISC-V isn't widely adopted yet, and they'd be right. Others might
pointout that RISC-V is currently showing up in embedded systems more than server/desktop/laptop/cloud, also true.
However,there is some evidence that is changing as there are RISC-V in servers [2][3], and there is a hosted (cloud)
solutionfrom Scaleway [4]. There exists a 64 core RISC-V desktop [6] and a Framework laptop mainboard [7] sporting a
RISC-VCPUs. And there is the OrangePi RV2 [7] I have that is "greenfly".
Is it early days? Certainly! But too early? That's up for debate. :)
If nothing else, these patches can be a durable record and used later when RISC-V is a critical platform for Postgres
orinformational to other projects.
best.
-greg
[1] https://github.com/abseil/abseil-cpp/pull/1986 absl/crc/internal/crc_riscv.cc
[2] https://www.firefly.store/products/rs-sra120-risc-v-server-2u-computing-server-cloud-storage-large-model-sg2042
[3] https://edgeaicomputer.com/our-products/servers/risc-v-compute-server-sra1-20/
[4]
https://www.scaleway.com/en/news/scaleway-launches-its-risc-v-servers-in-the-cloud-a-world-first-and-a-firm-commitment-to-technological-independence/
[5] https://milkv.io/pioneer and
https://www.crowdsupply.com/milk-v/milk-v-pioneer/updates/current-status-of-production
[6] https://deepcomputing.io/product/dc-roma-risc-v-mainboard/
[7] http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-RV2.html
---- TEST PROGRAM OUTPUT:
gburd@rv:~/ws/postgres$ make -f Makefile.RISCV
gcc -O2 riscv-des.c -o des-gcc-sw
gcc -O2 riscv-des.c -march=rv64gcv -o des-gcc-hw
clang-20 -O1 riscv-des.c -o des-clang-o1-sw
clang-20 -O1 -march=rv64gcv riscv-des.c -o des-clang-o1-hw
clang-20 -O2 riscv-des.c -o des-clang-o2-sw
clang-20 -O2 -march=rv64gcv riscv-des.c -o des-clang-o2-hw
gcc -O2 -o popcnt-gcc-o2-sw riscv-popcnt.c
gcc -O2 -march=rv64gc_zbb -o popcnt-gcc-o2-hw riscv-popcnt.c
clang-20 -O2 -o popcnt-clang-o2-sw riscv-popcnt.c
clang-20 -O2 -march=rv64gc_zbb -o popcnt-clang-o2-hw riscv-popcnt.c
gcc -O2 -o crc32c-gcc-o2-sw riscv-crc32c.c
gcc -O2 -march=rv64gc_zbc -o crc32c-gcc-o2-hw riscv-crc32c.c
clang-20 -O2 -o crc32c-clang-o2-sw riscv-crc32c.c
clang-20 -O2 -march=rv64gc_zbc -o crc32c-clang-o2-hw riscv-crc32c.c
gburd@rv:~/ws/postgres$ make -f Makefile.RISCV test
./des-gcc-sw
Compiler: GCC 13.3.0
Target: RISC-V 64-bit
Vector extension: Not enabled
Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct
Testing WITH compiler barriers:
PASS: Permutation tables are correct
Performance Comparison (1000000 iterations):
Without barriers: 0.409 seconds (409 ns/iter)
With barriers: 0.416 seconds (416 ns/iter)
Overhead: 1.6%
./des-gcc-hw
Compiler: GCC 13.3.0
Target: RISC-V 64-bit
Vector extension: Enabled (RVV)
Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct
Testing WITH compiler barriers:
PASS: Permutation tables are correct
Performance Comparison (1000000 iterations):
Without barriers: 0.410 seconds (410 ns/iter)
With barriers: 0.410 seconds (410 ns/iter)
Overhead: Negligible
./des-clang-o1-sw
Compiler: Clang 20.1.2
Target: RISC-V 64-bit
Vector extension: Not enabled
Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct
Testing WITH compiler barriers:
PASS: Permutation tables are correct
Performance Comparison (1000000 iterations):
Without barriers: 0.517 seconds (517 ns/iter)
With barriers: 0.516 seconds (516 ns/iter)
Overhead: Negligible
./des-clang-o1-hw
Compiler: Clang 20.1.2
Target: RISC-V 64-bit
Vector extension: Enabled (RVV)
Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct
Testing WITH compiler barriers:
PASS: Permutation tables are correct
Performance Comparison (1000000 iterations):
Without barriers: 0.405 seconds (405 ns/iter)
With barriers: 0.405 seconds (405 ns/iter)
Overhead: Negligible
./des-clang-o2-sw
Compiler: Clang 20.1.2
Target: RISC-V 64-bit
Vector extension: Not enabled
Testing WITHOUT compiler barriers:
PASS: Permutation tables are correct
Testing WITH compiler barriers:
PASS: Permutation tables are correct
Performance Comparison (1000000 iterations):
Without barriers: 0.517 seconds (517 ns/iter)
With barriers: 0.518 seconds (518 ns/iter)
Overhead: Negligible
./des-clang-o2-hw
Compiler: Clang 20.1.2
Target: RISC-V 64-bit
Vector extension: Enabled (RVV)
Testing WITHOUT compiler barriers:
ERROR: un_pbox mismatch:
un_pbox[0] = 15, expected 8
un_pbox[1] = 6, expected 16
un_pbox[2] = 19, expected 22
un_pbox[3] = 20, expected 30
un_pbox[4] = 28, expected 12
... and 27 more errors
FAIL: Permutation tables are incorrect
Testing WITH compiler barriers:
PASS: Permutation tables are correct
Performance Comparison (1000000 iterations):
Without barriers: 0.093 seconds (93 ns/iter)
With barriers: 0.407 seconds (407 ns/iter)
Overhead: 335.5%
./popcnt-gcc-o2-sw
sw popcount: 0.183 sec ( 547.89 MB/s)
hw popcount: 0.274 sec ( 365.40 MB/s)
diff: 0.67x
match: 406261900 bits counted
./popcnt-gcc-o2-hw
sw popcount: 0.182 sec ( 548.17 MB/s)
hw popcount: 0.044 sec ( 2287.82 MB/s)
diff: 4.17x
match: 406261900 bits counted
./popcnt-clang-o2-sw
sw popcount: 0.188 sec ( 531.96 MB/s)
hw popcount: 0.207 sec ( 482.84 MB/s)
diff: 0.91x
match: 406261900 bits counted
./popcnt-clang-o2-hw
sw popcount: 0.224 sec ( 446.46 MB/s)
hw popcount: 0.056 sec ( 1794.83 MB/s)
diff: 4.02x
match: 406261900 bits counted
./crc32c-gcc-o2-sw
sw crc32c: 0.651 sec ( 153.68 MB/s)
hw crc32c: 0.651 sec ( 153.72 MB/s)
diff: 1.00x
match: 0x0B141F2D
validation: CRC32C("123456789") = 0xE3069283 (correct)
./crc32c-gcc-o2-hw
sw crc32c: 0.651 sec ( 153.70 MB/s)
hw crc32c: 0.000 sec ( 308052.33 MB/s)
diff: 2004.21x
match: 0x0B141F2D
validation: CRC32C("123456789") = 0xE3069283 (correct)
./crc32c-clang-o2-sw
sw crc32c: 0.584 sec ( 171.10 MB/s)
hw crc32c: 0.584 sec ( 171.17 MB/s)
diff: 1.00x
match: 0x0B141F2D
validation: CRC32C("123456789") = 0xE3069283 (correct)
./crc32c-clang-o2-hw
sw crc32c: 0.584 sec ( 171.15 MB/s)
hw crc32c: 0.000 sec ( 309282.38 MB/s)
diff: 1807.08x
match: 0x0B141F2D
validation: CRC32C("123456789") = 0xE3069283 (correct)
Вложения
[new subject] On Sat, Mar 28, 2026 at 3:22 AM Greg Burd <greg@burd.me> wrote: > 0001 - This is a bug fix for DES/RISC-V/Clang DES initialization. > > ------> Join me in "the rabbit hole" on this issue if you care to... > > The existing software DES (as shown by the build-farm animal "greenfly" [1]) fails because Clang 20 has an auto-vectorizationbug that we trigger in the DES initialization code (des_init() function), not the DES encryption algorithmitself. > [disable vectorization entirely] > While that might also fix the other intermittent bug we'd been seeing on greenfly (not tested) disablnig all RVV optimizationsseems to heavy handed to me. The first thing I notice is that not very long ago the buildfarm had 3 gcc RISC-V members, but not anymore. If you care about having coverage for this hardware, I'd suggest picking up gcc again if that's still working, and wait and see about clang. Clang has shipped broken code generation for obscure platforms in the past, and it seems here we're not even sure of the extent of the breakage. -- John Naylor Amazon Web Services
On Mon, Mar 30, 2026, at 2:39 AM, John Naylor wrote: > [new subject] > > On Sat, Mar 28, 2026 at 3:22 AM Greg Burd <greg@burd.me> wrote: > >> 0001 - This is a bug fix for DES/RISC-V/Clang DES initialization. >> >> ------> Join me in "the rabbit hole" on this issue if you care to... >> >> The existing software DES (as shown by the build-farm animal "greenfly" [1]) fails because Clang 20 has an auto-vectorizationbug that we trigger in the DES initialization code (des_init() function), not the DES encryption algorithmitself. > >> [disable vectorization entirely] >> While that might also fix the other intermittent bug we'd been seeing on greenfly (not tested) disablnig all RVV optimizationsseems to heavy handed to me. > > The first thing I notice is that not very long ago the buildfarm had 3 > gcc RISC-V members, but not anymore. If you care about having coverage > for this hardware, I'd suggest picking up gcc again if that's still > working, and wait and see about clang. Clang has shipped broken code > generation for obscure platforms in the past, and it seems here we're > not even sure of the extent of the breakage. Hey John, All fair points. I've changed greenfly to use GCC 13.3.0, thanks for the suggestion. > -- > John Naylor > Amazon Web Services best. -greg