Обсуждение: Re: vectorized CRC on ARM64
On Wed, May 14, 2025 I wrote: > > We did something similar for x86 for v18, and here is some progress > towards Arm support. Coming back to this, since there's been recent interest in Arm support. v2 is a rebase, with a few changes. - I simplified it by leaving out the inlining for "assume CRC" builds, since I wanted to avoid alignment considerations if I can. I think always indirecting through a pointer will have less risk of regressions in a realistic setting than for x86 since Arm chips typically have low latency for carryless multiplication instructions. With just a bit of code we can still use the direct call for small constant inputs, so I did that to avoid regressions under WAL insert lock. - One coding idiom for a vector literal in the generated code was giving pgindent indigestion, I so rewrote it using Neon intrinsics and verified it in Godbolt. > 0002: Like 3c6e8c12389 and in fact uses the same program to generate > the code, by specifying Neon instructions with the Arm "crypto" > extension instead. There are some interesting differences from x86 > here as well: > - The upstream implementation chose to use inline assembly instead of > intrinsics for some reason. I initially thought that was a way to get > broader compiler support, but it turns out you still need to pass the > relevant flags to get the assembly to link. To follow-up for curiosity's sake, [1] says that Apple chips can issue PMULL + EOR as a single uop if they are next to each other in the instruction stream. > - I only have Meson support for now, since I used MacOS on CI to test. > That OS and compiler combination apparently targets the CRC extension, > but the PMULL instruction runtime check uses Linux-only headers, I > believe, so previously I hacked the choose function to return true for > testing. The choose function in 0002 is untested in this form. This is still true, but now the CI hack lives in a separate not-for-commit patch for clarity. autoconf support is a WIP, and I will share that after I do some testing on an Arm Linux instance. [1] https://dougallj.github.io/applecpu/firestorm.html -- John Naylor Amazon Web Services
Вложения
Hi John
Thank yo for working on this. I had one question about the mixed use of intrinsics and inline asm here.
On Jan 12, 2026, at 1:27 AM, John Naylor <johncnaylorls@gmail.com> wrote:On Wed, May 14, 2025 I wrote:
We did something similar for x86 for v18, and here is some progress
towards Arm support.
Coming back to this, since there's been recent interest in Arm support.
v2 is a rebase, with a few changes.
- I simplified it by leaving out the inlining for "assume CRC" builds,
since I wanted to avoid alignment considerations if I can. I think
always indirecting through a pointer will have less risk of
regressions in a realistic setting than for x86 since Arm chips
typically have low latency for carryless multiplication instructions.
With just a bit of code we can still use the direct call for small
constant inputs, so I did that to avoid regressions under WAL insert
lock.
- One coding idiom for a vector literal in the generated code was
giving pgindent indigestion, I so rewrote it using Neon intrinsics and
verified it in Godbolt.0002: Like 3c6e8c12389 and in fact uses the same program to generate
the code, by specifying Neon instructions with the Arm "crypto"
extension instead. There are some interesting differences from x86
here as well:
- The upstream implementation chose to use inline assembly instead of
intrinsics for some reason. I initially thought that was a way to get
broader compiler support, but it turns out you still need to pass the
relevant flags to get the assembly to link.
Since the implementation already uses NEON intrinsics such as vld1q_u64, I was wondering why the pmull / pmull2 + eor helpers still need to be inline asm rather than intrinsics.
Is that due to compiler/toolchain support, or because the intrinsic-based version produced noticeably worse code?
To follow-up for curiosity's sake, [1] says that Apple chips can issue
PMULL + EOR as a single uop if they are next to each other in the
instruction stream.- I only have Meson support for now, since I used MacOS on CI to test.
That OS and compiler combination apparently targets the CRC extension,
but the PMULL instruction runtime check uses Linux-only headers, I
believe, so previously I hacked the choose function to return true for
testing. The choose function in 0002 is untested in this form.
This is still true, but now the CI hack lives in a separate
not-for-commit patch for clarity.
autoconf support is a WIP, and I will share that after I do some
testing on an Arm Linux instance.
[1] https://dougallj.github.io/applecpu/firestorm.html
--
John Naylor
Amazon Web Services
<v2-0001-Compute-CRC32C-on-ARM-using-the-Crypto-Extension-.patch><v2-0002-Force-testing-on-MacOS-CI-XXX-not-for-commit.patch>
Regards
Haibo
On Wed, Mar 18, 2026 at 10:34 AM Haibo Yan <tristan.yim@gmail.com> wrote: > > Hi John > > Thank yo for working on this. I had one question about the mixed use of intrinsics and inline asm here. > Since the implementation already uses NEON intrinsics such as vld1q_u64, I was wondering why the pmull / pmull2 + eor helpersstill need to be inline asm rather than intrinsics. > > Is that due to compiler/toolchain support, or because the intrinsic-based version produced noticeably worse code? I answered that in the email you replied to, re-quoted here: > To follow-up for curiosity's sake, [1] says that Apple chips can issue > PMULL + EOR as a single uop if they are next to each other in the > instruction stream. > [1] https://dougallj.github.io/applecpu/firestorm.html I don't know if that's relevant for current server hardware, so it could be pointless. I'm personally not a fan of inline assembly, but I also didn't yet want to put in the effort to alter generated code. I don't think it would be very hard to do, however. -- John Naylor Amazon Web Services
On Tue, Mar 17, 2026 at 11:52 PM John Naylor <johncnaylorls@gmail.com> wrote:
On Wed, Mar 18, 2026 at 10:34 AM Haibo Yan <tristan.yim@gmail.com> wrote:
>
> Hi John
>
> Thank yo for working on this. I had one question about the mixed use of intrinsics and inline asm here.
> Since the implementation already uses NEON intrinsics such as vld1q_u64, I was wondering why the pmull / pmull2 + eor helpers still need to be inline asm rather than intrinsics.
>
> Is that due to compiler/toolchain support, or because the intrinsic-based version produced noticeably worse code?
I answered that in the email you replied to, re-quoted here:
> To follow-up for curiosity's sake, [1] says that Apple chips can issue
> PMULL + EOR as a single uop if they are next to each other in the
> instruction stream.
> [1] https://dougallj.github.io/applecpu/firestorm.html
I don't know if that's relevant for current server hardware, so it
could be pointless. I'm personally not a fan of inline assembly, but I
also didn't yet want to put in the effort to alter generated code. I
don't think it would be very hard to do, however.
Thanks, that makes sense as an explanation for why the inline asm is there today. But it also sounds like this is more of a temporary implementation choice than a conclusion that intrinsics are unsuitable. If so, I wonder whether it would be better to treat an intrinsics-based version as the preferred end state unless benchmarks show a clear regression.
Regards
Haibo
On Thu, Mar 19, 2026 at 12:17 AM Haibo Yan <tristan.yim@gmail.com> wrote: > > On Tue, Mar 17, 2026 at 11:52 PM John Naylor <johncnaylorls@gmail.com> wrote: >> I don't know if that's relevant for current server hardware, so it >> could be pointless. I'm personally not a fan of inline assembly, but I >> also didn't yet want to put in the effort to alter generated code. I >> don't think it would be very hard to do, however. > > > Thanks, that makes sense as an explanation for why the inline asm is there today. But it also sounds like this is moreof a temporary implementation choice than a conclusion that intrinsics are unsuitable. I can see how my words imply that, but after a moment's thought I still don't want to put in that effort without a good reason. For starters, what I said above about "not very hard" may be wishful thinking. > If so, I wonder whether it would be better to treat an intrinsics-based version as the preferred end state unless benchmarksshow a clear regression. To meet your criterion, we'd not only have to rewrite it correctly, we'd have to test on multiple vendors of non-Apple hardware and multiple compiler vendors/versions (at least where the binary output is different) to prove we haven't caused a regression. I wouldn't recommend anyone to accept that challenge as stated, since the risk/reward ratio is just not favorable. Especially considering we're 2 1/2 weeks away from feature freeze. -- John Naylor Amazon Web Services