Обсуждение: [PATCH] CRC32C optimizations using SVE2 on ARM.

Поиск
Список
Период
Сортировка

[PATCH] CRC32C optimizations using SVE2 on ARM.

От
"Devanga.Susmitha@fujitsu.com"
Дата:
Hello,
This email aims to discuss the contribution of an optimized CRC32C implementation for ARM (aarch64) machines. The CRC32C function is widely used throughout PostgreSQL for checksum workloads such as WAL generation and data integrity validation.
The current CRC32C implementation on ARM relies on scalar hardware instructions (__crc32cb/ch/cw/cd) which process 1, 2, 4, or 8 bytes per iteration. While correct and efficient for smaller inputs, this scalar design becomes a bottleneck for large buffer sizes, leading to noticeable performance degradation.
With the introduction of an SVE2�Cbased implementation, we leverage wide vector intrinsics to process up to 128 bytes per iteration using 8 vectors in parallel. This design significantly accelerates CRC32C execution by reducing the total number of loop iterations, minimizing serial dependency chains, and improving compute and memory throughput.
We have implemented this feature to ensure correctness, compatibility, and safe integration. It includes compile-time and runtime checks to detect SVE2 support on both the compiler and underlying hardware. When SVE2 is unavailable, the function safely falls back to the existing scalar CRC32C path to ensure consistent results across systems.
For architecture-specific functions, we use pg_attribute_target("arch=armv9-a+sve2-aes") to ensure precise compilation control without modifying global CFLAGS, enabling a clean integration within PostgreSQL’s build system.
System Configuration
Machine: AWS EC2 c8g.4xlarge (16 cores, 30 GB RAM)
OS: Ubuntu 22.04.5 LTS
GCC: 13.1.0
Benchmark and Results
Setup:
We used the CRC32C microbenchmark SQL function published on the PostgreSQL mailing list
[0] to evaluate the performance of the SVE2 implementation against the existing scalar ARM version across multiple buffer sizes.
Query:
time SELECT drive_crc32c(1000000, bytes);
The experiment was executed for input sizes ranging from 8 bytes up to 32 MB.
Results:
Significant performance gains are observed starting from 128 bytes. For larger buffer sizes (≥ 1 KB), the SVE2 implementation achieves approximately 2�C3 times speed-ups, with peak improvements observed for multi-megabyte inputs due to parallel folding and polynomial carry-less multiplication using 8 SVE2 vectors.
These improvements make CRC32C computation substantially faster for real PostgreSQL workloads involving large data blocks or WAL buffers.
We would like to contribute this work so that it becomes available to the PostgreSQL community. As part of the process, we are following the guidelines provided in Submitting a Patch - PostgreSQL wiki. Please find the attachment for the patches and performance results.
Please let us know if you have any queries or suggestions.

Вложения

Re: [PATCH] CRC32C optimizations using SVE2 on ARM.

От
John Naylor
Дата:
On Fri, Dec 19, 2025 at 4:20 AM Devanga.Susmitha@fujitsu.com
<Devanga.Susmitha@fujitsu.com> wrote:
> For architecture-specific functions, we use pg_attribute_target("arch=armv9-a+sve2-aes")

There was already a proposal to use armv8-a+crypto, which is more
widely available and works on smaller inputs. Perhaps you'd be
interested in reviewing and testing?

https://www.postgresql.org/message-id/CANWCAZaKhE%2BRD5KKouUFoxx1EbUNrNhcduM1VQ%3DDkSDadNEFng%40mail.gmail.com

> to ensure precise compilation control without modifying global CFLAGS, enabling a clean integration within
PostgreSQL’sbuild system. 

I think the reason we continue to use CFLAGS here was that clang
support for target attributes on Arm is fairly recent. It's probably
too soon to reconsider that.

--
John Naylor
Amazon Web Services



Re: [PATCH] CRC32C optimizations using SVE2 on ARM.

От
"Devanga.Susmitha@fujitsu.com"
Дата:
>There was already a proposal to use armv8-a+crypto, which is more
widely available and works on smaller inputs.

Our implementation with SVE2 is able to gain better performance than
https://www.postgresql.org/message-id/CANWCAZaKhE%2BRD5KKouUFoxx1EbUNrNhcduM1VQ%3DDkSDadNEFng%40mail.gmail.com

I've benchmarked our SVE2 implementation against armv8-a+crypto, and the results show substantial improvements.

Buffer size (bytes)
               armv8+crypto (in ms)
             armv9+SVE2 (in ms)
                   Improvement
512
28.491
19.37
                  32.0% faster
1024
47.145
29.962
                  36.5% faster
2048
86.717
52.841
                  39.1% faster
4096
165.205
105.626
                  36.1% faster
8192
318.103
226.437
                  28.8% faster

These buffer sizes are particularly relevant for PostgreSQL workloads:
  • 8KB: Default page size (28.8% faster checksumming)
  • 4KB: Alternative page size configuration (36.1% faster)
  • 512B-2KB: Typical WAL record sizes (32-39% faster)
  • 2KB: TOAST chunk size (39% faster)


While armv8-a+crypto has broader current deployment, SVE2 is already available in production cloud infrastructure: AWS Graviton 4, Ampere AmpereOne, and NVIDIA Grace (all released 2023). As ARMv9 adoption continues, these gains become increasingly relevant.
Rather than choosing one approach over the other, perhaps we could implement both with runtime CPU detection? Since we already perform runtime detection for crypto extension availability, adding an additional check for SVE2 introduces no performance degradation on systems without SVE2, while providing significant performance gains (28-39%) on systems that do support it. This would provide optimal performance on capable hardware while maintaining broad compatibility. Please let me know your thoughts.


static pg_crc32c (*pg_comp_crc32c_armv8)(pg_crc32c crc, const void *data, size_t len);
void pg_comp_crc32c_choose_armv8(void)
{
    if (pg_cpu_has_sve2())
        pg_comp_crc32c_armv8 = pg_comp_crc32c_armv8_sve2;
    else if (pg_cpu_has_crypto())
        pg_comp_crc32c_armv8 = pg_comp_crc32c_armv8_crypto;
    else
        pg_comp_crc32c_armv8 = pg_comp_crc32c_sb8; // scalar fallback
}



Thanks,
Susmitha Devanga.




From: John Naylor <johncnaylorls@gmail.com>
Sent: Friday, December 19, 2025 08:27
To: Susmitha, Devanga <Devanga.Susmitha@fujitsu.com>
Cc: pgsql-hackers <pgsql-hackers@postgresql.org>; Hajela, Ragesh <Ragesh.Hajela@fujitsu.com>; Bhattacharya, Chiranmoy <Chiranmoy.Bhattacharya@fujitsu.com>
Subject: Re: [PATCH] CRC32C optimizations using SVE2 on ARM.

On Fri, Dec 19, 2025 at 4:20 AM Devanga.Susmitha@fujitsu.com
<Devanga.Susmitha@fujitsu.com> wrote:
> For architecture-specific functions, we use pg_attribute_target("arch=armv9-a+sve2-aes")

There was already a proposal to use armv8-a+crypto, which is more
widely available and works on smaller inputs. Perhaps you'd be
interested in reviewing and testing?

https://www.postgresql.org/message-id/CANWCAZaKhE%2BRD5KKouUFoxx1EbUNrNhcduM1VQ%3DDkSDadNEFng%40mail.gmail.com

> to ensure precise compilation control without modifying global CFLAGS, enabling a clean integration within PostgreSQL’s build system.

I think the reason we continue to use CFLAGS here was that clang
support for target attributes on Arm is fairly recent. It's probably
too soon to reconsider that.

--
John Naylor
Amazon Web Services

Re: [PATCH] CRC32C optimizations using SVE2 on ARM.

От
John Naylor
Дата:

On Mon, Jan 5, 2026 at 4:02 PM Devanga.Susmitha@fujitsu.com <Devanga.Susmitha@fujitsu.com> wrote:
>
> >There was already a proposal to use armv8-a+crypto, which is more
> widely available and works on smaller inputs.
>
> Our implementation with SVE2 is able to gain better performance than
> https://www.postgresql.org/message-id/CANWCAZaKhE%2BRD5KKouUFoxx1EbUNrNhcduM1VQ%3DDkSDadNEFng%40mail.gmail.com
>
> I've benchmarked our SVE2 implementation against armv8-a+crypto, and the results show substantial improvements.

Thanks for testing! For future reference, please don't top post. We prefer to reply in-line to relevant points.

> These buffer sizes are particularly relevant for PostgreSQL workloads:
>
> 8KB: Default page size (28.8% faster checksumming)
> 4KB: Alternative page size configuration (36.1% faster)

We don't use CRC for page checksums. Did you mean FPI in WAL?

> 512B-2KB: Typical WAL record sizes (32-39% faster)

That seems on the high end, but your mileage may vary. Even if it were typical, the difference isn't that impressive, given that the patch seems to use double the number of (twice as wide?) accumulators. In any case, I suspect going from (say) 3x to 4x won't make much difference in profiles for those workloads, aside from having less portability. Plus, it seems more widely useful to allow a cutoff at 64 bytes, rather than 128.

> While armv8-a+crypto has broader current deployment, SVE2 is already available in production cloud infrastructure: AWS Graviton 4, Ampere AmpereOne, and NVIDIA Grace (all released 2023). As ARMv9 adoption continues, these gains become increasingly relevant.

That's pretty recent.

> Rather than choosing one approach over the other, perhaps we could implement both with runtime CPU detection?

That's bad for maintainability.

The broader point is I've already found something that seems good enough (and can be configured differently if necessary), portable enough, and is already used for our x86 vectorized implementation. I just wasn't going to make it a priority until someone showed up with enough interest to test and review it.

--
John Naylor
Amazon Web Services