Re: Popcount optimization using AVX512
От | Nathan Bossart |
---|---|
Тема | Re: Popcount optimization using AVX512 |
Дата | |
Msg-id | 20231107022240.GA729644@nathanxps13 обсуждение исходный текст |
Ответ на | Re: Popcount optimization using AVX512 (Matthias van de Meent <boekewurm+postgres@gmail.com>) |
Ответы |
Re: Popcount optimization using AVX512
|
Список | pgsql-hackers |
On Fri, Nov 03, 2023 at 12:16:05PM +0100, Matthias van de Meent wrote: > On Thu, 2 Nov 2023 at 15:22, Amonson, Paul D <paul.d.amonson@intel.com> wrote: >> This proposal showcases the speed-up provided to popcount feature when >> using AVX512 registers. The intent is to share the preliminary results >> with the community and get feedback for adding avx512 support for >> popcount. >> >> Revisiting the previous discussion/improvements around this feature, I >> have created a micro-benchmark based on the pg_popcount() in >> PostgreSQL's current implementations for x86_64 using the newer AVX512 >> intrinsics. Playing with this implementation has improved performance up >> to 46% on Intel's Sapphire Rapids platform on AWS. Such gains will >> benefit scenarios relying on popcount. Nice. I've been testing out AVX2 support in src/include/port/simd.h, and the results look promising there, too. I intend to start a new thread for that (hopefully soon), but one open question I don't have a great answer for yet is how to detect support for newer intrinsics. So far, we've been able to use function pointers (e.g., popcount, crc32c) or deduce support via common predefined compiler macros (e.g., we assume SSE2 is supported if the compiler is targeting 64-bit x86). But the former introduces a performance penalty, and we probably want to inline most of this stuff, anyway. And the latter limits us to stuff that has been around for a decade or two. Like I said, I don't have any proposals yet, but assuming we do want to support newer intrinsics, either open-coded or via auto-vectorization, I suspect we'll need to gather consensus for a new policy/strategy. > Apart from the two type functions bytea_bit_count and bit_bit_count > (which are not accessed in postgres' own systems, but which could want > to cover bytestreams of >BLCKSZ) the only popcount usages I could find > were on objects that fit on a page, i.e. <8KiB in size. How does > performance compare for bitstreams of such sizes, especially after any > CPU clock implications are taken into account? Yeah, the previous optimizations in this area appear to have used ANALYZE as the benchmark, presumably because of visibilitymap_count(). I briefly attempted to measure the difference with and without AVX512 support, but I haven't noticed any difference thus far. One complication for visiblitymap_count() is that the data passed to pg_popcount64() is masked, which requires a couple more intructions when you're using the intrinsics. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
В списке pgsql-hackers по дате отправления: