Re: Popcount optimization using AVX512

Поиск

Список

Период

Сортировка

От	Nathan Bossart
Тема	Re: Popcount optimization using AVX512
Дата	7 ноября 2023 г. 02:22:40
Msg-id	20231107022240.GA729644@nathanxps13 обсуждение исходный текст
Ответ на	Re: Popcount optimization using AVX512 (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Ответы	Re: Popcount optimization using AVX512
Список	pgsql-hackers

Дерево обсуждения

On Fri, Nov 03, 2023 at 12:16:05PM +0100, Matthias van de Meent wrote:
> On Thu, 2 Nov 2023 at 15:22, Amonson, Paul D <paul.d.amonson@intel.com> wrote:
>> This proposal showcases the speed-up provided to popcount feature when
>> using AVX512 registers. The intent is to share the preliminary results
>> with the community and get feedback for adding avx512 support for
>> popcount.
>>
>> Revisiting the previous discussion/improvements around this feature, I
>> have created a micro-benchmark based on the pg_popcount() in
>> PostgreSQL's current implementations for x86_64 using the newer AVX512
>> intrinsics. Playing with this implementation has improved performance up
>> to 46% on Intel's Sapphire Rapids platform on AWS. Such gains will
>> benefit scenarios relying on popcount.

Nice.  I've been testing out AVX2 support in src/include/port/simd.h, and
the results look promising there, too.  I intend to start a new thread for
that (hopefully soon), but one open question I don't have a great answer
for yet is how to detect support for newer intrinsics.  So far, we've been
able to use function pointers (e.g., popcount, crc32c) or deduce support
via common predefined compiler macros (e.g., we assume SSE2 is supported if
the compiler is targeting 64-bit x86).  But the former introduces a
performance penalty, and we probably want to inline most of this stuff,
anyway.  And the latter limits us to stuff that has been around for a
decade or two.

Like I said, I don't have any proposals yet, but assuming we do want to
support newer intrinsics, either open-coded or via auto-vectorization, I
suspect we'll need to gather consensus for a new policy/strategy.

> Apart from the two type functions bytea_bit_count and bit_bit_count
> (which are not accessed in postgres' own systems, but which could want
> to cover bytestreams of >BLCKSZ) the only popcount usages I could find
> were on objects that fit on a page, i.e. <8KiB in size. How does
> performance compare for bitstreams of such sizes, especially after any
> CPU clock implications are taken into account?

Yeah, the previous optimizations in this area appear to have used ANALYZE
as the benchmark, presumably because of visibilitymap_count().  I briefly
attempted to measure the difference with and without AVX512 support, but I
haven't noticed any difference thus far.  One complication for
visiblitymap_count() is that the data passed to pg_popcount64() is masked,
which requires a couple more intructions when you're using the intrinsics.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Popcount optimization using AVX512