Re: Popcount optimization using AVX512

Поиск
Список
Период
Сортировка
От Nathan Bossart
Тема Re: Popcount optimization using AVX512
Дата
Msg-id 20231107022240.GA729644@nathanxps13
обсуждение исходный текст
Ответ на Re: Popcount optimization using AVX512  (Matthias van de Meent <boekewurm+postgres@gmail.com>)
Ответы Re: Popcount optimization using AVX512  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On Fri, Nov 03, 2023 at 12:16:05PM +0100, Matthias van de Meent wrote:
> On Thu, 2 Nov 2023 at 15:22, Amonson, Paul D <paul.d.amonson@intel.com> wrote:
>> This proposal showcases the speed-up provided to popcount feature when
>> using AVX512 registers. The intent is to share the preliminary results
>> with the community and get feedback for adding avx512 support for
>> popcount.
>>
>> Revisiting the previous discussion/improvements around this feature, I
>> have created a micro-benchmark based on the pg_popcount() in
>> PostgreSQL's current implementations for x86_64 using the newer AVX512
>> intrinsics. Playing with this implementation has improved performance up
>> to 46% on Intel's Sapphire Rapids platform on AWS. Such gains will
>> benefit scenarios relying on popcount.

Nice.  I've been testing out AVX2 support in src/include/port/simd.h, and
the results look promising there, too.  I intend to start a new thread for
that (hopefully soon), but one open question I don't have a great answer
for yet is how to detect support for newer intrinsics.  So far, we've been
able to use function pointers (e.g., popcount, crc32c) or deduce support
via common predefined compiler macros (e.g., we assume SSE2 is supported if
the compiler is targeting 64-bit x86).  But the former introduces a
performance penalty, and we probably want to inline most of this stuff,
anyway.  And the latter limits us to stuff that has been around for a
decade or two.

Like I said, I don't have any proposals yet, but assuming we do want to
support newer intrinsics, either open-coded or via auto-vectorization, I
suspect we'll need to gather consensus for a new policy/strategy.

> Apart from the two type functions bytea_bit_count and bit_bit_count
> (which are not accessed in postgres' own systems, but which could want
> to cover bytestreams of >BLCKSZ) the only popcount usages I could find
> were on objects that fit on a page, i.e. <8KiB in size. How does
> performance compare for bitstreams of such sizes, especially after any
> CPU clock implications are taken into account?

Yeah, the previous optimizations in this area appear to have used ANALYZE
as the benchmark, presumably because of visibilitymap_count().  I briefly
attempted to measure the difference with and without AVX512 support, but I
haven't noticed any difference thus far.  One complication for
visiblitymap_count() is that the data passed to pg_popcount64() is masked,
which requires a couple more intructions when you're using the intrinsics.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: Making aggregate deserialization (and WAL receive) functions slightly faster
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: A recent message added to pg_upgade