Re: add AVX2 support to simd.h

Поиск
Список
Период
Сортировка
От Ants Aasma
Тема Re: add AVX2 support to simd.h
Дата
Msg-id CANwKhkMvEr+EgRCX5eV39cdCBw_ArcevM0hmJqq-1dNgpAB0cg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: add AVX2 support to simd.h  (Peter Eisentraut <peter@eisentraut.org>)
Список pgsql-hackers
On Tue, 9 Jan 2024 at 16:03, Peter Eisentraut <peter@eisentraut.org> wrote:
> On 29.11.23 18:15, Nathan Bossart wrote:
> > Using the same benchmark as we did for the SSE2 linear searches in
> > XidInMVCCSnapshot() (commit 37a6e5d) [1] [2], I see the following:
> >
> >    writers    sse2    avx2     %
> >        256    1195    1188    -1
> >        512     928    1054   +14
> >       1024     633     716   +13
> >       2048     332     420   +27
> >       4096     162     203   +25
> >       8192     162     182   +12
>
> AFAICT, your patch merely provides an alternative AVX2 implementation
> for where currently SSE2 is supported, but it doesn't provide any new
> API calls or new functionality.  One might naively expect that these are
> just two different ways to call the underlying primitives in the CPU, so
> these performance improvements are surprising to me.  Or do the CPUs
> actually have completely separate machinery for SSE2 and AVX2, and just
> using the latter to do the same thing is faster?

The AVX2 implementation uses a wider vector register. On most current
processors the throughput of the instructions in question is the same
on 256bit vectors as on 128bit vectors. Basically, the chip has AVX2
worth of machinery and using SSE2 leaves half of it unused. Notable
exceptions are efficiency cores on recent Intel desktop CPUs and AMD
CPUs pre Zen 2 where AVX2 instructions are internally split up into
two 128bit wide instructions.

For AVX512 the picture is much more complicated. Some instructions run
at half rate, some at full rate, but not on all ALU ports, some
instructions cause aggressive clock rate reduction on some
microarchitectures. AVX-512 adds mask registers and masked vector
instructions that enable quite a bit simpler code in many cases.
Interestingly I have seen Clang make quite effective use of these
masked instructions even when using AVX2 intrinsics, but targeting an
AVX-512 capable platform.

The vector width independent approach used in the patch is nice for
simple cases by not needing a separate implementation for each vector
width. However for more complicated cases where "horizontal"
operations are needed it's going to be much less useful. But these
cases can easily just drop down to using intrinsics directly.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: torikoshia
Дата:
Сообщение: Re: POC PATCH: copy from ... exceptions to: (was Re: VLDB Features)
Следующее
От: Tom Lane
Дата:
Сообщение: Re: pg_dump: Remove obsolete trigger support