Re: Popcount optimization using AVX512

Поиск
Список
Период
Сортировка
От Ants Aasma
Тема Re: Popcount optimization using AVX512
Дата
Msg-id CANwKhkM-YZGE527y00LPU1660GW6zicvX7a=ZOsxnYYU0hng3g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Popcount optimization using AVX512  (Nathan Bossart <nathandbossart@gmail.com>)
Ответы Re: Popcount optimization using AVX512  (Nathan Bossart <nathandbossart@gmail.com>)
Список pgsql-hackers
On Mon, 1 Apr 2024 at 18:53, Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote:
> > On 2024-Mar-31, Nathan Bossart wrote:
> >> +    popcnt = _mm512_reduce_add_epi64(accum);
> >> +    return popcnt + pg_popcount_fast(buf, bytes);
> >
> > Hmm, doesn't this arrangement cause an extra function call to
> > pg_popcount_fast to be used here?  Given the level of micro-optimization
> > being used by this code, I would have thought that you'd have tried to
> > avoid that.  (At least, maybe avoid the call if bytes is 0, no?)
>
> Yes, it does.  I did another benchmark on very small arrays and can see the
> overhead.  This is the time in milliseconds to run pg_popcount() on an
> array 1 billion times:
>
>     size (bytes)  HEAD      AVX512-POPCNT
>     1             1707.685  3480.424
>     2             1926.694  4606.182
>     4             3210.412  5284.506
>     8             1920.703  3640.968
>     16            2936.91   4045.586
>     32            3627.956  5538.418
>     64            5347.213  3748.212
>
> I suspect that anything below 64 bytes will see this regression, as that is
> the earliest point where there are enough bytes for ZMM registers.

What about using the masking capabilities of AVX-512 to handle the
tail in the same code path? Masked out portions of a load instruction
will not generate an exception. To allow byte level granularity
masking, -mavx512bw is needed. Based on wikipedia this will only
disable this fast path on Knights Mill (Xeon Phi), in all other cases
VPOPCNTQ implies availability of BW.

Attached is an example of what I mean. I did not have a machine to
test it with, but the code generated looks sane. I added the clang
pragma because it insisted on unrolling otherwise and based on how the
instruction dependencies look that is probably not too helpful even
for large cases (needs to be tested). The configure check and compile
flags of course need to be amended for BW.

Regards,
Ants Aasma

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Statistics Import and Export
Следующее
От: Corey Huinker
Дата:
Сообщение: Re: Statistics Import and Export