Re: add AVX2 support to simd.h

Поиск

Список

Период

Сортировка

От	Ants Aasma
Тема	Re: add AVX2 support to simd.h
Дата	9 января 2024 г. 21:21:21
Msg-id	CANwKhkNygeEMkme6x_vs6+vY_+Rqjd=eZpfMe_0aA=UwMdPxFw@mail.gmail.com обсуждение исходный текст
Ответ на	Re: add AVX2 support to simd.h (Nathan Bossart <nathandbossart@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

On Tue, 9 Jan 2024 at 18:20, Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Tue, Jan 09, 2024 at 09:20:09AM +0700, John Naylor wrote:
> > On Tue, Jan 9, 2024 at 12:37 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >>
> >> > I suspect that there could be a regression lurking for some inputs
> >> > that the benchmark doesn't look at: pg_lfind32() currently needs to be
> >> > able to read 4 vector registers worth of elements before taking the
> >> > fast path. There is then a tail of up to 15 elements that are now
> >> > checked one-by-one, but AVX2 would increase that to 31. That's getting
> >> > big enough to be noticeable, I suspect. It would be good to understand
> >> > that case (n*32 + 31), because it may also be relevant now. It's also
> >> > easy to improve for SSE2/NEON for v17.
> >>
> >> Good idea.  If it is indeed noticeable, we might be able to "fix" it by
> >> processing some of the tail with shorter vectors.  But that probably means
> >> finding a way to support multiple vector sizes on the same build, which
> >> would require some work.
> >
> > What I had in mind was an overlapping pattern I've seen in various
> > places: do one iteration at the beginning, then subtract the
> > aligned-down length from the end and do all those iterations. And
> > one-by-one is only used if the total length is small.
>
> Sorry, I'm not sure I understood this.  Do you mean processing the first
> several elements individually or with SSE2 until the number of remaining
> elements can be processed with just the AVX2 instructions (a bit like how
> pg_comp_crc32c_armv8() is structured for memory alignment)?

For some operations (min, max, = any) processing the same elements
multiple times doesn't change the result. So the vectors for first
and/or last iterations can overlap with the main loop. In other cases
it's possible to mask out the invalid elements and replace them with
zeroes. Something along the lines of:

static inline Vector8
vector8_mask_right(int num_valid)
{
    __m256i seq = _mm256_set_epi8(31, 30, 29, 28, 27, 26, 25, 24,
                                  23, 22, 21, 20, 19, 18, 17, 16,
                                  15, 14, 13, 12, 11, 10, 9, 8,
                                  7, 6, 5, 4, 3, 2, 1, 0);
    return _mm256_cmpgt_epi8(_mm256_set1_epi8(num_valid), seq);
}

/* final incomplete iteration */
Vector8 mask = vector8_mask_right(end - cur);
final_vec = vector8_and((Vector8*) (end - sizeof(Vector8), mask);
accum = vector8_add(accum, final_vec);

It helps that on any halfway recent x86 unaligned loads only have a
minor performance penalty and only when straddling cache line
boundaries. Not sure what the  state on ARM is. If we don't care about
unaligned loads then we only need to care about the load not crossing
page boundaries which could cause segfaults. Though I'm sure memory
sanitizer tools will have plenty to complain about around such hacks.

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: add AVX2 support to simd.h