Re: Improve CRC32C performance on SSE4.2

Поиск

Список

Период

Сортировка

От	John Naylor
Тема	Re: Improve CRC32C performance on SSE4.2
Дата	11 февраля 10:02:11
Msg-id	CANWCAZYXk9xWrjEPNsm1oGYZ4CHBSVonn4XcCfWnusg+s_wiEA@mail.gmail.com обсуждение исходный текст
Ответ на	RE: Improve CRC32C performance on SSE4.2 ("Devulapalli, Raghuveer" <raghuveer.devulapalli@intel.com>)
Список	pgsql-hackers

Дерево обсуждения

On Tue, Feb 11, 2025 at 7:25 AM Devulapalli, Raghuveer
<raghuveer.devulapalli@intel.com> wrote:
> I ran the same benchmark drive_crc32c with the postgres infrastructure and found that your v2 sse42 version from
corsixis slower than pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes. 

That matches my findings as well.

> I think the reason is that postgres is not using -O3 flag build the crc32c source files and the compiler generates
lessthan optimal code. Adding that flag fixes the regression for buffers with 64 bytes – 128 bytes. Could you confirm
thatbehavior on your end too? 

On my machine that still regresses compared to master in that range
(although by not as much) so I still think 128 bytes is the right
threshold.

The effect of -O3 with gcc14.2 is that the single-block loop (after
the 4-block loop) is unrolled. Unrolling adds branches and binary
space, so it'd be nice to avoid that even for systems that build with
-O3. I tried leaving out the single block loop, so that the tail is
called for the remaining <63 bytes, and it's actually better:

v2:

128
latency average = 10.256 ms
144
latency average = 11.393 ms
160
latency average = 12.977 ms
176
latency average = 14.364 ms
192
latency average = 12.627 ms

remove single-block loop:

128
latency average = 10.211 ms
144
latency average = 10.987 ms
160
latency average = 12.094 ms
176
latency average = 12.927 ms
192
latency average = 12.466 ms

Keeping the extra loop is probably better at this benchmark on newer
machines, but I don't think it's worth it. Microbenchmarks with fixed
sized input don't take branch mispredicts into account.

> > I did the benchmarks on my older machine, which I believe has a latency of 7 cycles for this instruction.
>
> May I ask which processor does you older machine have? I am benchmarking on a Tigerlake processor.

It's an i7-7500U / Kaby Lake.

> > It's probably okay to fold these together in the same compile-time
> > check, since both are fairly old by now, but for those following
> > along, pclmul is not in SSE 4.2 and is a bit newer. So this would
> > cause machines building on Nehalem (2008) to fail the check and go
> > back to slicing-by-8 with it written this way.
>
> Technically, the current version of the patch does not have a runtime cpuid check for pclmul and so would cause it to
crashwith segill on Nehalam (currently we only check for sse4.2). This needs to be fixed by adding an additional cpuid
checkfor pcmul but it would fall back to slicing by 8 on Nehalem and use the latest version on Westmere and above. If
youcare about keeping the performance on Nehalem, then I am happy to update the choose function to pick the right
pointeraccordingly.  Let me know which one you would prefer. 

Okay, Nehalem is 17 years old, and the additional cpuid check would
still work on hardware 14-15 years old, so I think it's fine to bump
the requirement for runtime hardware support.

--
John Naylor
Amazon Web Services

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Improve CRC32C performance on SSE4.2