Re: Improve CRC32C performance on SSE4.2
От | John Naylor |
---|---|
Тема | Re: Improve CRC32C performance on SSE4.2 |
Дата | |
Msg-id | CANWCAZYXk9xWrjEPNsm1oGYZ4CHBSVonn4XcCfWnusg+s_wiEA@mail.gmail.com обсуждение исходный текст |
Ответ на | RE: Improve CRC32C performance on SSE4.2 ("Devulapalli, Raghuveer" <raghuveer.devulapalli@intel.com>) |
Список | pgsql-hackers |
On Tue, Feb 11, 2025 at 7:25 AM Devulapalli, Raghuveer <raghuveer.devulapalli@intel.com> wrote: > I ran the same benchmark drive_crc32c with the postgres infrastructure and found that your v2 sse42 version from corsixis slower than pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes. That matches my findings as well. > I think the reason is that postgres is not using -O3 flag build the crc32c source files and the compiler generates lessthan optimal code. Adding that flag fixes the regression for buffers with 64 bytes – 128 bytes. Could you confirm thatbehavior on your end too? On my machine that still regresses compared to master in that range (although by not as much) so I still think 128 bytes is the right threshold. The effect of -O3 with gcc14.2 is that the single-block loop (after the 4-block loop) is unrolled. Unrolling adds branches and binary space, so it'd be nice to avoid that even for systems that build with -O3. I tried leaving out the single block loop, so that the tail is called for the remaining <63 bytes, and it's actually better: v2: 128 latency average = 10.256 ms 144 latency average = 11.393 ms 160 latency average = 12.977 ms 176 latency average = 14.364 ms 192 latency average = 12.627 ms remove single-block loop: 128 latency average = 10.211 ms 144 latency average = 10.987 ms 160 latency average = 12.094 ms 176 latency average = 12.927 ms 192 latency average = 12.466 ms Keeping the extra loop is probably better at this benchmark on newer machines, but I don't think it's worth it. Microbenchmarks with fixed sized input don't take branch mispredicts into account. > > I did the benchmarks on my older machine, which I believe has a latency of 7 cycles for this instruction. > > May I ask which processor does you older machine have? I am benchmarking on a Tigerlake processor. It's an i7-7500U / Kaby Lake. > > It's probably okay to fold these together in the same compile-time > > check, since both are fairly old by now, but for those following > > along, pclmul is not in SSE 4.2 and is a bit newer. So this would > > cause machines building on Nehalem (2008) to fail the check and go > > back to slicing-by-8 with it written this way. > > Technically, the current version of the patch does not have a runtime cpuid check for pclmul and so would cause it to crashwith segill on Nehalam (currently we only check for sse4.2). This needs to be fixed by adding an additional cpuid checkfor pcmul but it would fall back to slicing by 8 on Nehalem and use the latest version on Westmere and above. If youcare about keeping the performance on Nehalem, then I am happy to update the choose function to pick the right pointeraccordingly. Let me know which one you would prefer. Okay, Nehalem is 17 years old, and the additional cpuid check would still work on hardware 14-15 years old, so I think it's fine to bump the requirement for runtime hardware support. -- John Naylor Amazon Web Services
В списке pgsql-hackers по дате отправления: