Re: Popcount optimization using AVX512
От | Nathan Bossart |
---|---|
Тема | Re: Popcount optimization using AVX512 |
Дата | |
Msg-id | 20240405153811.GA9352@nathanxps13 обсуждение исходный текст |
Ответ на | Re: Popcount optimization using AVX512 (Nathan Bossart <nathandbossart@gmail.com>) |
Ответы |
Re: Popcount optimization using AVX512
|
Список | pgsql-hackers |
On Fri, Apr 05, 2024 at 07:58:44AM -0500, Nathan Bossart wrote: > On Fri, Apr 05, 2024 at 10:33:27AM +0300, Ants Aasma wrote: >> The main issue I saw was that clang was able to peel off the first >> iteration of the loop and then eliminate the mask assignment and >> replace masked load with a memory operand for vpopcnt. I was not able >> to convince gcc to do that regardless of optimization options. >> Generated code for the inner loop: >> >> clang: >> <L2>: >> 50: add rdx, 64 >> 54: cmp rdx, rdi >> 57: jae <L1> >> 59: vpopcntq zmm1, zmmword ptr [rdx] >> 5f: vpaddq zmm0, zmm1, zmm0 >> 65: jmp <L2> >> >> gcc: >> <L1>: >> 38: kmovq k1, rdx >> 3d: vmovdqu8 zmm0 {k1} {z}, zmmword ptr [rax] >> 43: add rax, 64 >> 47: mov rdx, -1 >> 4e: vpopcntq zmm0, zmm0 >> 54: vpaddq zmm0, zmm0, zmm1 >> 5a: vmovdqa64 zmm1, zmm0 >> 60: cmp rax, rsi >> 63: jb <L1> >> >> I'm not sure how much that matters in practice. Attached is a patch to >> do this manually giving essentially the same result in gcc. As most >> distro packages are built using gcc I think it would make sense to >> have the extra code if it gives a noticeable benefit for large cases. > > Yeah, I did see this, but I also wasn't sure if it was worth further > complicating the code. I can test with and without your fix and see if it > makes any difference in the benchmarks. This seems to provide a small performance boost, so I've incorporated it into v27. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Вложения
В списке pgsql-hackers по дате отправления: