Re: Popcount optimization for the slow-path lookups
| От | Nathan Bossart |
|---|---|
| Тема | Re: Popcount optimization for the slow-path lookups |
| Дата | |
| Msg-id | aTL8x7hylAkptvBi@nathan обсуждение исходный текст |
| Ответ на | Popcount optimization for the slow-path lookups (Andrew Pogrebnoi <andrew.pogrebnoi@percona.com>) |
| Ответы |
Re: Popcount optimization for the slow-path lookups
|
| Список | pgsql-hackers |
On Fri, Dec 05, 2025 at 03:07:07PM +0200, Andrew Pogrebnoi wrote: > I want to propose an optimization for pg_popcount32_slow() and > pg_popcount64_slow() where lookups into pg_number_of_ones[] are made > branchless. It shows speedup around 58% for uint64 and 35% for uint32 words > compared to the current, looped version. This is on x86. It is much more > significant on Arm64 (Apple M1): ~x2.6 for uint64 and ~x2.25 for uint32 > words. > > I probably have to guard the lookup in pg_popcount64_slow() with `#if > SIZEOF_LONG == 8` like it's done for __builtin_popcountl() in the same > function. What do you think? > > For benchmarks, I used functions with the copies of the lookup loop from > pg_popcount32/64_slow(), measuring them against branchless counterparts. > Then in C++ I pre generated two static vectors with random uint64's and > uint32's, each of 5M size and ran lookups using google/benchmark [0]. I've > run it on M1 and x86 Macs. I don't think the proposed improvements are relevant for either of the machines you used for your benchmarks. For x86, we've optimized our popcount code to use SSE4.2 or AVX-512, and for AArch64, we've optimized it to use Neon or SVE. And for other systems, we still try to use __builtin_popcount() and friends in the fallback paths, which IIUC are available on both gcc and clang (and maybe elsewhere). IMHO we need to run the benchmarks on a compiler/architecture combination where it would actually be used in practice. -- nathan
В списке pgsql-hackers по дате отправления: