Обсуждение: refactor architecture-specific popcount code
Right now, the organization of this code is weird. All AArch64-specific implementations live in an AArch64-specific file, the AVX-512 implementations live in their own file, and the architecture-agnostic and SSE4.2 implementations live in pg_bitutils.c. The attached patches move the SSE4.2 implementations to the AVX-512 file (which is renamed appropriately), and they update some function names to be more descriptive, i.e., "fast" is replaced with "sse42" and "slow" is replaced with "generic". I probably should've done this a while ago... -- nathan
Вложения
On Thu, Jan 15, 2026 at 3:40 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > Right now, the organization of this code is weird. All AArch64-specific > implementations live in an AArch64-specific file, the AVX-512 > implementations live in their own file, and the architecture-agnostic and > SSE4.2 implementations live in pg_bitutils.c. The attached patches move > the SSE4.2 implementations to the AVX-512 file (which is renamed > appropriately), and they update some function names to be more descriptive, > i.e., "fast" is replaced with "sse42" and "slow" is replaced with > "generic". Thanks for taking on some technical debt! 0001 --- a/src/port/pg_popcount_avx512.c +++ b/src/port/pg_popcount_x86_64.c Can we get away with just "x86" for brevity? We generally don't target 32-bit CPUs for this kind of work, so there's no chance of confusion. 0002 ``` -#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK +#include "port/pg_bitutils.h" + +#ifdef TRY_POPCNT_X86_64 #if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT) #include <cpuid.h> #endif ``` With the above in the x86 .c file, I wonder we can get rid of this stanza and the "try" symbol and gate only on HAVE_X86_64_POPCNTQ: #ifdef HAVE_X86_64_POPCNTQ #if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID) #define TRY_POPCNT_X86_64 1 #endif #endif If we have to be cautious, we could just turn the #error on no CPUID symbol into "return false". 0003 s/fast/sse42/: Seems okay in this file, but this isn't the best name, either. Maybe a comment to head off future "corrections", something like: "Technically, POPCNT is not part of SSE 4.2, and is not even a vector operation, but many compilers emit the popcnt instruction with -msse4.2 anyway." s/slow/generic/: I'm ambivalent about this. The "slow" designation is flat-out wrong since at least Power and aarch64 can emit a single instruction here without prodding the compiler. On the other hand, "generic" seems wrong too, since e.g. pg_popcount64_slow() has three configure symbols and two compiler builtins. :-D A possible future project would be to have a truly generic simple fallback in pure C and put all the fancy stuff in the header for architectures that have unconditional hardware support. It would make more sense to revisit the name then. -- John Naylor Amazon Web Services
On 15/01/2026 11:07, John Naylor wrote: > 0003 > > s/fast/sse42/: > > Seems okay in this file, but this isn't the best name, either. Maybe a > comment to head off future "corrections", something like: > "Technically, POPCNT is not part of SSE 4.2, and is not even a vector > operation, but many compilers emit the popcnt instruction with > -msse4.2 anyway." > > s/slow/generic/: > > I'm ambivalent about this. The "slow" designation is flat-out wrong > since at least Power and aarch64 can emit a single instruction here > without prodding the compiler. On the other hand, "generic" seems > wrong too, since e.g. pg_popcount64_slow() has three configure symbols > and two compiler builtins. :-D "fallback", or "portable" ? > A possible future project would be to have a truly generic simple > fallback in pure C and put all the fancy stuff in the header for > architectures that have unconditional hardware support. It would make > more sense to revisit the name then. Yeah, I noticed that on x86_64, pg_popcount_optimized is always a function pointer with runtime check, even if you use compiler flags to target a CPU where the special instructions are available unconditionally. - Heikki
On Thu, Jan 15, 2026 at 04:07:51PM +0700, John Naylor wrote:
> Thanks for taking on some technical debt!
Thanks for reviewing.
> --- a/src/port/pg_popcount_avx512.c
> +++ b/src/port/pg_popcount_x86_64.c
>
> Can we get away with just "x86" for brevity? We generally don't target
> 32-bit CPUs for this kind of work, so there's no chance of confusion.
WFM.
> ```
> -#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
> +#include "port/pg_bitutils.h"
> +
> +#ifdef TRY_POPCNT_X86_64
>
> #if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
> #include <cpuid.h>
> #endif
> ```
>
> With the above in the x86 .c file, I wonder we can get rid of this
> stanza and the "try" symbol and gate only on HAVE_X86_64_POPCNTQ:
>
> #ifdef HAVE_X86_64_POPCNTQ
> #if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID)
> #define TRY_POPCNT_X86_64 1
> #endif
> #endif
>
> If we have to be cautious, we could just turn the #error on no CPUID
> symbol into "return false".
Yeah, the CPUID macro checks do seem overly cautious to me, especially
since we've just #error'd when the CPUID intrinsics are missing in
pg_crc32c_sse42_choose.c since 2015. That seems to suggest that nobody is
trying to build Postgres with a compiler that knows about SSE4.2/POPCNT but
not CPUID. For reference, CPUID was introduced in 1993.
I bet we could also convert this bit into a configuration-time check:
#if defined(_MSC_VER) && defined(_M_AMD64)
#define HAVE_X86_64_POPCNTQ
#endif
> s/fast/sse42/:
>
> Seems okay in this file, but this isn't the best name, either. Maybe a
> comment to head off future "corrections", something like:
> "Technically, POPCNT is not part of SSE 4.2, and is not even a vector
> operation, but many compilers emit the popcnt instruction with
> -msse4.2 anyway."
Makes sense.
--
nathan
On Thu, Jan 15, 2026 at 11:42:14AM +0200, Heikki Linnakangas wrote: > On 15/01/2026 11:07, John Naylor wrote: >> s/slow/generic/: >> >> I'm ambivalent about this. The "slow" designation is flat-out wrong >> since at least Power and aarch64 can emit a single instruction here >> without prodding the compiler. On the other hand, "generic" seems >> wrong too, since e.g. pg_popcount64_slow() has three configure symbols >> and two compiler builtins. :-D > > "fallback", or "portable" ? I've no strong opinions, but "portable" seems reasonable to me. > Yeah, I noticed that on x86_64, pg_popcount_optimized is always a function > pointer with runtime check, even if you use compiler flags to target a CPU > where the special instructions are available unconditionally. I wonder how close we are to being able to just require SSE4.2/POPCNT for x86-64 builds. I suppose there's always a chance that someone will try to run Postgres 19 on a CPU from the aughts... In any case, avoiding the function pointer when possible seems like a good follow-up. -- nathan
On 15/01/2026 18:08, Nathan Bossart wrote: > On Thu, Jan 15, 2026 at 11:42:14AM +0200, Heikki Linnakangas wrote: >> Yeah, I noticed that on x86_64, pg_popcount_optimized is always a function >> pointer with runtime check, even if you use compiler flags to target a CPU >> where the special instructions are available unconditionally. > > I wonder how close we are to being able to just require SSE4.2/POPCNT for > x86-64 builds. I suppose there's always a chance that someone will try to > run Postgres 19 on a CPU from the aughts... In any case, avoiding the > function pointer when possible seems like a good follow-up. It's not really our decision. Packagers choose which architecture to target and which compiler options to build with. We ought to just respect those choices. - Heikki
Here is a new patch set. Notably, I've added a 0004 that does the
following:
* Removes TRY_POPCNT_X86_64. We now assume that the required CPUID
intrinsics are available, as we have long done in some of the CRC-32C code.
* Moves the MSVC check for HAVE_X86_64_POPCNTQ to configuration-time. This
way, we set it for all relevant platforms in one place.
* Moves the #defines for USE_SSE2 and USE_NEON to c.h so that they can be
used elsewhere without simd.h. Consequently, we can remove POPCNT_AARCH64.
* Moves the #includes for pg_bitutils.h to below the system headers in
pg_popcount_{aarch64,x86}.c (since we no longer depend on macros defined in
pg_bitutils.h).
--
nathan
Вложения
On Fri, Jan 16, 2026 at 2:07 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> Here is a new patch set. Notably, I've added a 0004 that does the
> following:
>
> * Removes TRY_POPCNT_X86_64. We now assume that the required CPUID
> intrinsics are available, as we have long done in some of the CRC-32C code.
>
> * Moves the MSVC check for HAVE_X86_64_POPCNTQ to configuration-time. This
> way, we set it for all relevant platforms in one place.
LGTM.
> * Moves the #defines for USE_SSE2 and USE_NEON to c.h so that they can be
> used elsewhere without simd.h. Consequently, we can remove POPCNT_AARCH64.
Seems reasonable.
> * Moves the #includes for pg_bitutils.h to below the system headers in
> pg_popcount_{aarch64,x86}.c (since we no longer depend on macros defined in
> pg_bitutils.h).
Good.
v2-0003
+ * XXX Technically, POPCNT is not part of SSE4.2, and isn't even a vector
+ * operation, but in practice this is close enough, and "sse42" seems easier to
+ * follow than "popcnt" for these names.
Quibble -- XXX is a bit loud for a side note.
On Thu, Jan 15, 2026 at 11:08 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Thu, Jan 15, 2026 at 11:42:14AM +0200, Heikki Linnakangas wrote:
> > "fallback", or "portable" ?
>
> I've no strong opinions, but "portable" seems reasonable to me.
I have a mild preference for "fallback" since it's a noun and we
already have comments in src/(include/)port referring to fallbacks. No
objection to "portable", though.
--
John Naylor
Amazon Web Services
Committed, thanks for reviewing. -- nathan
On Thu, Jan 22, 2026 at 3:23 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > Committed, thanks for reviewing. Sure. Now that that's in place, I wanted to brainstorm more refactoring/rationalization ideas that seem on-topic for the thread but have less clear payoff: 1) Nowadays, the only global call sites of the word-sized functions are select_best_grantor() and in bitmapsets. The latter calls the word-sized functions in a loop (could be just one word). It may be more efficient to calculate the size in bytes and call pg_popcount(). Then we could get rid of all the pointer indirection for the word-sized functions. 2) The x86 byte buffer variants expend a lot of effort to detect whether the buffer is aligned on both 64- and 32-bit platforms, with an optimized path for each. At least 64-bit doesn't care about alignment, and 32-bit doesn't warrant anything fancier than pure C. Simultaneously, the aarch64 equivalent doesn't seem to take care about alignment. (I think Nathan mentioned he didn't see a difference during testing, but I wonder how universal that is). 3) There is repeated code for the <8 bytes case, and the tail of the "optimized" functions. I'm also not sure why the small case is inlined everywhere. -- John Naylor Amazon Web Services
On Thu, Jan 22, 2026 at 04:50:26PM +0700, John Naylor wrote: > 1) Nowadays, the only global call sites of the word-sized functions > are select_best_grantor() and in bitmapsets. The latter calls the > word-sized functions in a loop (could be just one word). It may be > more efficient to calculate the size in bytes and call pg_popcount(). Yeah, these seem like obvious places to use pg_popcount(). Note that bms_member_index() does a final popcount on a masked version of the last word. We could swap that with pg_popcount(), too, but it might be slower than just calling the word-sized function. However, it could be hard to tell the difference, as we'd be trading a function or function pointer call with an inlined loop over pg_number_of_ones. And even if it is slower, I'm not sure it matters all that much in the grand scheme of things. In any case, 0001 gets the easy ones out of the way. > Then we could get rid of all the pointer indirection for the > word-sized functions. Do you mean that we'd just keep the portable ones around? I see some code in pgvector that might be negatively impacted by that, but if I understand correctly it would require an unusual setup. > 2) The x86 byte buffer variants expend a lot of effort to detect > whether the buffer is aligned on both 64- and 32-bit platforms, with > an optimized path for each. At least 64-bit doesn't care about > alignment, and 32-bit doesn't warrant anything fancier than pure C. > Simultaneously, the aarch64 equivalent doesn't seem to take care about > alignment. (I think Nathan mentioned he didn't see a difference during > testing, but I wonder how universal that is). 0002 makes these changes for pg_popcount_sse42() and pg_popcount_masked_sse42(). It does seem strange to prefer a loop over pg_number_of_ones instead of using POPCNTQ when unaligned, but perhaps it's worth testing. I do recall the alignment stuff in the AVX-512 code showing benefits in tests because it avoids double-load overhead, so I think we should keep that for now. > 3) There is repeated code for the <8 bytes case, and the tail of the > "optimized" functions. I'm also not sure why the small case is inlined > everywhere. This is intended to help avoid function call and SIMD instruction overhead when it doesn't make sense to take it. I recall this showing a rather big difference in benchmarks when we were working on the AVX-512 versions. Regarding the duplicated code, sure, we could add some static inline functions or something. I think the only reason I haven't done so is because it's ~2 lines of code. -- nathan
Вложения
On Thu, Jan 22, 2026 at 11:50:38AM -0600, Nathan Bossart wrote: > On Thu, Jan 22, 2026 at 04:50:26PM +0700, John Naylor wrote: >> 1) Nowadays, the only global call sites of the word-sized functions >> are select_best_grantor() and in bitmapsets. The latter calls the >> word-sized functions in a loop (could be just one word). It may be >> more efficient to calculate the size in bytes and call pg_popcount(). > > Yeah, these seem like obvious places to use pg_popcount(). Note that > bms_member_index() does a final popcount on a masked version of the last > word. We could swap that with pg_popcount(), too, but it might be slower > than just calling the word-sized function. However, it could be hard to > tell the difference, as we'd be trading a function or function pointer call > with an inlined loop over pg_number_of_ones. And even if it is slower, I'm > not sure it matters all that much in the grand scheme of things. I added a 0003 that swaps that final popcount with pg_popcount(). >> Then we could get rid of all the pointer indirection for the >> word-sized functions. > > Do you mean that we'd just keep the portable ones around? I see some code > in pgvector that might be negatively impacted by that, but if I understand > correctly it would require an unusual setup. I added a 0004 that removes the architecture-specific word-sized functions. -- nathan
Вложения
On Mon, Jan 26, 2026 at 10:41 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Thu, Jan 22, 2026 at 11:50:38AM -0600, Nathan Bossart wrote: > > On Thu, Jan 22, 2026 at 04:50:26PM +0700, John Naylor wrote: > >> 1) Nowadays, the only global call sites of the word-sized functions > >> are select_best_grantor() and in bitmapsets. The latter calls the > >> word-sized functions in a loop (could be just one word). It may be > >> more efficient to calculate the size in bytes and call pg_popcount(). > > > > Yeah, these seem like obvious places to use pg_popcount(). Note that > > bms_member_index() does a final popcount on a masked version of the last > > word. We could swap that with pg_popcount(), too, but it might be slower > > than just calling the word-sized function. However, it could be hard to > > tell the difference, as we'd be trading a function or function pointer call > > with an inlined loop over pg_number_of_ones. And even if it is slower, I'm > > not sure it matters all that much in the grand scheme of things. > > I added a 0003 that swaps that final popcount with pg_popcount(). I'm not sure either if this part matters much, but it makes more sense to me to continue using single word functions for that last part. Since they have very few call sites anymore, we can make them inline without bloating the binary on x86. > >> Then we could get rid of all the pointer indirection for the > >> word-sized functions. > > > > Do you mean that we'd just keep the portable ones around? I see some code > > in pgvector that might be negatively impacted by that, but if I understand > > correctly it would require an unusual setup. > > I added a 0004 that removes the architecture-specific word-sized functions. Right, just the portable ones. Here, too, inlining them everywhere would mitigate any impact. Further, -int -pg_popcount64(uint64 word) +static inline int +pg_popcount64_neon(uint64 word) ...if they were inlined from the header, I think we wouldn't need this separate neon function in this file at all. Currently, we rely on __builtin_popcountl for the portable function outside this file. We could either keep using that or switch to neon if there's a portability difference. 0001, 2, and 4 look good to me otherwise. Looking at commit a8c09daa8bb1, there are some planner benchmarks in the discussion thread that stress bitmapsets, but I'm not sure if they hit the paths affected by 0001 at all. -- John Naylor Amazon Web Services
On Thu, Jan 29, 2026 at 06:31:53PM +0700, John Naylor wrote: > On Mon, Jan 26, 2026 at 10:41 PM Nathan Bossart > <nathandbossart@gmail.com> wrote: >> I added a 0003 that swaps that final popcount with pg_popcount(). > > I'm not sure either if this part matters much, but it makes more sense > to me to continue using single word functions for that last part. > Since they have very few call sites anymore, we can make them inline > without bloating the binary on x86. Okay, I abandoned that patch. > Right, just the portable ones. Here, too, inlining them everywhere > would mitigate any impact. Done. > +static inline int > +pg_popcount64_neon(uint64 word) > > ...if they were inlined from the header, I think we wouldn't need this > separate neon function in this file at all. Currently, we rely on > __builtin_popcountl for the portable function outside this file. We > could either keep using that or switch to neon if there's a > portability difference. Done. -- nathan
Вложения
On Fri, Jan 30, 2026 at 12:06 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > [v5] 0001 - I'm pretty sure this is comparable to HEAD if the optimized function is pg_popcount_sse42(). Has the AVX512 version been tested with 8-byte inputs? That seems to have a lot of pre- and post-processing involved. The inline wrapper only bypasses for 7 or less bytes. 0002 - I tried running this on x86-64 with alignment sanitizer and no alarms went off during "make check", but adding pg_attribute_no_sanitize_alignment() would prevent surprises in the future. - I imagine that the old SIZEOF_VOID_P check is superfluous now, since the whole file is gated by HAVE_X86_64_POPCNTQ. - Maybe we can remove the aligned 32-bit path in pg_popcount_(masked_)portable(), since that's on-topic for this patch and would simplify things further. -- John Naylor Amazon Web Services
On Fri, Jan 30, 2026 at 03:22:45PM +0700, John Naylor wrote: > 0001 - I'm pretty sure this is comparable to HEAD if the optimized > function is pg_popcount_sse42(). Has the AVX512 version been tested > with 8-byte inputs? That seems to have a lot of pre- and > post-processing involved. The inline wrapper only bypasses for 7 or > less bytes. Here [0] is the latest perf data I see for the AVX-512 popcount patch, although that's comparing to v16, which IIRC lacks a few other inlining tricks. There's a chance the SSE4.2 version is faster at that particular length. I'm not sure we need to worry about that, but I can do a bit of testing if you'd like. > 0002 > - I tried running this on x86-64 with alignment sanitizer and no > alarms went off during "make check", but adding > pg_attribute_no_sanitize_alignment() would prevent surprises in the > future. Done. > - I imagine that the old SIZEOF_VOID_P check is superfluous now, since > the whole file is gated by HAVE_X86_64_POPCNTQ. I think you're right. There was some concern about this when I was first adding the SSE4.2-specific pg_popcount() [1], but all the configure-time checks for HAVE_X86_64_POPCNTQ are restricted to 64-bit x86, so I bet we could safely assume SIZEOF_VOID_P == 8 in that file. > - Maybe we can remove the aligned 32-bit path in > pg_popcount_(masked_)portable(), since that's on-topic for this patch > and would simplify things further. IMHO that's a reasonable thing for us to do. [0] https://postgr.es/m/20240404171828.GA3866970%40nathanxps13 [1] https://postgr.es/m/CAApHDvojPyh6dLKooqjXSZE%3D0Ed590Lq1BxF7WQ9knSggyuJEA%40mail.gmail.com -- nathan
Вложения
On Sat, Jan 31, 2026 at 4:33 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Fri, Jan 30, 2026 at 03:22:45PM +0700, John Naylor wrote: > > 0001 - I'm pretty sure this is comparable to HEAD if the optimized > > function is pg_popcount_sse42(). Has the AVX512 version been tested > > with 8-byte inputs? That seems to have a lot of pre- and > > post-processing involved. The inline wrapper only bypasses for 7 or > > less bytes. > > Here [0] is the latest perf data I see for the AVX-512 popcount patch, > although that's comparing to v16, which IIRC lacks a few other inlining > tricks. There's a chance the SSE4.2 version is faster at that particular > length. I'm not sure we need to worry about that, but I can do a bit of > testing if you'd like. It might be a good idea to do a little new testing, and I see a use for a special 8-byte path independent of AVX512: v6 seems to regress a little for single-words. But, it turns out that when gcc turns __builtin_popcountl into a single instruction, it's inline, but if it emits portable bitwise ops, it does so in a function called __popcountdi2(). That can be avoided by hand-coding in C for normal builds (and for 32-bit looks cleaner anyway), as in the attached 0005. My laptop here is really too old to make decisions that are micro-architecture dependent, but with that caveat, I dusted off the popcount benchmark and added a test for counting bitmapsets (v7-0004, applies on top of v6): select drive_bms_num_members(10000000, 1); master: 13.2 ticks per call v6: 15.3 v6+v7-0005 10.8 Again, take this with a grain of salt, but 0005 seems worth looking at. -- John Naylor Amazon Web Services
Вложения
On Mon, Feb 02, 2026 at 09:16:42PM +0700, John Naylor wrote: > It might be a good idea to do a little new testing, and I see a use > for a special 8-byte path independent of AVX512: v6 seems to regress a > little for single-words. But, it turns out that when gcc turns > __builtin_popcountl into a single instruction, it's inline, but if it > emits portable bitwise ops, it does so in a function called > __popcountdi2(). That can be avoided by hand-coding in C for normal > builds (and for 32-bit looks cleaner anyway), as in the attached 0005. Oh, interesting. I looked into this a little more [0]. Both gcc and clang generate cnt instructions for aarch64, so we're good there. However, clang on x86-64 generates the bit-twiddling version, and gcc on x86-64 generates a call to __popcountdi2() (which I imagine does something similar). It's not until you provide a compiler flag like -march=x86-64-v2 that gcc/clang start generating popcnt instructions for x86-64, which makes sense. 0005 seems like the correct move to me... [0] https://godbolt.org/z/he3WozG3E -- nathan
On Mon, Feb 2, 2026, at 5:51 PM, Nathan Bossart wrote: > On Mon, Feb 02, 2026 at 09:16:42PM +0700, John Naylor wrote: >> It might be a good idea to do a little new testing, and I see a use >> for a special 8-byte path independent of AVX512: v6 seems to regress a >> little for single-words. But, it turns out that when gcc turns >> __builtin_popcountl into a single instruction, it's inline, but if it >> emits portable bitwise ops, it does so in a function called >> __popcountdi2(). That can be avoided by hand-coding in C for normal >> builds (and for 32-bit looks cleaner anyway), as in the attached 0005. > > Oh, interesting. I looked into this a little more [0]. Both gcc and clang > generate cnt instructions for aarch64, so we're good there. However, clang > on x86-64 generates the bit-twiddling version, and gcc on x86-64 generates > a call to __popcountdi2() (which I imagine does something similar). It's > not until you provide a compiler flag like -march=x86-64-v2 that gcc/clang > start generating popcnt instructions for x86-64, which makes sense. 0005 > seems like the correct move to me... > > [0] https://godbolt.org/z/he3WozG3E > > -- > nathan Nathan, John, Thanks for the focus on this area of the code. I've been looking into what to do with popcnt when building Win11/ARM64/MSVC. I know that when _MSC_VER and _M_ARM64 are defined we can make use of the __popcnt(unsigned int) and __popcnt64(unsigned__int64) intrinsics which have been available since VS 2022 17.11+. I thought I'd check that combo outand it turns out that it is identical to clang/gcc on that platform [0]. I'll wait for your work to land before proposing a patch to add these unless it is really easy to fit it and you feel likegiving it a go. :) best. -greg [0] https://godbolt.org/z/TrxjzcPGE
On Tue, Feb 03, 2026 at 12:19:48PM -0500, Greg Burd wrote:
> Thanks for the focus on this area of the code. I've been looking into
> what to do with popcnt when building Win11/ARM64/MSVC. I know that when
> _MSC_VER and _M_ARM64 are defined we can make use of the
> __popcnt(unsigned int) and __popcnt64(unsigned __int64) intrinsics which
> have been available since VS 2022 17.11+. I thought I'd check that combo
> out and it turns out that it is identical to clang/gcc on that platform
> [0].
>
> I'll wait for your work to land before proposing a patch to add these
> unless it is really easy to fit it and you feel like giving it a go. :)
We should probably just add something like
#ifdef _MSC_VER
return __popcnt(word);
for the new inlined versions of pg_popcount{32,64}. We're already doing
that today for x86-64, and the AArch64-specific versions use intrinsics
that in theory compile to the same thing. Plus, popcnt is required for
Windows these days.
I'm working on polishing/benchmarking these patches at the moment, so I
will work this change in. Thanks!
--
nathan
On Feb 3 2026, at 12:41 pm, Nathan Bossart <nathandbossart@gmail.com> wrote:
> On Tue, Feb 03, 2026 at 12:19:48PM -0500, Greg Burd wrote:
>> Thanks for the focus on this area of the code. I've been looking into
>> what to do with popcnt when building Win11/ARM64/MSVC. I know that when
>> _MSC_VER and _M_ARM64 are defined we can make use of the
>> __popcnt(unsigned int) and __popcnt64(unsigned __int64) intrinsics which
>> have been available since VS 2022 17.11+. I thought I'd check that combo
>> out and it turns out that it is identical to clang/gcc on that platform
>> [0].
>>
>> I'll wait for your work to land before proposing a patch to add these
>> unless it is really easy to fit it and you feel like giving it a go. :)
>
> We should probably just add something like
>
> #ifdef _MSC_VER
> return __popcnt(word);
Right, it seemed to me after reviewing the patch to fit right in.
> for the new inlined versions of pg_popcount{32,64}. We're already doing
> that today for x86-64, and the AArch64-specific versions use intrinsics
> that in theory compile to the same thing. Plus, popcnt is required for
> Windows these days.
>
> I'm working on polishing/benchmarking these patches at the moment, so I
> will work this change in. Thanks!
Not a problem, thank you!
> --
> nathan
-greg
Okay, here is an updated patch set with first drafts of the commit
messages. I'm reasonably happy with these patches, but I'll admit my
justification for ripping out the 32-bit optimizations feels a bit flimsy.
I don't get the idea that we are all that concerned about things like
micro-regressions for popcount on 32-bit builds, but OTOH it isn't hard to
imagine someone objecting to these changes.
I ran the bms_num_members() benchmark on a couple of machines I had nearby:
apple-m3 (neon) intel-i5-13500T (sse4.2)
words HEAD v8 words HEAD v8
1 40 25 1 26 10
2 57 51 2 37 29
4 75 57 4 55 45
8 105 56 8 88 51
16 154 59 16 158 68
32 265 73 32 296 102
64 545 103 64 577 209
128 1027 178 128 1212 423
I was going to run it on machines with SVE/AVX-512, but John already tested
the AVX-512 case [0], and I have no reason to believe that we'll see
regressions on machines with SVE.
[0] https://postgr.es/m/CANWCAZbWLX%3DEDd1Bq-8oGK2ZLVNR4m4BkGe%3D288t2V5oLcqeZA%40mail.gmail.com
--
nathan
Вложения
Sorry for the noise. cfbot dislikes the lack of parentheses in pg_popcount32(). -- nathan
Вложения
On Wed, Feb 4, 2026 at 3:44 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> Okay, here is an updated patch set with first drafts of the commit
> messages. I'm reasonably happy with these patches, but I'll admit my
> justification for ripping out the 32-bit optimizations feels a bit flimsy.
> I don't get the idea that we are all that concerned about things like
> micro-regressions for popcount on 32-bit builds, but OTOH it isn't hard to
> imagine someone objecting to these changes.
I actually believe the changes improve even on 32-bit, at least for
bitmapsets. The previous 32-bit optimizations are just mirrors of the
64-bit ones, which we found weren't always great to begin with, so I
don't buy that we're losing much here.
> I was going to run it on machines with SVE/AVX-512, but John already tested
> the AVX-512 case [0], and I have no reason to believe that we'll see
> regressions on machines with SVE.
Actually, that was the SSE4.2 case. Sorry if I wasn't clear.
For v9:
0001:
- /*
- * We set the threshold to the point at which we'll first use special
- * instructions in the optimized version.
- */
-#if SIZEOF_VOID_P >= 8
- int threshold = 8;
-#else
- int threshold = 4;
-#endif
-
- if (bytes < threshold)
+ if (bytes < 8)
It seems like we should still have some kind of comment here. Even
just to say the 8 value was found through testing. (It was some time
ago, right?)
0002:
+static inline int
+pg_popcount32(uint32 word)
+{
+ /*
+ * On x86, gcc generates a function call for this built-in unless the
+ * popcnt instruction is available, so we use the plain C version in that
+ * case to ensure inlining.
+ */
+#if defined(HAVE__BUILTIN_POPCOUNT) && (defined(__POPCNT__) ||
!defined(__x86_64__))
+ return __builtin_popcount(word);
It was intentional that my PoC pg_popcount32 was simple, pure C and
didn't have #ifdefs anymore, and I'd prefer to go back to that. This
function is now only used for single-word bitmapsets on 32-bit
platforms and in an assert, and even if that changes we've shown that
inlining bitwise ops for a single word is already pretty good for
performance. Plus, doesn't this cause gcc generate a function call on
32-bit x86 because "!defined(__x86_64__)" is true? That defeats the
whole purpose of inlining in the first place. Simple is good here.
+#elif defined(_MSC_VER)
+ return __popcnt64(word);
The commit message says "converts the portable ones to inlined
functions", but this was copied from a architecture specific file with
a runtime check. I've seen an assertion in this thread that the
hardware instruction is required for some Windows version, but it
would be nice to have a link to documentation for the archives. More
worryingly, this is almost certainly broken on 32-bit, and the
buildfarm won't tell us -- please see commit
53ea2b7ad050ce4ad95c89bb55197209b65886a1 and bug report that led to
it. Seems like material for a separate commit.
0003:
- for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++)
- cnt += bmw_popcount(n256->isset[i]);
+ cnt += pg_popcount((const char *) n256->isset,
+ RT_NODE_MAX_SLOTS / BITS_PER_BYTE);
This can now be "cnt =". The initialization to zero is now
unnecessary, but it's also harmless.
LGTM otherwise.
"perhaps-over-optimized" -- In more neutral language, they're better
optimized for longer inputs. Hence the need to shortcut for the common
case.
"This fast-path could arguably go in pg_popcount() itself (with an
appropriate alignment check), but that is left as a future exercise."
I wondered about that too, but that adds branches and code everywhere
it's called.
--
John Naylor
Amazon Web Services
On Wed, Feb 04, 2026 at 12:13:35PM +0700, John Naylor wrote:
> - /*
> - * We set the threshold to the point at which we'll first use special
> - * instructions in the optimized version.
> - */
> [...]
>
> It seems like we should still have some kind of comment here. Even
> just to say the 8 value was found through testing. (It was some time
> ago, right?)
Yeah, it needs a comment. I recall doing a lot of testing for small
inputs since those are what tended to regress.
> It was intentional that my PoC pg_popcount32 was simple, pure C and
> didn't have #ifdefs anymore, and I'd prefer to go back to that. This
> function is now only used for single-word bitmapsets on 32-bit
> platforms and in an assert, and even if that changes we've shown that
> inlining bitwise ops for a single word is already pretty good for
> performance. Plus, doesn't this cause gcc generate a function call on
> 32-bit x86 because "!defined(__x86_64__)" is true? That defeats the
> whole purpose of inlining in the first place. Simple is good here.
Agreed and done.
> +#elif defined(_MSC_VER)
> + return __popcnt64(word);
>
> The commit message says "converts the portable ones to inlined
> functions", but this was copied from a architecture specific file with
> a runtime check. I've seen an assertion in this thread that the
> hardware instruction is required for some Windows version, but it
> would be nice to have a link to documentation for the archives. More
> worryingly, this is almost certainly broken on 32-bit, and the
> buildfarm won't tell us -- please see commit
> 53ea2b7ad050ce4ad95c89bb55197209b65886a1 and bug report that led to
> it. Seems like material for a separate commit.
Sure. I'm tempted to suggest that we only use the plain C version here,
too. The SSE4.2 bms_num_members() test I did yesterday used it and showed
improvement at one word. If we do that, we can rip out even more code
since we no longer need the popcount built-ins.
* tests plain C version on an Apple M3 *
Yeah, the plain C version might be marginally slower than the built-in
version for that test, but it still seems quite a bit faster than HEAD.
HEAD v8 v10
40 25 29
We probably want to re-add the Neon version of pg_popcount64() so that
pg_popcount_neon() and pg_popcount_masked_neon() can use it, but that's
easy enough.
> - for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++)
> - cnt += bmw_popcount(n256->isset[i]);
> + cnt += pg_popcount((const char *) n256->isset,
> + RT_NODE_MAX_SLOTS / BITS_PER_BYTE);
>
> This can now be "cnt =". The initialization to zero is now
> unnecessary, but it's also harmless.
Fixed. The only reason I didn't make that change earlier was to keep the
patch tidy.
I haven't updated the commit messages yet. Once the code is ready to go,
I'll give those another try.
[0] https://postgr.es/m/aYJeGs-xz3NEEdSe%40nathan
--
nathan
Вложения
On Thu, Feb 5, 2026 at 4:43 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > Sure. I'm tempted to suggest that we only use the plain C version here, > too. The SSE4.2 bms_num_members() test I did yesterday used it and showed > improvement at one word. If we do that, we can rip out even more code > since we no longer need the popcount built-ins. Unlike the 32-bit case, people do run production on 64-bit platforms that are not Arm/x86, so that would require effort to see if the builtins are worth it for them. That seems like a separate effort. I can help with that, but let's get the tested stuff in first. > * tests plain C version on an Apple M3 * > > Yeah, the plain C version might be marginally slower than the built-in > version for that test, but it still seems quite a bit faster than HEAD. > > HEAD v8 v10 > 40 25 29 That's good to know, and maybe it'll be true elsewhere. -- John Naylor Amazon Web Services
On Thu, Feb 05, 2026 at 02:48:44PM +0700, John Naylor wrote: > On Thu, Feb 5, 2026 at 4:43 AM Nathan Bossart <nathandbossart@gmail.com> wrote: >> Sure. I'm tempted to suggest that we only use the plain C version here, >> too. The SSE4.2 bms_num_members() test I did yesterday used it and showed >> improvement at one word. If we do that, we can rip out even more code >> since we no longer need the popcount built-ins. > > Unlike the 32-bit case, people do run production on 64-bit platforms > that are not Arm/x86, so that would require effort to see if the > builtins are worth it for them. That seems like a separate effort. I > can help with that, but let's get the tested stuff in first. Alright. I moved that to a new 0004 patch that we can consider separately once 0001-0003 have been committed. -- nathan
Вложения
On Fri, Feb 6, 2026 at 11:13 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Thu, Feb 05, 2026 at 02:48:44PM +0700, John Naylor wrote: > > On Thu, Feb 5, 2026 at 4:43 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > >> Sure. I'm tempted to suggest that we only use the plain C version here, > >> too. The SSE4.2 bms_num_members() test I did yesterday used it and showed > >> improvement at one word. If we do that, we can rip out even more code > >> since we no longer need the popcount built-ins. > > > > Unlike the 32-bit case, people do run production on 64-bit platforms > > that are not Arm/x86, so that would require effort to see if the > > builtins are worth it for them. That seems like a separate effort. I > > can help with that, but let's get the tested stuff in first. > > Alright. I moved that to a new 0004 patch that we can consider separately > once 0001-0003 have been committed. Okay, this is looking good. I have just one more suggestion: For 0002, just copy the word-wise functions verbatim. That way, it's a pure refactoring commit and the exception doesn't need explaining. With that, I'd say go ahead and commit 0001/2. Then after a bit more research, the final form of the inline functions can be visible in a single commit. I've tested S390X already and hope to test one other platform. -- John Naylor Amazon Web Services
On Sat, Feb 07, 2026 at 03:54:31PM +0700, John Naylor wrote: > Okay, this is looking good. I have just one more suggestion: For 0002, > just copy the word-wise functions verbatim. That way, it's a pure > refactoring commit and the exception doesn't need explaining. With > that, I'd say go ahead and commit 0001/2. Seems reasonable. Here is an updated patch set. I've also swapped 0003 and 0004. > Then after a bit more research, the final form of the inline functions > can be visible in a single commit. I've tested S390X already and hope > to test one other platform. Thanks. Looking forward to the results. -- nathan
Вложения
On Wed, Feb 11, 2026 at 1:45 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Sat, Feb 07, 2026 at 03:54:31PM +0700, John Naylor wrote:
> > Okay, this is looking good. I have just one more suggestion: For 0002,
> > just copy the word-wise functions verbatim. That way, it's a pure
> > refactoring commit and the exception doesn't need explaining. With
> > that, I'd say go ahead and commit 0001/2.
>
> Seems reasonable. Here is an updated patch set. I've also swapped 0003
> and 0004.
0002: I just noticed another thing that could be done in a separate
follow-up patch:
@@ -220,17 +170,6 @@ pg_popcount_masked_portable(const char *buf, int
bytes, bits8 mask)
* actual external functions. The compiler should be able to inline the
* portable versions here.
*/
-int
-pg_popcount32(uint32 word)
-{
- return pg_popcount32_portable(word);
-}
The last sentence doesn't seem to be true anymore for the functions
that remain here:
(Needed to add "#undef HAVE_X86_64_POPCNTQ" in a few places to demonstrate)
objdump --no-show-raw-insn --no-addresses -drwC -M intel
/path/to/postgres | grep -C 5 pg_popcount_portable
<pg_popcount_optimized>:
jmp <pg_popcount_portable>
I don't think we need to worry about this, but the attached looks
nicer to me. There might be a disadvantage of using macros here, but
I'm not sure what it would be.
--
John Naylor
Amazon Web Services
Вложения
On Wed, Feb 11, 2026 at 02:10:37PM +0700, John Naylor wrote: > I don't think we need to worry about this, but the attached looks > nicer to me. There might be a disadvantage of using macros here, but > I'm not sure what it would be. The only problem that I see with this is that a compiler that understands Neon intrinsics but not SVE ones will probably fail because we build external functions that call the Neon versions in that case. But that's easy enough to work around by adding an extra #elif defined(USE_NEON) section, and it still saves a handful of lines of code. Perhaps someday this situation will be rare enough that we can remove that hack and just point to the portable versions if the compiler doesn't understand both Neon and SVE, but I'm not sure we're there yet. -- nathan
Вложения
On Thu, Feb 12, 2026 at 12:39 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Wed, Feb 11, 2026 at 02:10:37PM +0700, John Naylor wrote: > > I don't think we need to worry about this, but the attached looks > > nicer to me. There might be a disadvantage of using macros here, but > > I'm not sure what it would be. > > The only problem that I see with this is that a compiler that understands > Neon intrinsics but not SVE ones will probably fail because we build > external functions that call the Neon versions in that case. Ah, right. > But that's > easy enough to work around by adding an extra #elif defined(USE_NEON) > section, and it still saves a handful of lines of code. Perhaps someday > this situation will be rare enough that we can remove that hack and just > point to the portable versions if the compiler doesn't understand both Neon > and SVE, but I'm not sure we're there yet. I have a better idea, but it depends on re-thinking feature detection and compile-time vs run-time checks, and that's not quite ready to share, so let's shelve this for now. Anyway, I think 0001 and 0002 are ready for commit. -- John Naylor Amazon Web Services
On Thu, Feb 12, 2026 at 11:58:37AM +0700, John Naylor wrote: > I have a better idea, but it depends on re-thinking feature detection > and compile-time vs run-time checks, and that's not quite ready to > share, so let's shelve this for now. Sounds good. > Anyway, I think 0001 and 0002 are ready for commit. Committed. I've attached rebased versions of the remaining patches under consideration. -- nathan