Обсуждение: refactor architecture-specific popcount code

Поиск
Список
Период
Сортировка

refactor architecture-specific popcount code

От
Nathan Bossart
Дата:
Right now, the organization of this code is weird.  All AArch64-specific
implementations live in an AArch64-specific file, the AVX-512
implementations live in their own file, and the architecture-agnostic and
SSE4.2 implementations live in pg_bitutils.c.  The attached patches move
the SSE4.2 implementations to the AVX-512 file (which is renamed
appropriately), and they update some function names to be more descriptive,
i.e., "fast" is replaced with "sse42" and "slow" is replaced with
"generic".

I probably should've done this a while ago...

-- 
nathan

Вложения

Re: refactor architecture-specific popcount code

От
John Naylor
Дата:
On Thu, Jan 15, 2026 at 3:40 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> Right now, the organization of this code is weird.  All AArch64-specific
> implementations live in an AArch64-specific file, the AVX-512
> implementations live in their own file, and the architecture-agnostic and
> SSE4.2 implementations live in pg_bitutils.c.  The attached patches move
> the SSE4.2 implementations to the AVX-512 file (which is renamed
> appropriately), and they update some function names to be more descriptive,
> i.e., "fast" is replaced with "sse42" and "slow" is replaced with
> "generic".

Thanks for taking on some technical debt!

0001

--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_x86_64.c

Can we get away with just "x86" for brevity? We generally don't target
32-bit CPUs for this kind of work, so there's no chance of confusion.

0002

```
-#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
+#include "port/pg_bitutils.h"
+
+#ifdef TRY_POPCNT_X86_64

 #if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
 #include <cpuid.h>
 #endif
```

With the above in the x86 .c file, I wonder we can get rid of this
stanza and the "try" symbol and gate only on HAVE_X86_64_POPCNTQ:

#ifdef HAVE_X86_64_POPCNTQ
#if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID)
#define TRY_POPCNT_X86_64 1
#endif
#endif

If we have to be cautious, we could just turn the #error on no CPUID
symbol into "return false".

0003

s/fast/sse42/:

Seems okay in this file, but this isn't the best name, either. Maybe a
comment to head off future "corrections", something like:
"Technically, POPCNT is not part of SSE 4.2, and is not even a vector
operation, but many compilers emit the popcnt instruction with
-msse4.2 anyway."

s/slow/generic/:

I'm ambivalent about this. The "slow" designation is flat-out wrong
since at least Power and aarch64 can emit a single instruction here
without prodding the compiler. On the other hand, "generic" seems
wrong too, since e.g. pg_popcount64_slow() has three configure symbols
and two compiler builtins. :-D

A possible future project would be to have a truly generic simple
fallback in pure C and put all the fancy stuff in the header for
architectures that have unconditional hardware support. It would make
more sense to revisit the name then.

--
John Naylor
Amazon Web Services



Re: refactor architecture-specific popcount code

От
Heikki Linnakangas
Дата:
On 15/01/2026 11:07, John Naylor wrote:
> 0003
> 
> s/fast/sse42/:
> 
> Seems okay in this file, but this isn't the best name, either. Maybe a
> comment to head off future "corrections", something like:
> "Technically, POPCNT is not part of SSE 4.2, and is not even a vector
> operation, but many compilers emit the popcnt instruction with
> -msse4.2 anyway."
> 
> s/slow/generic/:
> 
> I'm ambivalent about this. The "slow" designation is flat-out wrong
> since at least Power and aarch64 can emit a single instruction here
> without prodding the compiler. On the other hand, "generic" seems
> wrong too, since e.g. pg_popcount64_slow() has three configure symbols
> and two compiler builtins. :-D

"fallback", or "portable" ?

> A possible future project would be to have a truly generic simple
> fallback in pure C and put all the fancy stuff in the header for
> architectures that have unconditional hardware support. It would make
> more sense to revisit the name then.
Yeah, I noticed that on x86_64, pg_popcount_optimized is always a 
function pointer with runtime check, even if you use compiler flags to 
target a CPU where the special instructions are available unconditionally.

- Heikki




Re: refactor architecture-specific popcount code

От
Nathan Bossart
Дата:
On Thu, Jan 15, 2026 at 04:07:51PM +0700, John Naylor wrote:
> Thanks for taking on some technical debt!

Thanks for reviewing.

> --- a/src/port/pg_popcount_avx512.c
> +++ b/src/port/pg_popcount_x86_64.c
> 
> Can we get away with just "x86" for brevity? We generally don't target
> 32-bit CPUs for this kind of work, so there's no chance of confusion.

WFM.

> ```
> -#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
> +#include "port/pg_bitutils.h"
> +
> +#ifdef TRY_POPCNT_X86_64
> 
>  #if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
>  #include <cpuid.h>
>  #endif
> ```
> 
> With the above in the x86 .c file, I wonder we can get rid of this
> stanza and the "try" symbol and gate only on HAVE_X86_64_POPCNTQ:
> 
> #ifdef HAVE_X86_64_POPCNTQ
> #if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID)
> #define TRY_POPCNT_X86_64 1
> #endif
> #endif
> 
> If we have to be cautious, we could just turn the #error on no CPUID
> symbol into "return false".

Yeah, the CPUID macro checks do seem overly cautious to me, especially
since we've just #error'd when the CPUID intrinsics are missing in
pg_crc32c_sse42_choose.c since 2015.  That seems to suggest that nobody is
trying to build Postgres with a compiler that knows about SSE4.2/POPCNT but
not CPUID.  For reference, CPUID was introduced in 1993.

I bet we could also convert this bit into a configuration-time check:

    #if defined(_MSC_VER) && defined(_M_AMD64)
    #define HAVE_X86_64_POPCNTQ
    #endif

> s/fast/sse42/:
> 
> Seems okay in this file, but this isn't the best name, either. Maybe a
> comment to head off future "corrections", something like:
> "Technically, POPCNT is not part of SSE 4.2, and is not even a vector
> operation, but many compilers emit the popcnt instruction with
> -msse4.2 anyway."

Makes sense.

-- 
nathan



Re: refactor architecture-specific popcount code

От
Nathan Bossart
Дата:
On Thu, Jan 15, 2026 at 11:42:14AM +0200, Heikki Linnakangas wrote:
> On 15/01/2026 11:07, John Naylor wrote:
>> s/slow/generic/:
>> 
>> I'm ambivalent about this. The "slow" designation is flat-out wrong
>> since at least Power and aarch64 can emit a single instruction here
>> without prodding the compiler. On the other hand, "generic" seems
>> wrong too, since e.g. pg_popcount64_slow() has three configure symbols
>> and two compiler builtins. :-D
> 
> "fallback", or "portable" ?

I've no strong opinions, but "portable" seems reasonable to me.

> Yeah, I noticed that on x86_64, pg_popcount_optimized is always a function
> pointer with runtime check, even if you use compiler flags to target a CPU
> where the special instructions are available unconditionally.

I wonder how close we are to being able to just require SSE4.2/POPCNT for
x86-64 builds.  I suppose there's always a chance that someone will try to
run Postgres 19 on a CPU from the aughts...  In any case, avoiding the
function pointer when possible seems like a good follow-up.

-- 
nathan



Re: refactor architecture-specific popcount code

От
Heikki Linnakangas
Дата:
On 15/01/2026 18:08, Nathan Bossart wrote:
> On Thu, Jan 15, 2026 at 11:42:14AM +0200, Heikki Linnakangas wrote:
>> Yeah, I noticed that on x86_64, pg_popcount_optimized is always a function
>> pointer with runtime check, even if you use compiler flags to target a CPU
>> where the special instructions are available unconditionally.
> 
> I wonder how close we are to being able to just require SSE4.2/POPCNT for
> x86-64 builds.  I suppose there's always a chance that someone will try to
> run Postgres 19 on a CPU from the aughts...  In any case, avoiding the
> function pointer when possible seems like a good follow-up.

It's not really our decision. Packagers choose which architecture to 
target and which compiler options to build with. We ought to just 
respect those choices.

- Heikki




Re: refactor architecture-specific popcount code

От
Nathan Bossart
Дата:
Here is a new patch set.  Notably, I've added a 0004 that does the
following:

* Removes TRY_POPCNT_X86_64.  We now assume that the required CPUID
intrinsics are available, as we have long done in some of the CRC-32C code.

* Moves the MSVC check for HAVE_X86_64_POPCNTQ to configuration-time.  This
way, we set it for all relevant platforms in one place.

* Moves the #defines for USE_SSE2 and USE_NEON to c.h so that they can be
used elsewhere without simd.h.  Consequently, we can remove POPCNT_AARCH64.

* Moves the #includes for pg_bitutils.h to below the system headers in
pg_popcount_{aarch64,x86}.c (since we no longer depend on macros defined in
pg_bitutils.h).

-- 
nathan

Вложения

Re: refactor architecture-specific popcount code

От
John Naylor
Дата:
On Fri, Jan 16, 2026 at 2:07 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> Here is a new patch set. Notably, I've added a 0004 that does the
> following:
>
> * Removes TRY_POPCNT_X86_64.  We now assume that the required CPUID
> intrinsics are available, as we have long done in some of the CRC-32C code.
>
> * Moves the MSVC check for HAVE_X86_64_POPCNTQ to configuration-time.  This
> way, we set it for all relevant platforms in one place.

LGTM.

> * Moves the #defines for USE_SSE2 and USE_NEON to c.h so that they can be
> used elsewhere without simd.h.  Consequently, we can remove POPCNT_AARCH64.

Seems reasonable.

> * Moves the #includes for pg_bitutils.h to below the system headers in
> pg_popcount_{aarch64,x86}.c (since we no longer depend on macros defined in
> pg_bitutils.h).

Good.

v2-0003

+ * XXX Technically, POPCNT is not part of SSE4.2, and isn't even a vector
+ * operation, but in practice this is close enough, and "sse42" seems easier to
+ * follow than "popcnt" for these names.

Quibble -- XXX is a bit loud for a side note.

On Thu, Jan 15, 2026 at 11:08 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Thu, Jan 15, 2026 at 11:42:14AM +0200, Heikki Linnakangas wrote:

> > "fallback", or "portable" ?
>
> I've no strong opinions, but "portable" seems reasonable to me.

I have a mild preference for "fallback" since it's a noun and we
already have comments in src/(include/)port referring to fallbacks. No
objection to "portable", though.

--
John Naylor
Amazon Web Services



Re: refactor architecture-specific popcount code

От
Nathan Bossart
Дата:
Committed, thanks for reviewing.

-- 
nathan



Re: refactor architecture-specific popcount code

От
John Naylor
Дата:
On Thu, Jan 22, 2026 at 3:23 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> Committed, thanks for reviewing.

Sure. Now that that's in place, I wanted to brainstorm more
refactoring/rationalization ideas that seem on-topic for the thread
but have less clear payoff:

1) Nowadays, the only global call sites of the word-sized functions
are select_best_grantor() and in bitmapsets. The latter calls the
word-sized functions in a loop (could be just one word). It may be
more efficient to calculate the size in bytes and call pg_popcount().
Then we could get rid of all the pointer indirection for the
word-sized functions.

2) The x86 byte buffer variants expend a lot of effort to detect
whether the buffer is aligned on both 64- and 32-bit platforms, with
an optimized path for each. At least 64-bit doesn't care about
alignment, and 32-bit doesn't warrant anything fancier than pure C.
Simultaneously, the aarch64 equivalent doesn't seem to take care about
alignment. (I think Nathan mentioned he didn't see a difference during
testing, but I wonder how universal that is).

3) There is repeated code for the <8 bytes case, and the tail of the
"optimized" functions. I'm also not sure why the small case is inlined
everywhere.

--
John Naylor
Amazon Web Services



Re: refactor architecture-specific popcount code

От
Nathan Bossart
Дата:
On Thu, Jan 22, 2026 at 04:50:26PM +0700, John Naylor wrote:
> 1) Nowadays, the only global call sites of the word-sized functions
> are select_best_grantor() and in bitmapsets. The latter calls the
> word-sized functions in a loop (could be just one word). It may be
> more efficient to calculate the size in bytes and call pg_popcount().

Yeah, these seem like obvious places to use pg_popcount().  Note that
bms_member_index() does a final popcount on a masked version of the last
word.  We could swap that with pg_popcount(), too, but it might be slower
than just calling the word-sized function.  However, it could be hard to
tell the difference, as we'd be trading a function or function pointer call
with an inlined loop over pg_number_of_ones.  And even if it is slower, I'm
not sure it matters all that much in the grand scheme of things.

In any case, 0001 gets the easy ones out of the way.

> Then we could get rid of all the pointer indirection for the
> word-sized functions.

Do you mean that we'd just keep the portable ones around?  I see some code
in pgvector that might be negatively impacted by that, but if I understand
correctly it would require an unusual setup.

> 2) The x86 byte buffer variants expend a lot of effort to detect
> whether the buffer is aligned on both 64- and 32-bit platforms, with
> an optimized path for each. At least 64-bit doesn't care about
> alignment, and 32-bit doesn't warrant anything fancier than pure C.
> Simultaneously, the aarch64 equivalent doesn't seem to take care about
> alignment. (I think Nathan mentioned he didn't see a difference during
> testing, but I wonder how universal that is).

0002 makes these changes for pg_popcount_sse42() and
pg_popcount_masked_sse42().  It does seem strange to prefer a loop over
pg_number_of_ones instead of using POPCNTQ when unaligned, but perhaps it's
worth testing.  I do recall the alignment stuff in the AVX-512 code showing
benefits in tests because it avoids double-load overhead, so I think we
should keep that for now.

> 3) There is repeated code for the <8 bytes case, and the tail of the
> "optimized" functions. I'm also not sure why the small case is inlined
> everywhere.

This is intended to help avoid function call and SIMD instruction overhead
when it doesn't make sense to take it.  I recall this showing a rather big
difference in benchmarks when we were working on the AVX-512 versions.
Regarding the duplicated code, sure, we could add some static inline
functions or something.  I think the only reason I haven't done so is
because it's ~2 lines of code.

-- 
nathan

Вложения

Re: refactor architecture-specific popcount code

От
Nathan Bossart
Дата:
On Thu, Jan 22, 2026 at 11:50:38AM -0600, Nathan Bossart wrote:
> On Thu, Jan 22, 2026 at 04:50:26PM +0700, John Naylor wrote:
>> 1) Nowadays, the only global call sites of the word-sized functions
>> are select_best_grantor() and in bitmapsets. The latter calls the
>> word-sized functions in a loop (could be just one word). It may be
>> more efficient to calculate the size in bytes and call pg_popcount().
> 
> Yeah, these seem like obvious places to use pg_popcount().  Note that
> bms_member_index() does a final popcount on a masked version of the last
> word.  We could swap that with pg_popcount(), too, but it might be slower
> than just calling the word-sized function.  However, it could be hard to
> tell the difference, as we'd be trading a function or function pointer call
> with an inlined loop over pg_number_of_ones.  And even if it is slower, I'm
> not sure it matters all that much in the grand scheme of things.

I added a 0003 that swaps that final popcount with pg_popcount().

>> Then we could get rid of all the pointer indirection for the
>> word-sized functions.
> 
> Do you mean that we'd just keep the portable ones around?  I see some code
> in pgvector that might be negatively impacted by that, but if I understand
> correctly it would require an unusual setup.

I added a 0004 that removes the architecture-specific word-sized functions.

-- 
nathan

Вложения

Re: refactor architecture-specific popcount code

От
John Naylor
Дата:
On Mon, Jan 26, 2026 at 10:41 PM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Thu, Jan 22, 2026 at 11:50:38AM -0600, Nathan Bossart wrote:
> > On Thu, Jan 22, 2026 at 04:50:26PM +0700, John Naylor wrote:
> >> 1) Nowadays, the only global call sites of the word-sized functions
> >> are select_best_grantor() and in bitmapsets. The latter calls the
> >> word-sized functions in a loop (could be just one word). It may be
> >> more efficient to calculate the size in bytes and call pg_popcount().
> >
> > Yeah, these seem like obvious places to use pg_popcount().  Note that
> > bms_member_index() does a final popcount on a masked version of the last
> > word.  We could swap that with pg_popcount(), too, but it might be slower
> > than just calling the word-sized function.  However, it could be hard to
> > tell the difference, as we'd be trading a function or function pointer call
> > with an inlined loop over pg_number_of_ones.  And even if it is slower, I'm
> > not sure it matters all that much in the grand scheme of things.
>
> I added a 0003 that swaps that final popcount with pg_popcount().

I'm not sure either if this part matters much, but it makes more sense
to me to continue using single word functions for that last part.
Since they have very few call sites anymore, we can make them inline
without bloating the binary on x86.

> >> Then we could get rid of all the pointer indirection for the
> >> word-sized functions.
> >
> > Do you mean that we'd just keep the portable ones around?  I see some code
> > in pgvector that might be negatively impacted by that, but if I understand
> > correctly it would require an unusual setup.
>
> I added a 0004 that removes the architecture-specific word-sized functions.

Right, just the portable ones. Here, too, inlining them everywhere
would mitigate any impact. Further,

-int
-pg_popcount64(uint64 word)
+static inline int
+pg_popcount64_neon(uint64 word)

...if they were inlined from the header, I think we wouldn't need this
separate neon function in this file at all. Currently, we rely on
__builtin_popcountl for the portable function outside this file. We
could either keep using that or switch to neon if there's a
portability difference.

0001, 2, and 4 look good to me otherwise. Looking at commit
a8c09daa8bb1, there are some planner benchmarks in the discussion
thread that stress bitmapsets, but I'm not sure if they hit the paths
affected by 0001 at all.

--
John Naylor
Amazon Web Services



Re: refactor architecture-specific popcount code

От
Nathan Bossart
Дата:
On Thu, Jan 29, 2026 at 06:31:53PM +0700, John Naylor wrote:
> On Mon, Jan 26, 2026 at 10:41 PM Nathan Bossart
> <nathandbossart@gmail.com> wrote:
>> I added a 0003 that swaps that final popcount with pg_popcount().
> 
> I'm not sure either if this part matters much, but it makes more sense
> to me to continue using single word functions for that last part.
> Since they have very few call sites anymore, we can make them inline
> without bloating the binary on x86.

Okay, I abandoned that patch.

> Right, just the portable ones. Here, too, inlining them everywhere
> would mitigate any impact.

Done.

> +static inline int
> +pg_popcount64_neon(uint64 word)
> 
> ...if they were inlined from the header, I think we wouldn't need this
> separate neon function in this file at all. Currently, we rely on
> __builtin_popcountl for the portable function outside this file. We
> could either keep using that or switch to neon if there's a
> portability difference.

Done.

-- 
nathan

Вложения

Re: refactor architecture-specific popcount code

От
John Naylor
Дата:
On Fri, Jan 30, 2026 at 12:06 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
> [v5]

0001 - I'm pretty sure this is comparable to HEAD if the optimized
function is pg_popcount_sse42(). Has the AVX512 version been tested
with 8-byte inputs? That seems to have a lot of pre- and
post-processing involved. The inline wrapper only bypasses for 7 or
less bytes.

0002
- I tried running this on x86-64 with alignment sanitizer and no
alarms went off during "make check", but adding
pg_attribute_no_sanitize_alignment() would prevent surprises in the
future.
- I imagine that the old SIZEOF_VOID_P check is superfluous now, since
the whole file is gated by HAVE_X86_64_POPCNTQ.
- Maybe we can remove the aligned 32-bit path in
pg_popcount_(masked_)portable(), since that's on-topic for this patch
and would simplify things further.

--
John Naylor
Amazon Web Services



Re: refactor architecture-specific popcount code

От
Nathan Bossart
Дата:
On Fri, Jan 30, 2026 at 03:22:45PM +0700, John Naylor wrote:
> 0001 - I'm pretty sure this is comparable to HEAD if the optimized
> function is pg_popcount_sse42(). Has the AVX512 version been tested
> with 8-byte inputs? That seems to have a lot of pre- and
> post-processing involved. The inline wrapper only bypasses for 7 or
> less bytes.

Here [0] is the latest perf data I see for the AVX-512 popcount patch,
although that's comparing to v16, which IIRC lacks a few other inlining
tricks.  There's a chance the SSE4.2 version is faster at that particular
length.  I'm not sure we need to worry about that, but I can do a bit of
testing if you'd like.

> 0002
> - I tried running this on x86-64 with alignment sanitizer and no
> alarms went off during "make check", but adding
> pg_attribute_no_sanitize_alignment() would prevent surprises in the
> future.

Done.

> - I imagine that the old SIZEOF_VOID_P check is superfluous now, since
> the whole file is gated by HAVE_X86_64_POPCNTQ.

I think you're right.  There was some concern about this when I was first
adding the SSE4.2-specific pg_popcount() [1], but all the configure-time
checks for HAVE_X86_64_POPCNTQ are restricted to 64-bit x86, so I bet we
could safely assume SIZEOF_VOID_P == 8 in that file.

> - Maybe we can remove the aligned 32-bit path in
> pg_popcount_(masked_)portable(), since that's on-topic for this patch
> and would simplify things further.

IMHO that's a reasonable thing for us to do.

[0] https://postgr.es/m/20240404171828.GA3866970%40nathanxps13
[1] https://postgr.es/m/CAApHDvojPyh6dLKooqjXSZE%3D0Ed590Lq1BxF7WQ9knSggyuJEA%40mail.gmail.com

-- 
nathan

Вложения

Re: refactor architecture-specific popcount code

От
John Naylor
Дата:
On Sat, Jan 31, 2026 at 4:33 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Fri, Jan 30, 2026 at 03:22:45PM +0700, John Naylor wrote:
> > 0001 - I'm pretty sure this is comparable to HEAD if the optimized
> > function is pg_popcount_sse42(). Has the AVX512 version been tested
> > with 8-byte inputs? That seems to have a lot of pre- and
> > post-processing involved. The inline wrapper only bypasses for 7 or
> > less bytes.
>
> Here [0] is the latest perf data I see for the AVX-512 popcount patch,
> although that's comparing to v16, which IIRC lacks a few other inlining
> tricks.  There's a chance the SSE4.2 version is faster at that particular
> length.  I'm not sure we need to worry about that, but I can do a bit of
> testing if you'd like.

It might be a good idea to do a little new testing, and I see a use
for a special 8-byte path independent of AVX512: v6 seems to regress a
little for single-words. But, it turns out that when gcc turns
__builtin_popcountl into a single instruction, it's inline, but if it
emits portable bitwise ops, it does so in a function called
__popcountdi2(). That can be avoided by hand-coding in C for normal
builds (and for 32-bit looks cleaner anyway), as in the attached 0005.

My laptop here is really too old to make decisions that are
micro-architecture dependent, but with that caveat, I dusted off the
popcount benchmark and added a test for counting bitmapsets (v7-0004,
applies on top of v6):

select drive_bms_num_members(10000000, 1);

master:    13.2 ticks per call
v6:        15.3
v6+v7-0005 10.8

Again, take this with a grain of salt, but 0005 seems worth looking at.

--
John Naylor
Amazon Web Services

Вложения

Re: refactor architecture-specific popcount code

От
Nathan Bossart
Дата:
On Mon, Feb 02, 2026 at 09:16:42PM +0700, John Naylor wrote:
> It might be a good idea to do a little new testing, and I see a use
> for a special 8-byte path independent of AVX512: v6 seems to regress a
> little for single-words. But, it turns out that when gcc turns
> __builtin_popcountl into a single instruction, it's inline, but if it
> emits portable bitwise ops, it does so in a function called
> __popcountdi2(). That can be avoided by hand-coding in C for normal
> builds (and for 32-bit looks cleaner anyway), as in the attached 0005.

Oh, interesting.  I looked into this a little more [0].  Both gcc and clang
generate cnt instructions for aarch64, so we're good there.  However, clang
on x86-64 generates the bit-twiddling version, and gcc on x86-64 generates
a call to __popcountdi2() (which I imagine does something similar).  It's
not until you provide a compiler flag like -march=x86-64-v2 that gcc/clang
start generating popcnt instructions for x86-64, which makes sense.  0005
seems like the correct move to me...

[0] https://godbolt.org/z/he3WozG3E

-- 
nathan



Re: refactor architecture-specific popcount code

От
"Greg Burd"
Дата:
On Mon, Feb 2, 2026, at 5:51 PM, Nathan Bossart wrote:
> On Mon, Feb 02, 2026 at 09:16:42PM +0700, John Naylor wrote:
>> It might be a good idea to do a little new testing, and I see a use
>> for a special 8-byte path independent of AVX512: v6 seems to regress a
>> little for single-words. But, it turns out that when gcc turns
>> __builtin_popcountl into a single instruction, it's inline, but if it
>> emits portable bitwise ops, it does so in a function called
>> __popcountdi2(). That can be avoided by hand-coding in C for normal
>> builds (and for 32-bit looks cleaner anyway), as in the attached 0005.
>
> Oh, interesting.  I looked into this a little more [0].  Both gcc and clang
> generate cnt instructions for aarch64, so we're good there.  However, clang
> on x86-64 generates the bit-twiddling version, and gcc on x86-64 generates
> a call to __popcountdi2() (which I imagine does something similar).  It's
> not until you provide a compiler flag like -march=x86-64-v2 that gcc/clang
> start generating popcnt instructions for x86-64, which makes sense.  0005
> seems like the correct move to me...
>
> [0] https://godbolt.org/z/he3WozG3E
>
> -- 
> nathan

Nathan, John,

Thanks for the focus on this area of the code.  I've been looking into what to do with popcnt when building
Win11/ARM64/MSVC. I know that when _MSC_VER and _M_ARM64 are defined we can make use of the __popcnt(unsigned int) and
__popcnt64(unsigned__int64) intrinsics which have been available since VS 2022 17.11+.  I thought I'd check that combo
outand it turns out that it is identical to clang/gcc on that platform [0].
 

I'll wait for your work to land before proposing a patch to add these unless it is really easy to fit it and you feel
likegiving it a go. :)
 

best.

-greg

[0] https://godbolt.org/z/TrxjzcPGE



Re: refactor architecture-specific popcount code

От
Nathan Bossart
Дата:
On Tue, Feb 03, 2026 at 12:19:48PM -0500, Greg Burd wrote:
> Thanks for the focus on this area of the code.  I've been looking into
> what to do with popcnt when building Win11/ARM64/MSVC.  I know that when
> _MSC_VER and _M_ARM64 are defined we can make use of the
> __popcnt(unsigned int) and __popcnt64(unsigned __int64) intrinsics which
> have been available since VS 2022 17.11+.  I thought I'd check that combo
> out and it turns out that it is identical to clang/gcc on that platform
> [0].
> 
> I'll wait for your work to land before proposing a patch to add these
> unless it is really easy to fit it and you feel like giving it a go. :)

We should probably just add something like

    #ifdef _MSC_VER
        return __popcnt(word);

for the new inlined versions of pg_popcount{32,64}.  We're already doing
that today for x86-64, and the AArch64-specific versions use intrinsics
that in theory compile to the same thing.  Plus, popcnt is required for
Windows these days.

I'm working on polishing/benchmarking these patches at the moment, so I
will work this change in.  Thanks!

-- 
nathan



Re: refactor architecture-specific popcount code

От
Greg Burd
Дата:
On Feb 3 2026, at 12:41 pm, Nathan Bossart <nathandbossart@gmail.com> wrote:

> On Tue, Feb 03, 2026 at 12:19:48PM -0500, Greg Burd wrote:
>> Thanks for the focus on this area of the code.  I've been looking into
>> what to do with popcnt when building Win11/ARM64/MSVC.  I know that when
>> _MSC_VER and _M_ARM64 are defined we can make use of the
>> __popcnt(unsigned int) and __popcnt64(unsigned __int64) intrinsics which
>> have been available since VS 2022 17.11+.  I thought I'd check that combo
>> out and it turns out that it is identical to clang/gcc on that platform
>> [0].
>> 
>> I'll wait for your work to land before proposing a patch to add these
>> unless it is really easy to fit it and you feel like giving it a go. :)
> 
> We should probably just add something like
> 
>     #ifdef _MSC_VER
>         return __popcnt(word);

Right, it seemed to me after reviewing the patch to fit right in.

> for the new inlined versions of pg_popcount{32,64}.  We're already doing
> that today for x86-64, and the AArch64-specific versions use intrinsics
> that in theory compile to the same thing.  Plus, popcnt is required for
> Windows these days.
> 
> I'm working on polishing/benchmarking these patches at the moment, so I
> will work this change in.  Thanks!

Not a problem, thank you!

> -- 
> nathan

-greg



Re: refactor architecture-specific popcount code

От
Nathan Bossart
Дата:
Okay, here is an updated patch set with first drafts of the commit
messages.  I'm reasonably happy with these patches, but I'll admit my
justification for ripping out the 32-bit optimizations feels a bit flimsy.
I don't get the idea that we are all that concerned about things like
micro-regressions for popcount on 32-bit builds, but OTOH it isn't hard to
imagine someone objecting to these changes.

I ran the bms_num_members() benchmark on a couple of machines I had nearby:

    apple-m3 (neon)         intel-i5-13500T (sse4.2)
    words  HEAD   v8        words  HEAD   v8
        1    40   25            1    26   10
        2    57   51            2    37   29
        4    75   57            4    55   45
        8   105   56            8    88   51
       16   154   59           16   158   68
       32   265   73           32   296  102
       64   545  103           64   577  209
      128  1027  178          128  1212  423

I was going to run it on machines with SVE/AVX-512, but John already tested
the AVX-512 case [0], and I have no reason to believe that we'll see
regressions on machines with SVE.

[0] https://postgr.es/m/CANWCAZbWLX%3DEDd1Bq-8oGK2ZLVNR4m4BkGe%3D288t2V5oLcqeZA%40mail.gmail.com

-- 
nathan

Вложения

Re: refactor architecture-specific popcount code

От
Nathan Bossart
Дата:

Re: refactor architecture-specific popcount code

От
John Naylor
Дата:
On Wed, Feb 4, 2026 at 3:44 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> Okay, here is an updated patch set with first drafts of the commit
> messages.  I'm reasonably happy with these patches, but I'll admit my
> justification for ripping out the 32-bit optimizations feels a bit flimsy.
> I don't get the idea that we are all that concerned about things like
> micro-regressions for popcount on 32-bit builds, but OTOH it isn't hard to
> imagine someone objecting to these changes.

I actually believe the changes improve even on 32-bit, at least for
bitmapsets. The previous 32-bit optimizations are just mirrors of the
64-bit ones, which we found weren't always great to begin with, so I
don't buy that we're losing much here.

> I was going to run it on machines with SVE/AVX-512, but John already tested
> the AVX-512 case [0], and I have no reason to believe that we'll see
> regressions on machines with SVE.

Actually, that was the SSE4.2 case. Sorry if I wasn't clear.

For v9:

0001:

- /*
- * We set the threshold to the point at which we'll first use special
- * instructions in the optimized version.
- */
-#if SIZEOF_VOID_P >= 8
- int threshold = 8;
-#else
- int threshold = 4;
-#endif
-
- if (bytes < threshold)
+ if (bytes < 8)

It seems like we should still have some kind of comment here. Even
just to say the 8 value was found through testing. (It was some time
ago, right?)

0002:

+static inline int
+pg_popcount32(uint32 word)
+{
+ /*
+ * On x86, gcc generates a function call for this built-in unless the
+ * popcnt instruction is available, so we use the plain C version in that
+ * case to ensure inlining.
+ */
+#if defined(HAVE__BUILTIN_POPCOUNT) && (defined(__POPCNT__) ||
!defined(__x86_64__))
+ return __builtin_popcount(word);

It was intentional that my PoC pg_popcount32 was simple, pure C and
didn't have #ifdefs anymore, and I'd prefer to go back to that. This
function is now only used for single-word bitmapsets on 32-bit
platforms and in an assert, and even if that changes we've shown that
inlining bitwise ops for a single word is already pretty good for
performance. Plus, doesn't this cause gcc generate a function call on
32-bit x86 because "!defined(__x86_64__)" is true? That defeats the
whole purpose of inlining in the first place. Simple is good here.

+#elif defined(_MSC_VER)
+   return __popcnt64(word);

The commit message says "converts the portable ones to inlined
functions", but this was copied from a architecture specific file with
a runtime check. I've seen an assertion in this thread that the
hardware instruction is required for some Windows version, but it
would be nice to have a link to documentation for the archives. More
worryingly, this is almost certainly broken on 32-bit, and the
buildfarm won't tell us -- please see commit
53ea2b7ad050ce4ad95c89bb55197209b65886a1 and bug report that led to
it. Seems like material for a separate commit.

0003:

- for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++)
- cnt += bmw_popcount(n256->isset[i]);
+ cnt += pg_popcount((const char *) n256->isset,
+    RT_NODE_MAX_SLOTS / BITS_PER_BYTE);

This can now be "cnt =". The initialization to zero is now
unnecessary, but it's also harmless.

LGTM otherwise.

"perhaps-over-optimized" -- In more neutral language, they're better
optimized for longer inputs. Hence the need to shortcut for the common
case.

"This fast-path could arguably go in pg_popcount() itself (with an
appropriate alignment check), but that is left as a future exercise."

I wondered about that too, but that adds branches and code everywhere
it's called.

--
John Naylor
Amazon Web Services



Re: refactor architecture-specific popcount code

От
Nathan Bossart
Дата:
On Wed, Feb 04, 2026 at 12:13:35PM +0700, John Naylor wrote:
> - /*
> - * We set the threshold to the point at which we'll first use special
> - * instructions in the optimized version.
> - */
> [...]
> 
> It seems like we should still have some kind of comment here. Even
> just to say the 8 value was found through testing. (It was some time
> ago, right?)

Yeah, it needs a comment.  I recall doing a lot of testing for small
inputs since those are what tended to regress.

> It was intentional that my PoC pg_popcount32 was simple, pure C and
> didn't have #ifdefs anymore, and I'd prefer to go back to that. This
> function is now only used for single-word bitmapsets on 32-bit
> platforms and in an assert, and even if that changes we've shown that
> inlining bitwise ops for a single word is already pretty good for
> performance. Plus, doesn't this cause gcc generate a function call on
> 32-bit x86 because "!defined(__x86_64__)" is true? That defeats the
> whole purpose of inlining in the first place. Simple is good here.

Agreed and done.

> +#elif defined(_MSC_VER)
> +   return __popcnt64(word);
> 
> The commit message says "converts the portable ones to inlined
> functions", but this was copied from a architecture specific file with
> a runtime check. I've seen an assertion in this thread that the
> hardware instruction is required for some Windows version, but it
> would be nice to have a link to documentation for the archives. More
> worryingly, this is almost certainly broken on 32-bit, and the
> buildfarm won't tell us -- please see commit
> 53ea2b7ad050ce4ad95c89bb55197209b65886a1 and bug report that led to
> it. Seems like material for a separate commit.

Sure.  I'm tempted to suggest that we only use the plain C version here,
too.  The SSE4.2 bms_num_members() test I did yesterday used it and showed
improvement at one word.  If we do that, we can rip out even more code
since we no longer need the popcount built-ins.

* tests plain C version on an Apple M3 *

Yeah, the plain C version might be marginally slower than the built-in
version for that test, but it still seems quite a bit faster than HEAD.

    HEAD  v8  v10
      40  25   29

We probably want to re-add the Neon version of pg_popcount64() so that
pg_popcount_neon() and pg_popcount_masked_neon() can use it, but that's
easy enough.

> - for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++)
> - cnt += bmw_popcount(n256->isset[i]);
> + cnt += pg_popcount((const char *) n256->isset,
> +    RT_NODE_MAX_SLOTS / BITS_PER_BYTE);
> 
> This can now be "cnt =". The initialization to zero is now
> unnecessary, but it's also harmless.

Fixed.  The only reason I didn't make that change earlier was to keep the
patch tidy.

I haven't updated the commit messages yet.  Once the code is ready to go,
I'll give those another try.

[0] https://postgr.es/m/aYJeGs-xz3NEEdSe%40nathan

-- 
nathan

Вложения

Re: refactor architecture-specific popcount code

От
John Naylor
Дата:
On Thu, Feb 5, 2026 at 4:43 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> Sure.  I'm tempted to suggest that we only use the plain C version here,
> too.  The SSE4.2 bms_num_members() test I did yesterday used it and showed
> improvement at one word.  If we do that, we can rip out even more code
> since we no longer need the popcount built-ins.

Unlike the 32-bit case, people do run production on 64-bit platforms
that are not Arm/x86, so that would require effort to see if the
builtins are worth it for them. That seems like a separate effort. I
can help with that, but let's get the tested stuff in first.

> * tests plain C version on an Apple M3 *
>
> Yeah, the plain C version might be marginally slower than the built-in
> version for that test, but it still seems quite a bit faster than HEAD.
>
>     HEAD  v8  v10
>       40  25   29

That's good to know, and maybe it'll be true elsewhere.

--
John Naylor
Amazon Web Services



Re: refactor architecture-specific popcount code

От
Nathan Bossart
Дата:
On Thu, Feb 05, 2026 at 02:48:44PM +0700, John Naylor wrote:
> On Thu, Feb 5, 2026 at 4:43 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>> Sure.  I'm tempted to suggest that we only use the plain C version here,
>> too.  The SSE4.2 bms_num_members() test I did yesterday used it and showed
>> improvement at one word.  If we do that, we can rip out even more code
>> since we no longer need the popcount built-ins.
> 
> Unlike the 32-bit case, people do run production on 64-bit platforms
> that are not Arm/x86, so that would require effort to see if the
> builtins are worth it for them. That seems like a separate effort. I
> can help with that, but let's get the tested stuff in first.

Alright.  I moved that to a new 0004 patch that we can consider separately
once 0001-0003 have been committed.

-- 
nathan

Вложения

Re: refactor architecture-specific popcount code

От
John Naylor
Дата:
On Fri, Feb 6, 2026 at 11:13 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Thu, Feb 05, 2026 at 02:48:44PM +0700, John Naylor wrote:
> > On Thu, Feb 5, 2026 at 4:43 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> >> Sure.  I'm tempted to suggest that we only use the plain C version here,
> >> too.  The SSE4.2 bms_num_members() test I did yesterday used it and showed
> >> improvement at one word.  If we do that, we can rip out even more code
> >> since we no longer need the popcount built-ins.
> >
> > Unlike the 32-bit case, people do run production on 64-bit platforms
> > that are not Arm/x86, so that would require effort to see if the
> > builtins are worth it for them. That seems like a separate effort. I
> > can help with that, but let's get the tested stuff in first.
>
> Alright.  I moved that to a new 0004 patch that we can consider separately
> once 0001-0003 have been committed.

Okay, this is looking good. I have just one more suggestion: For 0002,
just copy the word-wise functions verbatim. That way, it's a pure
refactoring commit and the exception doesn't need explaining. With
that, I'd say go ahead and commit 0001/2.

Then after a bit more research, the final form of the inline functions
can be visible in a single commit. I've tested S390X already and hope
to test one other platform.

--
John Naylor
Amazon Web Services



Re: refactor architecture-specific popcount code

От
Nathan Bossart
Дата:
On Sat, Feb 07, 2026 at 03:54:31PM +0700, John Naylor wrote:
> Okay, this is looking good. I have just one more suggestion: For 0002,
> just copy the word-wise functions verbatim. That way, it's a pure
> refactoring commit and the exception doesn't need explaining. With
> that, I'd say go ahead and commit 0001/2.

Seems reasonable.  Here is an updated patch set.  I've also swapped 0003
and 0004.

> Then after a bit more research, the final form of the inline functions
> can be visible in a single commit. I've tested S390X already and hope
> to test one other platform.

Thanks.  Looking forward to the results.

-- 
nathan

Вложения

Re: refactor architecture-specific popcount code

От
John Naylor
Дата:
On Wed, Feb 11, 2026 at 1:45 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
>
> On Sat, Feb 07, 2026 at 03:54:31PM +0700, John Naylor wrote:
> > Okay, this is looking good. I have just one more suggestion: For 0002,
> > just copy the word-wise functions verbatim. That way, it's a pure
> > refactoring commit and the exception doesn't need explaining. With
> > that, I'd say go ahead and commit 0001/2.
>
> Seems reasonable.  Here is an updated patch set.  I've also swapped 0003
> and 0004.

0002: I just noticed another thing that could be done in a separate
follow-up patch:

@@ -220,17 +170,6 @@ pg_popcount_masked_portable(const char *buf, int
bytes, bits8 mask)
  * actual external functions.  The compiler should be able to inline the
  * portable versions here.
  */
-int
-pg_popcount32(uint32 word)
-{
-   return pg_popcount32_portable(word);
-}

The last sentence doesn't seem to be true anymore for the functions
that remain here:

(Needed to add "#undef HAVE_X86_64_POPCNTQ" in a few places to demonstrate)
objdump --no-show-raw-insn --no-addresses -drwC -M intel
/path/to/postgres | grep -C 5 pg_popcount_portable

<pg_popcount_optimized>:
        jmp    <pg_popcount_portable>

I don't think we need to worry about this, but the attached looks
nicer to me. There might be a disadvantage of using macros here, but
I'm not sure what it would be.

--
John Naylor
Amazon Web Services

Вложения

Re: refactor architecture-specific popcount code

От
Nathan Bossart
Дата:
On Wed, Feb 11, 2026 at 02:10:37PM +0700, John Naylor wrote:
> I don't think we need to worry about this, but the attached looks
> nicer to me. There might be a disadvantage of using macros here, but
> I'm not sure what it would be.

The only problem that I see with this is that a compiler that understands
Neon intrinsics but not SVE ones will probably fail because we build
external functions that call the Neon versions in that case.  But that's
easy enough to work around by adding an extra #elif defined(USE_NEON)
section, and it still saves a handful of lines of code.  Perhaps someday
this situation will be rare enough that we can remove that hack and just
point to the portable versions if the compiler doesn't understand both Neon
and SVE, but I'm not sure we're there yet.

-- 
nathan

Вложения

Re: refactor architecture-specific popcount code

От
John Naylor
Дата:
On Thu, Feb 12, 2026 at 12:39 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:
>
> On Wed, Feb 11, 2026 at 02:10:37PM +0700, John Naylor wrote:
> > I don't think we need to worry about this, but the attached looks
> > nicer to me. There might be a disadvantage of using macros here, but
> > I'm not sure what it would be.
>
> The only problem that I see with this is that a compiler that understands
> Neon intrinsics but not SVE ones will probably fail because we build
> external functions that call the Neon versions in that case.

Ah, right.

> But that's
> easy enough to work around by adding an extra #elif defined(USE_NEON)
> section, and it still saves a handful of lines of code.  Perhaps someday
> this situation will be rare enough that we can remove that hack and just
> point to the portable versions if the compiler doesn't understand both Neon
> and SVE, but I'm not sure we're there yet.

I have a better idea, but it depends on re-thinking feature detection
and compile-time vs run-time checks, and that's not quite ready to
share, so let's shelve this for now.

Anyway, I think 0001 and 0002 are ready for commit.

--
John Naylor
Amazon Web Services



Re: refactor architecture-specific popcount code

От
Nathan Bossart
Дата:
On Thu, Feb 12, 2026 at 11:58:37AM +0700, John Naylor wrote:
> I have a better idea, but it depends on re-thinking feature detection
> and compile-time vs run-time checks, and that's not quite ready to
> share, so let's shelve this for now.

Sounds good.

> Anyway, I think 0001 and 0002 are ready for commit.

Committed.  I've attached rebased versions of the remaining patches under
consideration.

-- 
nathan

Вложения