Обсуждение: Popcount optimization for the slow-path lookups

Поиск
Список
Период
Сортировка

Popcount optimization for the slow-path lookups

От
Andrew Pogrebnoi
Дата:
Hello hackers,

I want to propose an optimization for pg_popcount32_slow() and pg_popcount64_slow() where lookups into pg_number_of_ones[] are made branchless. It shows speedup around 58% for uint64 and 35% for uint32 words compared to the current, looped version. This is on x86. It is much more significant on Arm64 (Apple M1): ~x2.6 for uint64 and ~x2.25 for uint32 words.

I probably have to guard the lookup in pg_popcount64_slow() with `#if SIZEOF_LONG == 8` like it's done for __builtin_popcountl() in the same function. What do you think?

For benchmarks, I used functions with the copies of the lookup loop from pg_popcount32/64_slow(), measuring them against branchless counterparts. Then in C++ I pre generated two static vectors with random uint64's and uint32's, each of 5M size and ran lookups using google/benchmark [0]. I've run it on M1 and x86 Macs.

Results:

X86
```
% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T10:13:34+02:00
Running ./mybenchmark
Run on (12 X 3200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB
  L1 Instruction 32 KiB
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB
Load Average: 3.23, 4.56, 2.92
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup       17.2 ns         17.2 ns     39247568
Popcnt64_PG                     27.2 ns         27.0 ns     25415636
Popcnt32_BranchlessLookup       14.5 ns         14.4 ns     48441566
Popcnt32_PG                     19.5 ns         19.4 ns     36125676
```

Apple M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include   -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T08:59:12+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
  L1 Data 128 KiB (x10)
  L1 Instruction 192 KiB (x10)
  L2 Unified 12288 KiB (x10)
Load Average: 0.28, 0.07, 0.02
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup       6.52 ns         6.52 ns    113974186
Popcnt64_PG                     17.1 ns         17.1 ns     43015334
Popcnt32_BranchlessLookup       4.34 ns         4.34 ns    145674483
Popcnt32_PG                     9.79 ns         9.79 ns     75893374
```


I also tried the Parallel summing of adjacent bits (see Hacker's Delight [1], Chapter 5)
This is the code for uint64:
```
#define POPCOUNT_M0 0x5555555555555555 // 01010101 ...
#define POPCOUNT_M1 0x3333333333333333 // 00110011 ...
#define POPCOUNT_M2 0x0F0F0F0F0F0F0F0F // 00001111 ...

static inline int
pg_popcount_word64(uint64 word)
{
    word = word - ((word >> 1) & POPCOUNT_M0);
    word = ((word >> 2) & POPCOUNT_M1) + (word & POPCOUNT_M1);
    word = ((word >> 4) + word) & POPCOUNT_M2;
    word += word >> 8;
    word += word >> 16;
    word += word >> 32; // this step is omitted for the uint32 version

    return word & 0x3F;
}
```

However, it doesn't show any improvements compared to the branchless look up on M1 and a slight gain for uint64 on x86.

The same tests with the summing of adjacent bits:
x86
```
% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T10:13:34+02:00
Running ./mybenchmark
Run on (12 X 3200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB
  L1 Instruction 32 KiB
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB
Load Average: 3.23, 4.56, 2.92
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup       17.2 ns         17.2 ns     39247568
Popcnt64_AdjacentBits           16.8 ns         16.7 ns     41481236
Popcnt64_PG                     27.2 ns         27.0 ns     25415636
Popcnt32_BranchlessLookup       14.5 ns         14.4 ns     48441566
Popcnt32_AdjacentBits           15.5 ns         15.4 ns     44454323
Popcnt32_PG                     19.5 ns         19.4 ns     36125676
```

M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include   -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
2025-12-05T08:59:12+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
  L1 Data 128 KiB (x10)
  L1 Instruction 192 KiB (x10)
  L2 Unified 12288 KiB (x10)
Load Average: 0.28, 0.07, 0.02
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup       6.52 ns         6.52 ns    113974186
Popcnt64_AdjacentBits           8.05 ns         8.05 ns     86988490
Popcnt64_PG                     17.1 ns         17.1 ns     43015334
Popcnt32_BranchlessLookup       4.34 ns         4.34 ns    145674483
Popcnt32_AdjacentBits           7.27 ns         7.27 ns     96050714
Popcnt32_PG                     9.79 ns         9.79 ns     75893374
```



[0] https://github.com/google/benchmark
[1] https://dl.acm.org/doi/book/10.5555/2462741

-----
Cheers,
Andy
Вложения

Re: Popcount optimization for the slow-path lookups

От
David Geier
Дата:
Hi Andy!

On 05.12.2025 14:07, Andrew Pogrebnoi wrote:
> Hello hackers,
> 
> I want to propose an optimization for pg_popcount32_slow() and
> pg_popcount64_slow() where lookups into pg_number_of_ones[] are made
> branchless. It shows speedup around 58% for uint64 and 35% for uint32 words
> compared to the current, looped version. This is on x86. It is much more
> significant on Arm64 (Apple M1): ~x2.6 for uint64 and ~x2.25 for uint32
> words.
> 
> I probably have to guard the lookup in pg_popcount64_slow() with `#if
> SIZEOF_LONG == 8` like it's done for __builtin_popcountl() in the same
> function. What do you think?
> 
> For benchmarks, I used functions with the copies of the lookup loop from
> pg_popcount32/64_slow(), measuring them against branchless counterparts.
> Then in C++ I pre generated two static vectors with random uint64's and
> uint32's, each of 5M size and ran lookups using google/benchmark [0]. I've
> run it on M1 and x86 Macs.
> 
> Results:
> 
> X86
> ```
> % g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src
> -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
> 2025-12-05T10:13:34+02:00
> Running ./mybenchmark
> Run on (12 X 3200 MHz CPU s)
> CPU Caches:
>   L1 Data 32 KiB
>   L1 Instruction 32 KiB
>   L2 Unified 256 KiB (x6)
>   L3 Unified 12288 KiB
> Load Average: 3.23, 4.56, 2.92
> --------------------------------------------------------------------
> Benchmark                          Time             CPU   Iterations
> --------------------------------------------------------------------
> Popcnt64_BranchlessLookup       17.2 ns         17.2 ns     39247568
> Popcnt64_PG                     27.2 ns         27.0 ns     25415636
> Popcnt32_BranchlessLookup       14.5 ns         14.4 ns     48441566
> Popcnt32_PG                     19.5 ns         19.4 ns     36125676
> ```
> 
> Apple M1
> ```
> $ g++ popcnt.cc -std=c++11 -isystem benchmark/include
> -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark &&
> ./mybenchmark
> 2025-12-05T08:59:12+00:00
> Running ./mybenchmark
> Run on (10 X 48 MHz CPU s)
> CPU Caches:
>   L1 Data 128 KiB (x10)
>   L1 Instruction 192 KiB (x10)
>   L2 Unified 12288 KiB (x10)
> Load Average: 0.28, 0.07, 0.02
> --------------------------------------------------------------------
> Benchmark                          Time             CPU   Iterations
> --------------------------------------------------------------------
> Popcnt64_BranchlessLookup       6.52 ns         6.52 ns    113974186
> Popcnt64_PG                     17.1 ns         17.1 ns     43015334
> Popcnt32_BranchlessLookup       4.34 ns         4.34 ns    145674483
> Popcnt32_PG                     9.79 ns         9.79 ns     75893374
> ```
> 
> 
> I also tried the Parallel summing of adjacent bits (see Hacker's Delight
> [1], Chapter 5)
> This is the code for uint64:
> ```
> #define POPCOUNT_M0 0x5555555555555555 // 01010101 ...
> #define POPCOUNT_M1 0x3333333333333333 // 00110011 ...
> #define POPCOUNT_M2 0x0F0F0F0F0F0F0F0F // 00001111 ...
> 
> static inline int
> pg_popcount_word64(uint64 word)
> {
>     word = word - ((word >> 1) & POPCOUNT_M0);
>     word = ((word >> 2) & POPCOUNT_M1) + (word & POPCOUNT_M1);
>     word = ((word >> 4) + word) & POPCOUNT_M2;
>     word += word >> 8;
>     word += word >> 16;
>     word += word >> 32; // this step is omitted for the uint32 version
> 
>     return word & 0x3F;
> }
> ```
> 
> However, it doesn't show any improvements compared to the branchless look
> up on M1 and a slight gain for uint64 on x86.
> 
> The same tests with the summing of adjacent bits:
> x86
> ```
> % g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src
> -lbenchmark -lpthread -o mybenchmark && ./mybenchmark
> 2025-12-05T10:13:34+02:00
> Running ./mybenchmark
> Run on (12 X 3200 MHz CPU s)
> CPU Caches:
>   L1 Data 32 KiB
>   L1 Instruction 32 KiB
>   L2 Unified 256 KiB (x6)
>   L3 Unified 12288 KiB
> Load Average: 3.23, 4.56, 2.92
> --------------------------------------------------------------------
> Benchmark                          Time             CPU   Iterations
> --------------------------------------------------------------------
> Popcnt64_BranchlessLookup       17.2 ns         17.2 ns     39247568
> Popcnt64_AdjacentBits           16.8 ns         16.7 ns     41481236
> Popcnt64_PG                     27.2 ns         27.0 ns     25415636
> Popcnt32_BranchlessLookup       14.5 ns         14.4 ns     48441566
> Popcnt32_AdjacentBits           15.5 ns         15.4 ns     44454323
> Popcnt32_PG                     19.5 ns         19.4 ns     36125676
> ```
> 
> M1
> ```
> $ g++ popcnt.cc -std=c++11 -isystem benchmark/include
> -L../benchmark/build/src -lbenchmark -lpthread -o mybenchmark &&
> ./mybenchmark
> 2025-12-05T08:59:12+00:00
> Running ./mybenchmark
> Run on (10 X 48 MHz CPU s)
> CPU Caches:
>   L1 Data 128 KiB (x10)
>   L1 Instruction 192 KiB (x10)
>   L2 Unified 12288 KiB (x10)
> Load Average: 0.28, 0.07, 0.02
> --------------------------------------------------------------------
> Benchmark                          Time             CPU   Iterations
> --------------------------------------------------------------------
> Popcnt64_BranchlessLookup       6.52 ns         6.52 ns    113974186
> Popcnt64_AdjacentBits           8.05 ns         8.05 ns     86988490
> Popcnt64_PG                     17.1 ns         17.1 ns     43015334
> Popcnt32_BranchlessLookup       4.34 ns         4.34 ns    145674483
> Popcnt32_AdjacentBits           7.27 ns         7.27 ns     96050714
> Popcnt32_PG                     9.79 ns         9.79 ns     75893374
> ```
> 
> 
> 
> [0] https://github.com/google/benchmark
> [1] https://dl.acm.org/doi/book/10.5555/2462741
> 
> -----
> Cheers,
> Andy
> 

I would like to test if I can reproduce your results. Could you share
your test program?

You also don't specify an optimization level. That means the default
level -O0 is used. Does it make it a difference if you run e.g. with -O3?

--
David Geier



Re: Popcount optimization for the slow-path lookups

От
Andrew Pogrebnoi
Дата:
Hi David,

Thanks for looking at it!

> I would like to test if I can reproduce your results. Could you share
> your test program?

Here you go: https://gist.github.com/dAdAbird/1480ff15764f5a6301174806d8512a3a

> You also don't specify an optimization level. That means the default
> level -O0 is used. Does it make it a difference if you run e.g. with -O3?

Yes, good point. I got carried away with experiments and totally forgot that..

I've re-run tests with -O3 and it seems like the Summing of Adjacent Bits might make sense for uint64:

X86
```

% g++ pop.cc -std=c++11 -isystem benchmark/include -L../benchmark/build/src -lbenchmark -lpthread -O3 -o mybenchmark && ./mybenchmark

2025-12-05T16:22:43+02:00

Running ./mybenchmark

Run on (12 X 3200 MHz CPU s)

CPU Caches:

  L1 Data 32 KiB

  L1 Instruction 32 KiB

  L2 Unified 256 KiB (x6)

  L3 Unified 12288 KiB

Load Average: 2.67, 4.72, 2.90

--------------------------------------------------------------------

Benchmark                          Time             CPU   Iterations

--------------------------------------------------------------------

Popcnt64_BranchlessLookup       5.92 ns         5.90 ns    113871130

Popcnt64_AdjacentBits           5.30 ns         5.27 ns    115084258

Popcnt64_PG                     8.34 ns         8.32 ns     80622869

Popcnt32_BranchlessLookup       3.62 ns         3.61 ns    197816110

Popcnt32_AdjacentBits           4.56 ns         4.55 ns    154932383

Popcnt32_PG                     5.32 ns         5.31 ns    121845083

```

M1
```
$ g++ popcnt.cc -std=c++11 -isystem benchmark/include   -L../benchmark/build/src -lbenchmark -lpthread -O3 -o mybenchmark && ./mybenchmark
2025-12-05T14:29:18+00:00
Running ./mybenchmark
Run on (10 X 48 MHz CPU s)
CPU Caches:
  L1 Data 128 KiB (x10)
  L1 Instruction 192 KiB (x10)
  L2 Unified 12288 KiB (x10)
Load Average: 0.01, 0.01, 0.00
--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
Popcnt64_BranchlessLookup       2.13 ns         2.13 ns    342101937
Popcnt64_AdjacentBits           1.89 ns         1.89 ns    370316571
Popcnt64_PG                     3.83 ns         3.83 ns    190990088
Popcnt32_BranchlessLookup       1.19 ns         1.19 ns    591797267
Popcnt32_AdjacentBits           1.73 ns         1.73 ns    409381266
Popcnt32_PG                     2.01 ns         2.01 ns    344969625
```

---
Cheers,
Andy

Re: Popcount optimization for the slow-path lookups

От
Nathan Bossart
Дата:
On Fri, Dec 05, 2025 at 03:07:07PM +0200, Andrew Pogrebnoi wrote:
> I want to propose an optimization for pg_popcount32_slow() and
> pg_popcount64_slow() where lookups into pg_number_of_ones[] are made
> branchless. It shows speedup around 58% for uint64 and 35% for uint32 words
> compared to the current, looped version. This is on x86. It is much more
> significant on Arm64 (Apple M1): ~x2.6 for uint64 and ~x2.25 for uint32
> words.
> 
> I probably have to guard the lookup in pg_popcount64_slow() with `#if
> SIZEOF_LONG == 8` like it's done for __builtin_popcountl() in the same
> function. What do you think?
> 
> For benchmarks, I used functions with the copies of the lookup loop from
> pg_popcount32/64_slow(), measuring them against branchless counterparts.
> Then in C++ I pre generated two static vectors with random uint64's and
> uint32's, each of 5M size and ran lookups using google/benchmark [0]. I've
> run it on M1 and x86 Macs.

I don't think the proposed improvements are relevant for either of the
machines you used for your benchmarks.  For x86, we've optimized our
popcount code to use SSE4.2 or AVX-512, and for AArch64, we've optimized it
to use Neon or SVE.  And for other systems, we still try to use
__builtin_popcount() and friends in the fallback paths, which IIUC are
available on both gcc and clang (and maybe elsewhere).  IMHO we need to run
the benchmarks on a compiler/architecture combination where it would
actually be used in practice. 

-- 
nathan



Re: Popcount optimization for the slow-path lookups

От
Andrew Pogrebnoi
Дата:
On Fri, Dec 5, 2025 at 5:40 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
> I don't think the proposed improvements are relevant for either of the
> machines you used for your benchmarks.  For x86, we've optimized our
> popcount code to use SSE4.2 or AVX-512, and for AArch64, we've optimized it
> to use Neon or SVE.  And for other systems, we still try to use
> __builtin_popcount() and friends in the fallback paths, which IIUC are
> available on both gcc and clang (and maybe elsewhere).  IMHO we need to run
> the benchmarks on a compiler/architecture combination where it would
> actually be used in practice.

Yes, I saw that the code is on a rather obscure path, but those machines were my only options for quick benchmarks.
I reasoned that the code path still exists, and eliminating branching there would be beneficial anyway
(most probably). But you are right, we need to test it on target architectures/compilers. I'll try to do with that.

---
Cheers,
Andy