RE: Popcount optimization using AVX512

Поиск
Список
Период
Сортировка
От Shankaran, Akash
Тема RE: Popcount optimization using AVX512
Дата
Msg-id PH0PR11MB5000C2258BF2804AAF7AAE27F27A2@PH0PR11MB5000.namprd11.prod.outlook.com
обсуждение исходный текст
Ответ на Re: Popcount optimization using AVX512  (Nathan Bossart <nathandbossart@gmail.com>)
Ответы Re: Popcount optimization using AVX512  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Список pgsql-hackers
Sorry for the late response. We did some further testing and research on our end, and ended up modifying the AVX512
basedalgorithm for popcount. We removed a scalar dependency and accumulate the results of popcnt instruction in a zmm
register,only performing the reduce add at the very end, similar to [0].
 

With the updated patch, we observed significant improvements and handily beat the previous popcount algorithm
performance.No regressions in any scenario are observed:
 
Platform: Intel Xeon Platinum 8360Y (Icelake) for data sizes 1kb - 64kb.
Microbenchmark: 2x - 3x gains presently vs 19% previously, on the same microbenchmark described initially in this
thread.
 

PG testing: 
SQL bit_count() calls popcount. Using a Postgres benchmark calling "select bit_count(bytea(col1)) from mytable" on a
tablewith ~2M text rows, each row 1-12kb in size, we observe (only comparing with 64bit PG implementation, which is the
fastest):
 

1. Entire benchmark using AVX512 implementation vs PG 64-bit impl runs 6-13% faster. 
2. Reduce time spent on pg_popcount() method in postgres server during the benchmark: 
    o    64bit (current PG):           29.5% 
    o    AVX512:                      3.3%
3. Reduce number of samples processed by popcount: 
    o    64bit (current PG):          2.4B samples
    o    AVX512:                    285M samples

Compile above patch (on a machine supporting AVX512 vpopcntdq) using: make all CFLAGS_AVX512="-DHAVE__HW_AVX512_POPCNT
-mavx-mavx512vpopcntdq -mavx512f -march=native
 
Attaching flamegraphs and patch for above observations. 

[0] https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx512-vpopcnt.cpp

Thanks,
Akash Shankaran 

-----Original Message-----
From: Nathan Bossart <nathandbossart@gmail.com> 
Sent: Wednesday, November 15, 2023 1:49 PM
To: Shankaran, Akash <akash.shankaran@intel.com>
Cc: Noah Misch <noah@leadboat.com>; Amonson, Paul D <paul.d.amonson@intel.com>; Tom Lane <tgl@sss.pgh.pa.us>; Matthias
vande Meent <boekewurm+postgres@gmail.com>; pgsql-hackers@lists.postgresql.org
 
Subject: Re: Popcount optimization using AVX512

On Wed, Nov 15, 2023 at 08:27:57PM +0000, Shankaran, Akash wrote:
> AVX512 has light and heavy instructions. While the heavy AVX512 
> instructions have clock frequency implications, the light instructions 
> not so much. See [0] for more details. We captured EMON data for the 
> benchmark used in this work, and see that the instructions are using 
> the licensing level not meant for heavy AVX512 operations. This means 
> the instructions for popcount : _mm512_popcnt_epi64(),
> _mm512_reduce_add_epi64() are not going to have any significant impact 
> on CPU clock frequency.
>
> Clock frequency impact aside, we measured the same benchmark for gains 
> on older Intel hardware and observe up to 18% better performance on 
> Intel Icelake. On older intel hardware, the popcntdq 512 instruction 
> is not present so it won’t work. If clock frequency is not affected, 
> rest of workload should not be impacted in the case of mixed workloads.

Thanks for sharing your analysis.

> Testing this on smaller block sizes < 8KiB shows that AVX512 compared 
> to the current 64bit behavior shows slightly lower performance, but 
> with a large variance. We cannot conclude much from it. The testing 
> with ANALYZE benchmark by Nathan also points to no visible impact as a 
> result of using AVX512. The gains on larger dataset is easily evident, 
> with less variance.
>
> What are your thoughts if we introduce AVX512 popcount for smaller 
> sizes as an optional feature initially, and then test it more 
> thoroughly over time on this particular use case?

I don't see any need to rush this.  At the very earliest, this feature would go into v17, which doesn't enter feature
freezeuntil April 2024.
 
That seems like enough time to complete any additional testing you'd like to do.  However, if you are seeing worse
performancewith this patch, then it seems unlikely that we'd want to proceed.
 

> Thoughts or feedback on the approach in the patch? This solution 
> should not impact anyone who doesn’t use the feature i.e. AVX512. Open 
> to additional ideas if this doesn’t seem like the right approach here.

It's true that it wouldn't impact anyone not using the feature, but there's also a decent chance that this code goes
virtuallyuntested.  As I've stated elsewhere [0], I think we should ensure there's buildfarm coverage for this kind of
architecture-specificstuff.
 

[0] https://postgr.es/m/20230726043707.GB3211130%40nathanxps13

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Masahiko Sawada
Дата:
Сообщение: Re: Make COPY format extendable: Extract COPY TO format implementations
Следующее
От: Thomas Munro
Дата:
Сообщение: Re: Remove pthread_is_threaded_np() checks in postmaster