Re: Popcount optimization using AVX512
| От | Nathan Bossart | 
|---|---|
| Тема | Re: Popcount optimization using AVX512 | 
| Дата | |
| Msg-id | 20240401155338.GA2296072@nathanxps13 обсуждение исходный текст | 
| Ответ на | Re: Popcount optimization using AVX512 (Alvaro Herrera <alvherre@alvh.no-ip.org>) | 
| Ответы | Re: Popcount optimization using AVX512 | 
| Список | pgsql-hackers | 
On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote:
> On 2024-Mar-31, Nathan Bossart wrote:
>> +    popcnt = _mm512_reduce_add_epi64(accum);
>> +    return popcnt + pg_popcount_fast(buf, bytes);
> 
> Hmm, doesn't this arrangement cause an extra function call to
> pg_popcount_fast to be used here?  Given the level of micro-optimization
> being used by this code, I would have thought that you'd have tried to
> avoid that.  (At least, maybe avoid the call if bytes is 0, no?)
Yes, it does.  I did another benchmark on very small arrays and can see the
overhead.  This is the time in milliseconds to run pg_popcount() on an
array 1 billion times:
    size (bytes)  HEAD      AVX512-POPCNT
    1             1707.685  3480.424
    2             1926.694  4606.182
    4             3210.412  5284.506
    8             1920.703  3640.968
    16            2936.91   4045.586
    32            3627.956  5538.418
    64            5347.213  3748.212
I suspect that anything below 64 bytes will see this regression, as that is
the earliest point where there are enough bytes for ZMM registers.
We could avoid the call if there are no remaining bytes, but the numbers
for the smallest arrays probably wouldn't improve much, and that might
actually add some overhead due to branching.  The other option to avoid
this overhead is to put most of pg_bitutils.c into its header file so that
we can inline the call.
Reviewing the current callers of pg_popcount(), IIUC the only ones that are
passing very small arrays are the bit_count() implementations and a call in
the syslogger for a single byte.  I don't know how much to worry about the
overhead for bit_count() since there's presumably a bunch of other
overhead, and the syslogger one could probably be fixed via an inline
function that pulled the value from pg_number_of_ones (which would probably
be an improvement over the status quo, anyway).  But this is all to save a
couple of nanoseconds...
-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
		
	В списке pgsql-hackers по дате отправления: