Re: Change GUC hashtable to use simplehash?

Поиск
Список
Период
Сортировка
От Ants Aasma
Тема Re: Change GUC hashtable to use simplehash?
Дата
Msg-id CANwKhkP7pCiW_5fAswLhs71-JKGEz1c1+PC0a_w1fwY4iGMqUA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Change GUC hashtable to use simplehash?  (Jeff Davis <pgsql@j-davis.com>)
Ответы Re: Change GUC hashtable to use simplehash?  (John Naylor <johncnaylorls@gmail.com>)
Список pgsql-hackers
On Sun, 21 Jan 2024 at 03:06, Jeff Davis <pgsql@j-davis.com> wrote:
> Yes, thank you. I don't think we need to change the algorithm.

Jumping in here at a random point just to share my findings from
poking around this on and off. I am concentrating here on cstring
hashing as that is the most complicated one.

One thing that caught my eye in testing was that the unaligned cstring
code was unexpectedly faster for short strings (3-18B uniform
distribution). Looking into it the cause was  fasthash_accum() called
in the final iteration. In the unaligned case compiler (clang-15)
unrolled the inner loop which allowed it to jump directly into the
correct place in the switch. In the unaligned case clang decided to
use a data dependent jump which then mispredicts all of the time.

But given that we know the data length and we have it in a register
already, it's easy enough to just mask out data past the end with a
shift. See patch 1. Performance benefit is about 1.5x Measured on a
small test harness that just hashes and finalizes an array of strings,
with a data dependency between consecutive hashes (next address
depends on the previous hash output).

Unaligned case can actually take advantage of the same trick as the
aligned case, it just has to shuffle the data from two consecutive
words before applying the combine function. Patch 2 implements this.
It makes the unaligned case almost as fast as the aligned one, both on
short and long strings. 10% benefit on short strings, 50% on long
ones.

Not sure if the second one is worth the extra code. A different
approach would be to use the simple word at a time hashing for the
unaligned case too and handle word accesses that straddle a page
boundary as a special case. Obviously this only makes sense for
platforms that support unaligned access. On x86 unaligned access
within a cache line is basically free, and across cache lines is only
slightly more expensive. On benchmarks calling the aligned code on
unaligned strings only has a 5% penalty on long strings, short ones
are indistinguishable.

I also took a look at using SIMD for implementing the hash using the
same aligned access + shuffle trick. The good news is that the
shuffling works well enough that neither it nor checking for string
end are the longest chain. The bad news is that the data load,
alignment, zero finding and masking form a big dependency chain on the
first iteration. Mixing and finalization is even worse, fasthash uses
64bit imul instruction that has a 3 cycle latency, the iteration to
iteration chain is imul + xor, for 4 cycles or 2 B/cycle (in practice
a bit less due to ALU port contention). In SIMD registers there is no
64bit multiply, and 32 bit multiply has a terrible 10 cycle latency on
Intel. AES instructions are an interesting option, but it seems that 2
are needed for good enough mixing, at 4 cycles each, we again end up
at 2B/cycle. Finalization needs another 3 AES instructions, a shuffle
and a xor fold to pass SMHasher, for 17 cycles. The mix latency issue
could be worked around by doing more mixing in parallel, potentially
up to 8x faster, but this does not help short strings at all and would
make the code way bigger. SIMD code does use fewer instructions so it
interleaves better with nearby code that is not dependent on it, not
sure if that matters anywhere.

The short version is that for very long (4k+) strings the attached
SIMD code is 35% faster, for short strings it is 35% slower, and this
is very much x86-64-v3 only and would need a fallback when AVX and
AES-NI are not available. Basically a dead end for the use cases this
hash function is used for.

Regards,
Ants Aasma

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jelte Fennema-Nio
Дата:
Сообщение: Re: meson + libpq_pipeline
Следующее
От: Nathan Bossart
Дата:
Сообщение: Re: cleanup patches for incremental backup