Re: tweaking perfect hash multipliers

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: tweaking perfect hash multipliers
Дата
Msg-id 20200330183146.nfvqclsy73tkxuwd@alap3.anarazel.de
обсуждение исходный текст
Ответ на tweaking perfect hash multipliers  (John Naylor <john.naylor@2ndquadrant.com>)
Ответы Re: tweaking perfect hash multipliers  (John Naylor <john.naylor@2ndquadrant.com>)
Re: tweaking perfect hash multipliers  (John Naylor <john.naylor@2ndquadrant.com>)
Список pgsql-hackers
Hi,

On 2020-03-30 21:33:14 +0800, John Naylor wrote:
> Then I used the attached program to measure various combinations of
> compiled instructions using two constant multipliers iterating over
> bytes similar to a generated hash function.

It looks like you didn't attach the program?


> <cc> -O2 -Wall test-const-mult.c test-const-mult-2.c
> ./a.out
> Median of 3 with clang 10:
> 
>             lea, lea 0.181s
> 
>         lea, lea+add 0.248s
>       lea, shift+add 0.251s
> 
>   lea+add, shift+add 0.273s
> shift+add, shift+add 0.276s
> 
>       2 leas, 2 leas 0.290s
>      shift+add, imul 0.329s
> 
> Taking this with a grain of salt, it nonetheless seems plausible that
> a single lea could be faster than any two instructions here.

It's a bit complicated by the fact that there's more execution ports to
execute shift/add than there ports to compute some form of leas. And
some of that won't easily be measurable in a micro-benchmark, because
there'll be dependencies between the instruction preventing any
instruction level parallelism.

I think the form of lea generated here is among the ones that can only
be executed on port 1. Whereas e.g. an register+register/immediate add
can be executed on four different ports.

There's also a significant difference in latency that you might not see
in your benchmark. E.g. on coffee lake the relevant form of lea has a
latency of three cycles, but one independent lea can be "started" per
cycle (agner calls this "reciprocal throughput). Whereas a shift has a
latency of 1 cycle and a reciprocal throughput of 0.5 (lower is better),
add has a latency o 1 and a reciprocal throughput of 0.25.

See the tables in  https://www.agner.org/optimize/instruction_tables.pdf

I'm not really sure my musings above matter terribly much, but I just
wanted to point out why I'd not take too much stock in the above timings
in isolation. Even a very high latency wouldn't necessarily be penalized
in a benchmark with one loop iteration independent from each other, but
would matter in the real world.


Cool work!

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Recognizing superuser in pg_hba.conf
Следующее
От: Justin Pryzby
Дата:
Сообщение: Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace onthe fly