Greg Stark <gsstark@mit.edu> writes:
> Well it was a bit of a pain but I filled an array with (1/1000 scaled
> down) values and then permuted them. I also went ahead and set the
> low-order bits to random values since the lookup table based algorithm
> might be affected by it.
> The results are a bit disappointing on my machine, only the CLZ and
> lookup table come out significantly ahead:
> clz 1.530s
> lookup table 1.720s
> float hack 4.424s
> unrolled 5.280s
> normal 5.369s
It strikes me that we could assume that the values are < 64K and hence
drop the first case in the lookup table code. I've added that variant
and get these results on my machines:
x86_64 (Xeon):
clz 15.357s lookup table 16.582s small lookup table 16.705s float hack 25.138s
unrolled64.630s normal 79.025s
PPC:
clz 3.842s lookup table 7.298s small lookup table 8.799s float hack 19.418s
unrolled7.656s normal 8.949s
HPPA: clz (n/a) lookup table 11.515s small lookup table 10.803s float hack 16.502s
unrolled 17.632s normal 19.754s
Not sure why the "small lookup table" variant actually seems slower
than the original on two of these machines; it can hardly be slower in
reality since it's strictly less code. Maybe some weird code-alignment
issue?
It also seems weird that the x86_64 is now showing a much bigger gap
between clz and "normal" than before. I don't see how branch prediction
would do much for the "normal" code.
regards, tom lane