At 2015-01-02 16:46:29 +0200, hlinnakangas@vmware.com wrote:
>
> In the slicing-by-8 version, I wonder if it would be better to do
> single-byte loads to c0-c7, instead of two 4-byte loads and shifts.
Nope. I did some tests, and the sb8 code is slightly slower if I remove
the 0-7byte alignment loop, and significantly slower if I switch to one
byte loads for the whole thing. So I think we should leave that part as
it is, but:
> Would it even make sense to keep the crc variable in different byte
> order, and only do the byte-swap once in END_CRC32() ?
…this certainly does make a noticeable difference. Will investigate.
> The comments need some work. I note that there is no mention of the
> slicing-by-8 algorithm anywhere in the comments (in the first patch).
Will fix. (Unfortunately the widely cited original Intel paper about
slice-by-8 seems to have gone AWOL, but I'll find something.)
> Instead of checking for "defined(__GNUC__) || defined(__clang__)",
> should add an explicit configure test for __builtin_bswap32().
Will do.
Thanks again.
-- Abhijit