On Mon, Jun 12, 2023, at 21:58, Tomas Vondra wrote:
> But those are numbers for large keys - if you restrict the input to
> 4B-16B (which is what we planned to do by focusing on int, bigint and
> uuid), there's no significant difference:
Oh, sorry, I completely failed to read the meaning of the Columns.
> lookup3:
>
> Small key speed test - 4-byte keys - 30.17 cycles/hash
> Small key speed test - 8-byte keys - 31.00 cycles/hash
> Small key speed test - 16-byte keys - 49.00 cycles/hash
>
> xxh3low:
>
> Small key speed test - 4-byte keys - 29.00 cycles/hash
> Small key speed test - 8-byte keys - 29.58 cycles/hash
> Small key speed test - 16-byte keys - 37.00 cycles/hash
The winner of the "Small key speed test" competition seems to be:
ahash64 "ahash 64bit":
Small key speed test - 4-byte keys - 24.00 cycles/hash
Small key speed test - 8-byte keys - 24.00 cycles/hash
Small key speed test - 16-byte keys - 26.98 cycles/hash
Looks like it's a popular one, e.g. it's used by Rust in their std::collections::HashSet.
Another notable property of ahash64 is that it's "DOS resistant",
but it isn't crucial for our use case, since we exclusively target
auto-generated primary keys which are not influenced by end-users.
> Not sure what you mean by "optimizing for space" - I imagined the
> on-disk format would just use the same hash table with tiny amount of
> free space (say 10% and not ~%50).
With "optimizing for space" I meant trying to find some alternative or
intermediate data structure that is more likely to be compressible,
like your idea of sorting the data.
What I wondered was if the on-disk format would be affected by
the choice of hash function. I guess it wouldn't, if the hashset
is created by adding the elements one-by-one by iterating
through the elements by reading the on-disk format.
But I thought maybe something more advanced could be
done, where conversion between the on-disk format
and the in-memory format could be done without naively
iterating through all elements, i.e. something less complex
than O(n).
No idea what that mechanism would be though.
> My suggestion is to be lazy, just use the lookup3 we have in hashfn.c
> (through hash_bytes or something), and at the same time make it possible
> to switch to a different function in the future. I'd store and ID of the
> hash function in the set, so that we can support a different algorithm
> in the future, if we choose to.
Sounds good to me.
Smart idea to include the hash function algorithm ID in the set,
to allow implementing a different one in the future!
/Joel