Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> I suspect you're right the hash is biased to lohalf bits, as you wrote
> in the 19/12 message.
I don't see any bias in what it's doing, which is basically xoring the
two halves and hashing the result. It's possible though that Todd's
data set contains values in which corresponding bits of the high and
low halves are correlated somehow, in which case the xor would produce
a lot of cancellation and a relatively small number of distinct outputs.
If we weren't bound by backwards compatibility, we could consider changing
to logic more like "if the value is within the int4 range, apply int4hash,
otherwise hash all 8 bytes normally". But I don't see how we can change
that now that hash indexes are first-class citizens.
In any case, we still need a fix for the behavior that the hash table size
is blown out by lots of collisions, because that can happen no matter what
the hash function is. Andres seems to have dropped the ball on doing
something about that.
regards, tom lane