On 01/27/2018 12:22 AM, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
>> I suspect you're right the hash is biased to lohalf bits, as you wrote
>> in the 19/12 message.
>
> I don't see any bias in what it's doing, which is basically xoring the
> two halves and hashing the result. It's possible though that Todd's
> data set contains values in which corresponding bits of the high and
> low halves are correlated somehow, in which case the xor would produce
> a lot of cancellation and a relatively small number of distinct outputs.
>
Hmm, that makes more sense than what I wrote. Probably time to get some
sleep or drink more coffee, I guess.
BTW what do you think about the fact that we only really generate ~63%
of the possible hash values (see my message from 11/12)? That seems a
bit unfortunate, although not unexpected for simple hash hunction.
> If we weren't bound by backwards compatibility, we could consider
> changing to logic more like "if the value is within the int4 range,
> apply int4hash, otherwise hash all 8 bytes normally". But I don't see
> how we can change that now that hash indexes are first-class
> citizens.
>
Yeah, I've been thinking about that too. But I think it's an issue only
for pg_upgraded clusters, which may have have hash indexes (and also
hash-partitioned tables). So couldn't we use new hash functions in fresh
clusters and use the backwards-compatible ones in pg_upgraded ones?r
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services