Обсуждение: Combining hash values

Поиск
Список
Период
Сортировка

Combining hash values

От
Thomas Munro
Дата:
Hi hackers,

Andres Freund asked me off list what I thought about this part of
execGrouping.c, which builds a hash from the hashes of several datums
in a loop:
       /* rotate hashkey left 1 bit at each step */       hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
    ...           hashkey ^= hkey;
 

I think we should consider improving it and putting it somewhere central.

1.  The same code also appears in nodeHash.c, and we also need to do
the same thing in a couple of new projects that my EDB colleagues and
I are working on for proposal as 10.0 features, based on DSM-backed
hash tables.  So I think it would make sense to define a hash_combine
function or macro to be reused by all of such places rather than
repeating it everywhere.

2.  I suspect that this algorithm for combining hashes is weak, and
could amplify weaknesses in the hash functions feeding it.  Compare
Boost's hash_combine, which does some more work before XORing:
   seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);

That constant approximates the golden ratio (as a fraction of the 32
bit hash space), and it also appears in our hash_any and hash_uint32
functions.  I don't claim to understand the mathematics but I think
this may have to do with TACP section 6, theorem S and exercises 8 and
9, though I'm not sure if it's being used for the same purpose in
algorithms that add it (is it just some random bits? [1][2]) and
algorithms that multiply by it [3][4].  Perhaps we could write a
reusable hash_combine function/macro using existing code from
hashfunc.c, or use the formula from Boost, or something else, to
improve the quality of our hash combinations.  It would be very
interesting to hear from someone with expertise in this area.

Hash indexes don't currently support multiple column keys.  If they
ever do in future, they would also benefit from high quality
combination.  Assuming we'd use the same algorithm there too, changing
the combination algorithm after we start persisting the results to
disk in hash indexes would obviously be difficult.  There doesn't seem
to be any reason why we can't change the algorithm before then.

[1] http://stackoverflow.com/questions/4948780/magic-number-in-boosthash-combine
[2] https://xkcd.com/221/
[3] http://cstheory.stackexchange.com/questions/9639/how-did-knuth-derive-a
[4] http://community.haskell.org/~simonmar/base/src/Data-HashTable.html

-- 
Thomas Munro
http://www.enterprisedb.com



Re: Combining hash values

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> 2.  I suspect that this algorithm for combining hashes is weak, and
> could amplify weaknesses in the hash functions feeding it.

Very possibly, but ...

> Compare
> Boost's hash_combine, which does some more work before XORing:

>     seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);

I can't help being reminded of Knuth's story about he tried to invent
the world's best random number generator, and was disappointed when
it almost immediately converged to a short repeating sequence.  If
there's any actual theoretical basis to the above, I'd be interested
to see it.  But as-is, the use of addition rather than XOR looks fishy,
and the way the old seed is shifted around looks more likely to result
in cancellation than anything else.

> That constant approximates the golden ratio (as a fraction of the 32
> bit hash space), and it also appears in our hash_any and hash_uint32
> functions.

I think it's just somebody's idea of a magic random number.  Your link
> [3] http://cstheory.stackexchange.com/questions/9639/how-did-knuth-derive-a
provides some reasons to think it might be a good choice for a very
specific application, but this is not that application --- in particular,
it doesn't involve multiplying by that constant.
        regards, tom lane



Re: Combining hash values

От
Thomas Munro
Дата:
On Mon, Aug 1, 2016 at 4:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@enterprisedb.com> writes:
>> 2.  I suspect that this algorithm for combining hashes is weak, and
>> could amplify weaknesses in the hash functions feeding it.
>
> Very possibly, but ...

Concrete example: suppose a clever data type author defines a perfect
hash function that maps values to small integers.  In terms of hash
collisions, this performs optimally in a single column hash join, agg,
index etc by definition, and seems like an entirely reasonable thing
to want to do, but performs terribly with Postgres's hash combination
algorithm as soon as you use several columns.  The stuffing is all up
one end of the cushion, which is OK for a perfect hash on its own, but
we do very little to spread it around when combining.  For two columns
that hash to 8 bit integers, we map 16 bits of information to only 9
bits:

postgres=# select count(*) as distinct_outputs,
postgres-#        min(collisions),
postgres-#        max(collisions),
postgres-#        avg(collisions),
postgres-#        stddev(collisions)
postgres-#   from (
postgres(#     select s1.i # (s2.i << 1),
postgres(#            count(*) as collisions
postgres(#       from generate_series(0, 255) as s1(i)
postgres(# cross join generate_series(0, 255) as s2(i)
postgres(#      group by 1
postgres(#         ) ss;
┌──────────────────┬─────┬─────┬──────────────────────┬────────┐
│ distinct_outputs │ min │ max │         avg          │ stddev │
├──────────────────┼─────┼─────┼──────────────────────┼────────┤
│              512 │ 128 │ 128 │ 128.0000000000000000 │      0 │
└──────────────────┴─────┴─────┴──────────────────────┴────────┘
(1 row)

The Boost combiner does better.  If I have this right:

postgres=# create or replace function hash_combine(seed bigint, hash bigint)
postgres-# returns bigint as
postgres-# $$
postgres$#   select (seed # (hash + 2654435769 + (seed << 6) + (seed >> 2)))
postgres$#          % 4294967296;
postgres$# $$ language sql;
CREATE FUNCTION
postgres=#
postgres=# select count(*) as distinct_outputs,
postgres-#        min(collisions),
postgres-#        max(collisions),
postgres-#        avg(collisions),
postgres-#        stddev(collisions)
postgres-#   from (
postgres(#     select hash_combine(hash_combine(0, s1.i), s2.i),
postgres(#            count(*) as collisions
postgres(#       from generate_series(0, 255) as s1(i)
postgres(# cross join generate_series(0, 255) as s2(i)
postgres(#      group by 1
postgres(#         ) ss;
┌──────────────────┬─────┬─────┬────────────────────┬────────────────────────┐
│ distinct_outputs │ min │ max │        avg         │         stddev         │
├──────────────────┼─────┼─────┼────────────────────┼────────────────────────┤
│            16583 │   1 │   9 │ 3.9519990351564856 │ 0.70952439864334926106 │
└──────────────────┴─────┴─────┴────────────────────┴────────────────────────┘
(1 row)

Not as good as I hoped (maybe there is a bug in my query?), but not a
128 car pile-up.

There are probably other kinds of specialised (or poorly written) hash
functions that will work well or acceptably on their own but not so
well when combined without good mixing.  Like Boost, we are dealing
with hashes produced by arbitrary hash functions that can be supplied
by user code, not just by the high quality built-in function from Bob
Jenkins.

>> Compare
>> Boost's hash_combine, which does some more work before XORing:
>
>>     seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
>
> I can't help being reminded of Knuth's story about he tried to invent
> the world's best random number generator, and was disappointed when
> it almost immediately converged to a short repeating sequence.  If
> there's any actual theoretical basis to the above, I'd be interested
> to see it.  But as-is, the use of addition rather than XOR looks fishy,
> and the way the old seed is shifted around looks more likely to result
> in cancellation than anything else.

I provided boost::hash_combine as an example only because it's widely
known and used in used in C++ circles and familiar to me, but so far I
have only taken it on authority that it works.  Here's what I managed
to dig up on it this morning:  According to C++ N3876[1], a proposal
to add std::hash_combine (but not a specific algorithm) to the C++
standard library, Boost's algorithm comes from a paper called "Methods
for Identifying Versioned and Plagiarised Documents (2003)"[2], where
you can see this recipe for shifting and summing as equation 1, with
some discussion and pointers to other papers.  But I see in N3876's
feedback (see the comment from Jeffrey Jasskin) that Bob Jenkins'
mixer (or at least an aspect of it: not requiring intermediate state
to fit in the final output size) is discussed as a potential
implementation.  We already have that in our source tree... perhaps we
should use it.

>> That constant approximates the golden ratio (as a fraction of the 32
>> bit hash space), and it also appears in our hash_any and hash_uint32
>> functions.
>
> I think it's just somebody's idea of a magic random number.  Your link
>> [3] http://cstheory.stackexchange.com/questions/9639/how-did-knuth-derive-a
> provides some reasons to think it might be a good choice for a very
> specific application, but this is not that application --- in particular,
> it doesn't involve multiplying by that constant.

Yeah.  About its use in his 1997 algorithm, Bob Jenkins says:

"The golden ratio really is an arbitrary value. Its purpose is to
avoid mapping all zeros to all zeros."

[1] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3876.pdf
[2] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.2680

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Combining hash values

От
Greg Stark
Дата:
<p dir="ltr">Surely combining multiple hashes is the same problem as hashing a block of memory? Shouldn't we just use
thesame algorithm as hash_any()?<p dir="ltr">I was originally going to suggest using a crc  to combine but iirc we
changedhash_any() a while back and decided against using crc. I don't know if that was wise but wouldn't want to
suggestrelitigating that. 

Re: Combining hash values

От
Dean Rasheed
Дата:
On 1 August 2016 at 08:19, Greg Stark <stark@mit.edu> wrote:
> Surely combining multiple hashes is the same problem as hashing a block of
> memory? Shouldn't we just use the same algorithm as hash_any()?
>

Yes, I imagine that should work well. I suspect that Thomas's proposal
would also probably work well, as would a number of other hashing
algorithms with reasonable pedigree, such as the one used for array
hashing. I don't have any particular preference, but I do know that
what usually turns out to be disastrous is an arbitrary made-up
formula like rotating and xor'ing. The last thing we should attempt to
do is invent our own hashing algorithms.

On that subject, while looking at hashfunc.c, I spotted that
hashint8() has a very obvious deficiency, which causes disastrous
performance with certain inputs:

create table foo (a bigint);
insert into foo select (x*2^32)+x from generate_series(1,1000000) g(x);
analyse foo;
select * from foo f1, foo f2 where f1.a=f2.a;

With random values that query (using a hash join) takes around 2
seconds on my machine, but with the values above I estimate it will
take 20 hours (although I didn't wait that long!). The reason is
pretty obvious -- all the bigint values above hash to the same value.
I'd suggest using hash_uint32() for values that fit in a 32-bit
integer and hash_any() otherwise.

Regards,
Dean



Re: Combining hash values

От
Tom Lane
Дата:
Greg Stark <stark@mit.edu> writes:
> I was originally going to suggest using a crc  to combine but iirc we
> changed hash_any() a while back and decided against using crc. I don't know
> if that was wise but wouldn't want to suggest relitigating that.

Nah, CRCs are designed to solve a different problem, ie detecting
single-bit and burst errors with high probability.  In particular, I don't
think they make any particular promises with respect to spreading changes
into all bits of the result.  That's important for our hash functions
because we usually take just the lowest N bits of the result as being
adequately randomized.
        regards, tom lane



Re: Combining hash values

От
Tom Lane
Дата:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On that subject, while looking at hashfunc.c, I spotted that
> hashint8() has a very obvious deficiency, which causes disastrous
> performance with certain inputs:

Well, if you're trying to squeeze 64 bits into a 32-bit result, there
are always going to be collisions somewhere.

> I'd suggest using hash_uint32() for values that fit in a 32-bit
> integer and hash_any() otherwise.

Perhaps, but this'd break existing hash indexes.  That might not be
a fatal objection, but if we're going to put users through that
I'd like to think a little bigger in terms of the benefits we get.
I've thought for some time that we needed to move to 64-bit hash function
results, because the size of problem that's reasonable to use a hash join
or hash aggregation for keeps increasing.  Maybe we should do that and fix
hashint8 as a side effect.
        regards, tom lane



Re: Combining hash values

От
Robert Haas
Дата:
On Mon, Aug 1, 2016 at 11:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Dean Rasheed <dean.a.rasheed@gmail.com> writes:
>> On that subject, while looking at hashfunc.c, I spotted that
>> hashint8() has a very obvious deficiency, which causes disastrous
>> performance with certain inputs:
>
> Well, if you're trying to squeeze 64 bits into a 32-bit result, there
> are always going to be collisions somewhere.
>
>> I'd suggest using hash_uint32() for values that fit in a 32-bit
>> integer and hash_any() otherwise.
>
> Perhaps, but this'd break existing hash indexes.  That might not be
> a fatal objection, but if we're going to put users through that
> I'd like to think a little bigger in terms of the benefits we get.
> I've thought for some time that we needed to move to 64-bit hash function
> results, because the size of problem that's reasonable to use a hash join
> or hash aggregation for keeps increasing.  Maybe we should do that and fix
> hashint8 as a side effect.

Well, considering that Amit is working on makes hash indexes
WAL-logged in v10[1], this seems like an awfully good time to get any
breakage we want to do out of the way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] https://www.postgresql.org/message-id/CAA4eK1LfzcZYxLoXS874Ad0+S-ZM60U9bwcyiUZx9mHZ-KCWhw@mail.gmail.com



Re: Combining hash values

От
Robert Haas
Дата:
On Mon, Aug 1, 2016 at 7:24 AM, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
> On 1 August 2016 at 08:19, Greg Stark <stark@mit.edu> wrote:
>> Surely combining multiple hashes is the same problem as hashing a block of
>> memory? Shouldn't we just use the same algorithm as hash_any()?
>
> Yes, I imagine that should work well. I suspect that Thomas's proposal
> would also probably work well, as would a number of other hashing
> algorithms with reasonable pedigree, such as the one used for array
> hashing. I don't have any particular preference, but I do know that
> what usually turns out to be disastrous is an arbitrary made-up
> formula like rotating and xor'ing. The last thing we should attempt to
> do is invent our own hashing algorithms.

+1.  (x << 1) | y isn't the stupidest way of combining hash values
anybody's ever invented, but there are surely others that are better.
I don't much care whether we adopt Thomas's proposal or Greg's or
something else, but I can't see why we'd stick with (x << 1) | y when
better approaches are known.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Combining hash values

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Aug 1, 2016 at 11:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Perhaps, but this'd break existing hash indexes.  That might not be
>> a fatal objection, but if we're going to put users through that
>> I'd like to think a little bigger in terms of the benefits we get.
>> I've thought for some time that we needed to move to 64-bit hash function
>> results, because the size of problem that's reasonable to use a hash join
>> or hash aggregation for keeps increasing.  Maybe we should do that and fix
>> hashint8 as a side effect.

> Well, considering that Amit is working on makes hash indexes
> WAL-logged in v10[1], this seems like an awfully good time to get any
> breakage we want to do out of the way.

Yeah, the compatibility stakes would go up quite a bit as soon as that
happens, because then users might start using hash indexes without the
expectation of having to rebuild them at random.

(BTW, this line of reasoning also says that if Amit finds he needs to
twiddle the on-disk format a bit to make WAL logging work, it would not
be a problem.)
        regards, tom lane