Обсуждение: Hash support for arrays

Поиск
Список
Период
Сортировка

Hash support for arrays

От
Tom Lane
Дата:
I was reminded today that I'd promised to do $SUBJECT awhile back.
It's worth having so that hash joins and hash aggregation can work
on array values.  I believe this is a fairly straightforward
exercise:

1. Add a hash opclass that accepts ANYARRAY, similar to the existing
btree opclass for ANYARRAY.  It will work for any array type whose
element type has a hash opclass.

2. Coding is much like the array_cmp support code, including caching
the lookup of the element type's hash function.

3. To hash, apply the element type's hash function to each array
element.  Combine these values by rotating the accumulator left
one bit at each step and xor'ing in the next element's hash value.

Thoughts?  In particular, is anyone aware of a better method
for combining the element hash values?
        regards, tom lane


Re: Hash support for arrays

От
marcin mank
Дата:
On Sat, Oct 30, 2010 at 6:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> 3. To hash, apply the element type's hash function to each array
> element.  Combine these values by rotating the accumulator left
> one bit at each step and xor'ing in the next element's hash value.
>
> Thoughts?  In particular, is anyone aware of a better method
> for combining the element hash values?
>

This would make the hash the same for arrays with elements 32 apart swapped.

This is what boost does:
http://www.systomath.com/include/Boost-1_34/doc/html/boost/hash_combine.html

Greetings


Re: Hash support for arrays

От
Tom Lane
Дата:
marcin mank <marcin.mank@gmail.com> writes:
> On Sat, Oct 30, 2010 at 6:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> 3. To hash, apply the element type's hash function to each array
>> element. �Combine these values by rotating the accumulator left
>> one bit at each step and xor'ing in the next element's hash value.
>> 
>> Thoughts? �In particular, is anyone aware of a better method
>> for combining the element hash values?

> This would make the hash the same for arrays with elements 32 apart swapped.

Well, there are *always* going to be cases where you get the same hash
value for two different inputs; it's unavoidable given that you have to
combine N 32-bit hash values into only one 32-bit output.

> This is what boost does:
> http://www.systomath.com/include/Boost-1_34/doc/html/boost/hash_combine.html

Hmm.  I am reminded of Knuth's famous dictum: "never generate random
numbers with a method chosen at random".  Is there any actual theory
behind that algorithm, and if so what is it?  The combination of
shifting with addition (not xor) seems more likely to lead to weird
cancellations than any improvement in the hash behavior.
        regards, tom lane


Re: Hash support for arrays

От
Greg Stark
Дата:
On Sat, Oct 30, 2010 at 9:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thoughts?  In particular, is anyone aware of a better method
> for combining the element hash values?
>

The obvious thing to do seems like it would be to feed the individual
values back into the regular hash function.

--
greg


Re: Hash support for arrays

От
Tom Lane
Дата:
Greg Stark <gsstark@mit.edu> writes:
> On Sat, Oct 30, 2010 at 9:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Thoughts? �In particular, is anyone aware of a better method
>> for combining the element hash values?

> The obvious thing to do seems like it would be to feed the individual
> values back into the regular hash function.

Hmm, you mean use hash_any on the sequence of hash values?  Interesting
idea; but hash_any doesn't look very amenable to being invoked
incrementally, and I don't think I want to construct a temp array with
enough space for the hashes of all the items in the input array ...
        regards, tom lane


Re: Hash support for arrays

От
Greg Stark
Дата:
On Sat, Oct 30, 2010 at 1:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hmm, you mean use hash_any on the sequence of hash values?  Interesting
> idea; but hash_any doesn't look very amenable to being invoked
> incrementally, and I don't think I want to construct a temp array with
> enough space for the hashes of all the items in the input array ...

I had rather hoped without evidence that it would be easy enough to
use incrementally. Looking now I see your point; it doesn't lend
itself to that at all.

I suppose you could use crc where our interface does allow incremental use.

I'm not really familiar enough with the math to know whether CRC is
any better than your XOR mixing. I suspect they're pretty similar. I
suppose if you have arrays of various lengths containing repetitions
of a single value than the non-CRC would produce a hash of 0 for all
arrays with a length that's a multiple of 32. I'm not sure what CRC
would do.

Oh, one question that occurred to me you didn't mention including the
bounds of the array or the null bitmap in the hash function. I suppose
it's correct with or without them (or in the case of the null bitmap
another way to put it is the choice of the hash value to represent
null is arbitrary).


--
greg


Re: Hash support for arrays

От
Tom Lane
Дата:
Greg Stark <gsstark@mit.edu> writes:
> I suppose you could use crc where our interface does allow incremental use.

Hm, that's a thought.  I'm not sure though whether CRC really has the
behavior you want from a hash function, in particular that all the bits
are independent (a/k/a taking N low-order bits is a reasonable way to
cut the hash down to a suitable bucket index).  Anybody know?

> I'm not really familiar enough with the math to know whether CRC is
> any better than your XOR mixing. I suspect they're pretty similar. I
> suppose if you have arrays of various lengths containing repetitions
> of a single value than the non-CRC would produce a hash of 0 for all
> arrays with a length that's a multiple of 32. I'm not sure what CRC
> would do.

Some quick experimenting suggests that you get -1 from an array of 32 of
the same value, then 0 from 64 of the same, etc.  This doesn't bother me
particularly though.  There are always going to be some situations that
produce degenerate hash values.

> Oh, one question that occurred to me you didn't mention including the
> bounds of the array or the null bitmap in the hash function. I suppose
> it's correct with or without them (or in the case of the null bitmap
> another way to put it is the choice of the hash value to represent
> null is arbitrary).

Yeah.  I have it coded to treat a null element as if it had a hash value
of zero.  I don't see a great need to consider the array bounds per se.
        regards, tom lane


Re: Hash support for arrays

От
hernan gonzalez
Дата:
>Hmm.  I am reminded of Knuth's famous dictum: "never generate random
numbers with a method chosen at random".  Is there any actual theory
behind that algorithm, and if so what is it?  The combination of
shifting with addition (not xor) seems more likely to lead to weird
cancellations than any improvement in the hash behavior.

For what's worth: the standard way in Java is multiplying by 31.  Which 31 amounts to a 5 bit shift and a substraction. 
In general, a prime number is recommended though one would like that Knuth made some analysis here... it all seems mostly empirical.

Perhaps the 5 bit shift is related to the spread of ascii charpoints, as that hashcode() implementation was mainly tested and implemented for a String. But now it's also suggested for general objects, and it's even specified for standard collections: for example 
http://download.oracle.com/javase/6/docs/api/java/util/List.html#hashCode()


Hernán J. González

Re: Hash support for arrays

От
Robert Haas
Дата:
On Sat, Oct 30, 2010 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> marcin mank <marcin.mank@gmail.com> writes:
>> On Sat, Oct 30, 2010 at 6:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> 3. To hash, apply the element type's hash function to each array
>>> element.  Combine these values by rotating the accumulator left
>>> one bit at each step and xor'ing in the next element's hash value.
>>>
>>> Thoughts?  In particular, is anyone aware of a better method
>>> for combining the element hash values?
>
>> This would make the hash the same for arrays with elements 32 apart swapped.
>
> Well, there are *always* going to be cases where you get the same hash
> value for two different inputs; it's unavoidable given that you have to
> combine N 32-bit hash values into only one 32-bit output.

Sure.  The goal is to make those hard to predict, though.  I think
"multiply by 31 and add the next value" is a fairly standard way of
getting that behavior.  It mixes up the bits a lot more than just
left-shifting by a variable offset.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Hash support for arrays

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, Oct 30, 2010 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> marcin mank <marcin.mank@gmail.com> writes:
>>> This would make the hash the same for arrays with elements 32 apart swapped.
>> 
>> Well, there are *always* going to be cases where you get the same hash
>> value for two different inputs; it's unavoidable given that you have to
>> combine N 32-bit hash values into only one 32-bit output.

> Sure.  The goal is to make those hard to predict, though.

Really?  I think "I don't understand when this fails" isn't obviously
better than being able to predict when it fails ...

> I think
> "multiply by 31 and add the next value" is a fairly standard way of
> getting that behavior.  It mixes up the bits a lot more than just
> left-shifting by a variable offset.

What concerns me about that is that it tends to push the bits to the
left --- I think the effects of the earlier inputs are going to overflow
out as you incorporate a lot of newer inputs.  What you want is a scheme
where every input item affects all the bits of the final result about
equally, and my gut feeling is this doesn't provide that.

I'd be happier about this approach if there were some actual theory
behind it ... maybe there's some analysis out there, but the one link
that was given was just about entirely unconvincing.
        regards, tom lane


Re: Hash support for arrays

От
Alvaro Herrera
Дата:
Excerpts from Tom Lane's message of mar nov 02 15:21:31 -0300 2010:

> What concerns me about that is that it tends to push the bits to the
> left --- I think the effects of the earlier inputs are going to overflow
> out as you incorporate a lot of newer inputs.  What you want is a scheme
> where every input item affects all the bits of the final result about
> equally, and my gut feeling is this doesn't provide that.

Ahh, now I understand what you meant with "rotating" the bits in the
original email in this thread.  You're not simply shifting.  Clever.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: Hash support for arrays

От
Robert Haas
Дата:
On Tue, Nov 2, 2010 at 11:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sat, Oct 30, 2010 at 10:01 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> marcin mank <marcin.mank@gmail.com> writes:
>>>> This would make the hash the same for arrays with elements 32 apart swapped.
>>>
>>> Well, there are *always* going to be cases where you get the same hash
>>> value for two different inputs; it's unavoidable given that you have to
>>> combine N 32-bit hash values into only one 32-bit output.
>
>> Sure.  The goal is to make those hard to predict, though.
>
> Really?  I think "I don't understand when this fails" isn't obviously
> better than being able to predict when it fails ...

Isn't that the whole point of hash functions?  Collisions are
inevitable, but you want them to be unpredictable.  If you want a hash
function with predictable collision behavior, just has the first
element.  It'll be highly predictable AND wicked fast.

>> I think
>> "multiply by 31 and add the next value" is a fairly standard way of
>> getting that behavior.  It mixes up the bits a lot more than just
>> left-shifting by a variable offset.
>
> What concerns me about that is that it tends to push the bits to the
> left --- I think the effects of the earlier inputs are going to overflow
> out as you incorporate a lot of newer inputs.  What you want is a scheme
> where every input item affects all the bits of the final result about
> equally, and my gut feeling is this doesn't provide that.

I don't think so.  That would be a problem if you multiplied by an
even number.  You can see that you don't get dups if you just add in
the same value over and over:

#include <stdio.h>

int
main(int argc, char **argv)
{unsigned hv = 1;unsigned i;
for (i = 0; i < 1000000; ++i){    hv = hv * 31 + 1;    printf("%08x\n", hv);}return 0;
}

No dups.  Also, if you change that first hv = 1 line to hv = 2 and
rerun, that sequence also has no dups, and no collisions with the hv =
1 sequence.

> I'd be happier about this approach if there were some actual theory
> behind it ... maybe there's some analysis out there, but the one link
> that was given was just about entirely unconvincing.

I think it's from Knuth, though unfortunately I don't have a copy to
check.  There are probably better algorithms out there, but this one's
pretty simple.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Hash support for arrays

От
"Kevin Grittner"
Дата:
Robert Haas <robertmhaas@gmail.com> wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> The goal is to make those hard to predict, though.
>>
>> Really?  I think "I don't understand when this fails" isn't
>> obviously better than being able to predict when it fails ...
> 
> Isn't that the whole point of hash functions?  Collisions are
> inevitable, but you want them to be unpredictable.
Are you sure you're not confusing attributes of a good random number
generator with hash generation?  (They have much in common, but I've
only seen concern with "hard to predict" when working on random
number algorithms, as for financial audits or jury selection.)
> If you want a hash function with predictable collision behavior,
> just has the first element.  It'll be highly predictable AND
> wicked fast.
You're confusing "unpredictable" with "widely distributed", I think.
There's no reason that the hash value of an integer of the same size
as the hash shouldn't be the integer itself, for example.  It's hard
to get more predictable than that, yet it works well for hash
lookups.
I know that for a hash of a character string, the total = (total *
31) + nextcharacter has been shown to be effective.  That's hardly
random or hard to predict, but it does tend to spread out the hash
values well.  Whether it is as good for arrays, I don't know.  It
seems like a reasonable place to start, though, since a character
string is rather like an array of characters.
>> What concerns me about that is that it tends to push the bits to
>> the left --- I think the effects of the earlier inputs are going
>> to overflow out as you incorporate a lot of newer inputs.  What
>> you want is a scheme where every input item affects all the bits
>> of the final result about equally, and my gut feeling is this
>> doesn't provide that.
> 
> I don't think so.  That would be a problem if you multiplied by an
> even number.  You can see that you don't get dups if you just add
> in the same value over and over:
Yeah, that 1 in the low order position ensures that the impact of
any value which is ever added into the total is never entirely lost.
-Kevin


Re: Hash support for arrays

От
Robert Haas
Дата:
On Tue, Nov 2, 2010 at 12:34 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
>> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Robert Haas <robertmhaas@gmail.com> writes:
>
>>>> The goal is to make those hard to predict, though.
>>>
>>> Really?  I think "I don't understand when this fails" isn't
>>> obviously better than being able to predict when it fails ...
>>
>> Isn't that the whole point of hash functions?  Collisions are
>> inevitable, but you want them to be unpredictable.
>
> Are you sure you're not confusing attributes of a good random number
> generator with hash generation?  (They have much in common, but I've
> only seen concern with "hard to predict" when working on random
> number algorithms, as for financial audits or jury selection.)
>
>> If you want a hash function with predictable collision behavior,
>> just has the first element.  It'll be highly predictable AND
>> wicked fast.
>
> You're confusing "unpredictable" with "widely distributed", I think.
> There's no reason that the hash value of an integer of the same size
> as the hash shouldn't be the integer itself, for example.  It's hard
> to get more predictable than that, yet it works well for hash
> lookups.

Well, no, not really.  For example, it may be that you have a hash
table with N buckets and values that of the form N+k, 2N+k, 3N+k, ....and everything will collide.

If you do some mixing of the bits, that case is far less likely to
occur in practice.

See for example http://www.azillionmonkeys.com/qed/hash.html

> I know that for a hash of a character string, the total = (total *
> 31) + nextcharacter has been shown to be effective.  That's hardly
> random or hard to predict, but it does tend to spread out the hash
> values well.  Whether it is as good for arrays, I don't know.  It
> seems like a reasonable place to start, though, since a character
> string is rather like an array of characters.

That was my thought.

>>> What concerns me about that is that it tends to push the bits to
>>> the left --- I think the effects of the earlier inputs are going
>>> to overflow out as you incorporate a lot of newer inputs.  What
>>> you want is a scheme where every input item affects all the bits
>>> of the final result about equally, and my gut feeling is this
>>> doesn't provide that.
>>
>> I don't think so.  That would be a problem if you multiplied by an
>> even number.  You can see that you don't get dups if you just add
>> in the same value over and over:
>
> Yeah, that 1 in the low order position ensures that the impact of
> any value which is ever added into the total is never entirely lost.

Right...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Hash support for arrays

От
"Kevin Grittner"
Дата:
Robert Haas <robertmhaas@gmail.com> wrote:
>> There's no reason that the hash value of an integer of the same
>> size as the hash shouldn't be the integer itself, for example. 
>> It's hard to get more predictable than that, yet it works well
>> for hash lookups.
> 
> Well, no, not really.  For example, it may be that you have a hash
> table with N buckets and values that of the form N+k, 2N+k, 3N+k,
> .... and everything will collide.
That hardly seems convincing if the integer is a sequentially
assigned number, as with many ID columns; but if you want an
algorithm which will work well with numbers which might be assigned
in a pattern with a high correlation with some unpredictable number
of buckets -- sure, it won't work well without some scrambling.  For
sequentially assigned numbers, that scrambling would have a cost and
would be of no value.  I guess it depend a bit on your use case and
goals.
-Kevin


Re: Hash support for arrays

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Nov 2, 2010 at 11:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Really? �I think "I don't understand when this fails" isn't obviously
>> better than being able to predict when it fails ...

> Isn't that the whole point of hash functions?  Collisions are
> inevitable, but you want them to be unpredictable.  If you want a hash
> function with predictable collision behavior, just has the first
> element.  It'll be highly predictable AND wicked fast.

That seems like a rather poor straw man, since it suffers from exactly
the defect I'm complaining about, namely failing to consider all the
input values equally.

>> I'd be happier about this approach if there were some actual theory
>> behind it ... maybe there's some analysis out there, but the one link
>> that was given was just about entirely unconvincing.

> I think it's from Knuth, though unfortunately I don't have a copy to
> check.  There are probably better algorithms out there, but this one's
> pretty simple.

I don't see anything in Knuth suggesting a multiplier of 31.  His
recommendation for a multiplier, if you're going to use multiplicative
hashing, is wordsize/phi (phi being the golden ratio) ... and he also
wants you to keep the high order not the low order bits of the product.

However, this is largely beside the point, because that theory, as well
as the Java code you're arguing from, has to do with the initial hashing
of a raw sequence of input items.  Not with combining some existing hash
values.  The rotate-and-xor method I suggested for that is borrowed
exactly from section 6.4 of Knuth (page 512, in the first edition of
volume 3).
        regards, tom lane


Re: Hash support for arrays

От
Kenneth Marshall
Дата:
On Tue, Nov 02, 2010 at 04:42:19PM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Tue, Nov 2, 2010 at 11:21 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Really? �I think "I don't understand when this fails" isn't obviously
> >> better than being able to predict when it fails ...
> 
> > Isn't that the whole point of hash functions?  Collisions are
> > inevitable, but you want them to be unpredictable.  If you want a hash
> > function with predictable collision behavior, just has the first
> > element.  It'll be highly predictable AND wicked fast.
> 
> That seems like a rather poor straw man, since it suffers from exactly
> the defect I'm complaining about, namely failing to consider all the
> input values equally.
> 
> >> I'd be happier about this approach if there were some actual theory
> >> behind it ... maybe there's some analysis out there, but the one link
> >> that was given was just about entirely unconvincing.
> 
> > I think it's from Knuth, though unfortunately I don't have a copy to
> > check.  There are probably better algorithms out there, but this one's
> > pretty simple.
> 
> I don't see anything in Knuth suggesting a multiplier of 31.  His
> recommendation for a multiplier, if you're going to use multiplicative
> hashing, is wordsize/phi (phi being the golden ratio) ... and he also
> wants you to keep the high order not the low order bits of the product.
> 
> However, this is largely beside the point, because that theory, as well
> as the Java code you're arguing from, has to do with the initial hashing
> of a raw sequence of input items.  Not with combining some existing hash
> values.  The rotate-and-xor method I suggested for that is borrowed
> exactly from section 6.4 of Knuth (page 512, in the first edition of
> volume 3).
> 
>             regards, tom lane
> 

Given that our hash implimentation mixes the input data well (It does.
I tested it.) then a simple rotate-and-xor method is all that should
be needed to maintain all of the needed information. The original
hash function has done the heavy lifting in this case.

Regards,
Ken


Re: Hash support for arrays

От
Robert Haas
Дата:
On Nov 2, 2010, at 1:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> However, this is largely beside the point, because that theory, as well
> as the Java code you're arguing from, has to do with the initial hashing
> of a raw sequence of input items.  Not with combining some existing hash
> values.  The rotate-and-xor method I suggested for that is borrowed
> exactly from section 6.4 of Knuth (page 512, in the first edition of
> volume 3).

It seems undesirable to me to have a situation where transposing two array elements doesn't change the hash value.  But
Ijust work here. 

...Robert

Re: Hash support for arrays

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Nov 2, 2010, at 1:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> However, this is largely beside the point, because that theory, as well
>> as the Java code you're arguing from, has to do with the initial hashing
>> of a raw sequence of input items.  Not with combining some existing hash
>> values.  The rotate-and-xor method I suggested for that is borrowed
>> exactly from section 6.4 of Knuth (page 512, in the first edition of
>> volume 3).

> It seems undesirable to me to have a situation where transposing two array elements doesn't change the hash value.
ButI just work here.
 

[ shrug... ]  There are always going to be collisions, and you're
overstating the importance of this one (only some transpositions will
result in a duplicate hash, not any transposition).

What's actually *important*, for our purposes, is that all bits of the
final hash value depend in a reasonably uniform way on all bits of the
input hash values.  If we don't have that property, then bucket numbers
(which we form by taking the low-order N bits of the final hash, where
N isn't known in advance) won't be as well distributed as we'd like.

It's possible that the multiply-by-31 method is as good as the
rotate-and-xor method by that measure, or even better; but it's far from
obvious that it's better.  And I'm not convinced that the multiply
method has a pedigree that should encourage me to take it on faith.
        regards, tom lane


Re: Hash support for arrays

От
Sam Mason
Дата:
On Sat, Oct 30, 2010 at 01:01:44PM -0400, Tom Lane wrote:
> marcin mank <marcin.mank@gmail.com> writes:
> > This is what boost does:
> > http://www.systomath.com/include/Boost-1_34/doc/html/boost/hash_combine.html
> 
> Hmm.  I am reminded of Knuth's famous dictum: "never generate random
> numbers with a method chosen at random".  Is there any actual theory
> behind that algorithm, and if so what is it?  The combination of
> shifting with addition (not xor) seems more likely to lead to weird
> cancellations than any improvement in the hash behavior.

As far as I can tell the boost combiner comes from:
 http://goanna.cs.rmit.edu.au/~jz/fulltext/jasist-tch.pdf

Which seems to be solving a completely different problem and I can't see
how relevant it is to the design of a hash combiner.  The fact that they
get trivial details wrong like 32bit integers going up to 2^32 rather
than 2^32-1 doesn't inspire confidence.

One subtle point I noticed on the boost mailing list was that they
didn't want [[1],2] hashing to the same value as [1,2].  I'm pretty sure
that this sort of construction isn't possible to express in PG, but
thought I should mention it just in case.

--  Sam  http://samason.me.uk/


Re: Hash support for arrays

От
Ron Mayer
Дата:
Tom Lane wrote:
> It's possible that the multiply-by-31 method is as good as the
> rotate-and-xor method by that measure, or even better; but it's far from
> obvious that it's better.  And I'm not convinced that the multiply
> method has a pedigree that should encourage me to take it on faith.

Short summary: * some history of it's trivia follows * (nothing here suggests it's better - just old and common and
cheap)


Longer - some trivia regarding its pedigree:

It (or at least a *31 variant) seems to have a history of advocacy
going back to Chris Torek in 1990:
http://groups.google.com/group/comp.lang.c/browse_thread/thread/28c2095282f0c1b5/193be99e9836791b?q=#193be99e9836791b
 X#define        HASH(str, h, p) \ X       for (p = str, h = 0; *p;) h = (h << 5) - h + *p++

and gets referred to in Usenet papers in the early 90's as well:
http://www.usenix.com/publications/library/proceedings/sa92/salz.pdf


Regarding "why the magic number 31 [or 33 which also often comes
up]" apparently the only thing magic about it is that it's an
odd number != 1.    The rest of the odd numbers work about as well
according to this guy who tried to explain it:

http://svn.eu.apache.org/repos/asf/apr/apr/trunk/tables/apr_hash.c    * The magic of number 33, i.e. why it works
betterthan many other    * constants, prime or not, has never been adequately explained by    * anyone. So I try an
explanation:if one experimentally tests all    * multipliers between 1 and 256 (as I did while writing a low-level    *
datastructure library some time ago) one detects that even    * numbers are not useable at all. The remaining 128 odd
numbers   * (except for the number 1) work more or less all equally well.    * They all distribute in an acceptable way
andthis way fill a hash    * table with an average percent of approx. 86%.    *    * If one compares the chi^2 values
ofthe variants (see    * Bob Jenkins ``Hashing Frequently Asked Questions'' at    *
http://burtleburtle.net/bob/hash/hashfaq.htmlfor a description    * of chi^2), the number 33 not even has the best
value.But the    * number 33 and a few other equally good numbers like 17, 31, 63,    * 127 and 129 have nevertheless a
greatadvantage to the remaining    * numbers in the large set of possible multipliers: their multiply    * operation
canbe replaced by a faster operation based on just one    * shift plus either a single addition or subtraction
operation.And    * because a hash function has to both distribute good _and_ has to    * be very fast to compute, those
fewnumbers should be preferred.    ...
 


Re: Hash support for arrays

От
Nicolas Barbier
Дата:
2010/11/2 Kenneth Marshall <ktm@rice.edu>:

> Given that our hash implimentation mixes the input data well (It does.
> I tested it.) then a simple rotate-and-xor method is all that should
> be needed to maintain all of the needed information. The original
> hash function has done the heavy lifting in this case.

Even with the perfect hash function for the elements, certain
combinations of elements could still lead to massive collisions. E.g.,
if repeated values are typical in the input data we are talking about,
then the rotate-and-xor method would still lead to collisions between
any array of the same values of certain lengths, regardless of the
value. In Tom's implementation, as he mentioned before, those
problematical lengths would be multiples of 32 (e.g., an array of 32
1s would collide with an array of 32 2s would collide with an array of
32 3s, etc).

Nicolas


Re: Hash support for arrays

От
Kenneth Marshall
Дата:
On Wed, Nov 03, 2010 at 10:24:16AM +0100, Nicolas Barbier wrote:
> 2010/11/2 Kenneth Marshall <ktm@rice.edu>:
> 
> > Given that our hash implimentation mixes the input data well (It does.
> > I tested it.) then a simple rotate-and-xor method is all that should
> > be needed to maintain all of the needed information. The original
> > hash function has done the heavy lifting in this case.
> 
> Even with the perfect hash function for the elements, certain
> combinations of elements could still lead to massive collisions. E.g.,
> if repeated values are typical in the input data we are talking about,
> then the rotate-and-xor method would still lead to collisions between
> any array of the same values of certain lengths, regardless of the
> value. In Tom's implementation, as he mentioned before, those
> problematical lengths would be multiples of 32 (e.g., an array of 32
> 1s would collide with an array of 32 2s would collide with an array of
> 32 3s, etc).
> 
> Nicolas
> 

True. I just took another look at our defined hash functions and it
looks like we can make a simple variant of hash_uint32() that we
can use as a stream checksum. The only thing missing is that ability
to pass in the current 32-bit hash value as a starting seed to add
the next 32-bit value. Something like this should work:

Datum
hash_uint32(uint32 k, uint32 initval)
{       register uint32 a,                               b,                               c;
       a = b = c = 0x9e3779b9 + (uint32) sizeof(uint32) + 3923095 + initval;       a += k;
       final(a, b, c);
       /* report the result */       return UInt32GetDatum(c);
}

Then if you pass in the current value as the initval, it should mix
well each additional 32-bit hash value.

Regards,
Ken


Re: Hash support for arrays

От
Dean Rasheed
Дата:
On 3 November 2010 09:24, Nicolas Barbier <nicolas.barbier@gmail.com> wrote:
> 2010/11/2 Kenneth Marshall <ktm@rice.edu>:
>
>> Given that our hash implimentation mixes the input data well (It does.
>> I tested it.) then a simple rotate-and-xor method is all that should
>> be needed to maintain all of the needed information. The original
>> hash function has done the heavy lifting in this case.
>
> Even with the perfect hash function for the elements, certain
> combinations of elements could still lead to massive collisions. E.g.,
> if repeated values are typical in the input data we are talking about,
> then the rotate-and-xor method would still lead to collisions between
> any array of the same values of certain lengths, regardless of the
> value. In Tom's implementation, as he mentioned before, those
> problematical lengths would be multiples of 32 (e.g., an array of 32
> 1s would collide with an array of 32 2s would collide with an array of
> 32 3s, etc).
>

Yeah, rotate-and-xor is a pretty weak hashing algorithm, since any
array of 32 identical elements will hash to either 0 or -1. Similarly
various permutations or multiples of that array length will cause it
to perform badly.

The multiply-by-m algorithm doesn't have that weakness, provided m is
chosen carefully. There are a couple of qualities a good algorithm
should possess:

1). The bits from the individual element hash values should be
distributed evenly so that no 2 different hash values would result in
the same contribution to the final value. This is easy to achieve -
just make sure that m is odd.

2). The way that each element's hash value bits are distributed should
be different from the way that every other element's hash value bits
are distributed. m=31 achieves this pretty well, although there are
plenty of other equally valid choices.

Regards,
Dean


Re: Hash support for arrays

От
Kenneth Marshall
Дата:
On Thu, Nov 04, 2010 at 10:00:40AM +0000, Dean Rasheed wrote:
> On 3 November 2010 09:24, Nicolas Barbier <nicolas.barbier@gmail.com> wrote:
> > 2010/11/2 Kenneth Marshall <ktm@rice.edu>:
> >
> >> Given that our hash implimentation mixes the input data well (It does.
> >> I tested it.) then a simple rotate-and-xor method is all that should
> >> be needed to maintain all of the needed information. The original
> >> hash function has done the heavy lifting in this case.
> >
> > Even with the perfect hash function for the elements, certain
> > combinations of elements could still lead to massive collisions. E.g.,
> > if repeated values are typical in the input data we are talking about,
> > then the rotate-and-xor method would still lead to collisions between
> > any array of the same values of certain lengths, regardless of the
> > value. In Tom's implementation, as he mentioned before, those
> > problematical lengths would be multiples of 32 (e.g., an array of 32
> > 1s would collide with an array of 32 2s would collide with an array of
> > 32 3s, etc).
> >
> 
> Yeah, rotate-and-xor is a pretty weak hashing algorithm, since any
> array of 32 identical elements will hash to either 0 or -1. Similarly
> various permutations or multiples of that array length will cause it
> to perform badly.
> 
> The multiply-by-m algorithm doesn't have that weakness, provided m is
> chosen carefully. There are a couple of qualities a good algorithm
> should possess:
> 
> 1). The bits from the individual element hash values should be
> distributed evenly so that no 2 different hash values would result in
> the same contribution to the final value. This is easy to achieve -
> just make sure that m is odd.
> 
> 2). The way that each element's hash value bits are distributed should
> be different from the way that every other element's hash value bits
> are distributed. m=31 achieves this pretty well, although there are
> plenty of other equally valid choices.
> 
> Regards,
> Dean
> 
Hi Dean,

In my comment yesterday, I included a simple function that would
allow us to leverage our current hash functions mixing process to
scramble the bits effectively and retaining the maximum amount of
information in the hash.

Regards,
Ken