Re: Hash support for arrays

Поиск
Список
Период
Сортировка
От Kenneth Marshall
Тема Re: Hash support for arrays
Дата
Msg-id 20101104123539.GE27429@aart.is.rice.edu
обсуждение исходный текст
Ответ на Re: Hash support for arrays  (Dean Rasheed <dean.a.rasheed@gmail.com>)
Список pgsql-hackers
On Thu, Nov 04, 2010 at 10:00:40AM +0000, Dean Rasheed wrote:
> On 3 November 2010 09:24, Nicolas Barbier <nicolas.barbier@gmail.com> wrote:
> > 2010/11/2 Kenneth Marshall <ktm@rice.edu>:
> >
> >> Given that our hash implimentation mixes the input data well (It does.
> >> I tested it.) then a simple rotate-and-xor method is all that should
> >> be needed to maintain all of the needed information. The original
> >> hash function has done the heavy lifting in this case.
> >
> > Even with the perfect hash function for the elements, certain
> > combinations of elements could still lead to massive collisions. E.g.,
> > if repeated values are typical in the input data we are talking about,
> > then the rotate-and-xor method would still lead to collisions between
> > any array of the same values of certain lengths, regardless of the
> > value. In Tom's implementation, as he mentioned before, those
> > problematical lengths would be multiples of 32 (e.g., an array of 32
> > 1s would collide with an array of 32 2s would collide with an array of
> > 32 3s, etc).
> >
> 
> Yeah, rotate-and-xor is a pretty weak hashing algorithm, since any
> array of 32 identical elements will hash to either 0 or -1. Similarly
> various permutations or multiples of that array length will cause it
> to perform badly.
> 
> The multiply-by-m algorithm doesn't have that weakness, provided m is
> chosen carefully. There are a couple of qualities a good algorithm
> should possess:
> 
> 1). The bits from the individual element hash values should be
> distributed evenly so that no 2 different hash values would result in
> the same contribution to the final value. This is easy to achieve -
> just make sure that m is odd.
> 
> 2). The way that each element's hash value bits are distributed should
> be different from the way that every other element's hash value bits
> are distributed. m=31 achieves this pretty well, although there are
> plenty of other equally valid choices.
> 
> Regards,
> Dean
> 
Hi Dean,

In my comment yesterday, I included a simple function that would
allow us to leverage our current hash functions mixing process to
scramble the bits effectively and retaining the maximum amount of
information in the hash.

Regards,
Ken


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thom Brown
Дата:
Сообщение: Alter column to type serial
Следующее
От: KaiGai Kohei
Дата:
Сообщение: contrib: auth_delay module