Re: updated hash functions for postgresql v1

Поиск
Список
Период
Сортировка
От Kenneth Marshall
Тема Re: updated hash functions for postgresql v1
Дата
Msg-id 20080829211536.GA16964@it.is.rice.edu
обсуждение исходный текст
Ответ на Re: updated hash functions for postgresql v1  (Simon Riggs <simon@2ndquadrant.com>)
Список pgsql-patches
On Sun, Oct 28, 2007 at 08:06:58PM +0000, Simon Riggs wrote:
> On Sun, 2007-10-28 at 13:05 -0500, Kenneth Marshall wrote:
> > On Sun, Oct 28, 2007 at 05:27:38PM +0000, Simon Riggs wrote:
> > > On Sat, 2007-10-27 at 15:15 -0500, Kenneth Marshall wrote:
> > > > Its features include a better and faster hash function.
> > >
> > > Looks very promising. Do you have any performance test results to show
> > > it really is faster, when compiled into Postgres? Better probably needs
> > > some definition also; in what way are the hash functions better?
> > >
> > > --
> > >   Simon Riggs
> > >   2ndQuadrant  http://www.2ndQuadrant.com
> > >
> > The new hash function is roughly twice as fast as the old function in
> > terms of straight CPU time. It uses the same design as the current
> > hash but provides code paths for aligned and unaligned access as well
> > as separate mixing functions for different blocks in the hash run
> > instead of having one general purpose block. I think the speed will
> > not be an obvious win with smaller items, but will be very important
> > when hashing larger items (up to 32kb).
> >
> > Better in this case means that the new hash mixes more thoroughly
> > which results in less collisions and more even bucket distribution.
> > There is also a 64-bit varient which is still faster since it can
> > take advantage of the 64-bit processor instruction set.
>
> Ken, I was really looking for some tests that show both of the above
> were true. We've had some trouble proving the claims of other algorithms
> before, so I'm less inclined to take those things at face value.
>
> I'd suggest tests with Integers, BigInts, UUID, CHAR(20) and CHAR(100).
> Others may have different concerns.
>
> --
>   Simon Riggs
>   2ndQuadrant  http://www.2ndQuadrant.com
>
Hi,

I have finally had a chance to do some investigation on
the performance of the old hash mix() function versus
the updated mix()/final() in the new hash function. Here
is a table of my current results for both the old and the
new hash function. In this case cracklib refers to the
cracklib-dict containing 1648379 unique words massaged
in various ways to generate input strings for the hash
functions. The result is the number of collisions in the
hash values generated.

hash input                            old    new
----------                            ---    ---
cracklib                              338    316
cracklib x 2 (i.e. clibclib)          305    319
cracklib x 3 (clibclibclib)           323    329
cracklib x 10                         302    310
cracklib x 100                        350    335
cracklib x 1000                       314    309
cracklib x 100 truncated to char(100) 311    327

uint32 from 1-1648379                 309    319
(uint32 1-1948379)*256                309    314
(uint32 1-1948379)*16                 310    314
"a"uint32 (i.e. a00001,a0002...)      320    321

uint32uint32 (i.e. uint64)            321    287

In addition to these tests, the new mixing functions allow
the hash to pass the frog2.c test by Bob Jenkins. Here is
his comment from http://burtleburtle.net/bob/hash/doobs.html:

  lookup3.c does a much more thorough job of mixing than any
  of my previous hashes (lookup2.c, lookup.c, One-at-a-time).
  All my previous hashes did a more thorough job of mixing
  than Paul Hsieh's hash. Paul's hash does a good enough job
  of mixing for most practical purposes.

  The most evil set of keys I know of are sets of keys that are
  all the same length, with all bytes zero, except with a few
  bits set. This is tested by frog.c.. To be even more evil, I
  had my hashes return b and c instead of just c, yielding a
  64-bit hash value. Both lookup.c and lookup2.c start seeing
  collisions after 2^53 frog.c keypairs. Paul Hsieh's hash sees
  collisions after 2^17 keypairs, even if we take two hashes with
  different seeds. lookup3.c is the only one of the batch that
  passes this test. It gets its first collision somewhere beyond
  2^63 keypairs, which is exactly what you'd expect from a completely
  random mapping to 64-bit values.

If anyone has any other data for me to test with, please let me
know. I think this is a reasonable justification for including the
new mixing process (mix() and final()) as well as the word-at-a-time
processing in our hash function. I will be putting a small patch
together to add the new mixing process back in to the updated hash
function this weekend in time for the September commit-fest unless
there are objections. Both the old and the new hash functions meet
the strict avalanche conditions as well.
(http://home.comcast.net/~bretm/hash/3.html)

I have used an Inline::C perl driver for these tests and can post
it if others would like to use it as a testbed.

Regards,
Ken
avalance

В списке pgsql-patches по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: psql command setting
Следующее
От: Karl Schnaitter
Дата:
Сообщение: fixing bug in combocid.c