Re: [PATCHES] updated hash functions for postgresql v1

Поиск

Список

Период

Сортировка

От	Kenneth Marshall
Тема	Re: [PATCHES] updated hash functions for postgresql v1
Дата	4 ноября 2008 г. 17:15:52
Msg-id	20081104211544.GQ18362@it.is.rice.edu обсуждение
Ответ на	Re: [PATCHES] updated hash functions for postgresql v1 (Oleg Bartunov <oleg@sai.msu.su>)
Ответы	Re: [PATCHES] updated hash functions for postgresql v1
Список	pgsql-hackers

Дерево обсуждения

On Tue, Nov 04, 2008 at 11:32:47PM +0300, Oleg Bartunov wrote:
> Just interested if you repeat your tests not with cracklib-dict,
> but using 8-bit words. From our experience we found many hash functions
> are optimized for 7-bit words and produce too many collisions
> for 8-bit words. That's why we use crc32.
>
> Oleg

I think that the lines:

uint32 from 1-1648379                 309    319  347
(uint32 1-1948379)*256                309    314  304
(uint32 1-1948379)*16                 310    314  324
"a"uint32 (i.e. a00001,a0002...)      320    321  312

uint32uint32 (i.e. uint64)            321    287  309

can count as 8-bit words if taken a byte at a time. In fact
that is how hash_any() treats them, as a character string
and a length. One of the design goals of the original 1997
hash function in lookup2 and the 2006 update in lookup3 is
to support keys of arbitrary arrangements of bits. I can run
any additional checks that you want since the test harness
is perl with Inline::C. If you are using crc32 his article in
Dr. Dobbs shows that CRC has a "2 into 1" funnel-15 and an
"11 into 10" funnel-100 unless you are using a generalized
CRC. Also, unless you can inline your CRC the Jenkins lookup3
is 5n+20 where CRC is 9n+3.

Regards,
Ken

> On Tue, 4 Nov 2008, Kenneth Marshall wrote:
>
>> Sorry about the delay for this update to the new hash
>> index implementation. I was trying to get the WAL logging
>> in place and forgot to post the actual patch. The WAL
>> for hash indexes will need to wait for 8.5, but I did
>> want to add back in the piece of the Bob Jenkins 2006
>> hash function that was stripped out of the initial
>> patch on application due to concerns about the randomness
>> of the resulting hash values. Here is a re-post of my
>> initial findings comparing the old/new Jenkins hash
>> from lookup2 and lookup3. I have added a third column
>> containing the results for the hash_any() resulting
>> from the attached patch as well as simple timing test
>> for a DB reindex both before and after patching.
>>
>> Also attached is a simple documentation patch updating
>> the note attached to the hash index description.
>>
>> Regards,
>> Ken
>> ----------------------------------------------------
>> Hi,
>>
>> I have finally had a chance to do some investigation on
>> the performance of the old hash mix() function versus
>> the updated mix()/final() in the new hash function. Here
>> is a table of my current results for both the old and the
>> new hash function. In this case cracklib refers to the
>> cracklib-dict containing 1648379 unique words massaged
>> in various ways to generate input strings for the hash
>> functions. The result is the number of collisions in the
>> hash values generated.
>>
>> hash input                            old    new  newv2
>> ----------                            ---    ---  -----
>> cracklib                              338    316  338
>> cracklib x 2 (i.e. clibclib)          305    319  300
>> cracklib x 3 (clibclibclib)           323    329  315
>> cracklib x 10                         302    310  329
>> cracklib x 100                        350    335  298
>> cracklib x 1000                       314    309  315
>> cracklib x 100 truncated to char(100) 311    327  320
>>
>> uint32 from 1-1648379                 309    319  347
>> (uint32 1-1948379)*256                309    314  304
>> (uint32 1-1948379)*16                 310    314  324
>> "a"uint32 (i.e. a00001,a0002...)      320    321  312
>>
>> uint32uint32 (i.e. uint64)            321    287  309
>>
>> The different result columns are old = Jenkins 1996 hash
>> function(lookup2.c), new = Jenkins 2006 hash function
>> (lookup3.c), and newv2 = adaptation of current hash_any()
>> to incorporate the separate mix()/final() functions. As
>> you can see from the results, spliting the mix() and final()
>> apart does not result in any perceptible loss of randomness
>> in the hash assignment. I also ran a crude timing for a
>> reindex of the following database:
>>
>> CREATE TABLE dict (word text);
>> CREATE INDEX wordhash ON dict USING hash (word);
>> INSERT INTO dict (word) VALUES('s;lhkjdpyoijxfg;lktjgh;sdlhkjo');
>> INSERT INTO dict (SELECT MAX(word)||MAX(word) FROM dict);
>> ... (21 times)
>>
>> REINDEX TABLE
>> ...
>>
>> The average time to reindex the table using our current
>> hash_any() without the separate mix()/final() was 1696ms
>> and 1482ms with the separate mix()/final() stages giving
>> almost 13% better performance for this stupid metric.
>>
>
>     Regards,
>         Oleg
> _____________________________________________________________
> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
> Sternberg Astronomical Institute, Moscow University, Russia
> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
> phone: +007(495)939-16-83, +007(495)939-23-83
>
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [PATCHES] updated hash functions for postgresql v1