Re: [PATCHES] updated hash functions for postgresql v1
От | Kenneth Marshall |
---|---|
Тема | Re: [PATCHES] updated hash functions for postgresql v1 |
Дата | |
Msg-id | 20081104211544.GQ18362@it.is.rice.edu обсуждение исходный текст |
Ответ на | Re: [PATCHES] updated hash functions for postgresql v1 (Oleg Bartunov <oleg@sai.msu.su>) |
Ответы |
Re: [PATCHES] updated hash functions for postgresql v1
(Kenneth Marshall <ktm@rice.edu>)
|
Список | pgsql-hackers |
On Tue, Nov 04, 2008 at 11:32:47PM +0300, Oleg Bartunov wrote: > Just interested if you repeat your tests not with cracklib-dict, > but using 8-bit words. From our experience we found many hash functions > are optimized for 7-bit words and produce too many collisions > for 8-bit words. That's why we use crc32. > > Oleg I think that the lines: uint32 from 1-1648379 309 319 347 (uint32 1-1948379)*256 309 314 304 (uint32 1-1948379)*16 310 314 324 "a"uint32 (i.e. a00001,a0002...) 320 321 312 uint32uint32 (i.e. uint64) 321 287 309 can count as 8-bit words if taken a byte at a time. In fact that is how hash_any() treats them, as a character string and a length. One of the design goals of the original 1997 hash function in lookup2 and the 2006 update in lookup3 is to support keys of arbitrary arrangements of bits. I can run any additional checks that you want since the test harness is perl with Inline::C. If you are using crc32 his article in Dr. Dobbs shows that CRC has a "2 into 1" funnel-15 and an "11 into 10" funnel-100 unless you are using a generalized CRC. Also, unless you can inline your CRC the Jenkins lookup3 is 5n+20 where CRC is 9n+3. Regards, Ken > On Tue, 4 Nov 2008, Kenneth Marshall wrote: > >> Sorry about the delay for this update to the new hash >> index implementation. I was trying to get the WAL logging >> in place and forgot to post the actual patch. The WAL >> for hash indexes will need to wait for 8.5, but I did >> want to add back in the piece of the Bob Jenkins 2006 >> hash function that was stripped out of the initial >> patch on application due to concerns about the randomness >> of the resulting hash values. Here is a re-post of my >> initial findings comparing the old/new Jenkins hash >> from lookup2 and lookup3. I have added a third column >> containing the results for the hash_any() resulting >> from the attached patch as well as simple timing test >> for a DB reindex both before and after patching. >> >> Also attached is a simple documentation patch updating >> the note attached to the hash index description. >> >> Regards, >> Ken >> ---------------------------------------------------- >> Hi, >> >> I have finally had a chance to do some investigation on >> the performance of the old hash mix() function versus >> the updated mix()/final() in the new hash function. Here >> is a table of my current results for both the old and the >> new hash function. In this case cracklib refers to the >> cracklib-dict containing 1648379 unique words massaged >> in various ways to generate input strings for the hash >> functions. The result is the number of collisions in the >> hash values generated. >> >> hash input old new newv2 >> ---------- --- --- ----- >> cracklib 338 316 338 >> cracklib x 2 (i.e. clibclib) 305 319 300 >> cracklib x 3 (clibclibclib) 323 329 315 >> cracklib x 10 302 310 329 >> cracklib x 100 350 335 298 >> cracklib x 1000 314 309 315 >> cracklib x 100 truncated to char(100) 311 327 320 >> >> uint32 from 1-1648379 309 319 347 >> (uint32 1-1948379)*256 309 314 304 >> (uint32 1-1948379)*16 310 314 324 >> "a"uint32 (i.e. a00001,a0002...) 320 321 312 >> >> uint32uint32 (i.e. uint64) 321 287 309 >> >> The different result columns are old = Jenkins 1996 hash >> function(lookup2.c), new = Jenkins 2006 hash function >> (lookup3.c), and newv2 = adaptation of current hash_any() >> to incorporate the separate mix()/final() functions. As >> you can see from the results, spliting the mix() and final() >> apart does not result in any perceptible loss of randomness >> in the hash assignment. I also ran a crude timing for a >> reindex of the following database: >> >> CREATE TABLE dict (word text); >> CREATE INDEX wordhash ON dict USING hash (word); >> INSERT INTO dict (word) VALUES('s;lhkjdpyoijxfg;lktjgh;sdlhkjo'); >> INSERT INTO dict (SELECT MAX(word)||MAX(word) FROM dict); >> ... (21 times) >> >> REINDEX TABLE >> ... >> >> The average time to reindex the table using our current >> hash_any() without the separate mix()/final() was 1696ms >> and 1482ms with the separate mix()/final() stages giving >> almost 13% better performance for this stupid metric. >> > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83 > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
В списке pgsql-hackers по дате отправления:
Следующее
От: Hannu KrosingДата:
Сообщение: Re: ARRAY vars (was Enable pl/python to return records based on multiple OUT params)