How to make hash indexes fast

Поиск
Список
Период
Сортировка
От Craig James
Тема How to make hash indexes fast
Дата
Msg-id 4E769759.7060101@emolecules.com
обсуждение исходный текст
Ответ на Re: Hash index use presently(?) discouraged since 2005: revive or bury it?  (Stefan Keller <sfkeller@gmail.com>)
Ответы Re: How to make hash indexes fast
Re: How to make hash indexes fast
Список pgsql-performance
Regarding the recent discussion about hash versus B-trees: Here is a trick I invented years ago to make hash indexes
REALLYfast.  It eliminates the need for a second disk access to check the data in almost all cases, at the cost of an
additional32-bit integer in the hash-table data structure. 

Using this technique, we were able to load a hash-indexed database with data transfer rates that matched a cp (copy)
commandof the same data on Solaris, HP-UX and IBM AIX systems. 

You build a normal hash table with hash-collision chains.  But you add another 32-bit integer "signature" field to the
hash-collisionrecord (call it "DBsig").  You also create a function: 

   signature = sig(key)

that produces digital signature.  The critical factor in the sig() function is that there is an average of 9 bits set
(i.e.it is somewhat "sparse" on bits). 

DBsig for a hash-collision chain is always the bitwise OR of every record in that hash-collision chain.  When you add a
recordto the hash table, you do a bitwise OR of its signature into the existing DBsig.  If you delete a record, you
eraseDBsig and rebuild it by recomputing the signatures of each record in the hash-collision chain and ORing them
togetheragain. 

That means that for any key K, if K is actually on the disk, then all of the bits of sig(K) are always set in the
hash-tablerecord's "DBsig".  If any one bit in sig(K) isn't set in "DBsig", then K is not in the database and you don't
haveto do a disk access to verify it.  More formally, if 

    sig(K) AND DBsig != sig(K)

then K is definitely not in the database.

A typical hash table implementation might operate with a hash table that's 50-90% full, which means that the majority
ofaccesses to a hash index will return a record and require a disk access to check whether the key K is actually in the
database. With the signature method, you can eliminate over 99.9% of these disk accesses -- you only have to access the
datawhen you actually want to read or update it.  The hash table can usually fit easily in memory even for large
tables,so it is blazingly fast. 

Furthermore, performance degrades gracefully as the hash table becomes overloaded.  Since each signature has 9 bits
set,you can typically have 5-10 hash collisions (a lot of signatures ORed together in each record's DBsig) before the
false-positiverate of the signature test gets too high.  As the hash table gets overloaded and needs to be resized, the
falsepositives increase gradually and performance decreases due to the resulting unnecessary disk fetches to check the
key. In the worst case (e.g. a hash table that's overloaded by a factor of 10 or more), performance degrades to what it
wouldbe without the signatures. 

For much higher selectivity, you could use a 64-bit signatures and make the sig() set an average of 13 bits.  You'd get
verygood selectivity even in a badly overloaded hash table (long hash-collision chains). 

This technique was never patented, and it was disclosed at several user-group meetings back in the early 1990's, so
thereare no restrictions on its use.  If this is of any use to anyone, maybe you could stick my name in the code
somewhere.

Craig James (the other Craig)

В списке pgsql-performance по дате отправления:

Предыдущее
От: Craig James
Дата:
Сообщение: Re: Index containing records instead of pointers to the data?
Следующее
От: Ondrej Ivanič
Дата:
Сообщение: Re: How to make hash indexes fast