Building Hash Index by Presorting Tuples

Поиск
Список
Период
Сортировка
От twraney@comcast.net
Тема Building Hash Index by Presorting Tuples
Дата
Msg-id 073020072236.19529.46AE67DC0004F62700004C492206998499970A020E9D999B@comcast.net
обсуждение исходный текст
Список pgsql-hackers
Hi,

We are trying to sort the index tuples before inserting them into hash buckets, to improve build speed.  

Here is our plan:

1. Build a spool that contains all the index tuples to be inserted into the buckets. - this is done.

2. sort the index tuples in the spool according to the bucket number to which they should belong. This results in
accessinga bucket once and only once.  
 

3. For (2) to work, we need an estimate of the number of buckets. This is done.

4. After sorting the index tuples, insert them into hash in bucket order. 

Our challenge: we need to determine the final bucket number for the itup (index tuple). 

1. to do the above, we need to apply a mask to the hash value of the index tuple. first, we calculate the hash value of
theindex tuple. then, we calculate the mask using:
 

(1 << (ceiling(log 2 (Estimate of buckets needed))))-1

So, if we need 6 buckets, the mask would be 7 or binary 111.  If we needed 100 buckets, the mask would be 127 or binary
1111111.  If we AND this mask to the hash of the key, we only recognize the least   sig. bits needed to do the
compare.

A 32 bit hash value may look like:  10110101001010101000010101010101

Let's say we just need 6 buckets, apply the mask 111 and we get:

10110101001010101000010101010101 (the hash value of the key)  
00000000000000000000000000000111 (the mask &)--------------------------------  
00000000000000000000000000000101 (the resulting bucket number = 5)

If we needed 100 buckets, the calculation would look like:

10110101001010101000010101010101 (the hash value of the key)   
00000000000000000000000001111111 (the mask &)
--------------------------------    
00000000000000000000000001010101 (the resulting bucket number = 85) 

2. however, in practice when we apply a mask of value say, 1111(binary) our resulting bucket number is not evenly
distrubuted.

3. do we look for a better hash function? or can we modify the existing hash?  Comments are welcome.

-Tom



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Simon Riggs"
Дата:
Сообщение: Reducing Transaction Start/End Contention
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: default_text_search_config and expression indexes