Re: [GSoC08]some detail plan of improving hash index
От | Kenneth Marshall |
---|---|
Тема | Re: [GSoC08]some detail plan of improving hash index |
Дата | |
Msg-id | 20080516130442.GP28382@it.is.rice.edu обсуждение исходный текст |
Ответ на | [GSoC08]some detail plan of improving hash index ("Xiao Meng" <mx.cogito@gmail.com>) |
Список | pgsql-hackers |
Hi Xiao Meng, I am glad that you are making some progress. I have added a couple of comments below. Your phased approach is a good way to get it in a position for testing. I had a very basic test for creation time, query time for a simple lookup, and index size that I would like to re-run when you have a proto-type working. Regards, Ken On Fri, May 16, 2008 at 10:42:05AM +0800, Xiao Meng wrote: > Hi, hackers. > > I'm reading the source codes of hash and reviewing Neil's old patch of > improving hash index. > Here is some detail plan. I'm trying to adjust Neil's patch to the current > version of PostgreSQL first. I'm not quite familar with the code yet, so > please make some comment. > > * Phase 1. Just store hash value instead of hash keys > > First, define a macro to make it optional. > Good. > Second, add a new function _hash_form_item to construct IndexTuple with hash > code to replace index_form_tuple uaed in hash access method. It seems easy > since we did'nt need to deal with TOAST. > > Third, modify _hash_checkqual. We can first compare the hash value; if it's > the same, we compare the real key value. I think the changes to the system catalog cause this to happen automatically for an access method with the re-check flag set. You just need to return all of the tuples that satisfy the hash1 == hash2 criteria and the system will check them against the heap. This will need to be done for support of a unique index, but that should wait until we have demonstrated the performance of the new approach. > Also, HashScanOpaqueData adds an element hashso_sk_hash to hold scan key's > hash value to support scan function. > > Finally, it seems the system catalog pg_amop also need to be modified. > In Neil's patch, he set the amopreqcheck to be True. > In the documents, it means index hit must be rechecked in the document. But > I'm not so clear. Does it just means we need to recheck the value when use > _hash_chechqual? This means that the system will perform the re-check for you so you do not have to access the heap to check for yourself. > > > * Phase 2. Sort the hash value when insert into the bucket and use binary > search when scan > Add a function _hash_binsearch to support binary search in a bucket; > It involved in all functions when we need to search, insert and delete. I would wait on this piece, or at least make it a separate option so we can test whether or not the overhead is a worthwhile trade-off performance-wise. If we can make a smaller bucket-size work, than for a bucket size of a cacheline or two just reading the entire bucket and re-writing should be faster. It may be of value for buckets with many items with the same hash value. > > * Phase 3. When it's necessary, store the real value. > When we insert a new index tuple , for example tp , to a bucket, we can > check whether there's the same hash code. > If there's already an index tuple with such a hash code, we store both the > hash code and real key of tp. I would always store the hash code and not the value. One of the big wins is the reduction in index size to improve the ability to index very large items and tables efficiently. The btree index already handles the case of storing the actual value in the index. Since a hash code is a non-unique mapping, you will always need to check the value in the heap. So let the system do that and then the index does not need to carry that overhead. > Alternatively, we add a flag to represent the tuple stores a real value or > just hash code. It seems a little complex. > See above. > Phase 1 seems extremely easy. I'm trying to do it first. > Additionally, I need a benchmark to test the performance. It seems there's > some tools list in http://wiki.postgresql.org/wiki/Performances_QA_testing . > Any advice? > > -- > Have a good day;-) > Best Regards, > Xiao Meng > > ????????????????????????????????????????????????????????????????????? > Data and Knowledge Engineering Research Center > Harbin Institute of Technology, China > Gtalk: mx.cogito@gmail.com > MSN: cnEnder@live.com > <http://xiaomeng.yo2.cn>
В списке pgsql-hackers по дате отправления: