Re: Hash index todo list item

Поиск
Список
Период
Сортировка
От Kenneth Marshall
Тема Re: Hash index todo list item
Дата
Msg-id 20070907140059.GF19403@it.is.rice.edu
обсуждение исходный текст
Ответ на Re: Hash index todo list item  (Mark Mielke <mark@mark.mielke.cc>)
Ответы Re: Hash index todo list item  (Mark Mielke <mark@mark.mielke.cc>)
Re: Hash index todo list item  (Brian Hurt <bhurt@janestcapital.com>)
Список pgsql-hackers
On Fri, Sep 07, 2007 at 09:50:07AM -0400, Mark Mielke wrote:
> Kenneth Marshall wrote:
>> On Thu, Sep 06, 2007 at 11:56:25PM -0700, Neil Conway wrote:
>>   
>>> You might find this patch useful:
>>>
>>>     http://archives.postgresql.org/pgsql-patches/2005-05/msg00164.php
>>> ...
>>>
>>> Unfortunately, the patch doesn't apply cleanly to HEAD, but I can merge
>>> it up to HEAD if you'd like.
>>>     
>> This is a great starting point. I would appreciate it if you have the
>> time and could make it apply cleanly to HEAD. I remember when you first
>> posted it but had forgotten, probably because of the lack-luster results.
>> Just a quick glance at the patch and from what I can tell, tagging the
>> index as lossy causes a lot more work to be done than should be needed
>> in theory. Currently the index-scan machinery will recheck the value
>> against the original value for lossy indexes. However, given that we
>> are using a good hash function with a low chance of collision, if we
>> mark the unique items in the index then they do not actually have to
>> be rechecked during the scan. Do you have any suggestions for implementing
>> that optimization or is there any option to tell the scan machinery to
>> only re-check a certain list of tuples? Thank you again for pointing
>> this patch out and please let me know when you have a version against
>> HEAD.
>>   
> What do you mean by "mark the unique items in the index then they do not 
> actually have to be rechecked during the scan." Even if there is a unique 
> hash value mapping to a unique key, there is no guarantee that a new value 
> won't result in that same hash value. Such is the nature of hashes. A hash 
> key map does not mean a value match. The value must be checked. The 
> opposite, however, may be true. If the hash key is not found, then we know 
> the row for the value does not exist.
>
> Have you measured the performance of re-checking? I have always assumed the 
> performance of re-checking was near free when compared to the cost of 
> looking up the tuples in the table to determine whether or not they are 
> "live". This is why I have not been upset that bitmap index scans often 
> re-check.
>
> Cheers,
> mark

I understand that a hash value is a many-to-one mapping. That is the
point of the flag in the index. The flag means that there is only one
item in the heap corresponding to that hash value. In this case we
know that the value in the heap is the correct one and a possibly
very expensive string comparison can be skipped. Given that the hash
function is doing its job, almost every string comparison can be skipped.
How long would it take to compare 1-32K of data? How much CPU usage?
With this field in place, you only need to check tuple visibility.

Regards,
Ken


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Rainer Bauer
Дата:
Сообщение: Re: [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)
Следующее
От: Andrew Dunstan
Дата:
Сообщение: Re: [FEATURE REQUEST] Streaming Onlinebackup (Maybe OFFTOPIC)