Re: Next Steps with Hash Indexes

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Next Steps with Hash Indexes
Дата
Msg-id 1e173c75-aece-a17e-b6cd-7ef4dd6520db@enterprisedb.com
обсуждение исходный текст
Ответ на Re: Next Steps with Hash Indexes  (Simon Riggs <simon.riggs@enterprisedb.com>)
Ответы Re: Next Steps with Hash Indexes
Список pgsql-hackers

On 10/5/21 18:28, Simon Riggs wrote:
> On Tue, 5 Oct 2021 at 12:24, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> On Tue, Oct 5, 2021 at 4:08 PM Simon Riggs <simon.riggs@enterprisedb.com> wrote:
>>>
>>> On Mon, 27 Sept 2021 at 06:52, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>>>
>>>> On Thu, Sep 23, 2021 at 11:11 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>>>>
>>>>> On Thu, Sep 23, 2021 at 10:04 AM Sadhuprasad Patro <b.sadhu@gmail.com> wrote:
>>>>>>
>>>>>> And to get the multi-column hash index selected, we may set
>>>>>> enable_hashjoin =off, to avoid any condition become join condition,
>>>>>> saw similar behaviors in other DBs as well...
>>>>>
>>>>> This may be related to Tom's point that, if some of the quals are
>>>>> removed due to optimization or converted to join quals, then now, even
>>>>> if the user has given qual on all the key columns the index scan will
>>>>> not be selected because we will be forcing that the hash index can
>>>>> only be selected if it has quals on all the key attributes?
>>>>>
>>>>> I don't think suggesting enable_hashjoin =off is a solution,
>>>>>
>>>>
>>>> Yeah, this doesn't sound like a good idea. How about instead try to
>>>> explore the idea where the hash (bucket assignment and search) will be
>>>> based on the first index key and the other columns will be stored as
>>>> payload? I think this might pose some difficulty in the consecutive
>>>> patch to enable a unique index because it will increase the chance of
>>>> traversing more buckets for uniqueness checks. If we see such
>>>> problems, then I have another idea to minimize the number of buckets
>>>> that we need to lock during uniqueness check which is by lock chaining
>>>> as is used during hash bucket clean up where at a time we don't need
>>>> to lock more than two buckets at a time.
>>>
>>> I have presented a simple, almost trivial, patch to allow multi-col
>>> hash indexes. It hashes the first column only, which can be a downside
>>> in *some* cases. If that is clearly documented, it would not cause
>>> many issues, IMHO. However, it does not have any optimization issues
>>> or complexities, which is surely a very good thing.
>>>
>>> Trying to involve *all* columns in the hash index is a secondary
>>> optimization. It requires subtle changes in optimizer code, as Tom
>>> points out. It also needs fine tuning to make the all-column approach
>>> beneficial for the additional cases without losing against what the
>>> "first column" approach gives.
>>>
>>> I did consider both approaches and after this discussion I am still in
>>> favour of committing the very simple "first column" approach to
>>> multi-col hash indexes now.
>>
>> But what about the other approach suggested by Tom, basically we hash
>> only based on the first column for identifying the bucket, but we also
>> store the hash value for other columns?  With that, we don't need
>> changes in the optimizer and we can also avoid a lot of disk fetches
>> because after finding the bucket we can match the secondary columns
>> before fetching the disk tuple.  I agree, one downside with this
>> approach is we will increase the index size.
> 
> Identifying the bucket is the main part of a hash index's work, so
> that part would be identical.
> 
> Once we have identified the bucket, we sort the bucket page by hash,
> so having an all-col hash would help de-duplicate multi-col hash
> collisions, but not enough to be worth it, IMHO, given that storing an
> extra 4 bytes per index tuple is a significant size increase which
> would cause extra overflow pages etc.. The same thought applies to
> 8-byte hashes.
> 

IMO it'd be nice to show some numbers to support the claims that storing 
the extra hashes and/or 8B hashes is not worth it ...

I'm sure there are cases where it'd be a net loss, but imagine for 
example a case when the first column has a lot of duplicate values. 
Which is not all that unlikely - duplicates seem like one of the natural 
reasons why people want multi-column hash indexes. And those duplicates 
are quite expensive, due to having to access the heap. Being able to 
eliminate those extra accesses cheaply might be a clear win, even if it 
makes the index a bit larger (shorter hashes might be enough).


> IMHO, multi-column hash collisions are a secondary issue, given that
> we can already select the column order for an index and hash indexes
> would only be used by explicit user choice. If there are some minor
> sub-optimal aspects of using hash indexes, then btree was already the
> default and a great choice for many cases.
> 

But we can't select arbitrary column order, because the first column is 
used to select the bucket. Which means it's also about what columns are 
used by the queries. If the query is not using the first column, it 
can't use the index.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Chapman Flack
Дата:
Сообщение: Re: style for typedef of function that will be pointed to
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: [HACKERS] GSoC 2017: Foreign Key Arrays