Re: Hash Joins vs. Bloom Filters / take 2

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Hash Joins vs. Bloom Filters / take 2
Дата
Msg-id 67862a16-e48d-e918-5fac-e988e24d19f7@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: Hash Joins vs. Bloom Filters / take 2  (Claudio Freire <klaussfreire@gmail.com>)
Ответы Re: Hash Joins vs. Bloom Filters / take 2
Список pgsql-hackers

On 02/22/2018 09:52 PM, Claudio Freire wrote:
> On Thu, Feb 22, 2018 at 5:11 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 02/22/2018 08:33 PM, Claudio Freire wrote:
>>> That's kinda slow to do per-item. I tried to "count" distinct items by
>>> checking the BF before adding (don't add redundantly), but that's less
>>> precise than a HLL in my experience.
>>
>> But you don't need to do that for each item you try to add - you know
>> that with M bits and K hash functions you can't reach 50% before at
>> least M/(2*K) additions. And you don't really need to track the number
>> of bits exactly - if it gets 55% full, it's not going to explode.
>>
>> So just wait until M/(2*K) additions, see how many bits remain until the
>> threshold - rinse and repeat. And use some reasonable minimum distance
>> (say, 5% of the capacity), not to do the check too often.
> 
> Ok, that works too.
> 
> Point is, SBFs help to keep the BF size as small as possible, keep it
> below work_mem, and to avoid caring about the total number of distinct
> items.
> 
> You may want the planner to try and estimate that to figure out 
> whether it's worth trying BFs or not, but once you decide to try,
> SBFs remove any dependency on the accuracy of that estimate.
> 

OK, thanks for reminding me about SBF and for the discussion.

At this point I'll probably focus on the other parts though -
determining selectivity of the join, etc. Which I think is crucial, and
we need to get that right even for accurate estimates. It's good to know
that we have a solution for that part, though.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: Add PGDLLIMPORT to enable_hashagg
Следующее
От: Robert Haas
Дата:
Сообщение: Re: [HACKERS] Partition-wise aggregation/grouping