Re: Hash Joins vs. Bloom Filters / take 2

Поиск

Список

Период

Сортировка

От	Claudio Freire
Тема	Re: Hash Joins vs. Bloom Filters / take 2
Дата	21 февраля 2018 г. 02:17:28
Msg-id	CAGTBQpYioi71inVQSFMtP2BVBE_PUd6Goh2nfTU+wQHG36Yzrw@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Hash Joins vs. Bloom Filters / take 2 (Peter Geoghegan <pg@bowt.ie>)
Ответы	Re: Hash Joins vs. Bloom Filters / take 2
Список	pgsql-hackers

Дерево обсуждения

On Tue, Feb 20, 2018 at 8:06 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> You should try to exploit the fact that a Bloom filter can summarize a
> large set reasonably well with a very compact, simple representation.
> A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
> for cases where Bloom probes will save a lot of work, it probably
> doesn't make all that much difference -- the hash join is still much
> faster. If you're willing to accept a 10% false positive rate, then
> you can summarize a set of 40 million distinct values with only
> 22.85MB of memory and 3 hash functions. I think that the smallest
> possible amount of memory that any hash table of 40 million elements
> would require a much greater amount of memory than 22.85MB --
> certainly closer to 20x than to 8x. Even that is pretty conservative.
> I bet there are plenty of hash joins were the ratio could be 30x or
> more. Not to mention cases with multiple batches.

I've worked a lot with bloom filters, and for large false positive
rates and large sets (multi-million entries), you get bloom filter
sizes of about 10 bits per distinct item.

That's efficient, but it's not magic. It can still happen that the
whole set can't fit in work_mem with an acceptable false positive
rate.

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Hash Joins vs. Bloom Filters / take 2