Re: Hash Joins vs. Bloom Filters / take 2

Поиск
Список
Период
Сортировка
От Claudio Freire
Тема Re: Hash Joins vs. Bloom Filters / take 2
Дата
Msg-id CAGTBQpYioi71inVQSFMtP2BVBE_PUd6Goh2nfTU+wQHG36Yzrw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Hash Joins vs. Bloom Filters / take 2  (Peter Geoghegan <pg@bowt.ie>)
Ответы Re: Hash Joins vs. Bloom Filters / take 2
Список pgsql-hackers
On Tue, Feb 20, 2018 at 8:06 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> You should try to exploit the fact that a Bloom filter can summarize a
> large set reasonably well with a very compact, simple representation.
> A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
> for cases where Bloom probes will save a lot of work, it probably
> doesn't make all that much difference -- the hash join is still much
> faster. If you're willing to accept a 10% false positive rate, then
> you can summarize a set of 40 million distinct values with only
> 22.85MB of memory and 3 hash functions. I think that the smallest
> possible amount of memory that any hash table of 40 million elements
> would require a much greater amount of memory than 22.85MB --
> certainly closer to 20x than to 8x. Even that is pretty conservative.
> I bet there are plenty of hash joins were the ratio could be 30x or
> more. Not to mention cases with multiple batches.

I've worked a lot with bloom filters, and for large false positive
rates and large sets (multi-million entries), you get bloom filter
sizes of about 10 bits per distinct item.

That's efficient, but it's not magic. It can still happen that the
whole set can't fit in work_mem with an acceptable false positive
rate.


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Geoghegan
Дата:
Сообщение: Re: Hash Joins vs. Bloom Filters / take 2
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: Hash Joins vs. Bloom Filters / take 2