Re: Hash Joins vs. Bloom Filters / take 2

Поиск
Список
Период
Сортировка
От Claudio Freire
Тема Re: Hash Joins vs. Bloom Filters / take 2
Дата
Msg-id CAGTBQpYzRCUi9v-WKb9pO7J24GJjs0gU70ExH8CBt1LeerDABQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Hash Joins vs. Bloom Filters / take 2  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: Hash Joins vs. Bloom Filters / take 2
Список pgsql-hackers
On Thu, Feb 22, 2018 at 12:45 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
>
> On 02/22/2018 12:44 PM, Claudio Freire wrote:
>> Let me reiterate, you can avoid both issues with scalable bloom filters[1].
>>
>
> I'm afraid it's not as straight-forward as "Use scalable bloom filters!"
>
> This is not merely a question of unreliable estimates of number of
> entries. That could have been solved by scalable bloom filters, which
> are essentially a sequence of larger and larger bloom filters, added
> when the smaller bloom filter "fills up" (1/2 full).
>
> The problem is twofold:
>
> (a) we need to decide what false positive rate to use (i.e. what is a
> reasonable trade-off between filter size and how much work it saves)
>
> (b) we also need to consider work_mem (which I assume we all agree we
> must respect)
> So for example we can't size the first bloom filter to just perfectly
> fit into work_mem, only to add larger bloom filters later (each 2x the
> size of the previous one). Not only will that increase the memory usage
> to 7x the initial estimate

Aside from a, (b) is exactly the problem SBFs solve.

You decide how much of work_mem you're willing to devote for BFs, and
then keep creating bigger and bigger BFs until you've allocated that
many.

Then you either keep updating the biggest filter, or give up entirely.

You can use a rather conservative initial bloom filter size for that,
and let scalability expand until you hit the limit.

> but it will also make the bloom filter less
> efficient (having to probe larger and larger filters, likely not fitting
> into CPU cache).

Again, you factor that into account when choosing the limit.

>> An HLL can be used to estimate set size, the paper makes no mention of
>> it, probably assuming only distinct items are added to the set.
>>
>
> The problem with HLL is, it's only an estimate of how many entries you
> saw so far. It only tells you that after observing the items, and it
> only tells you how many items you saw so far. What we need for sizing a
> bloom filter is an estimate of number of distinct values in advance.
>
> In other words, HLL is entirely useless for sizing Bloom Filters.

Normal BFs, yes. But that's exactly what you need for scalable BFs.
You need an estimate of the amount of distinct entries you've added to
your current filter, not the total set size.

> Furthermore, we could estimate number of observed distinct values from
> the number of 1s in the bloom filter

That's kinda slow to do per-item. I tried to "count" distinct items by
checking the BF before adding (don't add redundantly), but that's less
precise than a HLL in my experience.


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Magnus Hagander
Дата:
Сообщение: Re: Online enabling of checksums
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Online enabling of checksums