Re: A better way than tweaking NTUP_PER_BUCKET

Поиск
Список
Период
Сортировка
От Stephen Frost
Тема Re: A better way than tweaking NTUP_PER_BUCKET
Дата
Msg-id 20130626134257.GA5952@tamriel.snowman.net
обсуждение исходный текст
Ответ на Re: A better way than tweaking NTUP_PER_BUCKET  (Atri Sharma <atri.jiit@gmail.com>)
Ответы Re: A better way than tweaking NTUP_PER_BUCKET  (Atri Sharma <atri.jiit@gmail.com>)
Список pgsql-hackers
Atri,

* Atri Sharma (atri.jiit@gmail.com) wrote:
> I just popped in here on Simon's advice to put an idea I had about
> optimizing hash joins on this thread.

I'd encourage reading the thread a bit first, in the future.. :)

> Essentially, I was thinking of using bloom filters in the part where
> we match the actual hash values of the outer relation's tuple and the
> hash table. We could do a bloom filter lookup first, and if it gives a
> positive, then we can proceed like we do currently, since we could
> have a false positive. However, if we have a negative from the bloom
> filter, then, we can skip that tuple since bloom filters never give
> false negatives.

I suggested this up-thread already, but it's not really a bloom filter
as there's only one hash function available- I can't see us requiring
every data type to provide multiple hash functions.  Still, I do think
breaking the single 32-bit hash key space up into fixed-sized chunks and
then having a bitfield array which we test against (very similar to how
the visibility map works) to see if there's any chance that a given hash
key exists might be valuable.  The problem is that, because we don't
have multiple hash functions, it's not clear how much "empty" space we'd
actually end up with.

This needs to be tested with different data sets and different sizes for
the bitfield (leading to correspondingly different sizes for the amount
of key space covered  by a single bit), along with performance metrics
which determine how expensive it is to build and use the bitfield.

> Another thing we could potentially look at is that in the case when
> the hash table doesnt fit into memory, and we have to do disk lookups,
> then, we could do bloom filter lookups first in order to select the
> tuples to be read from disk(only the positive ones).

We could have a bitfield filter (as I described above) created for each
bucket and then test against that before considering if we actually have
to go look in that bucket, yes.  I'm not sure if that's quite what you
were thinking, but I can see how a bitfield per bucket might work.  If
you were suggesting something else, please clarify.
Thanks,
    Stephen

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Yuri Levinsky"
Дата:
Сообщение: Re: Hash partitioning.
Следующее
От: Markus Wanner
Дата:
Сообщение: Re: Hash partitioning.