Bloom Indexes - bit array length and the total number of bits (orhash functions ?? ) !

Поиск

Список

Период

Сортировка

От	Avinash Kumar
Тема	Bloom Indexes - bit array length and the total number of bits (orhash functions ?? ) !
Дата	8 июня 2019 г. 05:43:08
Msg-id	CAN0TujeD-DxQ=TjdQkEPXdVSxtt6MpAhJ=TazsAHxuRQ4_E5qQ@mail.gmail.com обсуждение исходный текст
Ответы	Re: Bloom Indexes - bit array length and the total number of bits(or hash functions ?? ) ! (Fabien COELHO <coelho@cri.ensmp.fr>)
Список	pgsql-hackers

Дерево обсуждения

Hi All,

I was testing bloom indexes today. I understand bloom indexes uses bloom filters.

As i understand, a bloom filter is a bit array of m bits and a constant "k" number of hash functions are used to generate hashes for the data. And then the appropriate bits are set to 1.

I was doing the following test where i was generating 10 million records and testing bloom indexes.

CREATE TABLE foo.bar (id int, dept int, id2 int, id3 int, id4 int, id5 int,id6 int,id7 int,details text, zipcode int);

INSERT INTO foo.bar SELECT (random() * 1000000)::int, (random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int, (random() * 1000000)::int,(random() * 1000000)::int,md5(g::text), floor(random()* (20000-9999 + 1) + 9999) from generate_series(1,100*1e6) g;

As per the documentation, bloom indexes accepts 2 parameters. length and the number of bits for each column.

Here is the problem or the question i have.

I have created the following 2 Indexes.

Index 1
CREATE INDEX idx_bloom_bar ON foo.bar
USING bloom(id, dept, id2, id3, id4, id5, id6, zipcode)
WITH (length=48, col1=4, col2=4, col3=4, col4=4, col5=4, col6=4, col7=4, col8=4);

Index 2
CREATE INDEX idx_bloom_bar ON foo.bar
USING bloom(id, dept, id2, id3, id4, id5, id6, zipcode)
WITH (length=48, col1=2, col2=2, col3=2, col4=2, col5=2, col6=2, col7=2, col8=2);

With change in length, we of course see a difference in the Index size which is understandable. Here i have the same length for both indexes. But, i reduced the number of bits per each index column from 4 to 2. Both the above indexes are of the same size. But, there is a very slight difference in the execution time between Index 1 and Index 2 but with the same cost.

So the question here is -
I assume - number of bits = k. Where k is the total number of hash functions used on top of the data that needs to validated. Is that correct ? If yes, why do we see the Index 1 performing better than Index 2 ? Because, the data has to go through more hash functions (4 vs 2) in Index 1 than Index 2. So, with Index 1 it should take more time.
Also, both the indexes have ZERO false positives.
Please let me know if there is anything simple that i am missing here.

Query

EXPLAIN (ANALYZE, BUFFERS, VERBOSE) select * from foo.bar where id = 736833 and dept = 89861 and id2 = 573221 and id3 = 675911 and id4 = 943394 and id5 = 326756 and id6 = 597560 and zipcode = 10545;

With Index 1

QUERY PLAN
----------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo.bar (cost=2647060.00..2647061.03 rows=1 width=69) (actual time=307.000..307.000 rows=0 loops=1)
Output: id, dept, id2, id3, id4, id5, id6, id7, details, zipcode
Recheck Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 = 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 = 326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
Buffers: shared hit=147059
-> Bitmap Index Scan on idx_bloom_bar (cost=0.00..2647060.00 rows=1 width=0) (actual time=306.997..306.997 rows=0 loops=1)
Index Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 = 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 = 326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
Buffers: shared hit=147059
Planning Time: 0.106 ms
Execution Time: 307.030 ms
(9 rows)

With Index 2

QUERY PLAN
----------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foo.bar (cost=2647060.00..2647061.03 rows=1 width=69) (actual time=420.881..420.881 rows=0 loops=1)
Output: id, dept, id2, id3, id4, id5, id6, id7, details, zipcode
Recheck Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 = 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 = 326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
Buffers: shared hit=147059
-> Bitmap Index Scan on idx_bloom_bar (cost=0.00..2647060.00 rows=1 width=0) (actual time=420.878..420.878 rows=0 loops=1)
Index Cond: ((bar.id = 736833) AND (bar.dept = 89861) AND (bar.id2 = 573221) AND (bar.id3 = 675911) AND (bar.id4 = 943394) AND (bar.id5 = 326756) AND (bar.id6 = 597560) AND (bar.zipcode = 10545))
Buffers: shared hit=147059
Planning Time: 0.104 ms
Execution Time: 420.913 ms
(9 rows)

Thanks,
Avinash.

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Michael Paquier
Дата: 08 июня 2019 г., 04:45:40
Сообщение: Re: Temp table handling after anti-wraparound shutdown (Was: BUG#15840)

Следующее

От: Arthur Zakirov
Дата: 08 июня 2019 г., 07:16:49
Сообщение: Re: [PROPOSAL] Drop orphan temp tables in single-mode

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Bloom Indexes - bit array length and the total number of bits (orhash functions ?? ) !

Предыдущее

Следующее