Re: [HACKERS] WIP: BRIN bloom indexes

Поиск
Список
Период
Сортировка
От Sokolov Yura
Тема Re: [HACKERS] WIP: BRIN bloom indexes
Дата
Msg-id 94c173b54a0aef6ae9b18157ef52f03e@postgrespro.ru
обсуждение исходный текст
Ответ на [HACKERS] WIP: BRIN bloom indexes  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: [HACKERS] WIP: BRIN bloom indexes  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
On 2017-10-19 23:15, Tomas Vondra wrote:
> Hi,
> 
> The BRIN minmax opclasses work well only for data where the column is
> somewhat correlated to physical location in a table. So it works great
> for timestamps in append-only log tables, for example. When that is not
> the case (non-correlated columns) the minmax ranges get very "wide" and
> we end up scanning large portions of the tables.
> 
> A typical example are columns with types that are inherently random (or
> rather non-correlated) like for example UUIDs, IP or MAC addresses, and
> so on. But it can just as easily happen even with simple IDs generated
> from a sequence.
> 
> This WIP patch implements another type of BRIN index, with "summary"
> being represented by a bloom filter. So instead of building [min,max]
> range for each page range. The index is of course larger than the
> regular "minmax" BRIN index, but it's still orders of magnitude smaller
> than a btree index.
> 
> Note: This is different from the index type implemented in the "bloom"
> extension. Those are not related to BRIN at all, and the index builds a
> bloom filter on individual rows (values in different columns).
> 
> BTW kudos to everyone who worked on the BRIN infrastructure and API. I
> found it very intuitive and well-documented. Adding the new opclass was
> extremely straight-forward, and
> 
> 
> To demonstrate this on a small example, consider this table:
> 
>     CREATE TABLE bloom_test (id uuid, padding text);
> 
>     INSERT INTO bloom_test
>     SELECT md5((mod(i,1000000)/100)::text)::uuid, md5(i::text)
>       FROM generate_series(1,200000000) s(i);
> 
>     VACUUM ANALYZE bloom_test;
> 
>                          List of relations
>      Schema |    Name    | Type  | Owner | Size  | Description
>     --------+------------+-------+-------+-------+-------------
>      public | bloom_test | table | tomas | 16 GB |
>     (1 row)
> 
> The table was built so that each range contains relatively small number
> of distinct UUID values - this is typical e.g. for user activity with
> "bursts" of actions from a particular user separated by long periods of
> inactivity (with actions from other users).
> 
> Now, let's build BRIN "minmax", BRIN "bloom" and BTREE indexes on the 
> id
> column.
> 
>     create index test_brin_idx on bloom_test
>      using brin (id);
> 
>     create index test_bloom_idx on bloom_test
>      using brin (id uuid_bloom_ops);
> 
>     create index test_btree_idx on bloom_test (id);
> 
> which gives us this:
> 
>                           List of relations
>      Schema |      Name      | Type  | Owner |   Table    |  Size
>     --------+----------------+-------+-------+------------+---------
>      public | test_bloom_idx | index | tomas | bloom_test | 12 MB
>      public | test_brin_idx  | index | tomas | bloom_test | 832 kB
>      public | test_btree_idx | index | tomas | bloom_test | 6016 MB
>     (3 rows)
> 
> So yeah - the "bloom" index is larger than the default "minmax" index,
> but it's still orders of magnitude smaller than the regular btree one.
> 
> Now, what about query performance? Unlike the "minmax" indexes, the
> "bloom" filter can only handle equality conditions.
> 
> Let's see a query like this:
> 
>     select * from bloom_test
>      where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772';
> 
> The minmax index produces this plan
> 
>                                 QUERY PLAN
> ------------------------------------------------------------------------
>  Bitmap Heap Scan on bloom_test
>      (cost=390.31..2130910.82 rows=20593 width=49)
>      (actual time=197.974..22707.311 rows=20000 loops=1)
>    Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
>    Rows Removed by Index Recheck: 199980000
>    Heap Blocks: lossy=2061856
>    ->  Bitmap Index Scan on test_brin_idx
>            (cost=0.00..385.16 rows=5493161 width=0)
>            (actual time=133.463..133.463 rows=20619520 loops=1)
>          Index Cond: (id = 
> '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
>  Planning time: 0.165 ms
>  Execution time: 22707.891 ms
> (8 rows)
> 
> Which is not that great, and the long duration is a direct consequence
> of "wide" ranges - the bitmap heap scan had to scan pretty much the
> whole table. (A sequential scan takes only about 15 seconds.)
> 
> Now, the bloom index:
> 
>                                 QUERY PLAN
> ------------------------------------------------------------------------
>  Bitmap Heap Scan on bloom_test
>      (cost=5898.31..2136418.82 rows=20593 width=49)
>      (actual time=24.229..338.324 rows=20000 loops=1)
>    Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
>    Rows Removed by Index Recheck: 2500448
>    Heap Blocks: lossy=25984
>    ->  Bitmap Index Scan on test_bloom_idx
>          (cost=0.00..5893.16 rows=5493161 width=0)
>          (actual time=22.811..22.811 rows=259840 loops=1)
>          Index Cond: (id = 
> '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid)
>  Planning time: 0.201 ms
>  Execution time: 338.946 ms
> (8 rows)
> 
> So, way better.
> 
> For comparison, a simple index scan / bitmap index scan using the btree
> take only about ~10ms in this case, but that's mostly thanks to the
> whole dataset being entirely in-memory.
> 
> There are some remaining open questions.
> 
> 1) sizing the bloom filter
> 
> Firstly, the size of the filter is currently static, based on 1000
> distinct values per range, with 5% false-positive rate. Those are 
> mostly
> arbitrary values, and  I don't have any clear idea how to determine
> optimal values.
> 
> Maybe we could do the sizing based on ndistinct value for the indexed
> column? It also depends on the pages_per_range value, so perhaps it
> should be a user-defined option too.
> 
> An interesting feature is that the bloom indexes "degrade locally". If 
> a
> page range has significantly more distinct values, the bloom filter may
> be way too small and will suffer by high false-positive rate. But it
> only affects that one page range, and the rest may be perfectly fine.
> 
> I was thinking about disabling the bloom filter when it gets "too full"
> (too many bits set, resulting in high false-positive cases). But that
> seems like a bad idea - the false-positive rate automatically jumps to
> 100% for that range, and we only save much smaller amount of space in
> the index. So even if the false-positive rate is 50% it still seems
> efficient to keep the bloom filter.
> 
> Another thing to consider is what happens when there are very few
> distinct values in a given page range. The current code tries to be a
> bit smart in this case - instead of building the bloom filter right
> away, it initially keeps the exact values and only switches to bloom
> filter when filling the same space.
> 
> 2) costing
> 
> The other thing is costing of BRIN indexes. At this point it's rather
> simplistic and independent of the operator class, so the only thing 
> that
> matters is size of the BRIN index. That means a "minmax" index is 
> always
> considered cheaper than "bloom" index, because the optimizer has no 
> idea
> the "minmax" indexes are more vulnerable to "wide ranges".
> 
> But perhaps this is a non-issue, as the bloom index are meant for cases
> when minmax indexes don't work. And when minmax indexes work, they work
> better than bloom. So people are unlikely to build both of them at the
> same time.
> 
> 
> regards

Hi, Tomas

BRIN bloom index is a really cool feature, that definitely should be in
core distribution (either in contrib or builtin)!!!

Small suggestion for algorithm:

It is well known practice not to calculate whole hash function for every
point, but use double hashing approach.
Example of proving double hashing approach for bloom filters:
https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf

So, instead of:for (i = 0; i < filter->nhashes; i++){    /* compute the hashes with a seed, used for the bloom filter
*/   uint32 h = ((uint32)DatumGetInt64(hash_uint32_extended(value, i))) %  
filter->nbits;
    /* set or check bit h */}

such construction could be used:/* compute the hashes, used for the bloom filter */uint32 big_h =
((uint32)DatumGetInt64(hash_uint32_extended(value,i)));uint32 h = big_h % filter->nbits;/* ensures d is never >=
filter->nbits*/uint32 d = big_h % (filter->nbits - 1);
 
for (i = 0; i < filter->nhashes; i++){    /* set or check bit h */
    /* compute next bit h */    h += d++;    if (h >= filter->nbits) h -= filter->nbits;    if (d == filter->nbits) d =
0;}

Modulo of one 64bit result by two coprime numbers (`nbits` and 
`nbits-1`)
gives two good-quality functions usable for double hashing.

With regards,
-- 
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Simon Riggs
Дата:
Сообщение: Re: [HACKERS] MERGE SQL Statement for PG11
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: [HACKERS] MERGE SQL Statement for PG11