[HACKERS] WIP: BRIN bloom indexes
От | Tomas Vondra |
---|---|
Тема | [HACKERS] WIP: BRIN bloom indexes |
Дата | |
Msg-id | 5d78b774-7e9c-c94e-12cf-fef51cc89b1a@2ndquadrant.com обсуждение исходный текст |
Ответы |
Re: [HACKERS] WIP: BRIN bloom indexes
(Robert Haas <robertmhaas@gmail.com>)
Re: [HACKERS] WIP: BRIN bloom indexes (Sokolov Yura <y.sokolov@postgrespro.ru>) Re: [HACKERS] WIP: BRIN bloom indexes (Nico Williams <nico@cryptonector.com>) |
Список | pgsql-hackers |
Hi, The BRIN minmax opclasses work well only for data where the column is somewhat correlated to physical location in a table. So it works great for timestamps in append-only log tables, for example. When that is not the case (non-correlated columns) the minmax ranges get very "wide" and we end up scanning large portions of the tables. A typical example are columns with types that are inherently random (or rather non-correlated) like for example UUIDs, IP or MAC addresses, and so on. But it can just as easily happen even with simple IDs generated from a sequence. This WIP patch implements another type of BRIN index, with "summary" being represented by a bloom filter. So instead of building [min,max] range for each page range. The index is of course larger than the regular "minmax" BRIN index, but it's still orders of magnitude smaller than a btree index. Note: This is different from the index type implemented in the "bloom" extension. Those are not related to BRIN at all, and the index builds a bloom filter on individual rows (values in different columns). BTW kudos to everyone who worked on the BRIN infrastructure and API. I found it very intuitive and well-documented. Adding the new opclass was extremely straight-forward, and To demonstrate this on a small example, consider this table: CREATE TABLE bloom_test (id uuid, padding text); INSERT INTO bloom_test SELECT md5((mod(i,1000000)/100)::text)::uuid, md5(i::text) FROM generate_series(1,200000000) s(i); VACUUM ANALYZE bloom_test; List of relations Schema | Name | Type | Owner | Size | Description --------+------------+-------+-------+-------+------------- public | bloom_test | table | tomas | 16 GB | (1 row) The table was built so that each range contains relatively small number of distinct UUID values - this is typical e.g. for user activity with "bursts" of actions from a particular user separated by long periods of inactivity (with actions from other users). Now, let's build BRIN "minmax", BRIN "bloom" and BTREE indexes on the id column. create index test_brin_idx on bloom_test using brin (id); create index test_bloom_idx on bloom_test using brin (id uuid_bloom_ops); create index test_btree_idx on bloom_test (id); which gives us this: List of relations Schema | Name | Type | Owner | Table | Size --------+----------------+-------+-------+------------+--------- public | test_bloom_idx | index | tomas | bloom_test | 12 MB public | test_brin_idx | index | tomas | bloom_test | 832 kB public | test_btree_idx | index | tomas | bloom_test | 6016 MB (3 rows) So yeah - the "bloom" index is larger than the default "minmax" index, but it's still orders of magnitude smaller than the regular btree one. Now, what about query performance? Unlike the "minmax" indexes, the "bloom" filter can only handle equality conditions. Let's see a query like this: select * from bloom_test where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'; The minmax index produces this plan QUERY PLAN ------------------------------------------------------------------------ Bitmap Heap Scan on bloom_test (cost=390.31..2130910.82 rows=20593 width=49) (actual time=197.974..22707.311 rows=20000 loops=1) Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid) Rows Removed by Index Recheck: 199980000 Heap Blocks: lossy=2061856 -> Bitmap Index Scan on test_brin_idx (cost=0.00..385.16 rows=5493161 width=0) (actual time=133.463..133.463 rows=20619520 loops=1) Index Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid) Planning time: 0.165 ms Execution time: 22707.891 ms (8 rows) Which is not that great, and the long duration is a direct consequence of "wide" ranges - the bitmap heap scan had to scan pretty much the whole table. (A sequential scan takes only about 15 seconds.) Now, the bloom index: QUERY PLAN ------------------------------------------------------------------------ Bitmap Heap Scan on bloom_test (cost=5898.31..2136418.82 rows=20593 width=49) (actual time=24.229..338.324 rows=20000 loops=1) Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid) Rows Removed by Index Recheck: 2500448 Heap Blocks: lossy=25984 -> Bitmap Index Scan on test_bloom_idx (cost=0.00..5893.16 rows=5493161 width=0) (actual time=22.811..22.811 rows=259840 loops=1) Index Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid) Planning time: 0.201 ms Execution time: 338.946 ms (8 rows) So, way better. For comparison, a simple index scan / bitmap index scan using the btree take only about ~10ms in this case, but that's mostly thanks to the whole dataset being entirely in-memory. There are some remaining open questions. 1) sizing the bloom filter Firstly, the size of the filter is currently static, based on 1000 distinct values per range, with 5% false-positive rate. Those are mostly arbitrary values, and I don't have any clear idea how to determine optimal values. Maybe we could do the sizing based on ndistinct value for the indexed column? It also depends on the pages_per_range value, so perhaps it should be a user-defined option too. An interesting feature is that the bloom indexes "degrade locally". If a page range has significantly more distinct values, the bloom filter may be way too small and will suffer by high false-positive rate. But it only affects that one page range, and the rest may be perfectly fine. I was thinking about disabling the bloom filter when it gets "too full" (too many bits set, resulting in high false-positive cases). But that seems like a bad idea - the false-positive rate automatically jumps to 100% for that range, and we only save much smaller amount of space in the index. So even if the false-positive rate is 50% it still seems efficient to keep the bloom filter. Another thing to consider is what happens when there are very few distinct values in a given page range. The current code tries to be a bit smart in this case - instead of building the bloom filter right away, it initially keeps the exact values and only switches to bloom filter when filling the same space. 2) costing The other thing is costing of BRIN indexes. At this point it's rather simplistic and independent of the operator class, so the only thing that matters is size of the BRIN index. That means a "minmax" index is always considered cheaper than "bloom" index, because the optimizer has no idea the "minmax" indexes are more vulnerable to "wide ranges". But perhaps this is a non-issue, as the bloom index are meant for cases when minmax indexes don't work. And when minmax indexes work, they work better than bloom. So people are unlikely to build both of them at the same time. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Nico WilliamsДата:
Сообщение: Re: [HACKERS] [PATCH] Add ALWAYS DEFERRED option for constraints
Следующее
От: Tomasz OstrowskiДата:
Сообщение: [HACKERS] Queuing all tables for analyze after recovery