Re: An inverted index using roaring bitmaps

Поиск
Список
Период
Сортировка
От Chinmay Kanchi
Тема Re: An inverted index using roaring bitmaps
Дата
Msg-id CAJqPDh8cfA9JGy52moQzAJF2PQrj2wGKgL2uq2nfex5AvPmidw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: An inverted index using roaring bitmaps  (Peter Geoghegan <pg@bowt.ie>)
Ответы Re: An inverted index using roaring bitmaps  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-hackers
I personally don't think this is a great replacement for a BTree index - for one thing, it isn't really possible to use this approach beyond equality comparisons (for scalars) or "contains"-type operations for arrays (or tsvectors, jsonb, etc). I see this more as "competing" with GIN, though I think GIN solves a different use-case. The primary thought here is that we could build lightning fast inverted indexes for the cases where these really help. 

I played with using roaring bitmaps in production to build rollup tables, for instance - where a single bitmap per key could satisfy count() queries and count(*) ... GROUP BY with multiple WHERE conditions way faster than even an index-only scan could, and without the overhead of multi-column indexes. In our particular case, there were about 2 dozen columns with around 30-40 million rows, and we were able to run these queries in single-digit milliseconds. We ultimately abandoned that project because of the difficulty of keeping the bitmaps in sync with changing data, which would no longer be an issue, if this was built as an index.

I think your point about data warehouse-style bitmap indexes hits the nail on the head here. This would be pretty much just that, a very efficient way to accelerate such queries.

Cheers,
Chinmay


On Tue, Jun 7, 2022 at 11:53 AM Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, Jun 6, 2022 at 10:42 PM Chinmay Kanchi <cgkanchi@gmail.com> wrote:
> The simulated index in this case is outrageously fast, up to ~150x on the GROUP BY.

Couldn't you make a similar argument in favor of adding a B-Tree index
on "country"? This probably won't be effective in practice, but the
reasons for this have little to do with how a B-Tree index represents
TIDs. A GIN index can compress TIDs much more effectively, but the
same issues apply there.

The main reason why it won't work with a B-Tree is that indexes in
Postgres are not transactionally consistent structures, in general.
Whereas your cities_rb table is transactionally consistent (or perhaps
just simulates a transactionally consistent index). Maybe it could
work in cases where an index-only scan could be used, which is roughly
comparable to having a transactionally consistent index. But that
depends on having the visibility map set most or all heap pages
all-visible.

GIN indexes don't support index-only scans, and I don't see that
changing. So it's possible that just adding TID compression to B-Tree
indexes would significantly speedup this kind of query, just by making
low cardinality indexes much smaller. Though that's a hard project,
for many subtle reasons. This really amounts to building a bitmap
index, of the kind that are typically used for data warehousing, which
is something that has been discussed plenty of times on this list. GIN
indexes were really built for things like full-text search, not for
data warehousing.

B-Tree deduplication makes B-Tree indexes a lot smaller, but it tends
to "saturate" at about 3.5x smaller (relative to the same index with
deduplication disabled) once there are about 10 or so distinct keys
per row (the exception is indexes that happen to have huge keys, which
aren't very interesting). There are many B-Tree indexes (with typical
sized keys) that are similar in size to an "equivalent" GIN index --
the ability to compress TIDs isn't very valuable when you don't have
that many TIDs per key anyway. It's different when you have many TIDs
per key, of course. GIN indexes alone don't "saturate" at the same
point -- there is often a big size difference between low cardinality
and ultra low cardinality data. There are bound to be cases where not
having that level of space efficiency matters, especially with B-Tree
index-only scans that scan a significant fraction of the entire index,
or even the entire index.

--
Peter Geoghegan

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: broken regress tests on fedora 36
Следующее
От: "Jonathan S. Katz"
Дата:
Сообщение: Re: How about a psql backslash command to show GUCs?