Re: An inverted index using roaring bitmaps

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Re: An inverted index using roaring bitmaps
Дата
Msg-id CAH2-Wz=PfzyWW7PYcB8JmZ0A=sN8uPf4Aspw076=j5WRsoQcMA@mail.gmail.com
обсуждение исходный текст
Ответ на An inverted index using roaring bitmaps  (Chinmay Kanchi <cgkanchi@gmail.com>)
Ответы Re: An inverted index using roaring bitmaps  (Chinmay Kanchi <cgkanchi@gmail.com>)
Список pgsql-hackers
On Mon, Jun 6, 2022 at 10:42 PM Chinmay Kanchi <cgkanchi@gmail.com> wrote:
> The simulated index in this case is outrageously fast, up to ~150x on the GROUP BY.

Couldn't you make a similar argument in favor of adding a B-Tree index
on "country"? This probably won't be effective in practice, but the
reasons for this have little to do with how a B-Tree index represents
TIDs. A GIN index can compress TIDs much more effectively, but the
same issues apply there.

The main reason why it won't work with a B-Tree is that indexes in
Postgres are not transactionally consistent structures, in general.
Whereas your cities_rb table is transactionally consistent (or perhaps
just simulates a transactionally consistent index). Maybe it could
work in cases where an index-only scan could be used, which is roughly
comparable to having a transactionally consistent index. But that
depends on having the visibility map set most or all heap pages
all-visible.

GIN indexes don't support index-only scans, and I don't see that
changing. So it's possible that just adding TID compression to B-Tree
indexes would significantly speedup this kind of query, just by making
low cardinality indexes much smaller. Though that's a hard project,
for many subtle reasons. This really amounts to building a bitmap
index, of the kind that are typically used for data warehousing, which
is something that has been discussed plenty of times on this list. GIN
indexes were really built for things like full-text search, not for
data warehousing.

B-Tree deduplication makes B-Tree indexes a lot smaller, but it tends
to "saturate" at about 3.5x smaller (relative to the same index with
deduplication disabled) once there are about 10 or so distinct keys
per row (the exception is indexes that happen to have huge keys, which
aren't very interesting). There are many B-Tree indexes (with typical
sized keys) that are similar in size to an "equivalent" GIN index --
the ability to compress TIDs isn't very valuable when you don't have
that many TIDs per key anyway. It's different when you have many TIDs
per key, of course. GIN indexes alone don't "saturate" at the same
point -- there is often a big size difference between low cardinality
and ultra low cardinality data. There are bound to be cases where not
having that level of space efficiency matters, especially with B-Tree
index-only scans that scan a significant fraction of the entire index,
or even the entire index.

-- 
Peter Geoghegan



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tomas Vondra
Дата:
Сообщение: Re: pgcon unconference / impact of block size on performance
Следующее
От: Kaiting Chen
Дата:
Сообщение: Allow foreign keys to reference a superset of unique columns