Grouped index items (for discussion for 8.3)
От | Heikki Linnakangas |
---|---|
Тема | Grouped index items (for discussion for 8.3) |
Дата | |
Msg-id | 45546E25.70705@enterprisedb.com обсуждение исходный текст |
Список | pgsql-patches |
Since the discussion in September on Block B-Tree concept, I've worked on that and it has evolved to what I call "Grouped Index Tuples". That means that one index tuple in a B-tree can represent many heap tuples that are on the same page, and have monotonic keys. By monotonic, I mean that if A and B are the smallest and largest keys of the heap tuples represented by the grouped index tuple, there is no indexed tuple with key A < X <= B. The format of a grouped index tuple is the same as a normal index tuple, except that instead of storing the offset of a single heap tuple, a bitmap of offsets is stored. Essentially, it's a lossy compression scheme for B-tree indexes that takes advantage of clustering in the table. On a fully clustered table, it can achieve impressive compression by effectively storing only one index tuple per heap page instead of one per heap tuple. That can be a 100-fold reduction if in the best case of very narrow rows and full clustering. On a completely non-ordered table, it degrades to a normal B-tree. Smaller index size is important because it means less I/O which means more overall throughput. The cost of not storing the key of every heap tuple is that when scanning the index, all heap tuples represented by a matching grouped index tuple need to be fetched, index keys recalculated, and compared against the scan quals. However, because we only group tuples that are on the same heap page, that doesn't involve more I/O, just more CPU usage. (of course, if you're database already fits in memory and you're at 100% CPU usage, you're better off without this). Just to be clear: This doesn't incur any extra cost (nor savings) to scans or inserts when no grouped tuples are involved. And you can have a mixture of grouped and normal index tuples in the same index. As a side note: Can you come up with particularly bad scenario where this would lead to a dramatic performance loss? I can't immediately think of one, though I'm sure there's scenarios where this would lead to a moderate decrease in performance. It needs to be considered if we want to enable this by default. A patch against 8.2 beta3 that implements this is attached (gzipped). It's fully functional, but not very well tested, and there's a lot of small refactorings and improvements that could be made. It's a very long patch, so I'll try to summarize the changes in it (see also the additions to nbtree/README in it): Search logic ------------ * When a normal index scan encounters a grouped tuple, all the heap tuples represented by it need to be fetched, scan quals re-evaluated, and sorted before returning them to the executor. * When a bitmap scan encounters a grouped tuple, it returns a lossy pointer to the whole heap page, and the bitmap heap scan code takes care of scanning the heap page and re-evaluating scan quals. If the amgetmulti API had a way to return "candidate" matches, the bitmap heap scan code could scan just the potentially matching tuples. * The logic of locating the starting key in a scan needs to know that if no exact match is found, the previous item might be a grouped tuple that contains a matching key. * A grouped index tuple that doesn't match a non-boundary scan qual can't be skipped, because some of the heap tuples it represents might still match. Insertion logic --------------- * When inserting a new tuple, if the key of the new tuple falls within the range of tuples represented by a grouped tuple, the grouped tuple needs to be split to two tuples. For example, consider a heap and index like this: Heap 10 20 30 40 50 -- 60 70 key | heap blk | heap offsets ----+----------+------------- 10 | 1 | 1 2 3 4 5 60 | 2 | 1 2 An insert of key 35 to heap page 2 would need to split the first tuple into two halves: key | heap blk | heap offsets ----+----------+------------- 10 | 1 | 1 2 3 35 | 2 | 3 40 | 1 | 4 5 60 | 2 | 1 2 The changes in nbtinsert.c are to support this groupsplit operation. Splitting the original tuple to two halves needs to be done atomically in one WAL record, which is why I had to modify the _bt_insertonpg and _bt_split functions to support that. * Uniqueness check needs to fetch heap tuples represented by grouped tuples and re-calculate index keys just like the search logic. * All new tuples are initially inserted as normal index tuples, but when a page gets full, they're coalesced to grouped tuples to make room in a "groupify" operation. Originally I tried to eagerly set new bits in grouped tuples, but doing it lazily as a bulk operation simplifies the insertion logic quite a bit. Besides, we're making a better use of the space on a page this way. Vacuum ------ * Vacuum code hasn't changed much. Unlike in my first proposal, the index still stores a reference to all indexed heap tuples in the form of a bitmap, so they can be vacuumed as usual. The offset bitmaps are to be decoded, the callback function is called on each bit or offset individually, and the bits representing dead tuples are cleared. I've refrained from refactoring old code and modifying the indexam API as much as possible, to make the patch less invasive and easier to review. However, before applying, there is some things I'd like to do: * add support for candidate matches in the amgetmulti API. This should be thought together with the bitmap index code; I think supporting range encoded bitmap indexes is going to need that capability as well. * add support for candidate matches and partial ordering to the normal index scan code, thus moving the logic of scanning a heap page, re-evaluating index keys and sorting tuples out of B-tree code. Again I think this capability could also be used to support range queries with bitmap indexes. * clean up the insertonpg/split code. I tried to make as little changes as possible to ease review, but it just doesn't look pretty anymore. And the WAL records have become very complex, * clean up the hacks to attach bitmaps to IndexTuples. Maybe it's time to re-introduce the BTItem struct we had at one point, but again I didn't want to change the existing code where it wasn't absolutely necessary. * move the parse_* functions out of guc.c to a separate module and make them non-static, so they could easily be used to parse the WITH (xxx) options in the indexam and elsewhere. There's also some additional features/optimizations we could do, but I haven't bothered with yet: * improve the cost model. I haven't done anything to that. I'm not sure if and how it needs to be changed, but it's certainly something to think about * add more control (or even better, automatic tuning) of the threshold of coalescing tuples to grouped ones. For example, the CPU cost of all the extra index key evaluations and comparisons might outweight the I/O savings if you have an expression index with a very expensive function, and it would be nice to take that into account. * When all the heap tuples represented by a grouped index tuple have the same key, the key on the index tuple could be evaluated against non-boundary scan quals, though that's not the case in general. The same applies to uniqueness checks. * I'm not sure what we should be counting in pg_stats. The patch adds a number of extra counters that have been useful in debugging, but what should be counted as a idx_tup_fetch for example? * and of course, there's no user documentation yet ;) I'm hoping we could hash out the above API changes and do some refactoring early in the 8.3 cycle, and commit this as a smaller patch after that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Вложения
В списке pgsql-patches по дате отправления: