Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
От | Peter Geoghegan |
---|---|
Тема | Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. |
Дата | |
Msg-id | CAH2-WzkLEHUEPfRU+u3_DcYL7mYgUgpR8z=o-3RdLvVWQoPr1A@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. (Peter Geoghegan <pg@bowt.ie>) |
Список | pgsql-hackers |
On Thu, Jul 11, 2019 at 10:42 AM Peter Geoghegan <pg@bowt.ie> wrote: > > I think unique indexes may benefit from deduplication not only because > > of NULL values. Non-HOT updates produce duplicates of non-NULL values > > in unique indexes. And those duplicates can take significant space. > > I agree that we should definitely have an open mind about unique > indexes, even with non-NULL values. If we can prevent a page split by > deduplicating the contents of a unique index page, then we'll probably > win. Why not try? This will need to be tested. I thought about this some more. I believe that the LP_DEAD bit setting within _bt_check_unique() is generally more important than the more complicated kill_prior_tuple mechanism for setting LP_DEAD bits, even though the _bt_check_unique() thing can only be used with unique indexes. Also, I have often thought that we don't do enough to take advantage of the special characteristics of unique indexes -- they really are quite different. I believe that other database systems do this in various ways. Maybe we should too. Unique indexes are special because there can only ever be zero or one tuples of the same value that are visible to any possible MVCC snapshot. Within the index AM, there is little difference between an UPDATE by a transaction and a DELETE + INSERT of the same value by a transaction. If there are 3 or 5 duplicates within a unique index, then there is a strong chance that VACUUM could reclaim some of them, given the chance. It is worth going to a little effort to find out. In a traditional serial/bigserial primary key, the key space that is typically "owned" by the left half of a rightmost page split describes a range of about ~366 items, with few or no gaps for other values that didn't exist at the time of the split (i.e. the two pivot tuples on each side cover a range that is equal to the number of items itself). If the page ever splits again, the chances of it being due to non-HOT updates is perhaps 100%. Maybe VACUUM just didn't get around to the index in time, or maybe there is a long running xact, or whatever. If we can delay page splits in indexes like this, then we could easily prevent them from *ever* happening. Our first line of defense against page splits within unique indexes will probably always be LP_DEAD bits set within _bt_check_unique(), because it costs so little -- same as today. We could also add a second line of defense: deduplication -- same as with non-unique indexes with the patch. But we can even add a third line of defense on top of those two: more aggressive reclaiming of posting list space, by going to the heap to check the visibility status of earlier posting list entries. We can do this optimistically when there is no LP_DEAD bit set, based on heuristics. The high level principle here is that we can justify going to a small amount of extra effort for the chance to avoid a page split, and maybe even more than a small amount. Our chances of reversing the split by merging pages later on are almost zero. The two halves of the split will probably each get dirtied again and again in the future if we cannot avoid it, plus we have to dirty the parent page, and the old sibling page (to update its left link). In general, a page split is already really expensive. We could do something like amortize the cost of accessing the heap a second time for tuples that we won't have considered setting the LP_DEAD bit on within _bt_check_unique() by trying the *same* heap page a *second* time where possible (distinct values are likely to be nearby on the same page). I think that an approach like this could work quite well for many workloads. You only pay a cost (visiting the heap an extra time) when it looks like you'll get a benefit (not splitting the page). As you know, Andres already changed nbtree to get an XID for conflict purposes on the primary by visiting the heap a second time (see commit 558a9165e08), when we need to actually reclaim LP_DEAD space. I anticipated that we could extend this to do more clever/eager/lazy cleanup of additional items before that went in, which is a closely related idea. See: https://www.postgresql.org/message-id/flat/CAH2-Wznx8ZEuXu7BMr6cVpJ26G8OSqdVo6Lx_e3HSOOAU86YoQ%40mail.gmail.com#46ffd6f32a60e086042a117f2bfd7df7 I know that this is a bit hand-wavy; the details certainly need to be worked out. However, it is not so different to the "ghost bit" design that other systems use with their non-unique indexes (though this idea applies specifically to unique indexes in our case). The main difference is that we're going to the heap rather than to UNDO, because that's where we store our visibility information. That doesn't seem like such a big difference -- we are also reasonably confident that we'll find that the TID is dead, even without LP_DEAD bits being set, because we only do the extra stuff with unique indexes. And, we do it lazily. -- Peter Geoghegan
В списке pgsql-hackers по дате отправления:
Следующее
От: Bruce MomjianДата:
Сообщение: Re: [Proposal] Table-level Transparent Data Encryption (TDE) and KeyManagement Service (KMS)