Обсуждение: pgsql: Add deduplication to nbtree.
Add deduplication to nbtree. Deduplication reduces the storage overhead of duplicates in indexes that use the standard nbtree index access method. The deduplication process is applied lazily, after the point where opportunistic deletion of LP_DEAD-marked index tuples occurs. Deduplication is only applied at the point where a leaf page split would otherwise be required. New posting list tuples are formed by merging together existing duplicate tuples. The physical representation of the items on an nbtree leaf page is made more space efficient by deduplication, but the logical contents of the page are not changed. Even unique indexes make use of deduplication as a way of controlling bloat from duplicates whose TIDs point to different versions of the same logical table row. The lazy approach taken by nbtree has significant advantages over a GIN style eager approach. Most individual inserts of index tuples have exactly the same overhead as before. The extra overhead of deduplication is amortized across insertions, just like the overhead of page splits. The key space of indexes works in the same way as it has since commit dd299df8 (the commit that made heap TID a tiebreaker column). Testing has shown that nbtree deduplication can generally make indexes with about 10 or 15 tuples for each distinct key value about 2.5X - 4X smaller, even with single column integer indexes (e.g., an index on a referencing column that accompanies a foreign key). The final size of single column nbtree indexes comes close to the final size of a similar contrib/btree_gin index, at least in cases where GIN's posting list compression isn't very effective. This can significantly improve transaction throughput, and significantly reduce the cost of vacuuming indexes. A new index storage parameter (deduplicate_items) controls the use of deduplication. The default setting is 'on', so all new B-Tree indexes automatically use deduplication where possible. This decision will be reviewed at the end of the Postgres 13 beta period. There is a regression of approximately 2% of transaction throughput with synthetic workloads that consist of append-only inserts into a table with several non-unique indexes, where all indexes have few or no repeated values. The underlying issue is that cycles are wasted on unsuccessful attempts at deduplicating items in non-unique indexes. There doesn't seem to be a way around it short of disabling deduplication entirely. Note that deduplication of items in unique indexes is fairly well targeted in general, which avoids the problem there (we can use a special heuristic to trigger deduplication passes in unique indexes, since we're specifically targeting "version bloat"). Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed. No bump in BTREE_VERSION, since the representation of posting list tuples works in a way that's backwards compatible with version 4 indexes (i.e. indexes built on PostgreSQL 12). However, users must still REINDEX a pg_upgrade'd index to use deduplication, regardless of the Postgres version they've upgraded from. This is the only way to set the new nbtree metapage flag indicating that deduplication is generally safe. Author: Anastasia Lubennikova, Peter Geoghegan Reviewed-By: Peter Geoghegan, Heikki Linnakangas Discussion: https://postgr.es/m/55E4051B.7020209@postgrespro.ru https://postgr.es/m/4ab6e2db-bcee-f4cf-0916-3a06e6ccbb55@postgrespro.ru Branch ------ master Details ------- https://git.postgresql.org/pg/commitdiff/0d861bbb702f8aa05c2a4e3f1650e7e8df8c8c27 Modified Files -------------- contrib/amcheck/verify_nbtree.c | 231 +++++++-- doc/src/sgml/btree.sgml | 201 +++++++- doc/src/sgml/charset.sgml | 9 +- doc/src/sgml/citext.sgml | 7 +- doc/src/sgml/func.sgml | 9 +- doc/src/sgml/ref/create_index.sgml | 44 +- src/backend/access/common/reloptions.c | 10 + src/backend/access/index/genam.c | 4 + src/backend/access/nbtree/Makefile | 1 + src/backend/access/nbtree/README | 133 ++++- src/backend/access/nbtree/nbtdedup.c | 830 ++++++++++++++++++++++++++++++ src/backend/access/nbtree/nbtinsert.c | 388 ++++++++++++-- src/backend/access/nbtree/nbtpage.c | 246 ++++++++- src/backend/access/nbtree/nbtree.c | 171 +++++- src/backend/access/nbtree/nbtsearch.c | 272 +++++++++- src/backend/access/nbtree/nbtsort.c | 193 ++++++- src/backend/access/nbtree/nbtsplitloc.c | 39 +- src/backend/access/nbtree/nbtutils.c | 196 +++++-- src/backend/access/nbtree/nbtxlog.c | 268 +++++++++- src/backend/access/rmgrdesc/nbtdesc.c | 22 +- src/backend/storage/page/bufpage.c | 11 +- src/bin/psql/tab-complete.c | 4 +- src/include/access/nbtree.h | 433 +++++++++++++--- src/include/access/nbtxlog.h | 117 ++++- src/include/access/rmgrlist.h | 2 +- src/include/access/xlog_internal.h | 2 +- src/test/regress/expected/btree_index.out | 20 +- src/test/regress/sql/btree_index.sql | 22 +- 28 files changed, 3553 insertions(+), 332 deletions(-)
On Wed, 2020-02-26 at 21:06 +0000, Peter Geoghegan wrote: > Add deduplication to nbtree. This is great! Thanks! Are no changes to the "pageinspect" contrib required? Yours, Laurenz Albe
On Thu, Feb 27, 2020 at 9:25 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > This is great! Thanks! Thanks! > Are no changes to the "pageinspect" contrib required? There are. Those will be pushed either today or tomorrow. -- Peter Geoghegan
On Thu, Feb 27, 2020 at 9:26 AM Peter Geoghegan <pg@bowt.ie> wrote: > > Are no changes to the "pageinspect" contrib required? > > There are. Those will be pushed either today or tomorrow. Attached is a draft patch for this, with updated documentation. I'd like to push this tomorrow (Saturday), but if you could take a look over it first, that would be great. Thanks -- Peter Geoghegan
Вложения
Peter Geoghegan <pg@bowt.ie> writes: > Add deduplication to nbtree. Coverity isn't very happy with the coding in _bt_update_posting(): *** CID 1460433: Memory - corruptions (ARRAY_VS_SINGLETON) /srv/coverity/git/pgsql-git/postgresql/src/backend/access/nbtree/nbtdedup.c: 723 in _bt_update_posting() 717 { 718 if (d < vacposting->ndeletedtids && vacposting->deletetids[d] == i) 719 { 720 d++; 721 continue; 722 } >>> CID 1460433: Memory - corruptions (ARRAY_VS_SINGLETON) >>> Using "htids" as an array. This might corrupt or misinterpret adjacent memory locations. 723 htids[ui++] = *BTreeTupleGetPostingN(origtuple, i); 724 } 725 Assert(ui == nhtids); 726 Assert(d == vacposting->ndeletedtids); 727 Assert(nhtids == 1 || _bt_posting_valid(itup)); I can see its point: asserting after the fact that you didn't clobber memory isn't a terribly safe coding method, especially in a production build where you won't even have the asserts. Not sure if there's a better way though. regards, tom lane
On Sun, Mar 1, 2020 at 10:24 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I can see its point: asserting after the fact that you didn't clobber > memory isn't a terribly safe coding method, especially in a production > build where you won't even have the asserts. Not sure if there's a > better way though. I found it slightly more elegant to treat itup->t_tid as a degenerate 1 element posting list here, but I'm not particularly attached to that approach. The loop is only truly necessary when dealing with a posting list tuple. Do you think that _bt_update_posting() should avoid this loop when itup is just a plain tuple, that lacks a posting list? I can do it that way if you prefer. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Sun, Mar 1, 2020 at 10:24 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I can see its point: asserting after the fact that you didn't clobber >> memory isn't a terribly safe coding method, especially in a production >> build where you won't even have the asserts. Not sure if there's a >> better way though. > I found it slightly more elegant to treat itup->t_tid as a degenerate > 1 element posting list here, but I'm not particularly attached to that > approach. The loop is only truly necessary when dealing with a posting > list tuple. > Do you think that _bt_update_posting() should avoid this loop when > itup is just a plain tuple, that lacks a posting list? I can do it > that way if you prefer. Hm. That would probably be enough to shut up Coverity, but I'm unsure whether it'd really be an improvement from the legibility and safety viewpoints. Do you want to try coding it that way and see what it comes out like? Note that we do have the ability to just dismiss the Coverity complaint, if we decide that there's no better way to write the function. But it's usually worth looking to see if it could be written more cleanly. regards, tom lane
On Sun, Mar 1, 2020 at 11:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Hm. That would probably be enough to shut up Coverity, but I'm unsure > whether it'd really be an improvement from the legibility and safety > viewpoints. I noticed that _bt_update_posting() behaves as if the origtuple might not be a posting list tuple at the point that keysize is calculated, despite generally depending on it being a posting list tuple (which it asserts by way of its "_bt_posting_valid(origtuple)" assertion). The final tuple might not be a posting list, but the original one must be (if it isn't, then nbtree VACUUM should be deleting it outright in the traditional way, rather than updating it). I should fix that, either way. > Do you want to try coding it that way and see what it > comes out like? Sure. -- Peter Geoghegan
On Sun, Mar 1, 2020 at 11:42 AM Peter Geoghegan <pg@bowt.ie> wrote: > > Do you want to try coding it that way and see what it > > comes out like? > > Sure. Attached patch shows how this could work. I prefer my original approach, but I can see the argument for doing it this way. If we keep my original approach, we should still add a new "ItemPointerIsValid(&itup->t_tid)" assertion that covers the plain tupe case in a way that mirrors the current "_bt_posting_valid(itup)" assert. -- Peter Geoghegan
Вложения
Peter Geoghegan <pg@bowt.ie> writes: > Attached patch shows how this could work. I prefer my original > approach, but I can see the argument for doing it this way. This does seem a bit duplicative ... and shouldn't both code paths include a final "Assert(d == vacposting->ndeletedtids)"? So maybe we're better off just rejecting the Coverity complaint. > If we keep my original approach, we should still add a new > "ItemPointerIsValid(&itup->t_tid)" assertion that covers the plain > tupe case in a way that mirrors the current "_bt_posting_valid(itup)" > assert. Another thing that maybe bears closer scrutiny is the size calculation. It seems a bit confused as to whether the offset of the posting list within the tuple, or the total tuple size, or both, needs to be MAXALIGN'd. regards, tom lane
On Sun, Mar 1, 2020 at 2:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Attached patch shows how this could work. I prefer my original > > approach, but I can see the argument for doing it this way. > > This does seem a bit duplicative ... and shouldn't both code paths > include a final "Assert(d == vacposting->ndeletedtids)"? No, because the "nhtids == 1" loop has a "break" when the first and only TID that we need to keep around is hit. > Another thing that maybe bears closer scrutiny is the size calculation. > It seems a bit confused as to whether the offset of the posting list > within the tuple, or the total tuple size, or both, needs to be > MAXALIGN'd. That's not quite true in my opinion. BTreeTupleSetPosting() has an Assert() of its own about alignment. ISTM that it's reasonable for _bt_update_posting() to assume that BTreeTupleGetPostingOffset() will return a MAXALIGN()'d offset. Note also that _bt_form_posting() is explicit about what it expects around alignment. This is _bt_update_posting()'s sibling function. The _bt_update_posting() calculation is explicitly documented as being derived from the one in _bt_form_posting(). I am happy to add parallel-to-_bt_form_posting() assertions about alignment to _bt_form_posting(), to nail it down completely. Plus I'll add the assertion I suggested already. Once I commit a patch with these two new assertions, I think that we can consider the matter resolved. Does that seem reasonable to you? -- Peter Geoghegan
On Sun, Mar 1, 2020 at 3:01 PM Peter Geoghegan <pg@bowt.ie> wrote: > I am happy to add parallel-to-_bt_form_posting() assertions about > alignment to _bt_form_posting(), to nail it down completely. Plus I'll > add the assertion I suggested already. Once I commit a patch with > these two new assertions, I think that we can consider the matter > resolved. Does that seem reasonable to you? I was thinking of the approach taken in the attached patch. It more or less copies over the assertions from _bt_form_posting() that did not appear in _bt_update_posting(). -- Peter Geoghegan
Вложения
Peter Geoghegan <pg@bowt.ie> writes: > On Sun, Mar 1, 2020 at 3:01 PM Peter Geoghegan <pg@bowt.ie> wrote: >> I am happy to add parallel-to-_bt_form_posting() assertions about >> alignment to _bt_form_posting(), to nail it down completely. Plus I'll >> add the assertion I suggested already. Once I commit a patch with >> these two new assertions, I think that we can consider the matter >> resolved. Does that seem reasonable to you? > I was thinking of the approach taken in the attached patch. It more or > less copies over the assertions from _bt_form_posting() that did not > appear in _bt_update_posting(). OK by me. regards, tom lane
On Sun, Mar 1, 2020 at 8:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I was thinking of the approach taken in the attached patch. It more or > > less copies over the assertions from _bt_form_posting() that did not > > appear in _bt_update_posting(). > > OK by me. Pushed. Thanks. -- Peter Geoghegan
Another issue I just noticed is that src/tools/pginclude/cpluspluscheck complains thusly: ./src/include/access/nbtree.h: In function 'void BTreeTupleSetPosting(IndexTupleData*, int, int)': ./src/include/access/nbtree.h:384: warning: comparison between signed and unsigned integer expressions I suppose this can be silenced with an appropriate cast, and doing so would seem like a good idea. regards, tom lane
On Mon, Mar 2, 2020 at 9:47 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I suppose this can be silenced with an appropriate cast, and doing so > would seem like a good idea. I pushed a commit that silenced the cpluspluscheck warning just now. Thanks -- Peter Geoghegan
Hi, On 2020-03-01 16:09:37 -0800, Peter Geoghegan wrote: > On Sun, Mar 1, 2020 at 3:01 PM Peter Geoghegan <pg@bowt.ie> wrote: > > I am happy to add parallel-to-_bt_form_posting() assertions about > > alignment to _bt_form_posting(), to nail it down completely. Plus I'll > > add the assertion I suggested already. Once I commit a patch with > > these two new assertions, I think that we can consider the matter > > resolved. Does that seem reasonable to you? > > I was thinking of the approach taken in the attached patch. It more or > less copies over the assertions from _bt_form_posting() that did not > appear in _bt_update_posting(). It seems, based on discussion on this thread and private inquiry to PG that we're going to silence the warning, as there's not a good resolution? I've for now marked the issue as a false positive, and added a reference to this thread. But I think we should just mark it as ignored in that case? Is it perhaps possible to silence the warnign with somethign along the lines of Assert(nhtids + vacposting->ndeletedtids == BTreeTupleGetNPosting(origtuple)) I don't know this code, but it looks like that'd have to be true? Perhaps that'd be enough to silence coverity too? - Andres
On Sun, Mar 29, 2020 at 3:15 PM Andres Freund <andres@anarazel.de> wrote: > Is it perhaps possible to silence the warnign with somethign along the > lines of > Assert(nhtids + vacposting->ndeletedtids == BTreeTupleGetNPosting(origtuple)) > I don't know this code, but it looks like that'd have to be true? > Perhaps that'd be enough to silence coverity too? It would have to be true. It's a tautology. That is, the value of nhtids comes from "vacposting->ndeletedtids" and "BTreeTupleGetNPosting(origtuple)" anyway, and we don't mutate any of that state in _bt_update_posting(). Wouldn't it at least be necessary to Assert() something about the final tuple, and/or other work state? -- Peter Geoghegan
Hi, On 2020-03-29 15:19:50 -0700, Peter Geoghegan wrote: > On Sun, Mar 29, 2020 at 3:15 PM Andres Freund <andres@anarazel.de> wrote: > > Is it perhaps possible to silence the warnign with somethign along the > > lines of > > Assert(nhtids + vacposting->ndeletedtids == BTreeTupleGetNPosting(origtuple)) > > I don't know this code, but it looks like that'd have to be true? > > Perhaps that'd be enough to silence coverity too? > > It would have to be true. It's a tautology. That is, the value of > nhtids comes from "vacposting->ndeletedtids" and > "BTreeTupleGetNPosting(origtuple)" anyway, and we don't mutate any of > that state in _bt_update_posting(). Right. It doesn't look like coverity understands that currently though. > Wouldn't it at least be necessary to Assert() something about the > final tuple, and/or other work state? I don't see why? The assert might not help, but if it does, I don't think it'd need more than that check? It depends a bit on what coverity's precise logic here is. If its ARRAY_VS_SINGLETON checker allows index (i.e. ui) 0 but not 1, then I think the suggested assert could help it recognize that that's unreachable. After staring more at the coverity trace, it looks however like ARRAY_VS_SINGLETON might not allow any array like access, not even 0? Seems like it assumes that BTreeTupleGetNPosting(origtuple) is at least two, and that in the first loop d < vacposting->ndeletedtids and vacposting->deletetids[d] == i are true, but in the second iteration assumes d < vacposting->ndeletedtids is true but vacposting->deletetids[d] == i. Since it doesn't record a third iteration, it seems like it ought to see ui == 0 at that point. I'll mark it as ignored. Greetings, Andres Freund
On 2020-02-26 22:06, Peter Geoghegan wrote: > Add deduplication to nbtree. AddressSanitizer has a use-after-scope complaint related to this patch. The following change fixes it: diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c index a70b64d964..8246ab4f18 100644 --- a/src/backend/access/nbtree/nbtinsert.c +++ b/src/backend/access/nbtree/nbtinsert.c @@ -1289,6 +1289,7 @@ _bt_insertonpg(Relation rel, xl_btree_metadata xlmeta; uint8 xlinfo; XLogRecPtr recptr; + uint16 upostingoff; xlrec.offnum = newitemoff; @@ -1354,7 +1355,7 @@ _bt_insertonpg(Relation rel, * must reconstruct final itup (as well as nposting) using * _bt_swap_posting(). */ - uint16 upostingoff = postingoff; + upostingoff = postingoff; XLogRegisterBufData(0, (char *) &upostingoff, sizeof(uint16)); XLogRegisterBufData(0, (char *) origitup, Please check. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Apr 30, 2020 at 11:33 AM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > AddressSanitizer has a use-after-scope complaint related to this patch. > > The following change fixes it: Your fix looks good to me. Please go ahead and commit it. Thanks! -- Peter Geoghegan
On Thu, Apr 30, 2020 at 11:37 AM Peter Geoghegan <pg@bowt.ie> wrote: > > The following change fixes it: > > Your fix looks good to me. Please go ahead and commit it. Actually, never mind. I just pushed your fix myself a moment ago. Thanks again -- Peter Geoghegan