Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
От | Peter Geoghegan |
---|---|
Тема | Re: Making all nbtree entries unique by having heap TIDs participatein comparisons |
Дата | |
Msg-id | CAH2-Wzmr=kobQx11s7UvqjYVx9Y49GhPpVxL7C8S36ayTUqVWQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Making all nbtree entries unique by having heap TIDs participatein comparisons (Peter Geoghegan <pg@bowt.ie>) |
Список | pgsql-hackers |
On Tue, Jan 8, 2019 at 4:47 PM Peter Geoghegan <pg@bowt.ie> wrote: > Attached is v10 of the patch series, which has many changes based on > your feedback. However, I didn't end up refactoring _bt_findsplitloc() > in the way you described, because it seemed hard to balance all of the > concerns there. I still have an open mind on this question, and > recognize the merit in what you suggested. Perhaps it's possible to > reach a compromise here. > * Addresses Heikki's concerns about locking the metapage more > frequently in a general way. Comments are added to nbtpage.c, and > updated in a number of places that already talk about the same risk. Attached is v11 of the patch, which removes the comments mentioned here, and instead finds a way to not do new things with buffer locks. Changes ======= * We simply avoid holding buffer locks while accessing the metapage. (Of course, the old root page split stuff still does this -- it has worked that way forever.) * We also avoid calling index_getprocinfo() with any buffer lock held. We'll still call support function 1 with a buffer lock held to truncate, but that's not new -- *any* insertion will do that. For some reason I got stuck on the idea that we need to use a scankey's own values within _bt_truncate()/_bt_keep_natts() by generating a new insertion scankey every time. We now simply ignore those values, and call the comparator with pairs of tuples that each come from the page directly. Usually, we'll just reuse the insertion scankey that we were using for the insertion anyway (we no longer build our own scankey for truncation). Other times, we'll build an empty insertion scankey (one that has the function pointer and so on, but no values). The only downside is that I cannot have an assertion that calls _bt_compare() to make sure we truncated correctly there and then, since a dedicated insertion scankey is no longer conveniently available. I feel rather silly for not having gone this way from the beginning, because the new approach is quite obviously simpler and safer. nbtsort.c now gets a reusable, valueless insertion scankey that it uses for both truncation and for setting up a merge of the two spools for unique index builds. This approach allows me to remove _bt_mkscankey_nodata() altogether -- callers build a "nodata" insertion scankey with empty values by passing _bt_mkscankey() a NULL tuple. That's equivalent to having an insertion scankey built from an all-attributes-truncated tuple. IOW, the patch now makes the "nodata" stuff a degenerate case of building a scankey from a truncated-attributes tuple. tuplesort.c also uses the new BTScanInsert struct. There is no longer any case where there in an insertion scankey that isn't accessed using the BTScanInsert struct. * No more pg_depend tie-breaker column commit. That was an ugly hack, that I'm glad to be rid of -- many thanks to Tom for working through a number of test instability issues that affected the patch. I do still need to paper-over one remaining regression test issue/bug that the patch happens to unmask, pending Tom fixing it directly [1]. This papering-over is broken out into its own commit ("v11-0002-Paper-over-DEPENDENCY_INTERNAL_AUTO-bug-failures.patch"). I expect that Tom will fix the bug before too long, at which point the temporary work around can just be reverted from your local tree. Outlook ======= I feel that this version is pretty close to being commitable, since everything about the design is settled. It completely avoids saying anything new about buffer locking protocols, LWLock deadlock safety, etc. VACUUM and crash recovery are also unchanged, so subtle bugs should at least not be too hard to reproduce when observed once. It's pretty complementary code: the new logic for picking a split point builds a list of candidate split points using the old technique, with a second pass to choose the best one for suffix truncation among the accumulated list. Hard to see how that could introduce an invalid split point choice. I take the risk of introducing new corruption bugs very seriously. contrib/amcheck now verifies all aspects of the new on-disk representation. The stricter Lehman & Yao style invariant ("the subtree S is described by Ki < v <= Ki + 1 ...") allows amcheck to be stricter in what it will accept (e.g., heap TID needs to be in order among logical duplicates, we always expect to see a representation of the number of pivot tuple attributes, and we expect the high key to be strictly greater than items on internal pages). Review ====== It would be very helpful if a reviewer such as Heikki or Alexander could take a look at the patch once more. I suggest that they look at the following points in the patch: * The minusinfkey stuff, which is explained within _bt_compare(), and within nbtree.h header comments. Page deletion by VACUUM is the only _bt_search() caller that sets minusinfkey to true (though older versions of btree and amcheck also set minusinfkey to true). * Does the value of BTREE_SINGLEVAL_FILLFACTOR make sense? Am I being a little too aggressive there, possibly hurting workloads where HOT pruning occurs periodically? Sane duplicate handling is the most compelling improvement that the patch makes, but I may still have been a bit too aggressive in packing pages full of duplicates so tightly. I figured that that was the closest thing to the previous behavior that's still reasonable. * Does the _bt_splitatnewitem() criteria for deciding if we should split at the point the new tuple is positioned at miss some subtlety? It's important that splitting at the new item when we shouldn't doesn't happen, or hardly ever happens -- it should be *self-limiting*. This was tested using BenchmarkSQL/TPC-C [2] -- TPC-C has a workload where this particular enhancement makes indexes a lot smaller. * There was also testing of index bloat following bulk insertions, based on my own custom test suite. Data and indexes were taken from TPC-C tables, TPC-H tables, TPC-E tables, UK land registry data [3], and the Mouse Genome Database Project (Postgres schema + indexes) [4]. This takes almost an hour to run on my development machine, though the most important tests finish in less than 5 minutes. I can provide access to all or some of these tests, if reviewers are interested and are willing to download several gigabytes of sample data that I'll provide privately. [1] https://postgr.es/m/19220.1547767251@sss.pgh.pa.us [2] https://github.com/petergeoghegan/benchmarksql/tree/nbtree-customizations [3] https://wiki.postgresql.org/wiki/Sample_Databases [4] http://www.informatics.jax.org/software.shtml -- Peter Geoghegan
Вложения
- v11-0002-Paper-over-DEPENDENCY_INTERNAL_AUTO-bug-failures.patch
- v11-0001-Refactor-nbtree-insertion-scankeys.patch
- v11-0003-Treat-heap-TID-as-part-of-the-nbtree-key-space.patch
- v11-0004-Pick-nbtree-split-points-discerningly.patch
- v11-0007-DEBUG-Add-pageinspect-instrumentation.patch
- v11-0006-Add-high-key-continuescan-optimization.patch
- v11-0005-Add-split-at-new-tuple-page-split-optimization.patch
В списке pgsql-hackers по дате отправления:
Следующее
От: David RowleyДата:
Сообщение: Re: [HACKERS] PATCH: multivariate histograms and MCV lists