Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
От | Peter Geoghegan |
---|---|
Тема | Re: Making all nbtree entries unique by having heap TIDs participatein comparisons |
Дата | |
Msg-id | CAH2-Wz=Dhkbguk31kCgs62Uj6cqsQn3GA3SYBDp-FWidKncwYA@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Making all nbtree entries unique by having heap TIDs participatein comparisons (Heikki Linnakangas <hlinnaka@iki.fi>) |
Ответы |
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons |
Список | pgsql-hackers |
On Sun, Mar 3, 2019 at 5:41 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Great, looks much simpler now, indeed! Now I finally understand the > algorithm. Glad to hear it. Thanks for the additional review! Attached is v14, which has changes based on your feedback. This includes changes based on your more recent feedback on v13-0002-make-heap-TID-a-tie-breaker-nbtree-index-column.patch, though I'll respond to those points directly in a later email. v14 also changes the logic by which we decide if alternative strategy should be used to use leftmost and rightmost splits for the entire page, rather than accessing the page directly. We always handle the newitem-at-end edge case correctly now, since the new "top down" approach has all legal splits close at hand. This is more elegant, more obviously correct, and even more effective, at least in some cases -- it's another example of why "top down" is the superior approach for nbtsplitloc.c. This made my "UK land registry data" index have about 2.5% fewer leaf pages than with v13, which is small but significant. Separately, I made most of the new nbtsplitloc.c functions use a FindSplitData argument in v14, which simplifies their signatures quite a bit. > What would be the worst case scenario for this? Splitting a page that > has as many tuples as possible, I guess, so maybe inserting into a table > with a single-column index, with 32k BLCKSZ. Have you done performance > testing on something like that? I'll test that (added to my project TODO list), though it's not obvious that that's the worst case. Page splits will be less frequent, and have better choices about where to split. > Rounding up the allocation to BLCKSZ seems like a premature > optimization. Even if it saved some cycles, I don't think it's worth the > trouble of having to explain all that in the comment. Removed that optimization. > I think you could change the curdelta, leftfree, and rightfree fields in > SplitPoint to int16, to make the array smaller. Added this alternative optimization to replace the BLCKSZ allocation thing. I even found a way to get the array elements down to 8 bytes, but that made the code noticeably slower with "many duplicates" splits, so I didn't end up doing that (I used bitfields, plus the same pragmas that we use to make sure that item pointers are packed). > > * You seemed to refactor _bt_checksplitloc() in passing, making it not > > do the newitemisfirstonright thing. I changed that back. Did I miss > > something that you intended here? > > My patch treated the new item the same as all the old items, in > _bt_checksplitloc(), so it didn't need newitemisonright. You still need > it with your approach. I would feel better about it if we stuck to the same method for calculating if a split point is legal as before (the only difference being that we pessimistically add heap TID overhead to new high key on leaf level). That seems safer to me. > > Changes to my own code since v12: > > > > * Simplified "Add "split after new tuple" optimization" commit, and > > made it more consistent with associated code. This is something that > > was made a lot easier by the new approach. It would be great to hear > > what you think of this. > > I looked at it very briefly. Yeah, it's pretty simple now. Nice! I can understand why it might be difficult to express an opinion on the heuristics themselves. The specific cut-off points (e.g. details of what "heap TID adjacency" actually means) are not that easy to defend with a theoretical justification, though they have been carefully tested. I think it's worth comparing the "split after new tuple" optimization to the traditional leaf fillfactor of 90, which is a very similar situation. Why should it be 90? Why not 85, or 95? Why is it okay to assume that the rightmost page shouldn't be split 50/50? The answers to all of these questions about the well established idea of a leaf fillfactor boil down to this: it's very likely to be correct on average, and when it isn't correct the problem is self-limiting, and has an infinitesimally small chance of continually recurring (unless you imagine an *adversarial* case). Similarly, it doesn't matter if these new heuristics get it wrong once every 1000 splits (a very pessimistic estimate), because even then those will cancel each other out in the long run. It is necessary to take a holistic view of things. We're talking about an optimization that makes the two largest TPC-C indexes over 40% smaller -- I can hold my nose if I must in order to get that benefit. There were also a couple of indexes in the real-world mouse genome database that this made much smaller, so this will clearly help in the real world. Besides all this, the "split after new tuple" optimization fixes an existing worst case, rather than being an optimization, at least in my mind. It's not supposed to be possible to have leaf pages that are all only 50% full without any deletes, and yet we allow it to happen in this one weird case. Even completely random insertions result in 65% - 70% average space utilization, so the existing worst case really is exceptional. We are forced to take a holistic view, and infer something about the pattern of insertions over time, even though holistic is a dirty word. > > (New header comment block over _bt_findsplitloc()) > > This is pretty good, but I think some copy-editing can make this even > better I've restored the old structure of the _bt_findsplitloc() header comments. > The explanation of how the high key for the left page is formed (5th > paragraph), seems out-of-place here, because the high key is not formed > here. Moved that to _bt_split(), which seems like a good compromise. > Somewhere in the 1st or 2nd paragraph, perhaps we should mention that > the function effectively uses a different fillfactor in some other > scenarios too, not only when it's the rightmost page. Done. > > state.maxsplits = maxoff + 2; > > state.splits = palloc(Max(BLCKSZ, sizeof(SplitPoint) * state.maxsplits)); > > state.nsplits = 0; > > I wouldn't be that paranoid. The code that populates the array is pretty > straightforward. Done that way. But are you sure? Some of the attempts to create a new split point are bound to fail, because they try to put everything (including new item) on one size of the split. I'll leave the assertion there. > > * Still, it's typical for almost all calls to _bt_recordsplit to > > * determine that the split is legal, and therefore enter it into the > > * candidate split point array for later consideration. > > */ > > Suggestion: Remove the "Still" word. The observation that typically all > split points are legal is valid, but it seems unrelated to the first > paragraph. (Do we need to mention it at all?) Removed the second paragraph entirely. > > /* > > * If the new item goes as the last item, record the split point that > > * leaves all the old items on the left page, and the new item on the > > * right page. This is required because a split that leaves the new item > > * as the firstoldonright won't have been reached within the loop. We > > * always record every possible split point. > > */ > > Suggestion: Remove the last sentence. Agreed. Removed. > ISTM that figuring out which "mode" we want to operate in is actually > the *primary* purpose of _bt_perfect_penalty(). We only really use the > penalty as an optimization that we pass on to _bt_bestsplitloc(). So I'd > suggest changing the function name to something like _bt_choose_mode(), > and have secondmode be the primary return value from it, with > perfectpenalty being the secondary result through a pointer. I renamed _bt_perfect_penalty() to _bt_strategy(), since I agree that its primary purpose is to decide on a strategy (which is what I'm now calling a mode, per your request a bit further down). It still returns perfectpenalty, though, since that seemed more natural to me, probably because its style matches the style of caller/_bt_findsplitloc(). perfectpenalty isn't a mere optimization -- it is important to prevent many duplicates mode from going overboard with suffix truncation. It does more than just save _bt_bestsplitloc() cycles, which I've tried to make clearer in v14. > It doesn't really choose the mode, either, though. At least after the > next "Add split after new tuple optimization" patch. The caller has a > big part in choosing what to do. So maybe split _bt_perfect_penalty into > two functions: _bt_perfect_penalty(), which just computes the perfect > penalty, and _bt_analyze_split_interval(), which determines how many > duplicates there are in the top-N split points. Hmm. I didn't create a _bt_analyze_split_interval(), because now _bt_perfect_penalty()/_bt_strategy() is responsible for setting the perfect penalty in all cases. It was a mistake for me to move some perfect penalty stuff for alternative modes/strategies out to the caller in v13. In v14, we never make _bt_findsplitloc() change its perfect penalty -- it only changes its split interval, based on the strategy/mode, possibly after sorting. Let me know what you think of this. > BTW, I like the word "strategy", like you called it in the comment on > SplitPoint struct, better than "mode". I've adopted that terminology in v14 -- it's always "strategy", never "mode". > How about removing the "usemult" variable, and just check if > fillfactormult == 0.5? I need to use "usemult" to determine if the "split after new tuple" optimization should apply leaf fillfactor, or should instead split at the exact point after the newly inserted tuple -- it's very natural to have a single bool flag for that. It's seems simpler to continue to use "usemult" for everything, and not distinguish "split after new tuple" as a special case later on. (Besides, the master branch already uses a bool for this, even though it handles far fewer things.) > > /* > > * There are a much smaller number of candidate split points when > > * splitting an internal page, so we can afford to be exhaustive. Only > > * give up when pivot that will be inserted into parent is as small as > > * possible. > > */ > > if (!state->is_leaf) > > return MAXALIGN(sizeof(IndexTupleData) + 1); > > Why are there fewer candidate split points on an internal page? The comment should say that there is typically a much smaller split interval (this used to be controlled by limiting the size of the array initially -- should have updated this for v13 of the patch). I believe that you understand that, and are interested in why the split interval itself is different on internal pages. Or why we are more conservative with internal pages in general. Assuming that's what you meant, here is my answer: The "Prefix B-Tree" paper establishes the idea that there are different split intervals for leaf pages and internal pages (which it calls branch pages). We care about different things in each case. With leaf pages, we care about choosing the split point that allows us to create the smallest possible pivot tuple as our secondary goal (primary goal is balancing space). With internal pages, we care about choosing the smallest tuple to insert into parent of internal page (often the root) as our secondary goal, but don't care about truncation, because _bt_split() won't truncate new pivot. That's why the definition of "penalty" varies according to whether we're splitting an internal page or a leaf page. Clearly the idea of having separate split intervals is well established, and makes sense. It's fair to ask if I'm being too conservative (or not conservative enough) with split interval in either case. Unfortunately, the Prefix B-Tree paper never seems to give practical advice about how to come up with an interval. They say: "We have not analyzed the influence of sigma L [leaf interval] or sigma B [branch/internal interval] on the performance of the trees. We expect such an analysis to be quite involved and difficult. We are quite confident, however, that small split intervals improve performance considerably. Sets of keys that arise in practical applications are often far from random, and clusters of similar keys differing only in the last few letters (e.g. plural forms) are quite common." I am aware of another, not-very-notable paper that tries to to impose some theory here, but doesn't really help much [1]. Anyway, I've found that I was too conservative with split interval for internal pages. It pays to make internal interval that higher than leaf interval, because internal pages cover a much bigger portion of the key space than leaf pages, which will tend to get filled up one way or another. Leaf pages cover a tight part of the key space, in contrast. In v14, I've increased internal page to 18, a big increase from 3, and twice what it is for leaf splits (still 9 -- no change there). This mostly isn't that different to 3, since there usually are pivot tuples that are all the same size anyway. However, with cases where suffix truncation makes pivot tuples a lot smaller (e.g. UK land registry test case), this makes the items in the root a lot smaller on average, and even makes the whole index smaller. My entire test suite has a few cases that are noticeably improved by this change, and no cases that are hurt. I'm going to have to revalidate the performance of long-running benchmarks with this change, so this should be considered provisional. I think that it will probably be kept, though. Not expecting it to noticeably impact either BenchmarkSQL or pgbench benchmarks. [1] https://shareok.org/bitstream/handle/11244/16442/Thesis-1983-T747e.pdf?sequence=1 -- Peter Geoghegan
Вложения
- v14-0001-Refactor-nbtree-insertion-scankeys.patch
- v14-0003-Consider-secondary-factors-during-nbtree-splits.patch
- v14-0004-Add-split-after-new-tuple-optimization.patch
- v14-0005-Add-high-key-continuescan-optimization.patch
- v14-0002-Make-heap-TID-a-tie-breaker-nbtree-index-column.patch
- v14-0006-Allow-tuples-to-be-relocated-from-root-by-amchec.patch
- v14-0007-DEBUG-Add-pageinspect-instrumentation.patch
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Tom LaneДата:
Сообщение: Re: Allowing extensions to supply operator-/function-specific info