Dynamic prefix truncation for nbtree, Lanin & Shasha design issue
От | Peter Geoghegan |
---|---|
Тема | Dynamic prefix truncation for nbtree, Lanin & Shasha design issue |
Дата | |
Msg-id | CAH2-Wzn_NAyK4pR0HRWO0StwHmxjP5qyu+X8vppt030XpqrO6w@mail.gmail.com обсуждение исходный текст |
Список | pgsql-hackers |
Attached is a prototype patch for dynamic prefix truncation that applies on top of v9 of the nbtree patch series [1] I've been working on. This results in considerable performance improvements in some cases, since it's (almost!) safe to skip attributes that we know must be redundant/equal based on our descent of the B-Tree so far -- we can reason about that without any extra comparison work. There are problems, though, which hint at an underlying issue with the broader nbtree design, which is what I really want to talk about. This thread may not go anywhere, but I feel an obligation to put my thoughts on how we do the Lanin & Shasha deletion stuff on the record, available from the archives. I think that there may be some wisdom in the L&S design that was overlooked. First, I will say a bit more about the merits of the dynamic prefix truncation patch, to establish the practical relevance of what I'm about to go into. I'll show an improvement in bulk insertion performance for the UK land registry data [2]. There is one entry in the table for every property sale in the UK going back 2 or 3 decades -- that's about 21.5 million rows. I have an index defined against a 3GB table, where most cycles are spent during insertions: pg@land:5432 [25978]=# \d composite Unlogged index "public.composite" Column │ Type │ Key? │ Definition ──────────┼──────┼──────┼──────────── county │ text │ yes │ county city │ text │ yes │ city locality │ text │ yes │ locality btree, for table "public.land2" The total index size is 1101 MB after a CREATE INDEX. As you'd imagine, this is a medium to low cardinality index, because there are naturally quite a lot of individual property sales in each locality, especially those in and around London. The query "INSERT INTO land2 SELECT * FROM land_registry_price_paid_uk" does a bulk insertion into the unlogged table land2. Here is the impact on throughput/overall duration: Duration on master branch (best of 3): 01:17.165 Duration with just v9 of my patch (best of 3): 01:12.736 Duration with v9 + dynamic prefix truncation (best of 3): 01:04.038 Clearly dynamic prefix truncation adds quite a lot here -- a serial REINDEX takes about 45 - 50 seconds on the same machine. Note that I'm not cherry-picking my test-case; I've seen similar improvements while bulk loading a TPC-E database in a similar manner. Clearly this is a pretty good optimization, that complements the suffix truncation stuff well. Also, the skip scan patch could make good use of this, since that repeatedly descends a tree with a low cardinality leading attribute. The optimization would be particularly helpful there. Problem ======= I don't think this latest experimental patch can go anywhere for the foreseeable future, even though the idea is workable. There is an _incredibly_ subtle race condition. I want to talk about why this is, since it seems like it's down to a design decision, which likely has broad implications: """ (Note: Lanin and Shasha prefer to make the key space move left, but their argument for doing so hinges on not having left-links, which we have anyway. So we simplify the algorithm by moving key space right.) """ (I'm quoting the nbtree README.) IOW, nbtree page deletion is not the precise opposite of a page split, unlike an implementation that sticks closely to what Lanin & Shasha describe: a concurrent would-be insertion into a deleted page ends up inserting into the deletion target's right sibling, rather than the deletion target's left sibling. This nbtree README sentence seems misleading to me. While the README is technically correct to say L&S don't have left links, they do have something called "outlinks", which seem like a special case of left link to me [3]. Comparing their "procedure move-right()" pseudo-code on page 16 to our own _bt_moveright() routine is informative: you see that they distinguish between an ignorable/half-dead page and a page that just had a concurrent page split in a way that we clearly don't. We only ever 1) stay put, or 2) move right on concurrent split or concurrent deletion. They either 1) stay put, 2) move right when the high key is less than the value being searched for, or 3) follow the "outlink" when the page was marked ignorable (half-dead or dead) by a concurrent deletion. I think that the nbtree README means that nbtree does what you see in L&S's "Figure 3. (b)", despite the fact that Lanin & Shasha find it "inconvenient". L&S also say of Fig 3. (b): """ Let us consider various implementations of merging a node n with its right neighbor n' in the B-link structure. If we move the data in n' to n (Fig. 3a), there is no path that processes can follow from n' to the data moved out. This may, for example, lead a process to conclude that c is not in the structure.' If we move the data in n to n' (Fig. 3b), we will need to access the left neighbor of n to remove n from the linked list. Finding the left neighbor requires either much time or a doubly linked list, which increases complexity and the locking overhead (see [Sh84] for example). """ To be fair, whether or not the outlink is a special case of a leftlink as I suggest, or a separate thing is perhaps debatable -- I'll need to think about backwards scans some more to develop an opinion on that. Either way, having an extra link to handle concurrent page deletions makes a page deletion truly the opposite of a page split. Following a downlink becomes much closer to following a right link, in the sense that we can reason about the next page down/to the right having no key space overlap with the current/about-to-be-previous page. Advantages of making our design closer to Lanin & Shasha's in this way may include: * More elegant design -- moving right to find something that should actually be to the left is arguably pretty ugly. * Being able to reliably notice concurrent page deletions that could affect what our insertion scan key can expect to have to compare next would give me an easy way to make dynamic prefix truncation work correctly. We can safely invalidate the skip attribute hint when an ignorable (half-dead or dead) page is found, starting afresh from the page pointed to by the half-dead page's outlink -- a page deletion that we cannot possibly notice to the immediate left of the child causes no problems. In particular, no concurrent insertion that inserts into a part of the key space dynamic prefix truncation reasoned that it had already eliminated could actually occur. The race condition can be reliably noticed and recovered from as part of handling a concurrent page deletion/following ignorable page's "outlink". * Dynamic prefix truncation is not that different to actual prefix truncation, and is therefore probably going to also block on solving this problem. If we ever want a configurable leaf level compression feature, we'll probably have to confront the same issue. We'd probably end up adding a "low key" at the leaf level when the DBA turned that feature on. How is that supposed to work with the current design? * Merging not-fully-empty nodes seems much easier if we wanted to go that way. The L&S paper is very clear about the fact that it supports merging pages that aren't yet completely empty -- they have no restrictions on the page being totally empty. I'm pretty sure that this is the only reason for nbtree's restriction on page deletion. I'm not actually interested in pursuing this one myself, but I include it to aid understanding of what I'm talking about. I remember Kevin Grittner (CC'd) expressing interest in this project. * We could make amcheck's bt_index_parent_check() SQL-callable function only need an AccessShareLock, perhaps, because concurrent VACUUM/page deletion becomes a problem it can reason about and recover from. The race that makes dynamic prefix truncation unworkable is exactly the same as the race I go on about at length within amcheck's bt_downlink_check() function -- I guess you could say that I noticed this problem a few years ago, but I'm only now connecting the dots. Concurrent VACUUMs could instead be dealt with there by making use of a trick that's similar to the existing cross-page invariant trick that's explained at length within bt_right_page_check_scankey(). The general idea, once again, is that seeing a half-dead child page becomes a reliable canary condition. See also: the enormously detailed comments I wrote within bt_target_page_check() about why this is okay to do this across siblings in the case of bt_index_check(), the SQL-callable verification function that already only needs an AccessShareLock. [1] https://postgr.es/m/CAH2-Wz=apbKyaFhEfRN3UK_yXZ8DSE4Ybr0A3D87=4JWyy1QPA@mail.gmail.com [2] https://wiki.postgresql.org/wiki/Sample_Databases [3] https://archive.org/stream/symmetricconcurr00lani -- Peter Geoghegan
Вложения
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Peter GeogheganДата:
Сообщение: Re: Making all nbtree entries unique by having heap TIDs participatein comparisons