Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
От | Peter Geoghegan |
---|---|
Тема | Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. |
Дата | |
Msg-id | CAH2-WzkwDobJU9sgq8J2gaZk_Eo6Dsyny1-gWXhY-9VDZnHOTQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. (Peter Geoghegan <pg@bowt.ie>) |
Ответы |
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
(Peter Geoghegan <pg@bowt.ie>)
|
Список | pgsql-hackers |
On Sat, Aug 31, 2019 at 1:04 AM Peter Geoghegan <pg@bowt.ie> wrote: > I've found a fix for these Valgrind issues. Attach is v10, which fixes the Valgrind issue. Other changes: * The code now fully embraces the idea that posting list splits involve "changing the incoming item" in a way that "avoids" having the new/incoming item overlap with an existing posting list tuple. This allowed me to cut down on the changes required within nbtinsert.c considerably. * Streamlined a lot of the code in nbtsearch.c. I was able to significantly simplify _bt_compare() and _bt_binsrch_insert(). * Removed the DEBUG4 traces. A lot of these had to go when I refactored nbtsearch.c code, so I thought I might as well removed the remaining ones. I hope that you don't mind (go ahead and add them back where that makes sense). * A backwards scan will return "logical tuples" in descending order now. We should do this on general principle, and also because of the possibility of future external code that expects and takes advantage of consistent heap TID order. This change might even have a small performance benefit today, though: Index scans that visit multiple heap pages but only match on a single key will only pin each heap page visited once. Visiting the heap pages in descending order within a B-Tree page full of duplicates, but ascending order within individual posting lists could result in unnecessary extra pinning. * Standardized terminology. We consistently call what the patch adds "deduplication" rather than "compression". * Added a new section on the design to the nbtree README. This is fairly high level, and talks about dynamics that we can't really talk about anywhere else, such as how nbtsplitloc.c "cooperates" with deduplication, producing an effect that is greater than the sum of its parts. * I also made some changes to the WAL logging for leaf page insertions and page splits. I didn't add the optimization that you anticipated in your nbtxlog.h comments (i.e. only WAL-log a rewritten posting list when it will go on the left half of the split, just like the new/incoming item thing we have already). I agree that that's a good idea, and should be added soon. Actually, I think the whole "new item vs. rewritten posting list item" thing makes the WAL logging confusing, so this is not really about performance. Maybe the easiest way to do this is also the way that performs best. I'm thinking of this: maybe we could completely avoid WAL-logging the entire rewritten/split posting list. After all, the contents of the rewritten posting list are derived from the existing/original posting list, as well as the new/incoming item. We can make the WAL record much smaller on average by making standbys repeat a little bit of the work performed on the primary. Maybe we could WAL-log "in_posting_offset" itself, and an ItemPointerData (obviously the new item offset number tells us the offset number of the posting list that must be replaced/memmoved()'d). Then have the standby repeat some of the work performed on the primary -- at least the work of swapping a heap TID could be repeated on standbys, since it's very little extra work for standbys, but could really reduce the WAL volume. This might actually be simpler. The WAL logging that I didn't touch in v10 is the most important thing to improve. I am talking about the WAL-logging that is performed as part of deduplicating all items on a page, to avoid a page split (i.e. the WAL-logging within _bt_dedup_one_page()). That still just does a log_newpage_buffer() in v10, which is pretty inefficient. Much like the posting list split WAL logging stuff, WAL logging in _bt_dedup_one_page() can probably be made more efficient by describing deduplication in terms of logical changes. For example, the WAL records should consist of metadata that could be read by a human as "merge the tuples from offset number 15 until offset number 27". Perhaps this could also share code with the posting list split stuff. What do you think? Once we make the WAL-logging within _bt_dedup_one_page() more efficient, that also makes it fairly easy to make the deduplication that it performs occur incrementally, maybe even very incrementally. I can imagine the _bt_dedup_one_page() caller specifying "my new tuple is 32 bytes, and I'd really like to not have to split the page, so please at least do enough deduplication to make it fit". Delaying deduplication increases the amount of time that we have to set the LP_DEAD bit for remaining items on the page, which might be important. Also, spreading out the volume of WAL produced by deduplication over time might be important with certain workloads. We would still probably do somewhat more work than strictly necessary to avoid a page split if we were to make _bt_dedup_one_page() incremental like this, though not by a huge amount. OTOH, maybe I am completely wrong about "incremental deduplication" being a good idea. It seems worth experimenting with, though. It's not that much more work on top of making the _bt_dedup_one_page() WAL-logging efficient, which seems like the thing we should focus on now. Thoughts? -- Peter Geoghegan
Вложения
В списке pgsql-hackers по дате отправления:
Следующее
От: Michael PaquierДата:
Сообщение: Re: pg_basebackup -F t fails when fsync spends more time thantcp_user_timeout