Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
От | Peter Geoghegan |
---|---|
Тема | Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. |
Дата | |
Msg-id | CAH2-Wzme6NA70bkHK=Jrjx+bkHr9DAj9gX+y5OpmvEVNy0eqVQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. (Peter Geoghegan <pg@bowt.ie>) |
Ответы |
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
(Peter Geoghegan <pg@bowt.ie>)
|
Список | pgsql-hackers |
On Tue, Jan 14, 2020 at 6:08 PM Peter Geoghegan <pg@bowt.ie> wrote: > Still no progress on these items, but I am now posting v30. A new > version seems warranted, because I now want to revive a patch from a > couple of years back as part of the deduplication project -- it would > be good to get feedback on that sooner rather than later. Actually, I decided that this wasn't necessary -- I won't be touching compactify_tuples() at all (at least not right now). Deduplication doesn't even need to use PageIndexMultiDelete() in the attached revision of the patch, v31, so speeding up compactify_tuples() is no longer relevant. v31 simplifies everything quite a bit. This is something that I came up with more or less as a result of following Heikki's feedback. I found that reviving the v17 approach of using a temp page buffer in _bt_dedup_one_page() (much like _bt_split() always has) was a good idea. This approach was initially revived in order to make dedup WAL logging work on a whole-page basis -- Heikki suggested we do it that way, and so now we do. But this approach is also a lot faster in general, and has additional benefits besides that. When we abandoned the temp buffer approach back in September of last year, the unique index stuff was totally immature and unsettled, and it looked like a very incremental approach might make sense for unique indexes. It doesn't seem like a good idea now, though. In fact, I no longer even believe that a custom checkingunique/unique index strategy in _bt_dedup_one_page() is useful. That is also removed in v31, which will also make Heikki happy -- he expressed a separate concern about the extra complexity there. I've done a lot of optimization work since September, making these simplification possible now. The problems that I saw that justified the complexity seem to have gone away now. I'm pretty sure that the recent _bt_check_unique() posting list tuple _bt_compare() optimization is the biggest part of that. The checkingunique/unique index strategy in _bt_dedup_one_page() always felt overfit to my microbenchmarks, so I'm glad to be rid of it. Note that v31 changes nothing about how we think about deduplication in unique indexes in general, nor how it is presented to users. There is still special criteria around how deduplication is *triggered* in unique indexes. We continue to trigger a deduplication pass based on seeing a duplicate within _bt_check_unique() + _bt_findinsertloc() -- otherwise we never attempt deduplication in a unique index (same as before). Plus the GUC still doesn't affect unique indexes, unique index deduplication still isn't really documented in the user docs (it just gets a passing mention in B-Tree internals section), etc. In my opinion, the patch is now pretty close to being committable. I do have two outstanding open items for the patch, though. These items are: * We still need infrastructure that marks B-Tree opclasses as safe for deduplication, to avoid things like the numeric display scale problem, collations that are unsafe for deduplication because they're nondeterministic, etc. I talked to Anastasia about this over private e-mail recently. This is going well; I'm expecting a revision later this week. It will be based on all feedback to date over on the other thread [1] that we have for this part of the project. * Make VACUUM's WAL record more space efficient when it contains one or more "updates" to an existing posting list tuple. Currently, when VACUUM must delete some but not all TIDs from a posting list, we generate a new posting list tuple and dump it into the WAL stream -- the REDO routine simply overwrites the existing item with a version lacking the TIDs that have to go. This could be space inefficient with certain workloads, such as workloads where only one or two TIDs are deleted from a very large posting list tuple again and again. Heikki suggested I do something about this. I intend to at least research the problem, and can probably go ahead with implementing it without any trouble. What nbtree VACUUM does in the patch right now is roughly the same as what GIN's VACUUM does for posting lists within posting tree pages -- see ginVacuumPostingTreeLeaf() (we're horribly inefficient about WAL logging when VACUUM'ing a GIN entry tree leaf page, which works differently, and isn't what I'm talking about -- see ginVacuumEntryPage()). We might as well do better than GIN/ginVacuumPostingTreeLeaf() here if we can. The patch is pretty clever about minimizing the volume of WAL in all other contexts, managing to avoid any other case of what could be described as "WAL space amplification". Maybe we should do the same with the xl_btree_vacuum record just to be *consistent* about it. [1] https://www.postgresql.org/message-id/flat/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ@mail.gmail.com -- Peter Geoghegan
Вложения
В списке pgsql-hackers по дате отправления: