Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
От | Peter Geoghegan |
---|---|
Тема | Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. |
Дата | |
Msg-id | CAH2-Wz=2CUQZ+RhCrn9kdUrZri=5hvj5t=F2mM592nT1rdz4uQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. (Peter Geoghegan <pg@bowt.ie>) |
Ответы |
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.
(Peter Geoghegan <pg@bowt.ie>)
Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. (Anastasia Lubennikova <a.lubennikova@postgrespro.ru>) |
Список | pgsql-hackers |
On Fri, Jul 19, 2019 at 7:24 PM Peter Geoghegan <pg@bowt.ie> wrote: > Hmm. So, the attached test case fails amcheck verification for me with > the latest version of the patch: Attached is a revised version of your v2 that fixes this issue -- I'll call this v3. In general, my goal for the revision was to make sure that all of my old tests from the v12 work passed, and to make sure that amcheck can detect almost any possible problem. I tested the amcheck changes by corrupting random state in a test index using pg_hexedit, then making sure that amcheck actually complained in each case. I also fixed one or two bugs in passing, including the bug that caused an assertion failure in _bt_truncate(). That was down to a subtle off-by-one issue within _bt_insertonpg_in_posting(). Overall, I didn't make that many changes to your v2. There are probably some things about the patch that I still don't understand, or things that I have misunderstood. Other changes: * We now support system catalog indexes. There is no reason not to support them. * Removed unnecessary code from _bt_buildadd(). * Added my own new DEBUG4 trace to _bt_insertonpg_in_posting(), which I used to fix that bug I mentioned. I agree that we should keep the DEBUG4 traces around until the overall design settles down. I found the ones that you added helpful, too. * Added quite a few new assertions. For example, we need to still support !heapkeyspace (pre Postgres 12) nbtree indexes, but we cannot let them use compression -- new defensive assertions were added to make this break loudly. * Changed the custom binary search code within _bt_compare_posting() to look more like _bt_binsrch() and _bt_binsrch_insert(). Do you know of any reason not to do it that way? * Added quite a few "FIXME"/"XXX" comments at various points, to indicate where I have general concerns that need more discussion. * Included my own pageinspect hack to visualize the minimum TIDs in posting lists. It's broken out into a separate patch file. The code is very rough, but it might help someone else, so I thought I'd include it. I also have some new concerns about the code in the patch that I will point out now (though only as something to think about a solution on -- I am unsure myself): * It's a bad sign that compression involves calls to PageAddItem() that are allowed to fail (we just give up on compression when that happens). For one thing, all existing calls to PageAddItem() in Postgres are never expected to fail -- if they do fail we get a "can't happen" error that suggests corruption. It was a good idea to take this approach to get the patch to work, and to prove the general idea, but we now need to fully work out all the details about the use of space. This includes complicated new questions around how alignment is supposed to work. Alignment in nbtree is already complicated today -- you're supposed to MAXALIGN() everything in nbtree, so that the MAXALIGN() within bufpage.c routines cannot be different to the lp_len/IndexTupleSize() length (note that heapam can have tuples whose lp_len isn't aligned, so nbtree could do it differently if it proved useful). Code within nbtsplitloc.c fully understands the space requirements for the bufpage.c routines, and is very careful about it. (The bufpage.c details are supposed to be totally hidden from code like nbtsplitloc.c, but I guess that that ideal isn't quite possible in reality. Code comments don't really explain the situation today.) I'm not sure what it would look like for this patch to be as precise about free space as nbtsplitloc.c already is, even though that seems desirable (I just know that it would mean you would trust PageAddItem() to work in all cases). The patch is different to what we already have today in that it tries to add *less than* a single MAXALIGN() quantum at a time in some places (when a posting list needs to grow by one item). The devil is in the details. * As you know, the current approach to WAL logging is very inefficient. It's okay for now, but we'll need a fine-grained approach for the patch to be commitable. I think that this is subtly related to the last item (i.e. the one about alignment). I have done basic performance tests using unlogged tables. The patch seems to either make big INSERT queries run as fast or faster than before when inserting into unlogged tables, which is a very good start. * Since we can now split a posting list in two, we may also have to reconsider BTMaxItemSize, or some similar mechanism that worries about extreme cases where it becomes impossible to split because even two pages are not enough to fit everything. Think of what happens when there is a tuple with a single large datum, that gets split in two (the tuple is split, not the page), with each half receiving its own copy of the datum. I haven't proven to myself that this is broken, but that may just be because I haven't spent any time on it. OTOH, maybe you already have it right, in which case it seems like it should be explained somewhere. Possibly in nbtree.h. This is tricky stuff. * I agree with all of your existing TODO items -- most of them seem very important to me. * Do we really need to keep BTreeTupleGetHeapTID(), now that we have BTreeTupleGetMinTID()? Can't we combine the two macros into one, so that callers don't need to think about the pivot vs posting list thing themselves? See the new code added to _bt_mkscankey() by v3, for example. It now handles both cases/macros at once, in order to keep its amcheck caller happy. amcheck's verify_nbtree.c received similar ugly code in v3. * We should at least experiment with applying compression when inserting into unique indexes. Like Alexander, I think that compression in unique indexes might work well, given how they must work in Postgres. My next steps will be to study the design of the _bt_insertonpg_in_posting() stuff some more. It seems like you already have the right general idea there, but I would like to come up with a way of making _bt_insertonpg_in_posting() understand how to work with space on the page with total certainty, much like nbtsplitloc.c does today. This should allow us to make WAL-logging more precise/incremental. -- Peter Geoghegan
Вложения
В списке pgsql-hackers по дате отправления: