Обсуждение: On-disk compatibility for nbtree-unique-key enhancement

Поиск
Список
Период
Сортировка

On-disk compatibility for nbtree-unique-key enhancement

От
Peter Geoghegan
Дата:
I would like to get some feedback on dealing with on-disk
compatibility for my nbtree patch, which is in the current CF [1]. I
thought I'd start a new thread on these questions, since it's probably
the most sensitive aspect of the project, and can be discussed in a
fairly contained way.

My patch to make nbtree index tuple keys uniquely identifiable by
using heap TID as a differentiating last attribute/part of the key
space complicates space management within routines like
_bt_findsplitloc(). Though, ideally, not very much. In general, we may
need to make new "pivot" tuples [2] (all of which originate as the new
high key added on the left side of a leaf page split) have an
additional, explicit representation of heap TID. I'm MAXALIGN()'ing
ItemPointerData to fit the "extra" heap TID attribute once an actual
page split in underway, so that's an extra 8 bytes that I might have
to append to a nonpivot leaf tuple when a new high key/pivot is
generated during a leaf page split. This 8 bytes needs to be accounted
for.

I want to make a general pessimistic assumption that it's always
necessary to have an extra 8 bytes to fit a heap TID, stolen from the
"1/3 of a buffer" limit that we already enforce as the largest
possible nbtree tuple size. Prior to the current  _bt_findsplitloc()
coding (circa 1999), there were bugs about a failure to find a split
point, and I'm keen to avoid more problems along those lines.
Actually, I think that it's essential to keep the space management
logic as simple as possible, with everything ideally confined to a
single page (and not reliant on things that are true across pages). So
I feel reasonably confident that the best thing to do is impose
certain theoretical costs on users in the short term, rather than
doing something fancy and brittle that might be much harder to
stabilize in the long term.

More concretely:

* My patch changes the definition of BTMaxItemSize() for indexes on
the proposed new BTREE_VERSION, v4. We insist that new tuples must be
less than or equal to 2704 bytes in size, including all IndexTuple
overhead, rather than the current 2712 bytes.

* Under this scheme, the "internal" limit will remain 2712 bytes, so
that pivot tuples that are already at the proposed new user-visible
maximum of 2704 can safely have another 8 bytes added.

* This means that there is a compatibility issue for anyone that is
already right on the threshold -- we *really* don't want to see a
REINDEX fail, but that seems like a possibility that we need to talk
about now.

We've documented that there is this 1/3 of a page restriction, but
specify that it's "approximate", so I don't think our documentation as
written would need to be changed (though a release note item would
certainly be in order). Also, since the vagaries of how well TOAST can
compress are very hard to understand, I have to imagine that almost
nobody relies on having the last 8 bytes -- if they do, then they're
playing with fire.

Not being able to REINDEX is still really awful, however unrealistic
the scenario may be. I can imagine ways of avoiding it, but I'd really
rather not go there, since the cure is worse than the disease. I'll
still throw out some ideas for a cure, though:

* Maybe we could rely on the fact that internal pages are the only
pages that can contain more than one maybe-extra-sized 2720 byte pivot
tuple, since in practice they also always have one "negative infinity"
item. Those are always 16 bytes, so it works out.

We've done something a bit like this before [3], though I can see big
downsides to going further with it. It seems pretty grotty to me. I
actually would like to *not* truncate outside of the leftmost page on
the level, so that I can use the not-actually-negative-infinity first
item in internal pages as a "low key". This would make low-overhead
prefix compression within internal pages work. (This is not to be
confused with leaf level prefix compression, which the DBA would have
to enable.)

* Alternatively, we could lift the general restriction on out-of-line
storage within indexes, but I think that that would probably break
certain things in a way that would go undetected for a long time.

It would break amcheck's heapallindexed check, for one thing, and it
might have implications for buffer locking protocols. In general,
out-of-line storage within indexes seems like it works against what an
index is supposed to do. And, it's a lot of effort for a small to
nonexistent reward.

Does it seem like I have the right general idea here?

[1] https://commitfest.postgresql.org/19/1787/
[2]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/README;h=3680e69b89a8458d58f6c3361eb612788fa33786;hb=df8b5f3eb8a7c477156d0ad9d83e7297912cfe79#l612
[3]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/include/access/itup.h;h=bd3a702380954bc69c9536876094249430cb7c72;hb=df8b5f3eb8a7c477156d0ad9d83e7297912cfe79#l139
-- 
Peter Geoghegan


Re: On-disk compatibility for nbtree-unique-key enhancement

От
Peter Eisentraut
Дата:
On 21/09/2018 01:18, Peter Geoghegan wrote:
> * This means that there is a compatibility issue for anyone that is
> already right on the threshold -- we *really* don't want to see a
> REINDEX fail, but that seems like a possibility that we need to talk
> about now.

When would the REINDEX need to happen?  Will the code still be able to
read and write v3 btrees?  Could there perhaps be an amcheck or
pageinspect feature that tells you ahead of time if there are too large
items in an old index?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: On-disk compatibility for nbtree-unique-key enhancement

От
Peter Geoghegan
Дата:
On Fri, Sep 28, 2018 at 8:00 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 21/09/2018 01:18, Peter Geoghegan wrote:
> > * This means that there is a compatibility issue for anyone that is
> > already right on the threshold -- we *really* don't want to see a
> > REINDEX fail, but that seems like a possibility that we need to talk
> > about now.
>
> When would the REINDEX need to happen?  Will the code still be able to
> read and write v3 btrees?

The patch doesn't do that at the moment, because I've been busy
refining it, and because there are a couple of outstanding questions
about how to go about it -- the questions I pose on this thread. I
accept that it's absolutely essential that nbtree be able to read both
v2 and v3 indexes as part of any new v4. Without a measurable
performance penalty. That's the standard that this project should be
held to.

A REINDEX will never *need* to happen. v2 and v3 indexes will
gradually go extinct, without many users actually noticing.

The on-disk representation of my patch leaves several free status bits
in INDEX_ALT_TID_MASK tuples free (3 total will remain, since I'm now
using 1 of the 4 for BT_HEAP_TID_ATTR), so it should be easier to add
various further enhancements to a v5 or v6 of nbtree. This is similar
to how changes to GIN were managed in the past (it may be interesting
to look at a GIN leaf page with pg_hexedit, since it'll show you the
gory details in a relatively accessible way). I can imagine a
INDEX_ALT_TID_MASK bit being used for tuples that point to the heap --
not just for pivot tuples. I have an eye on things like duplicate
lists on the leaf level, which would probably work like a GIN posting
list.

> Could there perhaps be an amcheck or
> pageinspect feature that tells you ahead of time if there are too large
> items in an old index?

That would be easy, but it might not be any better than just having
REINDEX or CREATE INDEX [CONCURRENTLY] throw an error. They're already
pretty fast. I could easily raise a WARNING when amcheck is run
against an index of a version before v4, that has an index tuple
that's too big to make it under the lower limit. Actually, I could
even write an SQL query that had pageinspect notice affected tuples,
without changing any C code.

Bear in mind that TOAST compression accidentally plays a big role
here. It makes it very unlikely that indexes in the field are right at
the old 2712 byte threshold, without even 8 bytes of wiggle room,
because it's impossible to predict how well the pglz compression will
work with that kind of precision. Several highly improbable things
need to happen at the same time before REINDEX can break. I cannot see
how any app could have evolved to depend on having 2712 bytes, without
even a single MAXALIGN() quantum to spare.

I wrote a stress test around the new "1/3 of a page" restriction. It
involved a large text attribute with PLAIN storage, since I couldn't
sensibly test the restriction while using pglz compression in the
index. When all of your tuples are 2704 bytes, you end up with a
ridiculously tall B-Tree, that performs horribly. I think that I saw
that it had 11 levels with the test case. The tallest B-Tree that
you'll ever see in the wild is probably one that's 5 levels deep,
which is very tall indeed. Because of the logarithmic nature of how a
new level is added to a B-Tree, 11 levels is just ludicrous. (Granted,
you only have to have one tuple that's precisely 2712 bytes in length
for REINDEX to break.)

-- 
Peter Geoghegan