Re: WIP: Covering + unique indexes.

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: WIP: Covering + unique indexes.
Дата
Msg-id CAPpHfdva7aS=DMGPXKe8fV5fe3w8tXzzREndZvgaW+gWvpsT7w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: WIP: Covering + unique indexes.  (Peter Geoghegan <pg@bowt.ie>)
Ответы Re: WIP: Covering + unique indexes.  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-hackers
On Mon, Apr 2, 2018 at 1:18 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Sun, Apr 1, 2018 at 10:09 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
>> So? GIN doesn't have the same legacy at all. The GIN posting lists
>> *don't* have regular heap TID pointers at all. They started out
>> without them, and still don't have them.
>
>
> Yes, GIN never stored heap TID pointers in t_tid of index tuple.  But GIN
> assumes that heap TID pointer has at most 11 significant bits during
> posting list encoding.

I think that we should avoid assuming things, unless the cost of
representing them is too high, which I don't think applies here. The
more defensive general purpose code can be, the better.

I thought abut that another time and I decided that it would be safer
to use 13th bit in index tuple flags.  There are already attempt to
use whole 6 bytes of tid for not heap pointer information [1].  Thus, it
would be safe to use 13th bit for indicating alternative offset meaning
in pivot tuples, because it wouldn't block further work.  Revised patchset
in the attachment implements it.

> I don't think we should use assertions, because they are typically disabled
> on
> production PostgreSQL builds.  But we can have some explicit check in some
> common path.  In the attached patch I've such check to _bt_compare().
> Probably,
> together with amcheck, that would be sufficient.

Good idea -- a "can't happen" check in _bt_compare seems better, which
I see here:

> diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
> index 51dca64e13..fcf9832147 100644
> --- a/src/backend/access/nbtree/nbtsearch.c
> +++ b/src/backend/access/nbtree/nbtsearch.c
> @@ -443,6 +443,17 @@ _bt_compare(Relation rel,
>     if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
>         return 1;
>
> +   /*
> +    * Check tuple has correct number of attributes.
> +    */
> +   if (!_bt_check_natts(rel, page, offnum))
> +   {
> +       ereport(ERROR,
> +               (errcode(ERRCODE_INTERNAL_ERROR),
> +                errmsg("tuple has wrong number of attributes in index \"%s\"",
> +                       RelationGetRelationName(rel))));
> +   }
> +
>     itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));

It seems like it might be a good idea to make this accept an
IndexTuple, though, to possibly save some work.

I don't know.  We still need an offset number to check expected number
of attributes.  Passing index tuple as separate attribute would be
redundant and open door for extra possible errors.
 
lso, perhaps this
should be an unlikely() condition, if only because it makes the intent
clearer (might actually matter in a tight loop like this too, though).
 
OK, marked that check as unlikely().

Do you store an attribute number in the "minus infinity" item (the
leftmost one of internal pages)? I guess that that should be zero,
because it's totally truncated.

Yes, I store zero number of attributes in "minus infinity" item.  See this
part of the patch.

@@ -2081,7 +2081,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
  left_item_sz = sizeof(IndexTupleData);
  left_item = (IndexTuple) palloc(left_item_sz);
  left_item->t_info = left_item_sz;
- ItemPointerSet(&(left_item->t_tid), lbkno, P_HIKEY);
+ ItemPointerSetBlockNumber(&(left_item->t_tid), lbkno);
+ BTreeTupSetNAtts(left_item, 0);

However, note that I've to store (number_of_attributes + 1) in the offset
in order to correctly store zero number of attributes.  Otherwise, assertion
is faised in ItemPointerIsValid() macro.

/*
 * ItemPointerIsValid
 * True iff the disk item pointer is not NULL.
 */
#define ItemPointerIsValid(pointer) \
((bool) (PointerIsValid(pointer) && ((pointer)->ip_posid != 0)))

>> The fact is that that information can go out of date almost
>> immediately, whereas high keys usually last forever. The only reason
>> that there is a heap TID in the high key is because we'd have to add
>> special code to remove it; not because it has any real value. I find
>> it very hard to imagine it being used in a forensic situation. If you
>> actually wanted to do this, the key itself is probably enough -- you
>> probably wouldn't need the TID.
>
>
> I don't know,  When I wrote my own implementation of B-tree and debug
> it, I found saving hikeys "as is" to be very valuable for debugging.

I would like to see your implementation at some point. That sounds interesting.

It was in-memory B-tree [2].  Will be published one day.
 
> However, B-trees in PostgreSQL are quite mature, and probably
> don't need so much debug information.

Today, the highkey at the leaf level is an exact copy of the right
sibling's first item immediately after the split. The absence of a
usable heap TID offset (due to using it for number of attributes in
high keys) is unlikely to make it harder to locate that right
sibling's first item (to get a full heap TID), which could have moved
a lot further right after the split, or even have been removed
entirely. It could now be ambiguous where it wouldn't have been before
in the event of duplicates, but it's unlikely. And when it does
happen, it's unlikely to matter.

We can still include the heap block number, I suppose. I think of the
highkey as only having one simple job -- separating the keyspace
between siblings. We actually have a very neat choke point to check
that it does that one job -- when a high key is generated for a page
split at the leaf level. If we were doing generic suffix truncation,
we'd add a test that made sure that the high key was strictly greater
than the last item on the left, and strictly less than the first item
on the right. As I said yesterday, I don't like how we allow a highkey
to be equal to both sides of the split, which goes against L&Y, and I
think that we would at least be strict about < and > for suffix
truncation.

The highkey's actual value can be invented, provided it does this one
simple job, which needs to be assessed only once at our "neat choke
point". Everything else falls into place afterwards, since that's
where teh downlink actually comes from. You can check it during a leaf
page split while debugging (that's the neat choke point). That's why
the high key doesn't seem very interesting from a debuggability
perspective.

OK.  So, I mentioned that storing hikeys "as is" might be useful
for debug in some cases.  That doesn't mean it's irreplaceable.  For sure,
there are other ways to debug and obtain debugging information which
could be even better in certain situation.

>> Nobody asked you to write a suffix truncation patch. That has
>> complexity above and beyond what the covering index patch needs. I
>> just expect it to be compatible with an eventual suffix truncation
>> patch, which you've now shown is quite possible. It is clearly a
>> complimentary technique.
>
>
> OK, but change of on-disk tuple format also changes what people
> see in pageinspect.  Right now, they see "1" as offset for tuples in intenal
> page and hikeys.  After patch, they would see some large values
> (assuming we set some of hi bits) in offset.  I'm not sure it's OK.
> We probably should change display of index tuples in pageinspect.

This reminds me of a discussion I had with Robert Haas about
pageinspect + t_infomask bits. Robert thought that we should show the
composite bits as single constants, where we do that (with things like
HEAP_XMIN_FROZEN). I disagreed, saying I think that we should just
show "the bits that are on the page", while also documenting that this
situation exists in pageinspect directly.

I think something similar here. I think it's okay to just show offset,
provided it is documented. We have a number of odd things within
nbtree that I actually saw to it were documented, such as the "minus
infinity" item on internal pages, which looks odd and out of places. I
remember Tatsuo Ishii asked about it before this happened. It seems
helpful to show what's really there, and offer guidance on how to
interpret it. I actually thought carefully about many things like this
for pg_hexedit, which tries to be very consistent and logical, uses
color to suggest meaning, and so on.

Anyway, that's what I think about it, though I wouldn't really care if
I lost that particular argument and we did something special with
internal page offset in pageinspect. It seems like a matter of
opinion, or aesthetics.
 
I just thought that users might be confused when "1" offset in internal
pages became "32769" or something.  However, with attached patchset
it would be as least some more obvious value which is easier to interpret
in decimal form.

> I'd like to note that I really appreciate your attention to this patch
> as well as other patches.

Thanks. I would like to thank Anastasia and you for your patience and
perseverance, despite what I see as mistakes in how this project was
manged. I really want for it to be possible for there to be more
patches in the nbtree code, because they're really needed. That was a
big part of my motivation for writing amcheck, in fact. It's tedious
to link this patch to a bigger picture about what we need to do with
nbtree in the next 5 years, but I think that that's what it will take
to get this patch in. That's my opinion.

Yes.  But that depends on how difficulty to adopt patch to big picture
correlate with difficulty, which non-adopted patch makes to that big
picture.  My point was that second difficulty isn't high.  But we can be
satisfied with implementation in the attached patchset (probably some
small enhancements are still required), then the first difficulty isn't high too.


------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Следующее
От: Craig Ringer
Дата:
Сообщение: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS