Re: WIP: Covering + unique indexes.

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: WIP: Covering + unique indexes.
Дата
Msg-id CAPpHfdt5oj0+h7GaNJtRf+kePN5V9uRr0fhFnOgghYQqrxMHfg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: WIP: Covering + unique indexes.  (Peter Geoghegan <pg@bowt.ie>)
Ответы Re: WIP: Covering + unique indexes.  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-hackers
Hi!

Thank you for review.  Revised patchset is attached.

On Thu, Apr 5, 2018 at 5:40 AM, Peter Geoghegan <pg@bowt.ie> wrote:
On Wed, Apr 4, 2018 at 3:09 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> Thank you for review!  Revised patchset is attached.

Cool.

* btree_xlog_split() still has this code:

    /*
     * On leaf level, the high key of the left page is equal to the first key
     * on the right page.
     */
    if (isleaf)
    {
        ItemId      hiItemId = PageGetItemId(rpage, P_FIRSTDATAKEY(ropaque));

        left_hikey = (IndexTuple) PageGetItem(rpage, hiItemId);
        left_hikeysz = ItemIdGetLength(hiItemId);
    }

However, we never fail to store the high key now, even at the leaf
level, because of this change to the corresponding point in
_bt_split():

So should we remove the first block of code?

Right, I think there is absolutely no need in this code.  It's removed in
the attached patchset.
 
Note also that this
existing comment has been made obsolete:

/* don't release the buffer yet; we touch right page's first item below */

/* Now reconstruct left (original) sibling page */
if (XLogReadBufferForRedo(record, 0, &lbuf) == BLK_NEEDS_REDO)

Maybe we *should* release the right sibling buffer at the point of the
comment now?

Agreed.  We don't need to hold right buffer for getting hikey from it.
The only remaining concern is concurrency at standby.  But right page
is unrefenced at this point, and nobody should try read it before we
finish split.
 
* _bt_mkscankey() should assert that the IndexTuple has the correct
number of attributes.

I don't expect you to change routines like _bt_mkscankey() so they
actually respect the number of attributes from BTreeTupGetNAtts(),
rather than just relying on IndexRelationGetNumberOfKeyAttributes().
However, an assertion seems well worthwhile. It's a big reason for
having BTreeTupGetNAtts().
 
OK, I've added asserting that number of tuple attributes shoud be
either natts or nkeyatts, because we call _bt_mkscankey() for
pivot index tuples too.

This also lets you get rid of at least one assertion from
_bt_doinsert(), I think.

If you're talking about these assertions
 
Assert(IndexRelationGetNumberOfAttributes(rel) != 0);
indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
Assert(indnkeyatts != 0);

then I would rather leave both them.  If we know that index tuple
length is either natts or nkeyatts, that doesn't make you sure, that
both natts and nkeyatts are non-zero.

* _bt_isequal() should assert that the IndexTuple was not truncated.

Agreed.  Assertion is added.  I've to change signature of _bt_isequal()
to do that.  However, that shouldn't cause any problem: _bt_isequal()
is even static.

* The order could be switched here:

> @@ -443,6 +443,17 @@ _bt_compare(Relation rel,
>     if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
>         return 1;
>
> +   /*
> +    * Check tuple has correct number of attributes.
> +    */
> +   if (unlikely(!_bt_check_natts(rel, page, offnum)))
> +   {
> +       ereport(ERROR,
> +               (errcode(ERRCODE_INTERNAL_ERROR),
> +                errmsg("tuple has wrong number of attributes in index \"%s\"",
> +                       RelationGetRelationName(rel))));
> +   }

In principle, we should also check _bt_check_natts() for "minus
infinity" items, just like you did within verify_nbtree.c. Also, there
is no need for parenthesis here.

Right.  Over is switched.  We now check for number of attributes
before checking for "minus infinity" item.  Extra parenthesis are removed.
 
* Maybe _bt_truncate_tuple() should assert that the caller has not
tried to truncate a tuple that has already been truncated.

I'm not sure if our assertion should be quite that strong, but I think
that that might be good because in general we only need to truncate on
the leaf level -- truncating at any other level on the tree (e.g.
doing traditional suffix truncation) is always subtly wrong. What we
definitely should do, at a minimum, is make sure that attempting to
truncate a tuple to 2 attributes when it already has 0 attributes
fails with an assertion failure.

Can you try adding the strong assertion (truncate only once) to
_bt_truncate_tuple()? Maybe that's not possible, but it seems worth a
try.

I've done so.  Tests are passing for me.

* I suggest we invent a new flag for 0x2000 within itup.h, to replace
"/* bit 0x2000 is reserved for index-AM specific usage */".

We can call it INDEX_AM_RESERVED_BIT. Then, we can change
INDEX_ALT_TID_MASK to use this rather than a raw 0x2000. We can do the
same for INDEX_MOVED_BY_SPLIT_MASK within hash.h, too. I find this
neater.
 
Good point, done.

* We should "use" one of the 4 new status bit that are available from
an offset (for INDEX_ALT_TID_MASK index tuples) for future use by leaf
index tuples. Perhaps call it BT_ALT_TID_NONPIVOT.
 
Hmm, we have four bit reserved.  But I'm not sure whether we would use
*all* of them for non-pivot tuples.  Probably we would use some of them for
pivot tuples.  I don't know that in advance.  Thus, I propose to don't
rename this.  But I've added comment that non-pivot tuples might also
use those bits.

I guess you could say that I want us to reserve one of our 4 reserve bits.

Sorry, I didn't get which particular further use of reserved bits do you mean?
Did you mean key normalization?
 
* I think that you could add to this:

> +++ b/src/backend/access/nbtree/README
> @@ -590,6 +590,10 @@ original search scankey is consulted as each index entry is sequentially
>  scanned to decide whether to return the entry and whether the scan can
>  stop (see _bt_checkkeys()).
>
> +We use term "pivot" index tuples to distinguish tuples which don't point
> +to heap tuples, but rather used for tree navigation.  Pivot tuples includes
> +all tuples on non-leaf pages and high keys on leaf pages.

I like what you came up with, and where you put it, but I would add
another few sentences: "Note that pivot index tuples are only used to
represent which part of the key space belongs on each page, and can
have attribute values copied from non-pivot tuples that were deleted
and killed by VACUUM some time ago. In principle, we could truncate
away attributes that are not needed for a page high key during a leaf
page split, provided that the remaining attributes distinguish the
last index tuple on the post-split left page as belonging on the left
page, and the first index tuple on the post-split right page as
belonging on the right page. This optimization is sometimes called
suffix truncation, and may appear in a future release. Since the high
key is subsequently reused as the downlink in the parent page for the
new right page, suffix truncation can increase index fan-out
considerably by keeping pivot tuples short. INCLUDE indexes similarly
truncate away non-key attributes at the time of a leaf page split,
increasing fan-out."

Thank you for writing that explanation.  Looks good.

> Good point.  Tests with "heapallindexed" were added.  I also find that it's
> useful to
> check both index built by sorting, and index built by insertions, because
> there are
> different ways of forming tuples.

Right. It's a good cross-check for things like that. We'll have to
teach bt_tuple_present_callback() to normalize the representation in
some way for the BT_ALT_TID_NONPIVOT case in the future. But it
already talks about normalizing for reasons like this, so that's okay.

Ok. 

* I think you should add a note about BT_ALT_TID_NONPIVOT to
bt_tuple_present_callback(), though. If it cannot be sure that every
non-pivot tuple will have the same representation, amcheck will have
to normalize to the most flexible representation before hashing.

Ok.  I've added relevant comment.

> Ok.  I've tried to remove both assertions and "+1" hack.  That works
> for me.  However, I've to touch a lot of places, not sure if that's a
> problem.

Looks good to me. If it makes an assertion fail, that's probably a
good thing, because it would have been broken before anyway.

Ok.
 
* You missed this comment, which is now not accurate:

> + * It's possible that index tuple has zero attributes (leftmost item of
> + * iternal page).  And we have assertion that offset number is greater or equal
> + * to 1.  This is why we store (number_of_attributes + 1) in offset number.
> + */
 
Right.  This comment is no longer needed, removed.

I can see that it is actually 0 for a minus infinity item, which is good.

Ok.
 
> I wrote some comment there.  Please, check it.

The nbtsort.c comments could maybe do with some tweaks from a native
speaker, but look correct.

> Regarding !P_ISLEAF(), I think we should check every item on both
> leaf and non-leaf pages.  I that is how code now works unless I'm missing
> something.

It does, and should. Thanks.

> Thanks for pointing.  Since there are now three cases including handling of
> "minus infinity" items, comment is now split to three.

That looks good. Thanks.

Ok.

Right now, it looks like every B-Tree index could use
INDEX_ALT_TID_MASK, regardless of whether or not it's an INCLUDE
index. I think that that's fine, but let's say so in the paragraph
that introduces INDEX_ALT_TID_MASK. This patch establishes that any
nbtree pivot tuple could have INDEX_ALT_TID_MASK set, and that's
something that can be expected. It's also something that might not be
set when pg_upgrade was used, but that's fine too.

I've added comment about that.

> I don't seerelations between this patchset and SSI.  We just
> change representation of some index tuples in pages.  However,
> we didn't change the the order of page modification, the order
> of page lookup and so on.  Yes, we change size of some tuples,
> but B-tree already worked with tuples of variable sizes.  So, the fact
> that tuples now have different size shouldn't affect SSI.  Right now,
> I'm not sure if CheckForSerializableConflictIn() just above the
> _bt_doinsert() is good idea.  But even if so, I think that should be
> a subject of separate patch.

My point was that that nothing changes, because we already use what
_bt_doinsert() calls the "first valid" page. Maybe just add: "(This
reasoning also applies to INCLUDE indexes, whose extra attributes are
not considered part of the key space.)".

Ok.  I've added this comment. 

This patchset also incorporates docs enhacements by Erik Rijkers and
sentence which states that indexes with included colums are also called
"covering indexes".

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

 
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Magnus Hagander
Дата:
Сообщение: Re: Online enabling of checksums
Следующее
От: Ildus Kurbangaliev
Дата:
Сообщение: Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors