Re: Making all nbtree entries unique by having heap TIDs participatein comparisons

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
Дата
Msg-id 7061281c-74a9-558b-2b4c-8aeb26b9b894@iki.fi
обсуждение исходный текст
Ответ на Re: Making all nbtree entries unique by having heap TIDs participatein comparisons  (Peter Geoghegan <pg@bowt.ie>)
Ответы Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
Список pgsql-hackers
On 08/03/2019 12:22, Peter Geoghegan wrote:
> I would like to work through these other items with you
> (_bt_binsrch_insert() and so on), but I think that it would be helpful
> if you made an effort to understand the minusinfkey stuff first. I
> spent a lot of time improving the explanation of that within
> _bt_compare(). It's important.

Ok, after thinking about it for a while, I think I understand the minus 
infinity stuff now. Let me try to explain it in my own words:

Imagine that you have an index with two key columns, A and B. The index 
has two leaf pages, with the following items:

+--------+   +--------+
| Page 1 |   | Page 2 |
|        |   |        |
|    1 1 |   |    2 1 |
|    1 2 |   |    2 2 |
|    1 3 |   |    2 3 |
|    1 4 |   |    2 4 |
|    1 5 |   |    2 5 |
+--------+   +--------+

The key space is neatly split on the first key column - probably thanks 
to the new magic in the page split code.

Now, what do we have as the high key of page 1? Answer: "2 -inf". The 
"-inf" is not stored in the key itself, the second key column is just 
omitted, and the search code knows to treat it implicitly as a value 
that's lower than any real value. Hence, "minus infinity".

However, during page deletion, we need to perform a search to find the 
downlink pointing to a leaf page. We do that by using the leaf page's 
high key as the search key. But the search needs to treat it slightly 
differently in that case. Normally, searching with a single key value, 
"2", we would land on page 2, because any real value beginning with "2" 
would be on that page, but in the page deletion case, we want to find 
page 1. Setting the BTScanInsert.minusinfkey flag tells the search code 
to do that.

Question: Wouldn't it be more straightforward to use "1 +inf" as page 
1's high key? I.e treat any missing columns as positive infinity. That 
way, the search for page deletion wouldn't need to be treated 
differently. That's also how this used to work: all tuples on a page 
used to be <= high key, not strictly < high key. And it would also make 
the rightmost page less of a special case: you could pretend the 
rightmost page has a pivot tuple with all columns truncated away, which 
means positive infinity.

You have this comment _bt_split which touches the subject:

>     /*
>      * The "high key" for the new left page will be the first key that's going
>      * to go into the new right page, or possibly a truncated version if this
>      * is a leaf page split.  This might be either the existing data item at
>      * position firstright, or the incoming tuple.
>      *
>      * The high key for the left page is formed using the first item on the
>      * right page, which may seem to be contrary to Lehman & Yao's approach of
>      * using the left page's last item as its new high key when splitting on
>      * the leaf level.  It isn't, though: suffix truncation will leave the
>      * left page's high key fully equal to the last item on the left page when
>      * two tuples with equal key values (excluding heap TID) enclose the split
>      * point.  It isn't actually necessary for a new leaf high key to be equal
>      * to the last item on the left for the L&Y "subtree" invariant to hold.
>      * It's sufficient to make sure that the new leaf high key is strictly
>      * less than the first item on the right leaf page, and greater than or
>      * equal to (not necessarily equal to) the last item on the left leaf
>      * page.
>      *
>      * In other words, when suffix truncation isn't possible, L&Y's exact
>      * approach to leaf splits is taken.  (Actually, even that is slightly
>      * inaccurate.  A tuple with all the keys from firstright but the heap TID
>      * from lastleft will be used as the new high key, since the last left
>      * tuple could be physically larger despite being opclass-equal in respect
>      * of all attributes prior to the heap TID attribute.)
>      */

But it doesn't explain why it's done like that.

- Heikki


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: insensitive collations
Следующее
От: Shaoqi Bai
Дата:
Сообщение: Add tablespace tap test to pg_rewind