Re: Index Skip Scan

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Re: Index Skip Scan
Дата
Msg-id CAH2-Wz=4UfUU6hTFmb8HqotvCszdFzNCFMkx4=38JUJyLs-mqg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Index Skip Scan  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-hackers
On Wed, Jan 22, 2020 at 10:55 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jan 21, 2020 at 11:50 PM Floris Van Nee
> <florisvannee@optiver.com> wrote:
> > As far as I know, a page split could happen at any random element in the page. One of the situations we could be
leftwith is:
 
> > Leaf page 1 = [2,4]
> > Leaf page 2 = [5,6,8]
> > However, our scan is still pointing to leaf page 1. For non-skip scans this is not a problem, as we already read
allmatching elements in our local buffer and we'll return those. But the skip scan currently:
 
> > a) checks the lo-key of the page to see if the next prefix can be found on the leaf page 1
> > b) finds out that this is actually true
> > c) does a search on the page and returns value=4 (while it should have returned value=6)
> >
> > Peter, is my understanding about the btree internals correct so far?
>
> This is a good summary. This is the kind of scenario I had in mind
> when I expressed a general concern about "stopping between pages".
> Processing a whole page at a time is a crucial part of how
> _bt_readpage() currently deals with concurrent page splits.

I want to be clear about what it means that the page doesn't have a
"low key". Let us once again start with a very simple one leaf-page
btree containing four elements: leaf page 1 = [2,4,6,8] -- just like
in Floris' original page split scenario.

Let us also say that page 1 has a left sibling page -- page 0. Page 0
happens to have a high key with the integer value 0. So you could
*kind of* claim that the "low key" of page 1 is the integer value 0
(page 1 values must be > 0) -- *not* the integer value 2 (the
so-called "low key" here is neither > 2, nor >= 2). More formally, an
invariant exists that says that all values on page 1 must be
*strictly* greater than the integer value 0. However, this formal
invariant thing is hard or impossible to rely on when we actually
reach page 1 and want to know about its lower bound -- since there is
no "low key" pivot tuple on page 1 (we can only speak of a "low key"
as an abstract concept, or something that works transitively from the
parent -- there is only a physical high key pivot tuple on page 1
itself).

Suppose further that Page 0 is now empty, apart from its "all values
on page are <= 0" high key (page 0 must have had a few negative
integer values in its tuples at some point, but not anymore). VACUUM
will delete the page, *changing the effective low key* of Page 0 in
the process. The lower bound from the shared parent page will move
lower/left as a consequence of the deletion of page 0. nbtree page
deletion makes the "keyspace move right, not left". So the "conceptual
low key" of page 1 just went down from 0 to -5 (say), without there
being any practical way of a skip scan reading page 1 noticing the
change (the left sibling of page 0, page -1, has a high key of <= -5,
say).

Not only is it possible for somebody to insert the value 1 in page 1
-- now they can insert the value -3 or -4!

More concretely, the pivot tuple in the parent that originally pointed
to page 0 is still there -- all that page deletion changed about this
tuple is its downlink, which now points to page 1 instead or page 0.
Confusingly, page deletion removes the pivot tuple of the right
sibling page from the parent -- *not* the pivot tuple of the empty
page that gets deleted (in this case page 0) itself.

Note: this example ignores things like negative infinity values in
truncated pivot tuples, and the heap TID tiebreaker column -- in
reality this would look a bit different because of those factors.

See also: amcheck's bt_right_page_check_scankey() function, which has
a huge comment that reasons about a race involving page deletion. In
general, page deletion is by far the biggest source of complexity when
reasoning about the key space.
-- 
Peter Geoghegan



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Gierth
Дата:
Сообщение: Re: A rather hackish POC for alternative implementation of WITH TIES
Следующее
От: Alexander Korotkov
Дата:
Сообщение: Re: Improve search for missing parent downlinks in amcheck