Re: Index Skip Scan

Поиск
Список
Период
Сортировка
От Floris Van Nee
Тема Re: Index Skip Scan
Дата
Msg-id 1561287884806.46718@Optiver.com
обсуждение исходный текст
Ответ на Re: Index Skip Scan  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-hackers


> Try the attached patch -- it has DEBUG1 traces with the contents of
> index tuples from key points during index scans, allowing you to see
what's going on both as a B-Tree is descended, and as a range scan is
performed. It also shows details of the insertion scankey that is set
up within _bt_first(). This hasn't been adopted to the patch at all,
so you'll probably need to do that.

Thanks! That works quite nicely.

I've pinpointed the problem to within _bt_skip. I'll try to illustrate with my test case. The data in the table is (a,b)=(1,1), (1,2) ... (1,10000), (2, 1), (2,2), ... (2,10000) until (5,10000).
Running the query
SELECT DISTINCT ON (a) a,b FROM a WHERE b=2;
The flow is like this:
_bt_first is called first - it sees there are no suitable scan keys to start at a custom location in the tree, so it just starts from the beginning and searches until it finds the first tuple (1,2).
After the first tuple was yielded, _bt_skip kicks in. It constructs an insert scan key with a=1 and nextkey=true, so doing the _bt_search + _bt_binsrch on this, it finds the first tuple larger than this: (2,1). This is not the tuple that it stops at though, because after that it does this:

if (ScanDirectionIsForward(dir))
/* Move back for _bt_next */
offnum = OffsetNumberPrev(offnum);
....
/* Now read the data */
    if (!_bt_readpage(scan, dir, offnum))
{
/*
* There's no actually-matching data on this page.  Try to advance to
* the next page.  Return false if there's no matching data at all.
*/
LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
if (!_bt_steppage(scan, dir))

First, it takes the previous tuple with OffsetNumberPrev (so tuple before (2,1), which is (1,10000)). This is done, because if this tuple were to be returned, there would be a call to _bt_next afterwards, which would then conveniently be on the tuple (2,1) that we want. However, _bt_readpage messes things up, because it only reads tuples that match all the provided keys (so where b=2). The only tuple it'll return is (2,2). This will be the tuple that is set, however, on the call to _bt_next, the tuple is first incremented, so we'll find (2,3) there which doesn't match our keys. This leads it to skip (2,2) in our result set.

I was wondering about something else: don't we also have another problem with updating this current index tuple by skipping before calling btgettuple/_bt_next? I see there's some code in btgettuple to kill dead tuples when scan->kill_prior_tuple is true. I'm not too familiar with the concept of killing dead tuples while doing index scans, but by looking at the code it seems to be possible that btgettuple returns a tuple, caller processes it and sets kill_prior_tuple to true in order to have it killed. However, then the skip scan kicks in, which sets the current tuple to a completely different tuple. Then, on the next call of btgettuple, the wrong tuple gets killed. Is my reasoning correct here or am I missing something?

-Floris


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: Code comment change
Следующее
От: Vik Fearing
Дата:
Сообщение: Re: Code comment change