Re: Index Skip Scan

Поиск
Список
Период
Сортировка
От Floris Van Nee
Тема Re: Index Skip Scan
Дата
Msg-id 1564999539204.96859@Optiver.com
обсуждение исходный текст
Ответ на Re: Index Skip Scan  (Jesper Pedersen <jesper.pedersen@redhat.com>)
Ответы Re: Index Skip Scan  (Dmitry Dolgov <9erthalion6@gmail.com>)
Список pgsql-hackers

Thanks for the new patch. I've reviewed the skip scan patch, but haven't taken a close look at the uniquekeys patch yet.


In my previous review I mentioned that queries of the form:

select distinct on(a) a,b from a where a=1;

do not lead to a skip scan with the patch, even though the skip scan would be much faster. It's about this piece of code in planner.c


/* Consider index skip scan as well */
if (enable_indexskipscan &&
IsA(path, IndexPath) &&
((IndexPath *) path)->indexinfo->amcanskip &&
root->distinct_pathkeys != NIL)

The root->distinct_pathkeys is already filtered for redundant keys, so column 'a' is not in there anymore. Still, it'd be much faster to use the skip scan here, because a regular scan will scan all entries with a=1, even though we're really only interested in the first one. In previous versions, this would be fixed by changing the check in planner.c to use root->uniq_distinct_pathkeys instead of root->distinct_pathkeys, but things change a bit now that the patch is rebased on the unique-keys patch. Would it be valid to change this check to root->query_uniquekeys != NIL to consider skip scans also for this query?

Second is about the use of _bt_skip and _bt_read_closest in nbtsearch.c. I don't think _bt_read_closest is correctly implemented, and I'm not sure if it can be used at all, due to concerns by Tom and Peter about such approach. I had a similar idea to only partially read items from a page for another use case, for which I submitted a patch last Friday. However, both Tom and Peter find this idea quite scary [1]. You could take a look at my patch on that thread to see the approach taken to correctly partially read a page (well, correct as far as I can see so far...), but perhaps we need to just use the regular _bt_readpage function which reads everything, although this is unfortunate from a performance point of view, since most of the time we're indeed just interested in the first tuple.

Eg. it looks like there's some mixups between index offsets and heap tid's in _bt_read_closest:
/*
* Save the current item and the previous, even if the
* latter does not pass scan key conditions
*/
ItemPointerData tid = prevItup->t_tid;
OffsetNumber prevOffnum = ItemPointerGetOffsetNumber(&tid);

_bt_saveitem(so, itemIndex, prevOffnum, prevItup);
itemIndex++;

_bt_saveitem(so, itemIndex, offnum, itup);
itemIndex++;

The 'prevOffnum' is the offset number for the heap tid, and not the offset number for the index offset, so it looks like just a random item is saved. Furthermore, index offsets may change due to insertions and vacuums, so if we, at any point, release the lock, these offsets are not necessarily valid anymore. However, currently, the patch just reads the closest and then doesn't consider this page at all anymore, if the first tuple skipped to turns out to be not visible. Consider the following sql to illustrate:

create table a (a int, b int, c int);
insert into a (select vs, ks, 10 from generate_series(1,5) vs, generate_series(1, 10000) ks);
create index on a (a,b);
analyze a;
select distinct on (a) a,b from a order by a,b;

 a | b 
---+---
 1 | 1
 2 | 1
 3 | 1
 4 | 1
 5 | 1
(5 rows)

delete from a where a=2 and b=1;
DELETE 1

select distinct on (a) a,b from a order by a,b;

 a |  b  
---+-----
 1 |   1
 2 | 249           ->> this should be b=2, because we deleted a=2 && b=1. however, it doesn't consider any tuples from that page anymore and gives us the first tuple from the next page.
 3 |   1
 4 |   1
 5 |   1
(5 rows)


-Floris


[1] https://www.postgresql.org/message-id/flat/26641.1564778586%40sss.pgh.pa.us#dd8f23e1704f45447185894e1c2a4f2a

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Langote
Дата:
Сообщение: Re: partition routing layering in nodeModifyTable.c
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: POC: Cleaning up orphaned files using undo logs