Re: Index Skip Scan
От | Jesper Pedersen |
---|---|
Тема | Re: Index Skip Scan |
Дата | |
Msg-id | 9b36ad5c-53c2-47cc-2cc7-10588d24cb47@redhat.com обсуждение исходный текст |
Ответ на | Re: Index Skip Scan (David Rowley <david.rowley@2ndquadrant.com>) |
Список | pgsql-hackers |
Hi David, On 6/14/19 3:19 AM, David Rowley wrote: > I read over this thread a few weeks ago while travelling back from > PGCon. (I wish I'd read it on the outward trip instead since it would > have been good to talk about it in person.) > > First off. I think this is a pretty great feature. It certainly seems > worthwhile working on it. > > I've looked over the patch just to get a feel for how the planner part > works and I have a few ideas to share. > > The code in create_distinct_paths() I think should work a different > way. I think it would be much better to add a new field to Path and > allow a path to know what keys it is distinct for. This sort of goes > back to an idea I thought about when developing unique joins > (9c7f5229ad) about an easier way to detect fields that a relation is > unique for. I've been calling these "UniqueKeys" in a few emails [1]. > The idea was to tag these onto RelOptInfo to mention which columns or > exprs a relation is unique by so that we didn't continuously need to > look at unique indexes in all the places that call > relation_has_unique_index_for(). The idea there was that unique joins > would know when a join was unable to duplicate rows. If the outer side > of a join didn't duplicate the inner side, then the join RelOptInfo > could keep the UniqueKeys from the inner rel, and vice-versa. If both > didn't duplicate then the join rel would obtain the UniqueKeys from > both sides of the join. The idea here is that this would be a better > way to detect unique joins, and also when it came to the grouping > planner we'd also know if the distinct or group by should be a no-op. > DISTINCT could be skipped, and GROUP BY could do a group aggregate > without any sort. > > I think these UniqueKeys ties into this work, perhaps not adding > UniqueKeys to RelOptInfo, but just to Path so that we create paths > that have UniqueKeys during create_index_paths() based on some > uniquekeys that are stored in PlannerInfo, similar to how we create > index paths in build_index_paths() by checking if the index > has_useful_pathkeys(). Doing it this way would open up more > opportunities to use skip scans. For example, semi-joins and > anti-joins could make use of them if the uniquekeys covered the entire > join condition. With this idea, the code you've added in > create_distinct_paths() can just search for the cheapest path that has > the correct uniquekeys, or if none exist then just do the normal > sort->unique or hash agg. I'm not entirely certain how we'd instruct > a semi/anti joined relation to build such paths, but that seems like a > problem that could be dealt with when someone does the work to allow > skip scans to be used for those. > > Also, I'm not entirely sure that these UniqueKeys should make use of > PathKey since there's no real need to know about pk_opfamily, > pk_strategy, pk_nulls_first as those all just describe how the keys > are ordered. We just need to know if they're distinct or not. All > that's left after removing those fields is pk_eclass, so could > UniqueKeys just be a list of EquivalenceClass? or perhaps even a > Bitmapset with indexes into PlannerInfo->ec_classes (that might be > premature for not since we've not yet got > https://commitfest.postgresql.org/23/1984/ or > https://commitfest.postgresql.org/23/2019/ ) However, if we did use > PathKey, that does allow us to quickly check if the UniqueKeys are > contained within the PathKeys, since pathkeys are canonical which > allows us just to compare their memory address to know if two are > equal, however, if you're storing eclasses we could probably get the > same just by comparing the address of the eclass to the pathkey's > pk_eclass. > > Otherwise, I think how you're making use of paths in > create_distinct_paths() and create_skipscan_unique_path() kind of > contradicts how they're meant to be used. > Thank you very much for this feedback ! Will need to revise the patch based on this. > I also agree with James that this should not be limited to Index Only > Scans. From testing the patch, the following seems pretty strange to > me: > > # create table abc (a int, b int, c int); > CREATE TABLE > # insert into abc select a,b,1 from generate_Series(1,1000) a, > generate_Series(1,1000) b; > INSERT 0 1000000 > # create index on abc(a,b); > CREATE INDEX > # explain analyze select distinct on (a) a,b from abc order by a,b; -- > this is fast. > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------------- > Index Only Scan using abc_a_b_idx on abc (cost=0.42..85.00 rows=200 > width=8) (actual time=0.260..20.518 rows=1000 loops=1) > Scan mode: Skip scan > Heap Fetches: 1000 > Planning Time: 5.616 ms > Execution Time: 21.791 ms > (5 rows) > > > # explain analyze select distinct on (a) a,b,c from abc order by a,b; > -- Add one more column and it's slow. > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------ > Unique (cost=0.42..50104.43 rows=200 width=12) (actual > time=1.201..555.280 rows=1000 loops=1) > -> Index Scan using abc_a_b_idx on abc (cost=0.42..47604.43 > rows=1000000 width=12) (actual time=1.197..447.683 rows=1000000 > loops=1) > Planning Time: 0.102 ms > Execution Time: 555.407 ms > (4 rows) > > [1] https://www.postgresql.org/search/?m=1&q=uniquekeys&l=1&d=-1&s=r > Ok, understood. I have put the CF entry into "Waiting on Author". Best regards, Jesper
В списке pgsql-hackers по дате отправления: