Query planner suggestion, for indexes with similar but not exact ordering.

Поиск
Список
Период
Сортировка
От Andrew Barnham
Тема Query planner suggestion, for indexes with similar but not exact ordering.
Дата
Msg-id CAC9_YpafQD5kEsG8XXyVK2iDd0aKgUF6m-DpN7yPMAhpZm3=7w@mail.gmail.com
обсуждение исходный текст
Ответы Re: Query planner suggestion, for indexes with similar but not exact ordering.
Список pgsql-performance
Hi all.  Been using postgres for years, and lurking on this newsgroup for a short while now to help me gain the benefit of your expertise and experience and learn how to get most out of postgresql possible.

I do a fair bit of work on tables using composite keys.  I have discovered a couple of things lately that may or may not be worth looking at in terms of query planner.

Consider my archetypal table. Based on real data. 

table master
    franchise  smallint,
    partnumber varchar

It is a spare parts list, combining part numbers from multiple suppliers (franchise). Part numbers are typically unique but sometimes there are duplicates.  I have use cases which concern both finding parts by a specific franchise or finding parts system wide. In my table I have follow stats:
* Number of records : 2,343,569
* Number of unique partnumber records : 2,130,379  (i.e. for a given partnumber there is on average, 1.1 records. i.e. a partnumber is used by 1.1 suppliers. The partnumber with the most number of records = 8 records.
* Number of unique suppliers : 35

Now consider following query: its purpose is to render next 20 rows at an aribtrary position. The position being after record matching franchise=10, partnumber='1' in partnumber then franchise order.

select * from master where partnum>='1' and (partnum>'1' or franchise>10) order by partnum,franchise limit 20;

Now if I have a composite index on partnum + franchise. This query performs the way you would expect and very quickly.

But if I have an index on partnum only the system seqscan's master. And yields poor performance. i.e.:
============
 Limit  (cost=143060.23..143060.28 rows=20 width=93) (actual time=2307.986..2307.998 rows=20 loops=1)
   ->  Sort  (cost=143060.23..148570.14 rows=2203967 width=93) (actual time=2307.982..2307.986 rows=20 loops=1)
         Sort Key: partnum, franchise
         Sort Method:  top-N heapsort  Memory: 19kB
         ->  Seq Scan on master  (cost=0.00..84413.46 rows=2203967 width=93) (actual time=0.019..1457.001 rows=2226792 loops=1)
               Filter: (((partnum)::text >= '1'::text) AND (((partnum)::text > '1'::text) OR (franchise > 10)))
 Total runtime: 2308.118 ms


I wonder, if it is possible and worthwhile, to setup the query planner to recognize that because of the stats I indicate above, that a sort by partnum is almost exactly the same as a sort by partnum+franchise.  And doing a Index scan on partnum index, and sorting results in memory will be dramatically faster.  The sort buffer only needs to be very small, will only grow to 8 records only at most in my above example.  The buffer will scan partnum index, and as long as partnum is the same, it will sort that small segment, as soon as the partnum increments when walking the index, the buffer zeros out again for next sort group.

Artificially simulating this in SQL (only works with foreknowledge of max count of records for a given part. i.e. +8 ) shows the dramatic theoretical performance gain over the above.
explain analyze select * from (select * from master where partnum>='1' order by partnum limit 20+8) x where partnum>'1' or franchise>10 order by partnum,franchise limit 20;
=====
 Limit  (cost=77.71..77.75 rows=16 width=230) (actual time=0.511..0.555 rows=20 loops=1)
   ->  Sort  (cost=77.71..77.75 rows=16 width=230) (actual time=0.507..0.524 rows=20 loops=1)
         Sort Key: x.partnum, x.franchise
         Sort Method:  quicksort  Memory: 21kB
         ->  Subquery Scan x  (cost=0.00..77.39 rows=16 width=230) (actual time=0.195..0.367 rows=28 loops=1)
               Filter: (((x.partnum)::text > '1'::text) OR (x.franchise > 10))
               ->  Limit  (cost=0.00..76.97 rows=28 width=93) (actual time=0.180..0.282 rows=28 loops=1)
                     ->  Index Scan using master_searchpartkey on master  (cost=0.00..6134000.35 rows=2231481 width=93) (actual time=0.178..0.240 rows=28 loops=1)
                           Index Cond: ((partnum)::text >= '1'::text)
 Total runtime: 0.695 ms

Of course I could just make sure I create indexes with match my order by fields perfectly; which is exactly what I am doing right now.  But I thought that maybe it might be worth while considering looking at allowing some sort of in memory sort to be overlaid on an index if the statistics indicate that the sorts are very nearly ordered.

Andrew

В списке pgsql-performance по дате отправления:

Предыдущее
От: Cody Caughlan
Дата:
Сообщение: Re: Slow queries / commits, mis-configuration or hardware issues?
Следующее
От: "Tomas Vondra"
Дата:
Сообщение: Re: Slow queries / commits, mis-configuration or hardware issues?