WIP: 2d-mapping based indexing for ranges

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема WIP: 2d-mapping based indexing for ranges
Дата
Msg-id CAPpHfdsCtD_5wfKW8AO4_jvroOt0eY_rqu0kAON-J_xEH25g7g@mail.gmail.com
обсуждение исходный текст
Ответы Re: WIP: 2d-mapping based indexing for ranges  (Jeff Davis <pgsql@j-davis.com>)
Список pgsql-hackers
Hackers,

attached patch implements another approach to indexing of ranges: mapping lower and upper bounds as 2d-coordinates and using same indexing approaches as for 2d points. Patch provides range_ops2 operator class which implements this approach.

I did some tests on real-life dataset (attached periods.sql.gz). This dataset contain date ranges from real-life data provided by Benedikt Grundmann. 

Index build is slightly slower.

test=# create index period_idx on period_test using gist (period);
CREATE INDEX
Time: 48493,241 ms

test=# create index period_idx2 on period_test2 using gist (period range_ops2);
CREATE INDEX
Time: 54351,251 ms

It seems to be inevitably, because more complicated calculations are required. For example, picksplit algorithm works along two axes instead of one axis.

"Overlaps" queries seems to be slightly faster.

test=# explain (analyze, buffers) select * from period_test where period && daterange('2011-08-10', '2011-08-11');
                                                            QUERY PLAN                              
                              
----------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on period_test  (cost=471.65..13716.47 rows=11866 width=32) (actual time=107.260..147.541 rows=335190 loops=1)
   Recheck Cond: (period && '[2011-08-10,2011-08-11)'::daterange)
   Buffers: shared read=8563
   ->  Bitmap Index Scan on period_idx  (cost=0.00..468.68 rows=11866 width=0) (actual time=106.600..106.600 rows=335190 loops=1)
         Index Cond: (period && '[2011-08-10,2011-08-11)'::daterange)
         Buffers: shared read=3878
 Total runtime: 160.210 ms
(7 rows)

test=# explain (analyze, buffers) select * from period_test2 where period && daterange('2011-08-10', '2011-08-11');
                                                           QUERY PLAN                               
                             
---------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on period_test2  (cost=553.17..16034.26 rows=11866 width=32) (actual time=76.386..122.193 rows=335190 loops=1)
   Recheck Cond: (period && '[2011-08-10,2011-08-11)'::daterange)
   Buffers: shared read=9837
   ->  Bitmap Index Scan on period_idx2  (cost=0.00..550.21 rows=11866 width=0) (actual time=75.448..75.448 rows=335190 loops=1)
         Index Cond: (period && '[2011-08-10,2011-08-11)'::daterange)
         Buffers: shared read=3495
 Total runtime: 134.942 ms
(7 rows)

However, difference is small, so more comprehensive testing required for conclusion.

"Contained by" queries with small selectivity are dramatically faster as expected.

test=# explain (analyze, buffers) select * from period_test where period <@ daterange('2011-08-10', '2011-08-11');
                                                         QUERY PLAN                                 
                        
----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on period_test  (cost=102.08..6140.68 rows=2373 width=32) (actual time=88.899..89.
160 rows=619 loops=1)
   Recheck Cond: (period <@ '[2011-08-10,2011-08-11)'::daterange)
   Buffers: shared read=3957
   ->  Bitmap Index Scan on period_idx  (cost=0.00..101.49 rows=2373 width=0) (actual time=88.878..8
8.878 rows=619 loops=1)
         Index Cond: (period <@ '[2011-08-10,2011-08-11)'::daterange)
         Buffers: shared read=3878
 Total runtime: 89.246 ms
(7 rows)

test=# explain (analyze, buffers) select * from period_test2 where period <@ daterange('2011-08-10', '2011-08-11');
                                                        QUERY PLAN                                  
                       
---------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on period_test2  (cost=119.60..6497.06 rows=2373 width=32) (actual time=3.696..4.7
09 rows=619 loops=1)
   Recheck Cond: (period <@ '[2011-08-10,2011-08-11)'::daterange)
   Buffers: shared read=165
   ->  Bitmap Index Scan on period_idx2  (cost=0.00..119.01 rows=2373 width=0) (actual time=3.644..3
.644 rows=619 loops=1)
         Index Cond: (period <@ '[2011-08-10,2011-08-11)'::daterange)
         Buffers: shared read=77
 Total runtime: 4.881 ms
(7 rows)

I failed to do "contains" queries with really low selectivity of this dataset, because of specific distribution. On "contains" quieries new indexing method is also faster.

test=# explain (analyze, buffers) select * from period_test where period @> daterange('2011-08-10', '2011-12-31');
                                                          QUERY PLAN                                
                           
-------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on period_test  (cost=102.08..6140.68 rows=2373 width=32) (actual time=59.015..76.653 rows=118097 loops=1)
   Recheck Cond: (period @> '[2011-08-10,2011-12-31)'::daterange)
   Buffers: shared read=4533
   ->  Bitmap Index Scan on period_idx  (cost=0.00..101.49 rows=2373 width=0) (actual time=58.621..5
8.621 rows=118097 loops=1)
         Index Cond: (period @> '[2011-08-10,2011-12-31)'::daterange)
         Buffers: shared read=1799
 Total runtime: 81.238 ms
(7 rows)

test=# explain (analyze, buffers) select * from period_test2 where period @> daterange('2011-08-10', '2011-12-31');
                                                           QUERY PLAN                               
                            
--------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on period_test2  (cost=119.60..6497.06 rows=2373 width=32) (actual time=36.657..58
.656 rows=118097 loops=1)
   Recheck Cond: (period @> '[2011-08-10,2011-12-31)'::daterange)
   Buffers: shared read=5494
   ->  Bitmap Index Scan on period_idx2  (cost=0.00..119.01 rows=2373 width=0) (actual time=36.123..36.123 rows=118097 loops=1)
         Index Cond: (period @> '[2011-08-10,2011-12-31)'::daterange)
         Buffers: shared read=1450
 Total runtime: 63.400 ms
(7 rows)

Again superiority is not large enough for doing a conclusion by one dataset.

Now, let's talk about problems.

1) Handling infinities. While working on range_ops2 I liked to avoid quite bloated logic that I implemented in range_ops. I tried to handle infinities similar to just big values. However, this way appears to be much less effective. In this case insted of keeping infinities separately, index produce prolonged areas which are bad for performance. It seems that I have to implement similar key "classes" handling but even more complicated because of higher amount of "classes".

2) Limitations of "penalty" expressive power. GiST penalty function returns just one float value. It is a limitation even for geometrical datatypes. Some papers on R-trees recommends insert new entry into box which have minimal area when this boxes have same extension area (very possible if extension area is zero). We can't express it using single float value. 
For 2d-mapping ranges indexing situation is even harder.  Imagine, some set of ranges could have same lower or upper bound. On 2d space such ranges will be united into lines. But, we can't express increasing of line length in penalty, because area of line is always zero independently on it's length.  Actually, similar problem exists for geometrical datatypes, but it seems to me this problem is more urgent for ranges, because I believe in real life ranges, large set of ranges could really have same lower or upper bound.
Probably, we would like to rank key extension cases as following or similar:
a) extension of finite dimension to infinite
b) extension in dimension perpendicular to infinite
c) extension with finite area extension
d) extension with zero area increment (extension of line length)
So, it's impossible to fil all of these into one float. Ans it's hard to choose what to neglect and how.
I think there are at least two ways to make GiSTinferface sensitive enough to these things:
a) Make penalty function return more than one float. The question is how many floats we allow?
b) Create alternative "choose" interface function which would return set of most preferred keys for insert. Usage of such function could lead to slowdown because currently GiST stop scan when met zero penalty.

3) It appears that with this patch we have 3 distinct implementation of double sorting split algorithm. Seems that I have to generalize it with some interface.

------
With best regards,
Alexander Korotkov.
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Marti Raudsepp
Дата:
Сообщение: Re: Bogus nestloop rows estimate in 8.4.7
Следующее
От: Peter Eisentraut
Дата:
Сообщение: Re: libpq URL syntax vs SQLAlchemy