Statistics and selectivity estimation for ranges

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Statistics and selectivity estimation for ranges
Дата
Msg-id CAPpHfdtbdtM1_rhta+2cNdB4Pj7MH8EATQZHDPzK3s64qS61eg@mail.gmail.com
обсуждение исходный текст
Ответы Re: Statistics and selectivity estimation for ranges
Список pgsql-hackers
Hackers,

attached patch is for collecting statistics and selectivity estimation for ranges.

In order to make our estimations accurate for every distribution of ranges, we would collect 2d-distribution of lower and upper bounds of range into some kind of 2d-histogram. However, this patch use some simplification and assume distribution of lower bound and distribution of length to be independent. We can get distribution of lower bound from standard scalar statistics and thi patch additionally collect statistics for range length. This patch includes selectivity estimations for "&&", "@>" and "<@" operators on ranges. Linear interpolation is used in order to get more accurate results.

Some examples with test dataset where left bound of range is distributed by gaussian distribution and length of range is distributed by exponential distribution.

test=# CREATE TABLE range_test as (SELECT int4range(lb, lb + len) AS r FROM (SELECT (sqrt(-2*ln(random())) * sin(2*pi()*random()) * 1000000)::int as lb, (-10000*ln(1.0 - random()) + 1)::int as len FROM generate_series(1,1000000)) x);
SELECT 1000000

test=# ANALYZE range_test;
ANALYZE

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r && int4range(700000,710000);
                                                   QUERY PLAN                                       
             
-----------------------------------------------------------------------------------------------------------------
 Seq Scan on range_test  (cost=0.00..17906.00 rows=7535 width=14) (actual time=0.119..403.494 rows=6138 loops=1)
   Filter: (r && '[700000,710000)'::int4range)
   Rows Removed by Filter: 993862
 Total runtime: 403.945 ms
(4 rows)

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r && int4range(200000,300000);
                                                    QUERY PLAN                                      
               
-------------------------------------------------------------------------------------------------------------------
 Seq Scan on range_test  (cost=0.00..17906.00 rows=42427 width=14) (actual time=0.100..401.079 rows=42725 loops=1)
   Filter: (r && '[200000,300000)'::int4range)
   Rows Removed by Filter: 957275
 Total runtime: 403.055 ms
(4 rows)

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r <@ int4range(100000,150000);
                                                    QUERY PLAN                                      
               
-------------------------------------------------------------------------------------------------------------------
 Seq Scan on range_test  (cost=0.00..17906.00 rows=15341 width=14) (actual time=0.129..382.114 rows=16014 loops=1)
   Filter: (r <@ '[100000,150000)'::int4range)
   Rows Removed by Filter: 983986
 Total runtime: 382.985 ms
(4 rows)

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r <@ int4range(600000,603000);
                                                  QUERY PLAN                                        
           
---------------------------------------------------------------------------------------------------------------
 Seq Scan on range_test  (cost=0.00..17906.00 rows=122 width=14) (actual time=1.527..383.511 rows=127 loops=1)
   Filter: (r <@ '[600000,603000)'::int4range)
   Rows Removed by Filter: 999873
 Total runtime: 383.586 ms
(4 rows)


test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r @> int4range(100000,100400);
                                                   QUERY PLAN                                       
             
-----------------------------------------------------------------------------------------------------------------
 Seq Scan on range_test  (cost=0.00..17906.00 rows=5166 width=14) (actual time=0.238..377.712 rows=3909 loops=1)
   Filter: (r @> '[100000,100400)'::int4range)
   Rows Removed by Filter: 996091
 Total runtime: 378.018 ms
(4 rows)

test=# EXPLAIN ANALYZE SELECT * FROM range_test WHERE r @> int4range(500000,530000);
                                                   QUERY PLAN                                       
            
----------------------------------------------------------------------------------------------------------------
 Seq Scan on range_test  (cost=0.00..17906.00 rows=342 width=14) (actual time=11.796..382.986 rows=171 loops=1)
   Filter: (r @> '[500000,530000)'::int4range)
   Rows Removed by Filter: 999829
 Total runtime: 383.066 ms
(4 rows)

------
With best regards,
Alexander Korotkov.
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: [WIP] Performance Improvement by reducing WAL for Update Operation
Следующее
От: Heikki Linnakangas
Дата:
Сообщение: Re: [WIP] Performance Improvement by reducing WAL for Update Operation