Re: Statistics and selectivity estimation for ranges

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: Statistics and selectivity estimation for ranges
Дата
Msg-id CAPpHfdsaGJ_u+1Bw0SsQsF4O=b90sMqQw0eDpbcQi7vTqwvw1g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Statistics and selectivity estimation for ranges  (Alexander Korotkov <aekorotkov@gmail.com>)
Ответы Re: Statistics and selectivity estimation for ranges
Re: Statistics and selectivity estimation for ranges
Список pgsql-hackers
For testing statistics accuracy I've used same datasets as for testing opclasses performance:
Script for testing and database schema is attached.
Dump with tests results can be downloaded here:

Following table shows statistics of accuracy when actual count of rows is somewhat large (>=10). Second column shows average ratio of estimate count of rows to actual count of rows. Third column shows average relative error of estimation. 

range_test=# select operator, avg(estimate_count::float8/actual_count::float8) as avg_ratio, avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as avg_error from datasets d join test_results tr on tr.test_id = d.id where d.stat_target = 100 and actual_count >= 10 group by operator; 
 operator |    avg_ratio     |     avg_error     
----------+------------------+-------------------
 <@       | 1.27166784340153 | 0.498570654434906
 @>       | 1.35965412121763 | 0.384991198200582
 &&       | 1.08236985243139 | 0.105298599354035
(3 rows)

When result set is small (1-9 rows) then errors are more significant.

range_test=# select operator, avg(estimate_count::float8/actual_count::float8) as avg_ratio, avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as avg_error from datasets d join test_results tr on tr.test_id = d.id where d.stat_target = 100 and actual_count between 1 and 9 group by operator;
 operator |    avg_ratio     |    avg_error     
----------+------------------+------------------
 <@       | 3.51371646596783 | 2.85624536756285
 @>       | 3.85482923324034 | 2.91433432363562
 &&       | 3.14281204906205 | 2.28899260461761
(3 rows)

Following table presents average estimate count of rows when actual count of rows is 0. This value is quite high for && operator, but it comes from only one tests, so it's not really representative.

range_test=# select operator, avg(estimate_count) as avg_estimate, count(*) as tests_count from datasets d join test_results tr on tr.test_id = d.id where d.stat_target = 100 and actual_count = 0 group by operator;
 operator |    avg_estimate     | tests_count 
----------+---------------------+-------------
 <@       |  1.1259887005649718 |        1770
 @>       |  1.0598670878194025 |       88329
 &&       | 28.0000000000000000 |           1
(3 rows)

Same tables for statistics target = 1000.

range_test=# select operator, avg(estimate_count::float8/actual_count::float8) as avg_ratio, avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as avg_error from datasets d join test_results tr on tr.test_id = d.id where d.stat_target = 1000 and actual_count >= 10 group by operator;
 operator |    avg_ratio     |     avg_error      
----------+------------------+--------------------
 <@       | 1.17132962269887 |  0.394427785424827
 @>       | 1.35677772347908 |  0.376171286348914
 &&       | 1.06762781136499 | 0.0874012522386387
(3 rows)

range_test=# select operator, avg(estimate_count::float8/actual_count::float8) as avg_ratio, avg(exp(abs(ln(estimate_count::float8/actual_count::float8)))) - 1.0 as avg_error from datasets d join test_results tr on tr.test_id = d.id where d.stat_target = 1000 and actual_count between 1 and 9 group by operator;
 operator |    avg_ratio     |    avg_error     
----------+------------------+------------------
 <@       | 3.30836881177966 | 2.64459517657192
 @>       | 3.47535917820028 | 2.55199556747496
 &&       | 2.49181718664477 | 1.49181718664477
(3 rows)

range_test=# select operator, avg(estimate_count) as avg_estimate, count(*) as tests_count from datasets d join test_results tr on tr.test_id = d.id where d.stat_target = 1000 and actual_count = 0 group by operator;
 operator |    avg_estimate    | tests_count 
----------+--------------------+-------------
 <@       | 1.1650879566982409 |         739
 @>       | 1.0511811463771843 |       89447
(2 rows)

My conclusion is so, that current errors are probably ok for selectivity estimation. But taking into attention that generated datasets ideally fits assumptions of estimation, there could be room for improvement. Especially, it's unclear why estimate for "<@" and "@>" have much greater error than estimate for "&&". Possibly, it's caused by some bugs.

------
With best regards,
Alexander Korotkov.
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: -Wformat-zero-length
Следующее
От: Robert Haas
Дата:
Сообщение: Re: [PATCH] Make "psql -1 < file.sql" work as with "-f"