Case for Improved Costing for Index Only Scans?

Поиск
Список
Период
Сортировка
От Hamid Akhtar
Тема Case for Improved Costing for Index Only Scans?
Дата
Msg-id CANugjhsnh0OBMOYc7qKcC+ZsVvAXCeF7QiidLuFvg6zmHy1C7A@mail.gmail.com
обсуждение исходный текст
Список pgsql-hackers
While running a simple SELECT statement on a table where I expected indexonly scan to be preferred over sequential scan, the planner kept on selecting the sequential scan. Looking at the EXPLAIN output, it seemed obvious that the cost of indexonly was exceeding that of sequential scan. So the planner was right, however, my expectation was that indexonly should have been selected instead.

Following is an example table that I’ll be referring to.

-- Creating a very basic table, index and running vacuum analyze
CREATE TABLE index_test AS (SELECT GENERATE_SERIES(1, 10000000)::int id, 'hello world' AS name);
CREATE INDEX on index_test;
VACUUM ANALYZE index_test;

-- SELECT statement
SELECT id FROM index_test WHERE id < [SOME VALUE];

So as a starting point, I wanted to identify when the cost for indexonly exceeds that of a sequential scan. In my case, the number turned out to be 6,243,354 with the table containing 10,000,000 tuples.

When the cost exceeds, the planner should select the more optimum path. However, my concern was why is the indexonly scan cost greater than that of sequential one. Turning on the timing, the expectation was that at the tipping point, indexonly execution time should be greater than that of sequential one. However, I see that indexonly scan’s execution was much much faster than the sequential scan. In terms of timing, this is what I expected. So I turned on the timing in psql and ran EXPLAIN ANALYZE. Following are the outputs.

-- SEQUENTIAL SCAN
EXPLAIN ANALYZE SELECT id FROM index_test WHERE id < 6243354;

                                                      QUERY PLAN                                                      
-----------------------------------------------------------------------------------------------------------------------
 Seq Scan on index_test  (cost=0.00..179053.25 rows=6227030 width=4) (actual time=0.175..993.291 rows=6243353 loops=1)
   Filter: (id < 6243354)
   Rows Removed by Filter: 3756647
 Planning Time: 0.235 ms
 Execution Time: 1149.590 ms
(5 rows)

Time: 1150.798 ms (00:01.151)
postgres=# explain analyze select id from index_test where id < 6243353;


-- INDEXONLY SCAN
EXPLAIN ANALYZE SELECT id FROM index_test WHERE id < 6243353;

                                                                      QUERY PLAN                                                                      
------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using index_test_id_idx on index_test  (cost=0.43..177277.44 rows=6227029 width=4) (actual time=0.063..718.390 rows=6243352 loops=1)
   Index Cond: (id < 6243353)
   Heap Fetches: 0
 Planning Time: 0.174 ms
 Execution Time: 873.070 ms
(5 rows)

Time: 874.079 ms

Given that this is a very well packed table, still you can see that the execution time increases from 718ms for indexonly scan to 993ms only with an increase of a single tuple just because the cost of indexonly scan goes beyond that of sequential scan.

I’ve tried this in a docker, my laptop and on a server without virtualization. All have shown similar gains.

Reviewing the costestimate function in cost_size.c, I see that the cost of indexonly differs for min_IO_cost and max_IO_cost. The indexTotalCost cost remains the same whether it’s an indexonly scan or not. I believe that total index cost in case of indexonly scan should also be updated.

I don't think there is a need to modify all amcostestimate functions for indexes, however, perhaps we can pull in another factor that helps make cost for indexonly more realistic.

ISTM, it should be reduced by a factor somewhere around (index size)/(relation size), even perhaps putting together expected index size, actual index size and relation size in some equation.

--
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
ADDR: 10318 WHALLEY BLVD, Surrey, BC
CELL:+923335449950  EMAIL: mailto:hamid.akhtar@highgo.ca
SKYPE: engineeredvirus

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Ashutosh Bapat
Дата:
Сообщение: Re: Ideas about a better API for postgres_fdw remote estimates
Следующее
От: Bharath Rupireddy
Дата:
Сообщение: Re: Issue with cancel_before_shmem_exit while searching to remove a particular registered exit callbacks