Обсуждение: Q: will GROUP BY make use of an index to return tuples early?
Hi pgsql optimizer gurus, If I have a table CREATE TABLE Foo( id OID, time TIMESTAMP, value INTEGER ); CREATE INDEX Foo_id_idx ON Foo(id); and I have a query for SELECT id, MIN(foo.time) FROM Foo foo GROUP BY foo.id; can one tell the query executor to do an index scan on Foo_id_idx such as to be sure the Foo tuples are being considered ordered by foo.id in order to produce tuples before having worked through the whole table? If we don't use that assumption of the index order and would instead to a full table scan, the select could not return anything until the full table scan is completed. I am asking because if I wanted to stream the tuples of the first query into anothe system to do a distributed semijoin, I would like data to flow at all times while queries are still being executed. If the answer is yes, there is a way, then how about if we do this: CREATE INDEX Foo_id_time_idx ON Foo(id, time); now, considering the same query, it could be executed even faster, because we could do an index scan on Foo_id_time_idx and only need to consider the first data tuple of every Foo.id group (because the ordering of Foo_id_time_idx guarantees that the MIN(time) is in the first tuple. thank you, -Gunther -- Gunther Schadow, M.D., Ph.D. gschadow@regenstrief.org Medical Information Scientist Regenstrief Institute for Health Care Adjunct Assistant Professor Indiana University School of Medicine tel:1(317)630-7960 http://aurora.regenstrief.org
Gunther Schadow <gunther@aurora.regenstrief.org> writes: > SELECT id, MIN(foo.time) > FROM Foo foo > GROUP BY foo.id; > can one tell the query executor to do an index scan on Foo_id_idx That is a considered plan type. For example, using the regression test database: regression=# set enable_seqscan TO 0; SET regression=# explain analyze select hundred,min(unique1) from tenk1 group by hundred; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------Aggregate (cost=0.00..1121.89 rows=1000 width=8) (actual time=19.12..1619.71 rows=100 loops=1) -> Group (cost=0.00..1096.89 rows=10000width=8) (actual time=0.56..1484.14 rows=10000 loops=1) -> Index Scan using tenk1_hundred on tenk1 (cost=0.00..1071.89rows=10000 width=8) (actual time=0.53..1194.64 rows=10000 loops=1)Total runtime: 1621.69 msec (4 rows) regression=# set enable_seqscan TO 1; SET regression=# explain analyze select hundred,min(unique1) from tenk1 group by hundred; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------Aggregate (cost=997.39..1072.39 rows=1000 width=8) (actual time=1162.52..1677.46 rows=100 loops=1) -> Group (cost=997.39..1047.39rows=10000 width=8) (actual time=1157.33..1565.75 rows=10000 loops=1) -> Sort (cost=997.39..1022.39rows=10000 width=8) (actual time=1157.29..1298.10 rows=10000 loops=1) Sort Key: hundred -> Seq Scan on tenk1 (cost=0.00..333.00 rows=10000 width=8) (actual time=0.16..277.69 rows=10000 loops=1)Totalruntime: 1695.83 msec (6 rows) The sort-based plan usually comes out ahead in estimated total cost, but if you were using a cursor to read it then I believe the indexscan plan would get chosen due to its much lower startup cost. > If the answer is yes, there is a way, then how about if we do this: > CREATE INDEX Foo_id_time_idx ON Foo(id, time); > now, considering the same query, it could be executed even faster, > because we could do an index scan on Foo_id_time_idx and only need > to consider the first data tuple of every Foo.id group (because > the ordering of Foo_id_time_idx guarantees that the MIN(time) is > in the first tuple. This won't happen because (a) the optimizer treats aggregates as black boxes; there is no notion that some aggregates have a connection to the sort ordering of some indexes; (b) the indexscan would still have to pass over the other time values in each group. regards, tom lane