Обсуждение: Q: will GROUP BY make use of an index to return tuples early?

Поиск
Список
Период
Сортировка

Q: will GROUP BY make use of an index to return tuples early?

От
Gunther Schadow
Дата:
Hi pgsql optimizer gurus,

If I have a table

CREATE TABLE Foo(  id    OID,  time  TIMESTAMP,  value INTEGER
);

CREATE INDEX Foo_id_idx ON Foo(id);

and I have a query for

SELECT id, MIN(foo.time)  FROM Foo foo GROUP BY foo.id;

can one tell the query executor to do an index scan on Foo_id_idx
such as to be sure the Foo tuples are being considered ordered by
foo.id in order to produce tuples before having worked through the
whole table? If we don't use that assumption of the index order and
would instead to a full table scan, the select could not return
anything until the full table scan is completed.

I am asking because if I wanted to stream the tuples of the first
query into anothe system to do a distributed semijoin, I would
like data to flow at all times while queries are still being executed.

If the answer is yes, there is a way, then how about if we do this:

CREATE INDEX Foo_id_time_idx ON Foo(id, time);

now, considering the same query, it could be executed even faster,
because we could do an index scan on Foo_id_time_idx and only need
to consider the first data tuple of every Foo.id group (because
the ordering of Foo_id_time_idx guarantees that the MIN(time) is
in the first tuple.

thank you,
-Gunther


-- 
Gunther Schadow, M.D., Ph.D.                    gschadow@regenstrief.org
Medical Information Scientist      Regenstrief Institute for Health Care
Adjunct Assistant Professor        Indiana University School of Medicine
tel:1(317)630-7960                         http://aurora.regenstrief.org




Re: Q: will GROUP BY make use of an index to return tuples early?

От
Tom Lane
Дата:
Gunther Schadow <gunther@aurora.regenstrief.org> writes:
> SELECT id, MIN(foo.time)
>    FROM Foo foo
>   GROUP BY foo.id;
> can one tell the query executor to do an index scan on Foo_id_idx

That is a considered plan type.  For example, using the regression
test database:

regression=# set enable_seqscan TO 0;
SET
regression=# explain analyze select hundred,min(unique1) from tenk1 group by hundred;
                            QUERY PLAN                                                                  
 

---------------------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=0.00..1121.89 rows=1000 width=8) (actual time=19.12..1619.71 rows=100 loops=1)  ->  Group  (cost=0.00..1096.89
rows=10000width=8) (actual time=0.56..1484.14 rows=10000 loops=1)        ->  Index Scan using tenk1_hundred on tenk1
(cost=0.00..1071.89rows=10000 width=8) (actual time=0.53..1194.64 rows=10000 loops=1)Total runtime: 1621.69 msec
 
(4 rows)
regression=# set enable_seqscan TO 1;
SET
regression=# explain analyze select hundred,min(unique1) from tenk1 group by hundred;
                   QUERY PLAN                                                         
 

---------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=997.39..1072.39 rows=1000 width=8) (actual time=1162.52..1677.46 rows=100 loops=1)  ->  Group
(cost=997.39..1047.39rows=10000 width=8) (actual time=1157.33..1565.75 rows=10000 loops=1)        ->  Sort
(cost=997.39..1022.39rows=10000 width=8) (actual time=1157.29..1298.10 rows=10000 loops=1)              Sort Key:
hundred             ->  Seq Scan on tenk1  (cost=0.00..333.00 rows=10000 width=8) (actual time=0.16..277.69 rows=10000
loops=1)Totalruntime: 1695.83 msec
 
(6 rows)

The sort-based plan usually comes out ahead in estimated total cost, but
if you were using a cursor to read it then I believe the indexscan plan
would get chosen due to its much lower startup cost.

> If the answer is yes, there is a way, then how about if we do this:
> CREATE INDEX Foo_id_time_idx ON Foo(id, time);
> now, considering the same query, it could be executed even faster,
> because we could do an index scan on Foo_id_time_idx and only need
> to consider the first data tuple of every Foo.id group (because
> the ordering of Foo_id_time_idx guarantees that the MIN(time) is
> in the first tuple.

This won't happen because (a) the optimizer treats aggregates as black
boxes; there is no notion that some aggregates have a connection to the
sort ordering of some indexes; (b) the indexscan would still have to
pass over the other time values in each group.
        regards, tom lane