Re: Optimizer on sort aggregate

Поиск
Список
Период
Сортировка
От David Rowley
Тема Re: Optimizer on sort aggregate
Дата
Msg-id CAApHDvp3oQE+vg_SSvw84EmhYTxcXfKZ8Ytv02RS=SNx=hbBZQ@mail.gmail.com
обсуждение исходный текст
Ответ на Optimizer on sort aggregate  (Feng Tian <fengttt@gmail.com>)
Ответы Re: Optimizer on sort aggregate
Список pgsql-hackers
On Sat, Oct 18, 2014 at 5:10 AM, Feng Tian <fengttt@gmail.com> wrote:
Hi,

Consider the following queries.

create table t(i int, j int, k int, t text);
insert into t select i, i % 100, i % 1000, 'AABBCCDD' || i from generate_series(1, 1000000) i;

ftian=# explain select count(distinct j) from t group by t, i;
                               QUERY PLAN                              
------------------------------------------------------------------------
 GroupAggregate  (cost=158029.84..178029.84 rows=1000000 width=22)
   ->  Sort  (cost=158029.84..160529.84 rows=1000000 width=22)
         Sort Key: t, i
         ->  Seq Scan on t  (cost=0.00..17352.00 rows=1000000 width=22)
(4 rows)


The query,
select count(distinct j) from t group by t, i;

runs for 35 seconds.  However, if I change the query to,
select count(distinct j) from t group by i, t;  -- note switching the ordering
select count(distinct j) from t group by decode(t, 'escape'), i; -- convert t to bytea

Run times are just about 5 and 6.5 seconds.  The reason is clear, compare a string with collation is slow, which is well understood by pg hackers.   However, here, the sorting order is forced by the planner, not user.   Planner can do the following optimizations,

1. for the sort we generated for sort agg, planner can switch column ordering, put int before string,
2. for the sort we generated for sort agg, use bytea compare instead of string compare.

They will bring big improvement to this common query.   Is this something reasonable? 


It seems like it might be worth looking into, but I think it's likely more complex than sorting the group by order into narrowest column first.

If the query was:

select count(distinct j) from t group by t, i order by t, i;

Then if that was  rewritten to make the group by i,t then the planner would then need to perform an additional sort on t,i to get the correct order by for the final result, which may or may not be faster, it would depend on how many groups there were to sort. Though I guess you could possibly just skip the optimisation if the next node up didn't need any specific ordering.

I also wonder if taking into account the column's width is not quite enough. For example if the 'i' column only had 1 distinct value, then performing the group by on 'i' first wouldn't help at all. Keep in mind that the columns could be much closer in width than in your example, e.g int and bigint. Perhaps you'd need to include some sort of heuristics to look at the number of distinct values and the width, and form some sort of weighted estimates about which column to put first.

Regards

David Rowley

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tatsuo Ishii
Дата:
Сообщение: Re: Optimizer on sort aggregate
Следующее
От: David Rowley
Дата:
Сообщение: Re: Optimizer on sort aggregate