Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

Поиск
Список
Период
Сортировка
От Andy Fan
Тема Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
Дата
Msg-id CAKU4AWoymsjbm5KDYbsko13GUfG57pX1QyC3Y8sDHyrfoQeyQQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [PATCH] Keeps tracking the uniqueness with UniqueKey  (David Rowley <dgrowleyml@gmail.com>)
Ответы Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
Список pgsql-hackers


On Wed, Apr 15, 2020 at 11:00 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Wed, 15 Apr 2020 at 12:19, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
>> 2. I think 0002 is overly restrictive in its demands that
>> parse->hasAggs must be false. We should be able to just use a Group
>> Aggregate with unsorted input when the input_rel is unique on the
>> GROUP BY clause.  This will save on hashing and sorting.  Basically
>> similar to what we do for when a query contains aggregates without any
>> GROUP BY.
>>
>
> Yes,  This will be a perfect result,  the difficult is the current aggregation function
> execution is highly coupled with Agg node(ExecInitAgg) which is removed in the
> unique case.

This case here would be slightly different. It would be handled by
still creating a Group Aggregate path, but just not consider Hash
Aggregate and not Sort the input to the Group Aggregate path. Perhaps
that's best done by creating a new flag bit and using it in
create_grouping_paths() in the location where we set the flags
variable.  If you determine that the input_rel is unique for the GROUP
BY clause, and that there are aggregate functions, then set a flag,
e.g GROUPING_INPUT_UNIQUE. Likely there will be a few other flags that
you can skip setting in that function, for example, there's no need to
check if the input can sort, so no need for GROUPING_CAN_USE_SORT,
since you won't need to sort, likewise for GROUPING_CAN_USE_HASH. I'd
say there also is no need for checking if we can set
GROUPING_CAN_PARTIAL_AGG (What would be the point in doing partial
aggregation when there's 1 row per group?)   Then down in
add_paths_to_grouping_rel(), just add a special case before doing any
other code, such as:

if ((extra->flags & GROUPING_INPUT_UNIQUE) != 0 && parse->groupClause != NIL)
{
Path    *path = input_rel->cheapest_total_path;

add_path(grouped_rel, (Path *)
create_agg_path(root,
grouped_rel,
path,
grouped_rel->reltarget,
AGG_SORTED,
AGGSPLIT_SIMPLE,
parse->groupClause,
havingQual,
agg_costs,
dNumGroups));
return;
}

You may also want to consider the cheapest startup path there too so
that the LIMIT processing can do something smarter later in planning
(assuming cheapest_total_path != cheapest_startup_path (which you'd
need to check for)).

Perhaps it would be better to only set the GROUPING_INPUT_UNIQUE if
there is a groupClause, then just Assert(parse->groupClause != NIL)
inside that if.


Thank you for your detailed explanation.   The attached v6 has included this feature.  
Here is the the data to show the improvement.

Test cases:
create table grp2 (a int primary key, b char(200), c int);
insert into grp2 select i, 'x', i from generate_series(1, 10000000)i;
analyze grp2;
explain analyze select a, sum(c) from grp2 group by a;

w/o this feature:

postgres=# explain analyze select a, sum(c) from grp2 group by a;
                                                                 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=0.43..712718.44 rows=10000000 width=12) (actual time=0.088..15491.027 rows=10000000 loops=1)
   Group Key: a
   ->  Index Scan using grp2_pkey on grp2  (cost=0.43..562718.44 rows=10000000 width=8) (actual time=0.068..6503.459 rows=10000000 loops=1)
 Planning Time: 0.916 ms
 Execution Time: 16252.397 ms
(5 rows)

Since the order of my data in heap and index is exactly same, which makes
the index scan much faster.  The following is to test the cost of the *hash* aggregation,

postgres=# set enable_indexscan to off;
SET
postgres=# explain analyze select a, sum(c) from grp2 group by a;
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=765531.00..943656.00 rows=10000000 width=12) (actual time=14424.379..30133.171 rows=10000000 loops=1)
   Group Key: a
   Planned Partitions: 128
   Peak Memory Usage: 4153 kB
   Disk Usage: 2265608 kB
   HashAgg Batches: 640
   ->  Seq Scan on grp2  (cost=0.00..403031.00 rows=10000000 width=8) (actual time=0.042..2808.281 rows=10000000 loops=1)
 Planning Time: 0.159 ms
 Execution Time: 31098.804 ms
(9 rows)

With this feature:
explain analyze select a, sum(c) from grp2 group by a;
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=0.00..553031.57 rows=10000023 width=12) (actual time=0.044..13209.485 rows=10000000 loops=1)
   Group Key: a
   ->  Seq Scan on grp2  (cost=0.00..403031.23 rows=10000023 width=8) (actual time=0.023..4938.171 rows=10000000 loops=1)
 Planning Time: 0.400 ms
 Execution Time: 13749.121 ms
(5 rows)

During the implementation,  I also added AGG_UNIQUE AggStrategy to 
record this information in Agg Plan node, this is a simple way to do it and
should be semantic correct. 

V6 also includes:
1.  Fix the comment misleading you mentioned above.
2.  Fixed a concern case for `relation_has_uniquekeys_for` function. 

+       /* For UniqueKey->onerow case, the uniquekey->exprs is empty as well
+        * so we can't rely on list_is_subset to handle this special cases
+        */
+       if (exprs == NIL)
+               return false;


Best Regards
Andy Fan
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: James Coleman
Дата:
Сообщение: Re: [DOC] Document concurrent index builds waiting on each other
Следующее
От: David Rowley
Дата:
Сообщение: Re: remove_useless_groupby_columns does not need to record constraint dependencies