Re: [PATCH] Erase the distinctClause if the result is unique by definition

Поиск
Список
Период
Сортировка
От Ashutosh Bapat
Тема Re: [PATCH] Erase the distinctClause if the result is unique by definition
Дата
Msg-id CAExHW5tWqOEjhy8yOjPSk6=MYOGEtVaDsWmm+zyf69rUTHwL1g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [PATCH] Erase the distinctClause if the result is unique by definition  (David Rowley <dgrowleyml@gmail.com>)
Список pgsql-hackers


On Tue, Mar 10, 2020 at 3:51 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Sat, 7 Mar 2020 at 00:47, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> Upload the newest patch so that the cfbot can pass.  The last patch failed
> because some explain without the (cost off).

I've only really glanced at this patch, but I think we need to do this
in a completely different way.

I've been mentioning UniqueKeys around this mailing list for quite a
while now [1]. To summarise the idea:

1. Add a new List field to RelOptInfo named unique_keys
2. During get_relation_info() process the base relation's unique
indexes and add to the RelOptInfo's unique_keys list the indexed
expressions from each unique index (this may need to be delayed until
check_index_predicates() since predOK is only set there)
3. Perhaps in add_paths_to_joinrel(), or maybe when creating the join
rel itself (I've not looked for the best location in detail),

build_*_join_rel() will be a good place for this. The paths created might take advantage of this information for costing.
 
determine if the join can cause rows to be duplicated. If it can't,
then add the UniqueKeys from that rel.  For example: SELECT * FROM t1
INNER JOIN t2 ON t1.unique = t2.not_unique; would have the joinrel for
{t1,t2} only take the unique keys from t2 (t1 can't duplicate t2 rows
since it's an eqijoin and t1.unique has a unique index).

this is interesting.
 
If the
condition was t1.unique = t2.unique then we could take the unique keys
from both sides of the join, and with t1.non_unique = t2.non_unique,
we can take neither.
4. When creating the GROUP BY paths (when there are no aggregates),
don't bother doing anything if the input rel's unique keys are a
subset of the GROUP BY clause. Otherwise, create the group by paths
and tag the new unique keys onto the GROUP BY rel.
5. When creating the DISTINCT paths, don't bother if the input rel has
unique keys are a subset of the distinct clause.

Thanks for laying this out in more details. Two more cases can be added to this
6. When creating RelOptInfo for a grouped/aggregated result, if all the columns of a group by clause are part of the result i.e. targetlist, the columns in group by clause server as the unique keys of the result. So the corresponding RelOptInfo can be marked as such.
7. The result of DISTINCT is unique for the columns contained in the DISTINCT clause. Hence we can add those columns to the unique_key of the RelOptInfo representing the result of the distinct clause.
8. If each partition of a partitioned table has a unique key with the same columns in it and that happens to be superset of the partition key, then the whole partitioned table gets that unique key as well.

With this we could actually pass the uniqueness information through Subquery scans as well and the overall query will benefit with that.
 

4 and 5 will mean that: SELECT DISTINCT non_unique FROM t1 GROUP BY
non_unique will just uniquify once for the GROUP BY and not for the
distinct.  SELECT DISTINCT unique FROM t1 GROUP BY unique; won't do
anything to uniquify the results.

Because both 4 and 5 require that the uniquekeys are a subset of the
distinct/group by clause, an empty uniquekey set would mean that the
RelOptInfo returns no more than 1 row.  That would allow your:

SELECT DISTINCT max(non_unique) FROM t1; to skip doing the DISTINCT part.

There's a separate effort in
https://commitfest.postgresql.org/27/1741/ to implement some parts of
the uniquekeys idea.  However the implementation currently only covers
adding the unique keys to Paths, not to RelOptInfos.

I haven't looked at that patch, but as discussed upthread, in this case we want the uniqueness associated with the RelOptInfo and not the path.
 

I also believe that the existing code in analyzejoins.c should be
edited to make use of unique keys rather than how it looks at unique
indexes and group by / distinct clauses.

+1.
--
Best Wishes,
Ashutosh Bapat

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: Remove utils/acl.h from catalog/objectaddress.h
Следующее
От: Alexander Korotkov
Дата:
Сообщение: Re: [Patch] pg_rewind: options to use restore_command fromrecovery.conf or command line