Обсуждение: Remove useless GROUP BY columns considering unique index

Поиск
Список
Период
Сортировка

Remove useless GROUP BY columns considering unique index

От
Zhang Mingli
Дата:
Hi, 

This idea first came from remove_useless_groupby_columns does not need to record constraint dependencie[0] which points out that
unique index whose columns all have NOT NULL constraints  could also take the work with primary key when removing useless GROUP BY columns.
I study it and implement the idea.

Ex:

create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(c));

 explain (costs off) select * from t2 group by a,b,c;
 QUERY PLAN
 ----------------------
 HashAggregate
 Group Key: c
 -> Seq Scan on t2

The plan drop column a, b as I did a little more.

For the query, as t2 has primary key (a, b), before this patch, we could drop column c because {a, b} are PK.
And  we have an unique  index(c) with NOT NULL constraint now, we could drop column {a, b}, just keep {c}.

While we have multiple choices, group by a, b (c is removed  by PK) and group by c (a, b are removed by unique not null index)
And  I implement it to choose the one with less columns so that we can drop as more columns as possible. 
I think it’s good for planner to save some cost like Sort less columns. 
There may be better one for some reason like: try to keep PK for planner?
I’m not sure about that and it seems not worth much complex.

The NOT NULL constraint may also be computed from primary keys, ex: 
create temp table t2 (a int, b int, c int not null, primary key (a, b), unique(a));
Primary key(a, b) ensure a is NOT NULL and we have a unique index(a), but it will introduce more complex to check if a unique index could be used.
I also doubt it worths doing that..
So my patch make it easy: check unique index’s columns, it’s a valid candidate if all of that have NOT NULL constraint.
And we choose a best one who has the least column numbers in get_min_unique_not_null_attnos(), as the reason: less columns mean that more group by columns could be removed.

create temp table t3 (a int, b int, c int not null, d int not null, primary key (a, b), unique(c, d));
-- Test primary key beats unique not null index.
explain (costs off) select * from t3 group by a,b,c,d;
 QUERY PLAN 
----------------------
 HashAggregate
 Group Key: a, b
 -> Seq Scan on t3
(3 rows)

create temp table t4 (a int, b int not null, c int not null, d int not null, primary key (a, b), unique(b, c), unique(d));
-- Test unique not null index with less columns wins.
explain (costs off) select * from t4 group by a,b,c,d;
 QUERY PLAN 
----------------------
 HashAggregate
 Group Key: d
 -> Seq Scan on t4
(3 rows)

The unique Indices could have overlaps with primary keys and indices themselves. 

create temp table t5 (a int not null, b int not null, c int not null, d int not null, unique (a, b), unique(b, c), unique(a, c, d));
-- Test unique not null indices have overlap.
explain (costs off) select * from t5 group by a,b,c,d;
 QUERY PLAN 
----------------------
 HashAggregate
 Group Key: a, b
 -> Seq Scan on t5
(3 rows)

Вложения

Re: Remove useless GROUP BY columns considering unique index

От
jian he
Дата:
On Fri, Dec 29, 2023 at 11:05 PM Zhang Mingli <zmlpostgres@gmail.com> wrote:
>
> Hi,
>
> This idea first came from remove_useless_groupby_columns does not need to record constraint dependencie[0] which
pointsout that 
> unique index whose columns all have NOT NULL constraints  could also take the work with primary key when removing
uselessGROUP BY columns. 
> I study it and implement the idea.
>
>
[0]https://www.postgresql.org/message-id/flat/CAApHDvrdYa%3DVhOoMe4ZZjZ-G4ALnD-xuAeUNCRTL%2BPYMVN8OnQ%40mail.gmail.com
>

cosmetic issues:
+--
+-- Test removal of redundant GROUP BY columns using unique not null index.
+-- materialized view
+--
+create temp table t1 (a int, b int, c int, primary key (a, b), unique(c));
+create temp table t2 (a int, b int, c int not null, primary key (a,
b), unique(c));
+create temp table t3 (a int, b int, c int not null, d int not null,
primary key (a, b), unique(c, d));
+create temp table t4 (a int, b int not null, c int not null, d int
not null, primary key (a, b), unique(b, c), unique(d));
+create temp table t5 (a int not null, b int not null, c int not null,
d int not null, unique (a, b), unique(b, c), unique(a, c, d));
+
+-- Test unique index without not null constraint should not be used.
+explain (costs off) select * from t1 group by a,b,c;
+
+-- Test unique not null index beats primary key.
+explain (costs off) select * from t2 group by a,b,c;
+
+-- Test primary key beats unique not null index.
+explain (costs off) select * from t3 group by a,b,c,d;
+
+-- Test unique not null index with less columns wins.
+explain (costs off) select * from t4 group by a,b,c,d;
+
+-- Test unique not null indices have overlap.
+explain (costs off) select * from t5 group by a,b,c,d;
+
+drop table t1;
+drop table t2;
+drop table t3;
+drop table t4;
+

line `materailzed view` is unnecessary?
you forgot to drop table t5.

create temp table p_t2 (
  a int not null,
  b int not null,
  c int not null,
  d int,
  unique(a), unique(a,b),unique(a,b,c)
) partition by list(a);
create temp table p_t2_1 partition of p_t2 for values in(1);
create temp table p_t2_2 partition of p_t2 for values in(2);
explain (costs off, verbose) select * from p_t2 group by a,b,c,d;
your patch output:
                          QUERY PLAN
--------------------------------------------------------------
 HashAggregate
   Output: p_t2.a, p_t2.b, p_t2.c, p_t2.d
   Group Key: p_t2.a
   ->  Append
         ->  Seq Scan on pg_temp.p_t2_1
               Output: p_t2_1.a, p_t2_1.b, p_t2_1.c, p_t2_1.d
         ->  Seq Scan on pg_temp.p_t2_2
               Output: p_t2_2.a, p_t2_2.b, p_t2_2.c, p_t2_2.d
(8 rows)

so it seems to work with partitioned tables, maybe you should add some
test cases with partition tables.


- * XXX This is only used to create derived tables, so NO INHERIT constraints
- * are always skipped.
+ * XXX When used to create derived tables, set skip_no_inherit tp true,
+ * so that NO INHERIT constraints will be skipped.
  */
 List *
-RelationGetNotNullConstraints(Oid relid, bool cooked)
+RelationGetNotNullConstraints(Oid relid, bool cooked, bool skip_no_inherit)
 {
  List   *notnulls = NIL;
  Relation constrRel;
@@ -815,7 +815,7 @@ RelationGetNotNullConstraints(Oid relid, bool cooked)

  if (conForm->contype != CONSTRAINT_NOTNULL)
  continue;
- if (conForm->connoinherit)
+ if (skip_no_inherit && conForm->connoinherit)
  continue;

I don't think you need to refactor RelationGetNotNullConstraints.
however i found it hard to explain it, (which one is parent, which one
is children is very confusing for me).
so i use the following script to test it:
DROP TABLE ATACC1, ATACC2;
CREATE TABLE ATACC1 (a int);
CREATE TABLE ATACC2 (b int,c int, unique(c)) INHERITS (ATACC1);
ALTER TABLE ATACC1 ADD NOT NULL a;
-- ALTER TABLE ATACC1 ADD NOT NULL a NO INHERIT;
ALTER TABLE ATACC2 ADD unique(a);
explain (costs off, verbose) select * from ATACC2 group by a,b,c;
-------------------------

create table t_0_3 (a int not null, b int not null, c int not null, d
int not null, unique (a, b), unique(b, c) DEFERRABLE, unique(d));
explain (costs off, verbose) select * from t_0_3 group by a,b,c,d;
           QUERY PLAN
--------------------------------
 HashAggregate
   Output: a, b, c, d
   Group Key: t_0_3.a, t_0_3.b
   ->  Seq Scan on public.t_0_3
         Output: a, b, c, d
(5 rows)

the above part seems not right, it should be `Group Key: t_0_3.d`
so i changed to:

/* Skip constraints that are not UNIQUE */
+ if (con->contype != CONSTRAINT_UNIQUE)
+ continue;
+
+ /* Skip unique constraints that are condeferred */
+ if (con->condeferrable && con->condeferred)
+ continue;

I guess you probably have noticed, in the function
remove_useless_groupby_columns,
get_primary_key_attnos followed by get_min_unique_not_null_attnos.
Also, get_min_unique_not_null_attnos main code is very similar to
get_primary_key_attnos.

They both do `pg_constraint = table_open(ConstraintRelationId,
AccessShareLock);`
you might need to explain why we must open pg_constraint twice.
either it's cheap to do it or another reason.

If scanning twice is expensive, we may need to refactor get_primary_key_attnos.
get_primary_key_attnos only called in check_functional_grouping,
remove_useless_groupby_columns.

I added the patch to commitfest:
https://commitfest.postgresql.org/46/4751/