Обсуждение: Removing Functionally Dependent GROUP BY Columns

Поиск

Список

Период

Сортировка

Removing Functionally Dependent GROUP BY Columns

От

David Rowley

Дата:

01 декабря 2015 г., 07:00:29

We already allow a SELECT's target list to contain non-aggregated columns in a GROUP BY query in cases where the non-aggregated column is functionally dependent on the GROUP BY clause.

For example a query such as;

SELECT p.product_id,p.description, SUM(s.quantity)

FROM product p

INNER JOIN sale s ON p.product_id = s.product_id

GROUP BY p.product_id;

is perfectly fine in PostgreSQL, as p.description is functionally dependent on p.product_id (assuming product_id is the PRIMARY KEY of product).

It seems that there's no shortage of relational databases in existence today which don't support this. These databases would require the GROUP BY clause to include the p.description column too.

It seems rather unfortunate that people who migrate applications to PostgreSQL may not be aware that we support this, as currently if we needlessly include the p.description column, PostgreSQL naively includes this column while grouping. These people could well be incurring a performance penalty due to our planner not removing the useless items from the list, as if the primary key is present, then including any other columns won't cause splitting of the groups any further, all other columns from the *same relation* can simply be removed from the GROUP BY clause.

There are in fact also two queries in TPC-H (Q10 and Q18) which are written to include all of the non-aggregated column in the GROUP BY list. During a recent test I witnessed a 50% gain in performance in Q10 by removing the unneeded columns from the GROUP BY clause.

I've attached a patch which implements this in PostgreSQL.

The patch may need a little more work in order to adjust the targetlist's tleSortGroupRefs to remove invalid ones and perhaps also remove the gaps.

I'm posting this now so that I can gauge the community interest in this.

Is it something that we'd like to have in PostgreSQL?

David Rowley http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services

Вложения

prune_group_by_clause_2027f512_2015-12-01.patch

Re: Removing Functionally Dependent GROUP BY Columns

От

Marko Tiikkaja

Дата:

01 декабря 2015 г., 07:09:40

On 2015-12-01 05:00, David Rowley wrote:
> We already allow a SELECT's target list to contain non-aggregated columns
> in a GROUP BY query in cases where the non-aggregated column is
> functionally dependent on the GROUP BY clause.
>
> For example a query such as;
>
> SELECT p.product_id,p.description, SUM(s.quantity)
> FROM product p
> INNER JOIN sale s ON p.product_id = s.product_id
> GROUP BY p.product_id;
>
> is perfectly fine in PostgreSQL, as p.description is functionally dependent
> on p.product_id (assuming product_id is the PRIMARY KEY of product).

This has come up before (on other forums, at least), and my main concern 
has been that unlike the case where we go from throwing an error to 
allowing a query, this has a chance to make the planning of currently 
legal queries slower.  Have you tried to measure the impact of this on 
queries where there's no runtime gains to be had?

.m

Re: Removing Functionally Dependent GROUP BY Columns

От

David Rowley

Дата:

01 декабря 2015 г., 14:07:18

On 1 December 2015 at 17:09, Marko Tiikkaja <marko@joh.to> wrote:

On 2015-12-01 05:00, David Rowley wrote:
We already allow a SELECT's target list to contain non-aggregated columns
in a GROUP BY query in cases where the non-aggregated column is
functionally dependent on the GROUP BY clause.

For example a query such as;

SELECT p.product_id,p.description, SUM(s.quantity)
FROM product p
INNER JOIN sale s ON p.product_id = s.product_id
GROUP BY p.product_id;

is perfectly fine in PostgreSQL, as p.description is functionally dependent
on p.product_id (assuming product_id is the PRIMARY KEY of product).

This has come up before (on other forums, at least), and my main concern has been that unlike the case where we go from throwing an error to allowing a query, this has a chance to make the planning of currently legal queries slower. Have you tried to measure the impact of this on queries where there's no runtime gains to be had?

I've performed a series of benchmarks on the following queries:

Test1: explain select id1,id2 from t1 group by id1,id2;

Test2: explain select id from t2 group by id;

Test3: explain select t1.id1,t1.id2 from t2 inner join t1 on t1.id1=t2.id group by t1.id1,t1.id2;

I ran each of these with pgbench for 60 seconds, 3 runs per query. In each case below I've converted the TPS into seconds using the average TPS over the 3 runs.

In summary:

Test1 is the worst case test. It's a very simple query so planning overhead of join searching is non-existent. The fact that there's 2 columns in the GROUP BY means that the fast path cannot be used. I added this as if there's only 1 column in the GROUP BY then there's no point in searching for something to remove.

Average (Sec)

Master 0.0001043117

Patched 0.0001118961

Performance 93.22%

Microseconds of planning overhead 7.5844326722

Test2 is a simple query with a GROUP BY which can fast path due to there being only 1 GROUP BY column.

Average (Sec)

Master 0.000099374448

Patched 0.000099670124

Performance 99.70%

Microseconds of planning overhead 0.2956763193

Test3 is a slightly more complex and is aimed to show that the percentage of planning overhead is smaller when joins exist and overall planning cost becomes higher

Average (Sec)

Master 0.0001797165

Patched 0.0001798406

Performance 99.93%

Microseconds of planning overhead 0.1240776236

Test3 results seem a bit strange, I would have expected more of a slowdown. I ran the test again to make sure, and it came back with the same results the 2nd time.

I've attached the spreadsheet that used to collect the results, and also the raw pgbench output.

It seems that the worst case test adds about 7.6 microseconds onto planning time. To get this worse case result I had to add two GROUP BY columns, as having only 1 triggers a fast path as the code knows it can't remove any columns, since there's only 1. A similar fast path also exists which will only lookup the PRIMARY KEY details if there's more than 1 column per relation in the GROUP BY, so for example GROUP BY rel1.col1, rel2.col1 won't lookup any PRIMARY KEY constraint.

Given that the extra code really only does anything if the GROUP BY has 2 or more expressions, are you worried that this will affect too many short and fast to execute queries negatively?

David Rowley http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services

Вложения

Re: Removing Functionally Dependent GROUP BY Columns

От

Robert Haas

Дата:

03 декабря 2015 г., 02:50:15

On Mon, Nov 30, 2015 at 11:00 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> There are in fact also two queries in TPC-H (Q10 and Q18) which are written
> to include all of the non-aggregated column in the GROUP BY list. During a
> recent test I witnessed a 50% gain in performance in Q10 by removing the
> unneeded columns from the GROUP BY clause.

Wow, that's pretty impressive.  +1 for trying to do something about this.

(Full disclosure: I have not read your patch.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Removing Functionally Dependent GROUP BY Columns

От

Peter Eisentraut

Дата:

03 декабря 2015 г., 04:28:53

On 11/30/15 11:00 PM, David Rowley wrote:
> It seems that there's no shortage of relational databases in existence
> today which don't support this. These databases would require the GROUP
> BY clause to include the p.description column too. 

Well, actually, we implemented this because other databases had it and
users kept complaining to us.

Re: Removing Functionally Dependent GROUP BY Columns

От

David Rowley

Дата:

03 декабря 2015 г., 05:17:49

On 3 December 2015 at 14:28, Peter Eisentraut <peter_e@gmx.net> wrote:

On 11/30/15 11:00 PM, David Rowley wrote:
> It seems that there's no shortage of relational databases in existence
> today which don't support this. These databases would require the GROUP
> BY clause to include the p.description column too.

Well, actually, we implemented this because other databases had it and
users kept complaining to us.

Most likely you mean MySQL? I believe there was a bit of an influx in people emigrating away from that database not in the too distant past. It's not too surprising we picked up some of those people, and some of the features that they were used to at the same time... IF NOT EXISTS perhaps is a good example to back that up.

However, I think your comment makes it quite clear that we're not talking about the same database management systems.

For me, I tested the most popular 2 commercial ones and found neither of these supported columns in the SELECT list which were functionally dependent on the GROUP BY clause, unless the columns themselves were in the GROUP BY clause. I admit my "no shortage" comment was based on these two only. I also admit that I don't possess any statistics which show which RDBMSs users are migrating from. I merely thought that testing the two most popular commercial ones was enough justification to write what I wrote.

--
David Rowley http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services

Re: Removing Functionally Dependent GROUP BY Columns

От

Julien Rouhaud

Дата:

12 января 2016 г., 13:37:59

On 01/12/2015 12:07, David Rowley wrote:
> On 1 December 2015 at 17:09, Marko Tiikkaja <marko@joh.to
> <mailto:marko@joh.to>> wrote:
> 
>     On 2015-12-01 05:00, David Rowley wrote:
> 
>         We already allow a SELECT's target list to contain
>         non-aggregated columns
>         in a GROUP BY query in cases where the non-aggregated column is
>         functionally dependent on the GROUP BY clause.
> 
>         For example a query such as;
> 
>         SELECT p.product_id,p.description, SUM(s.quantity)
>         FROM product p
>         INNER JOIN sale s ON p.product_id = s.product_id
>         GROUP BY p.product_id;
> 
>         is perfectly fine in PostgreSQL, as p.description is
>         functionally dependent
>         on p.product_id (assuming product_id is the PRIMARY KEY of product).
> 
> 
>     This has come up before (on other forums, at least), and my main
>     concern has been that unlike the case where we go from throwing an
>     error to allowing a query, this has a chance to make the planning of
>     currently legal queries slower.  Have you tried to measure the
>     impact of this on queries where there's no runtime gains to be had?
> 
> 
> I've performed a series of benchmarks on the following queries:
> 
> Test1: explain select id1,id2 from t1 group by id1,id2;
> Test2: explain select id from t2 group by id;
> Test3: explain select t1.id1,t1.id2 from t2 inner join t1 on
> t1.id1=t2.id <http://t2.id> group by t1.id1,t1.id2;
>  
> I ran each of these with pgbench for 60 seconds, 3 runs per query. In
> each case below I've converted the TPS into seconds using the average
> TPS over the 3 runs.
> 
> In summary:
> 
> Test1 is the worst case test. It's a very simple query so planning
> overhead of join searching is non-existent. The fact that there's 2
> columns in the GROUP BY means that the fast path cannot be used. I added
> this as if there's only 1 column in the GROUP BY then there's no point
> in searching for something to remove.
> 
> [...]
> 
> It seems that the worst case test adds about 7.6 microseconds onto
> planning time. To get this worse case result I had to add two GROUP BY
> columns, as having only 1 triggers a fast path as the code knows it
> can't remove any columns, since there's only 1. A similar fast path also
> exists which will only lookup the PRIMARY KEY details if there's more
> than 1 column per relation in the GROUP BY, so for example GROUP BY
> rel1.col1, rel2.col1 won't lookup any PRIMARY KEY constraint.
> 
> Given that the extra code really only does anything if the GROUP BY has
> 2 or more expressions, are you worried that this will affect too many
> short and fast to execute queries negatively?
> 
> 

In identify_key_vars()

+       /* Only PK constraints are of interest for now, see comment above */
+       if (con->contype != CONSTRAINT_PRIMARY)
+           continue;

I think the comment should better refer to
remove_useless_groupby_columns() (don't optimize further than what
check_functional_grouping() does).


+       /*
+        * Exit if the constraint is deferrable, there's no point in looking
+        * for another PK constriant, as there can only be one.
+        */

small typo on "constriant"



+               {
+                   varlistmatches = bms_add_member(varlistmatches,
varlistidx);
+                   found_col = true;
+                   /* don't break, there might be dupicates */
+               }

small typo on "dupicates". Also, I thought transformGroupClauseExpr()
called in the parse analysis make sure that there isn't any duplicate in
the GROUP BY clause. Do I miss something?


in remove_useless_groupby_columns()

+       /*
+        * Skip if there were no Vars found in the GROUP BY clause which
belong
+        * to this relation. We can also skip if there's only 1 Var, as
there
+        * needs to be more than one Var for there to be any columns
which are
+        * functionally dependent on another column.
+        */


This comment doesn't sound very clear. I'm not a native english speaker,
so maybe it's just me.

+   /*
+    * If we found any surplus Vars in the GROUP BY clause, then we'll build
+    * a new GROUP BY clause without these surplus Vars.
+    */
+   if (anysurplus)
+   {
+       List *new_groupby = NIL;
+
+       foreach (lc, root->parse->groupClause)
+       {
+           SortGroupClause *sgc = (SortGroupClause *) lfirst(lc);
+           TargetEntry *tle;
+           Var *var;
+
+           tle = get_sortgroupclause_tle(sgc, root->parse->targetList);
+           var = (Var *) tle->expr;
+
+           if (!IsA(var, Var))
+               continue;
+ [...]

if var isn't a Var, it needs to be added in new_groupby.


+           /* XXX do we to alter tleSortGroupRef to remove gaps? */

no idea on that :/

-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org

Re: Removing Functionally Dependent GROUP BY Columns

От

David Rowley

Дата:

14 января 2016 г., 03:30:39

Many thanks for the thorough review!

On 12 January 2016 at 23:37, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:

In identify_key_vars()

+ /* Only PK constraints are of interest for now, see comment above */
+ if (con->contype != CONSTRAINT_PRIMARY)
+ continue;

I think the comment should better refer to
remove_useless_groupby_columns() (don't optimize further than what
check_functional_grouping() does).

I've improved this by explaining it more clearly.

+ /*
+ * Exit if the constraint is deferrable, there's no point in looking
+ * for another PK constriant, as there can only be one.
+ */

small typo on "constriant"

Fixed.

+ {
+ varlistmatches = bms_add_member(varlistmatches,
varlistidx);
+ found_col = true;
+ /* don't break, there might be dupicates */
+ }

small typo on "dupicates". Also, I thought transformGroupClauseExpr()
called in the parse analysis make sure that there isn't any duplicate in
the GROUP BY clause. Do I miss something?

No I missed something :) You are right. I've put a break; here and a comment to explain why.

in remove_useless_groupby_columns()

+ /*
+ * Skip if there were no Vars found in the GROUP BY clause which
belong
+ * to this relation. We can also skip if there's only 1 Var, as
there
+ * needs to be more than one Var for there to be any columns
which are
+ * functionally dependent on another column.
+ */

This comment doesn't sound very clear. I'm not a native english speaker,
so maybe it's just me.

Yes it was badly worded. I've fixed in the attached.

+ /*
+ * If we found any surplus Vars in the GROUP BY clause, then we'll build
+ * a new GROUP BY clause without these surplus Vars.
+ */
+ if (anysurplus)
+ {
+ List *new_groupby = NIL;
+
+ foreach (lc, root->parse->groupClause)
+ {
+ SortGroupClause *sgc = (SortGroupClause *) lfirst(lc);
+ TargetEntry *tle;
+ Var *var;
+
+ tle = get_sortgroupclause_tle(sgc, root->parse->targetList);
+ var = (Var *) tle->expr;
+
+ if (!IsA(var, Var))
+ continue;
+ [...]

if var isn't a Var, it needs to be added in new_groupby.

Yes you're right, well, at least I've written enough code to ensure that it's not needed.

Technically we could look inside non-Vars and check if the Expr is just made up of Vars from a single relation, and then we could also make that surplus if we find other Vars which make up the table's primary key. I didn't make these changes as I thought it was a less likely scenario. It wouldn't be too much extra code to add however. I've went and added an XXX comment to explain that there might be future optimisation opportunities in the future around this.

I've attached an updated patch.

David Rowley http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services

Вложения

prune_group_by_clause_90b5f3c_2016-01-14.patch

Re: Removing Functionally Dependent GROUP BY Columns

От

Julien Rouhaud

Дата:

14 января 2016 г., 14:19:47

On 14/01/2016 01:30, David Rowley wrote:
> Many thanks for the thorough review!
> 

And thanks for the patch and fast answer!

> On 12 January 2016 at 23:37, Julien Rouhaud <julien.rouhaud@dalibo.com
> <mailto:julien.rouhaud@dalibo.com>> wrote:
> 
> 
>     In identify_key_vars()
> 
>     +       /* Only PK constraints are of interest for now, see comment
>     above */
>     +       if (con->contype != CONSTRAINT_PRIMARY)
>     +           continue;
> 
>     I think the comment should better refer to
>     remove_useless_groupby_columns() (don't optimize further than what
>     check_functional_grouping() does).
> 
> 
> I've improved this by explaining it more clearly.
>  

Very nice.

Small typo though:


+         * Technically we could look at UNIQUE indexes too, however we'd also
+         * have to ensure that each column of the unique index had a NOT NULL


s/had/has/


+         * constraint, however since NOT NULL constraints currently are don't


s/are //

> 
> [...]
> 
> 
>     +               {
>     +                   varlistmatches = bms_add_member(varlistmatches,
>     varlistidx);
>     +                   found_col = true;
>     +                   /* don't break, there might be dupicates */
>     +               }
> 
>     small typo on "dupicates". Also, I thought transformGroupClauseExpr()
>     called in the parse analysis make sure that there isn't any duplicate in
>     the GROUP BY clause. Do I miss something?
> 
> 
> No I missed something :) You are right. I've put a break; here and a
> comment to explain why.
> 

ok :)


> [...]
> 
> 
>     +   /*
>     +    * If we found any surplus Vars in the GROUP BY clause, then
>     we'll build
>     +    * a new GROUP BY clause without these surplus Vars.
>     +    */
>     +   if (anysurplus)
>     +   {
>     +       List *new_groupby = NIL;
>     +
>     +       foreach (lc, root->parse->groupClause)
>     +       {
>     +           SortGroupClause *sgc = (SortGroupClause *) lfirst(lc);
>     +           TargetEntry *tle;
>     +           Var *var;
>     +
>     +           tle = get_sortgroupclause_tle(sgc, root->parse->targetList);
>     +           var = (Var *) tle->expr;
>     +
>     +           if (!IsA(var, Var))
>     +               continue;
>     + [...]
> 
>     if var isn't a Var, it needs to be added in new_groupby.
> 
> 
> Yes you're right, well, at least I've written enough code to ensure that
> it's not needed.

Oh yes, I realize that now.

> Technically we could look inside non-Vars and check if the Expr is just
> made up of Vars from a single relation, and then we could also make that
> surplus if we find other Vars which make up the table's primary key.  I
> didn't make these changes as I thought it was a less likely scenario. It
> wouldn't be too much extra code to add however. I've went and added an
> XXX comment to explain that there might be future optimisation
> opportunities in the future around this. 
>  

Agreed.

> I've attached an updated patch.
> 

+        /* don't try anything unless there's two Vars */
+        if (varlist == NULL || list_length(varlist) < 2)
+            continue;

To be perfectly correct, the comment should say "at least two Vars".

Except the small remaining typos, this patch looks very fine to me. I
flag it as ready for committer.

-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org

Re: Removing Functionally Dependent GROUP BY Columns

От

Geoff Winkless

Дата:

14 января 2016 г., 16:04:44

On 14 January 2016 at 11:19, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:
> +               /* don't try anything unless there's two Vars */
> +               if (varlist == NULL || list_length(varlist) < 2)
> +                       continue;
>
> To be perfectly correct, the comment should say "at least two Vars".

Apologies for butting in and I appreciate I don't have any ownership
over this codebase or right to suggest any changes, but this just
caught my eye before I could hit "delete".

My mantra tends to be "why, not what" for inline comments; in this
case you can get the same information from the next line of code as
you get from the comment.

Perhaps something like

/* it's clearly impossible to remove duplicates if there are fewer
than two GROUPBY columns */

might be more helpful?

(also sorry if I've misunderstood what it _actually_ does, I just made
an assumption based on reading this thread!)

Geoff

Re: Removing Functionally Dependent GROUP BY Columns

От

Julien Rouhaud

Дата:

14 января 2016 г., 16:16:15

On 14/01/2016 14:04, Geoff Winkless wrote:
> On 14 January 2016 at 11:19, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:
>> +               /* don't try anything unless there's two Vars */
>> +               if (varlist == NULL || list_length(varlist) < 2)
>> +                       continue;
>>
>> To be perfectly correct, the comment should say "at least two Vars".
> 
> Apologies for butting in and I appreciate I don't have any ownership
> over this codebase or right to suggest any changes, but this just
> caught my eye before I could hit "delete".
> 
> My mantra tends to be "why, not what" for inline comments; in this
> case you can get the same information from the next line of code as
> you get from the comment.
> 

You're absolutely right, but in this case the comment is more like a
reminder of a bigger comment few lines before that wasn't quoted in my mail:

+     *[...] If there are no Vars then
+     * nothing need be done for this rel. If there's less than two Vars then
+     * there's no room to remove any, as the PRIMARY KEY must have at
least one
+     * Var, so we can safely also do nothing if there's less than two Vars.


so I assume it's ok to keep it this way.


> Perhaps something like
> 
> /* it's clearly impossible to remove duplicates if there are fewer
> than two GROUPBY columns */
> 
> might be more helpful?
> 
> (also sorry if I've misunderstood what it _actually_ does, I just made
> an assumption based on reading this thread!)
> 


-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org

Re: Removing Functionally Dependent GROUP BY Columns

От

Geoff Winkless

Дата:

14 января 2016 г., 16:29:31

On 14 January 2016 at 13:16, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:
> You're absolutely right, but in this case the comment is more like a
> reminder of a bigger comment few lines before that wasn't quoted in my mail

Fair enough, although I have two niggles with that:

a) the second comment could become physically separated from the first
by later additions of extra code, or by refactoring;
b) if you don't need the comment because the explanation for it is
local anyway and the comment tells you nothing that the code doesn't,
why have it at all?

> so I assume it's ok to keep it this way.

Of course it's ok to do whatever you decide is best: as I said
previously, I fully appreciate that I have no ownership over any of
the code.

Geoff

Re: Removing Functionally Dependent GROUP BY Columns

От

Julien Rouhaud

Дата:

14 января 2016 г., 16:42:03

On 14/01/2016 14:29, Geoff Winkless wrote:
> On 14 January 2016 at 13:16, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:
>> You're absolutely right, but in this case the comment is more like a
>> reminder of a bigger comment few lines before that wasn't quoted in my mail
> 
> Fair enough, although I have two niggles with that:
> 
> a) the second comment could become physically separated from the first
> by later additions of extra code, or by refactoring;
> b) if you don't need the comment because the explanation for it is
> local anyway and the comment tells you nothing that the code doesn't,
> why have it at all?
> 

I agree. If I had to choose, I'd vote for removing it.

>> so I assume it's ok to keep it this way.
> 
> Of course it's ok to do whatever you decide is best: as I said
> previously, I fully appreciate that I have no ownership over any of
> the code.
> 

Neither do I, I'm just reviewer here just as you :)

-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org

Re: Removing Functionally Dependent GROUP BY Columns

От

David Rowley

Дата:

15 января 2016 г., 01:06:05

On 15 January 2016 at 00:19, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:

+ * Technically we could look at UNIQUE indexes too, however we'd also
+ * have to ensure that each column of the unique index had a NOT NULL

s/had/has/

+ * constraint, however since NOT NULL constraints currently are don't

s/are //

Both fixed. Thanks.

> + /*
> + * If we found any surplus Vars in the GROUP BY clause, then
> we'll build
> + * a new GROUP BY clause without these surplus Vars.
> + */
> + if (anysurplus)
> + {
> + List *new_groupby = NIL;
> +
> + foreach (lc, root->parse->groupClause)
> + {
> + SortGroupClause *sgc = (SortGroupClause *) lfirst(lc);
> + TargetEntry *tle;
> + Var *var;
> +
> + tle = get_sortgroupclause_tle(sgc, root->parse->targetList);
> + var = (Var *) tle->expr;
> +
> + if (!IsA(var, Var))
> + continue;
> + [...]
>
> if var isn't a Var, it needs to be added in new_groupby.
>
>
> Yes you're right, well, at least I've written enough code to ensure that
> it's not needed.

Oh yes, I realize that now.

I meant to say "I've not written enough code" ...

> Technically we could look inside non-Vars and check if the Expr is just
> made up of Vars from a single relation, and then we could also make that
> surplus if we find other Vars which make up the table's primary key. I
> didn't make these changes as I thought it was a less likely scenario. It
> wouldn't be too much extra code to add however. I've went and added an
> XXX comment to explain that there might be future optimisation
> opportunities in the future around this.
>

Agreed.

> I've attached an updated patch.
>

+ /* don't try anything unless there's two Vars */
+ if (varlist == NULL || list_length(varlist) < 2)
+ continue;

To be perfectly correct, the comment should say "at least two Vars".

Changed per discussion from you and Geoff

Except the small remaining typos, this patch looks very fine to me. I
flag it as ready for committer.

Many thanks for your followup review.

I've attached an updated patch to address what you highlighted above.

David Rowley http://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Training & Services

Вложения

prune_group_by_clause_59f15cf_2016-01-15.patch

Re: Removing Functionally Dependent GROUP BY Columns

От

Julien Rouhaud

Дата:

15 января 2016 г., 02:07:41

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 14/01/2016 23:05, David Rowley wrote:
> On 15 January 2016 at 00:19, Julien Rouhaud
> <julien.rouhaud@dalibo.com <mailto:julien.rouhaud@dalibo.com>>
> wrote:
> 
> 
> +                * Technically we could look at UNIQUE indexes
> too, however we'd also +                * have to ensure that each
> column of the unique index had a NOT NULL
> 
> 
> s/had/has/
> 
> 
> +                * constraint, however since NOT NULL constraints 
> currently are don't
> 
> 
> s/are //
> 
> 
> Both fixed. Thanks.
> 
> 
>> +   /* +    * If we found any surplus Vars in the GROUP BY
>> clause, then we'll build +    * a new GROUP BY clause without
>> these surplus Vars. +    */ +   if (anysurplus) +   { +
>> List *new_groupby = NIL; + +       foreach (lc,
>> root->parse->groupClause) +       { +           SortGroupClause
>> *sgc = (SortGroupClause *) lfirst(lc); +           TargetEntry
>> *tle; +           Var *var; + +           tle =
>> get_sortgroupclause_tle(sgc, root->parse->targetList); +
>> var = (Var *) tle->expr; + +           if (!IsA(var, Var)) +
>> continue; + [...]
>> 
>> if var isn't a Var, it needs to be added in new_groupby.
>> 
>> 
>> Yes you're right, well, at least I've written enough code to
>> ensure that it's not needed.
> 
> Oh yes, I realize that now.
> 
> 
> I meant to say "I've not written enough code" ...
> 

Yes, that makes sense with the explanation you wrote just after :)

> 
>> Technically we could look inside non-Vars and check if the Expr
>> is just made up of Vars from a single relation, and then we could
>> also make that surplus if we find other Vars which make up the
>> table's primary key.  I didn't make these changes as I thought it
>> was a less likely scenario. It wouldn't be too much extra code to
>> add however. I've went and added an XXX comment to explain that
>> there might be future optimisation opportunities in the future
>> around this.
>> 
> 
> Agreed.
> 
>> I've attached an updated patch.
>> 
> 
> +               /* don't try anything unless there's two Vars */ +
> if (varlist == NULL || list_length(varlist) < 2) +
> continue;
> 
> To be perfectly correct, the comment should say "at least two
> Vars".
> 
> 
> Changed per discussion from you and Geoff
> 
> 
> Except the small remaining typos, this patch looks very fine to me.
> I flag it as ready for committer.
> 
> 
> Many thanks for your followup review.
> 
> I've attached an updated patch to address what you highlighted
> above.
> 

Thanks!

> 
> -- David Rowley                   http://www.2ndQuadrant.com/ 
> PostgreSQL Development, 24x7 Support, Training & Services
> 
> 
> 


- -- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)

iQEcBAEBAgAGBQJWmCoYAAoJELGaJ8vfEpOqJzIIALajMHHd8aCDnAuH9uNUyevU
EuKyHTRDJkc8KUNbtDeSpVf9UGT3XUaZx4k/o+aKDdRB/RfYK0GKyDv2Owr4Wx3F
5BY9GuEO3vjqaFuDBH5u9EjySal3jQYC57nB3I0hRvpVRQBi0nFyQre/SXplCB6q
X1NqBfICyu6lwwocAMCeW9qN4ZQclLjxoScJpA4ml9Kj6CQvK2dDSyS00gstLFPH
Bj+n20wEcC7ZyxCpxfHmoZW1sjAvZK5mwrEdFG+lvCO8OwN/73YvDFzQHrpvVXWE
ZVoz069kfekZtXQ1OQ5CcvOAJLD9ewbPq+rpKh9YQorZB1R9QEj0qaxqo7LtjhA=
=G/uH
-----END PGP SIGNATURE-----

Re: Removing Functionally Dependent GROUP BY Columns

От

Tom Lane

Дата:

23 января 2016 г., 02:44:15

David Rowley <david.rowley@2ndquadrant.com> writes:
> [ prune_group_by_clause_59f15cf_2016-01-15.patch ]

I spent some time looking through this but decided that it needs some work
to be committable.

The main thing I didn't like was that identify_key_vars seems to have a
rather poorly chosen API: it mixes up identifying a rel's pkey with
sorting through the GROUP BY list.  I think it would be better to create
a function that, given a relid, hands back the OID of the pkey constraint
and a Bitmapset of the column numbers in the pkey (or 0/NULL if there's
no pkey or it's deferrable).  The column numbers should be offset by
FirstLowInvalidHeapAttributeNumber so that a pkey on OID can be
represented correctly --- compare RelationGetIndexAttrBitmap().

The reason this seems like a more attractive solution is that the output
of the pg_constraint lookup becomes independent of the particular query
and thus is potentially cacheable (in the relcache, say).  I do not think
we need to cache it right now but I'd like to have the option to do so.

(I wonder BTW whether check_functional_grouping couldn't be refactored
to use the output of such a function, too.)

Some lesser points:

* I did not like where you plugged in the call to
remove_useless_groupby_columns; there are weird interactions with grouping
sets as to whether it will get called or not, and also that whole chunk
of code is due for refactoring.  I'd be inclined to call it from
subquery_planner instead, maybe near where the HAVING clause preprocessing
happens.

* You've got it iterating over every RTE, even those that aren't
RTE_RELATION RTEs.  It's pure luck that identify_key_vars doesn't fail
outright when passed a bogus relid, and it's certainly wasteful.

* Both of the loops iterating over the groupClause neglect to check
varlevelsup, thus leading to assert failures or worse if an outer-level
Var is present in the GROUP BY list.  (I'm pretty sure outer Vars can
still be present at this point, though I might be wrong.)

* I'd be inclined to see if the relvarlists couldn't be converted into
bitmapsets too.  Then the test to see if the pkey is a subset of the
grouping columns becomes a simple and cheap bms_is_subset test.  You don't
really need surplusvars either, because you can instead test Vars for
membership in the pkey bitmapsets that pg_constraint.c will be handing
back.

* What you did to join.sql/join.out seems a bit bizarre.  The existing
test case that you modified was meant to test something else, and
conflating this behavior with the pre-existing one (and not documenting
it) is confusing as can be.  A bit more work on regression test cases
seems indicated.

I'm going to set this back to "Waiting on Author".  I think the basic
idea is good but it needs a rewrite.
        regards, tom lane

Re: Removing Functionally Dependent GROUP BY Columns

От

David Rowley

Дата:

23 января 2016 г., 12:44:14

On 23 January 2016 at 12:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I spent some time looking through this but decided that it needs some work
> to be committable.
>
> The main thing I didn't like was that identify_key_vars seems to have a
> rather poorly chosen API: it mixes up identifying a rel's pkey with
> sorting through the GROUP BY list.  I think it would be better to create
> a function that, given a relid, hands back the OID of the pkey constraint
> and a Bitmapset of the column numbers in the pkey (or 0/NULL if there's
> no pkey or it's deferrable).  The column numbers should be offset by
> FirstLowInvalidHeapAttributeNumber so that a pkey on OID can be
> represented correctly --- compare RelationGetIndexAttrBitmap().

That seems like a much better design to me too. I've made changes and
attached the updated patch.

> The reason this seems like a more attractive solution is that the output
> of the pg_constraint lookup becomes independent of the particular query
> and thus is potentially cacheable (in the relcache, say).  I do not think
> we need to cache it right now but I'd like to have the option to do so.

I've not touched that yet, but good idea.

> (I wonder BTW whether check_functional_grouping couldn't be refactored
> to use the output of such a function, too.)

I've attached a separate patch for that too. It applies after the
prunegroupby patch.

> Some lesser points:
>
> * I did not like where you plugged in the call to
> remove_useless_groupby_columns; there are weird interactions with grouping
> sets as to whether it will get called or not, and also that whole chunk
> of code is due for refactoring.  I'd be inclined to call it from
> subquery_planner instead, maybe near where the HAVING clause preprocessing
> happens.

I've moved the call to subquery_planner()

> * You've got it iterating over every RTE, even those that aren't
> RTE_RELATION RTEs.  It's pure luck that identify_key_vars doesn't fail
> outright when passed a bogus relid, and it's certainly wasteful.

:-( I've fixed that now.

> * Both of the loops iterating over the groupClause neglect to check
> varlevelsup, thus leading to assert failures or worse if an outer-level
> Var is present in the GROUP BY list.  (I'm pretty sure outer Vars can
> still be present at this point, though I might be wrong.)

Fixed in the first loop, and the way I've rewritten the code to use
bms_difference, there's no need to check again in the 2nd loop.

> * I'd be inclined to see if the relvarlists couldn't be converted into
> bitmapsets too.  Then the test to see if the pkey is a subset of the
> grouping columns becomes a simple and cheap bms_is_subset test.  You don't
> really need surplusvars either, because you can instead test Vars for
> membership in the pkey bitmapsets that pg_constraint.c will be handing
> back.

I've changed to use a bitmapset now, but I've kept surplusvars, having
this allows a single pass over the group by clause to remove the
surplus Vars, rather than doing it again and again for each relation.
I've also found a better way so that array is only allocated if some
surplus Vars are found. I understand what you mean, and yes, it is
possible to get rid of it, but I'd need to still add something else to
mark that this rel's extra Vars are okay to be removed. I could do
that by adding another bitmapset and marking the relid in that, but I
quite like how I've changed it now as it saves having to check
varlevelsup again on the Vars in the GROUP BY clause, as I just use
bms_difference() on the originally found Vars subtracting off the PK
vars, and assume all of those are surplus.

> * What you did to join.sql/join.out seems a bit bizarre.  The existing
> test case that you modified was meant to test something else, and
> conflating this behavior with the pre-existing one (and not documenting
> it) is confusing as can be.  A bit more work on regression test cases
> seems indicated.

The history behind that is that at one point during developing the
patch that test had started failing due to the group by item being
removed therefore allowing the join removal conditions to be met. On
testing again with the old test query I see this no longer happens, so
I've removed the change, although the expected output still differs
due to the group by item being removed.

> I'm going to set this back to "Waiting on Author".  I think the basic
> idea is good but it needs a rewrite.

Thanks for the review and the good ideas.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

Re: Removing Functionally Dependent GROUP BY Columns

От

Julien Rouhaud

Дата:

23 января 2016 г., 14:57:39

On 23/01/2016 10:44, David Rowley wrote:
> On 23 January 2016 at 12:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I spent some time looking through this but decided that it needs some work
>> to be committable.
>>
>> The main thing I didn't like was that identify_key_vars seems to have a
>> rather poorly chosen API: it mixes up identifying a rel's pkey with
>> sorting through the GROUP BY list.  I think it would be better to create
>> a function that, given a relid, hands back the OID of the pkey constraint
>> and a Bitmapset of the column numbers in the pkey (or 0/NULL if there's
>> no pkey or it's deferrable).  The column numbers should be offset by
>> FirstLowInvalidHeapAttributeNumber so that a pkey on OID can be
>> represented correctly --- compare RelationGetIndexAttrBitmap().
> 
> That seems like a much better design to me too. I've made changes and
> attached the updated patch.
> 
>> The reason this seems like a more attractive solution is that the output
>> of the pg_constraint lookup becomes independent of the particular query
>> and thus is potentially cacheable (in the relcache, say).  I do not think
>> we need to cache it right now but I'd like to have the option to do so.
> 
> I've not touched that yet, but good idea.
> 
>> (I wonder BTW whether check_functional_grouping couldn't be refactored
>> to use the output of such a function, too.)
> 
> I've attached a separate patch for that too. It applies after the
> prunegroupby patch.
> 

This refactoring makes the code much more better, +1 for me.

I also reviewed it, it looks fine. However, the following removed
comment could be kept for clarity:

-           /* The PK is a subset of grouping_columns, so we win */



>> Some lesser points:
>>
>> * I did not like where you plugged in the call to
>> remove_useless_groupby_columns; there are weird interactions with grouping
>> sets as to whether it will get called or not, and also that whole chunk
>> of code is due for refactoring.  I'd be inclined to call it from
>> subquery_planner instead, maybe near where the HAVING clause preprocessing
>> happens.
> 
> I've moved the call to subquery_planner()
> 
>> * You've got it iterating over every RTE, even those that aren't
>> RTE_RELATION RTEs.  It's pure luck that identify_key_vars doesn't fail
>> outright when passed a bogus relid, and it's certainly wasteful.
> 
> :-( I've fixed that now.
> 
>> * Both of the loops iterating over the groupClause neglect to check
>> varlevelsup, thus leading to assert failures or worse if an outer-level
>> Var is present in the GROUP BY list.  (I'm pretty sure outer Vars can
>> still be present at this point, though I might be wrong.)
> 
> Fixed in the first loop, and the way I've rewritten the code to use
> bms_difference, there's no need to check again in the 2nd loop.
> 
>> * I'd be inclined to see if the relvarlists couldn't be converted into
>> bitmapsets too.  Then the test to see if the pkey is a subset of the
>> grouping columns becomes a simple and cheap bms_is_subset test.  You don't
>> really need surplusvars either, because you can instead test Vars for
>> membership in the pkey bitmapsets that pg_constraint.c will be handing
>> back.
> 
> I've changed to use a bitmapset now, but I've kept surplusvars, having
> this allows a single pass over the group by clause to remove the
> surplus Vars, rather than doing it again and again for each relation.
> I've also found a better way so that array is only allocated if some
> surplus Vars are found. I understand what you mean, and yes, it is
> possible to get rid of it, but I'd need to still add something else to
> mark that this rel's extra Vars are okay to be removed. I could do
> that by adding another bitmapset and marking the relid in that, but I
> quite like how I've changed it now as it saves having to check
> varlevelsup again on the Vars in the GROUP BY clause, as I just use
> bms_difference() on the originally found Vars subtracting off the PK
> vars, and assume all of those are surplus.
> 

I wonder if in remove_useless_groupby_columns(), in the foreach loop you
could change the

+       if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
+       {

by something like


+       if (bms_num_members(relattnos) <= bms_num_members(pkattnos))
+           continue;
+
+       if (bms_is_subset(pkattnos, relattnos))
+       {


which may be cheaper.

Otherwise, I couldn't find any issue with this new version of the patch.

>> * What you did to join.sql/join.out seems a bit bizarre.  The existing
>> test case that you modified was meant to test something else, and
>> conflating this behavior with the pre-existing one (and not documenting
>> it) is confusing as can be.  A bit more work on regression test cases
>> seems indicated.
> 
> The history behind that is that at one point during developing the
> patch that test had started failing due to the group by item being
> removed therefore allowing the join removal conditions to be met. On
> testing again with the old test query I see this no longer happens, so
> I've removed the change, although the expected output still differs
> due to the group by item being removed.
> 
>> I'm going to set this back to "Waiting on Author".  I think the basic
>> idea is good but it needs a rewrite.
> 
> Thanks for the review and the good ideas.
> 


-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org

Re: Removing Functionally Dependent GROUP BY Columns

От

David Rowley

Дата:

23 января 2016 г., 16:51:12

On 24 January 2016 at 00:56, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:
> I wonder if in remove_useless_groupby_columns(), in the foreach loop you
> could change the
>
> +       if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
> +       {
>
> by something like
>
>
> +       if (bms_num_members(relattnos) <= bms_num_members(pkattnos))
> +           continue;
> +
> +       if (bms_is_subset(pkattnos, relattnos))
> +       {
>
>
> which may be cheaper.

Thanks for looking over this again.

I actually had that part written quite a few different ways before I
finally decided to use bms_subset_compare. I didn't benchmark, but I
thought 1 function call was better than 2, as I had it as
bms_is_subset(pkattnos, relattnos) && !bms_is_subset(relattnos,
pkattnos), and again with !bms_equal() instead of the 2nd subset test.
I figured 1 function call was better than 2, so finally settled on
bms_subset_compare(). I'm thinking that 3 function calls likely won't
make things better.  I can't imagine it's going to cost much either
way, so I doubt it's worth trying to check whats faster. Although the
thing about bms_num_members() is that it's going to loop over each
word in the bitmap no matter what, whereas a subset check can abort
early in some cases.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Removing Functionally Dependent GROUP BY Columns

От

Julien Rouhaud

Дата:

23 января 2016 г., 18:31:55

On 23/01/2016 14:51, David Rowley wrote:
> On 24 January 2016 at 00:56, Julien Rouhaud <julien.rouhaud@dalibo.com> wrote:
>> I wonder if in remove_useless_groupby_columns(), in the foreach loop you
>> could change the
>>
>> +       if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
>> +       {
>>
>> by something like
>>
>>
>> +       if (bms_num_members(relattnos) <= bms_num_members(pkattnos))
>> +           continue;
>> +
>> +       if (bms_is_subset(pkattnos, relattnos))
>> +       {
>>
>>
>> which may be cheaper.
> 
> Thanks for looking over this again.
> 
> I actually had that part written quite a few different ways before I
> finally decided to use bms_subset_compare. I didn't benchmark, but I
> thought 1 function call was better than 2, as I had it as
> bms_is_subset(pkattnos, relattnos) && !bms_is_subset(relattnos,
> pkattnos), and again with !bms_equal() instead of the 2nd subset test.
> I figured 1 function call was better than 2, so finally settled on
> bms_subset_compare(). I'm thinking that 3 function calls likely won't
> make things better.  I can't imagine it's going to cost much either
> way, so I doubt it's worth trying to check whats faster. Although the
> thing about bms_num_members() is that it's going to loop over each
> word in the bitmap no matter what, whereas a subset check can abort
> early in some cases.
> 
> 

FWIW, this code was simplified example.  bms_num_members(relattnos) is
already called a few lines before, so it'd be 1 function call against 2
function calls (and a var assignment).

-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org

Re: Removing Functionally Dependent GROUP BY Columns

От

Tom Lane

Дата:

23 января 2016 г., 18:39:27

Julien Rouhaud <julien.rouhaud@dalibo.com> writes:
> I wonder if in remove_useless_groupby_columns(), in the foreach loop you
> could change the

> +       if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
> +       {

> by something like

> +       if (bms_num_members(relattnos) <= bms_num_members(pkattnos))
> +           continue;
> +
> +       if (bms_is_subset(pkattnos, relattnos))
> +       {

> which may be cheaper.

FWIW, I really doubt that would be cheaper.  The various flavors of
subset comparison are word-at-a-time bitmasking operations, but
bms_num_members has to grovel over individual bits; it's certain to
be more expensive than a subset test.
        regards, tom lane

Re: Removing Functionally Dependent GROUP BY Columns

От

Tom Lane

Дата:

23 января 2016 г., 22:21:00

David Rowley <david.rowley@2ndquadrant.com> writes:
> On 23 January 2016 at 12:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> * What you did to join.sql/join.out seems a bit bizarre.  The existing
>> test case that you modified was meant to test something else, and
>> conflating this behavior with the pre-existing one (and not documenting
>> it) is confusing as can be.  A bit more work on regression test cases
>> seems indicated.

> The history behind that is that at one point during developing the
> patch that test had started failing due to the group by item being
> removed therefore allowing the join removal conditions to be met. On
> testing again with the old test query I see this no longer happens, so
> I've removed the change, although the expected output still differs
> due to the group by item being removed.

Hmm ... but ... it seems to me that the test as it stands *should* fail
after this patch, because once the non-pkey grouping column is removed
the join removal optimization should apply.  I think we should look a bit
more closely at what's happening there.

(IOW, I wasn't so much unhappy with the change to that test case as
that it was being used as the only test case for this new behavior.
I see you added some new, separate test cases, so that's good; but
there's something fishy if the existing case doesn't change behavior.)
        regards, tom lane

Re: Removing Functionally Dependent GROUP BY Columns

От

David Rowley

Дата:

24 января 2016 г., 23:53:33

On 24 January 2016 at 08:20, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
>> On 23 January 2016 at 12:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> * What you did to join.sql/join.out seems a bit bizarre.  The existing
>>> test case that you modified was meant to test something else, and
>>> conflating this behavior with the pre-existing one (and not documenting
>>> it) is confusing as can be.  A bit more work on regression test cases
>>> seems indicated.
>
>> The history behind that is that at one point during developing the
>> patch that test had started failing due to the group by item being
>> removed therefore allowing the join removal conditions to be met. On
>> testing again with the old test query I see this no longer happens, so
>> I've removed the change, although the expected output still differs
>> due to the group by item being removed.
>
> Hmm ... but ... it seems to me that the test as it stands *should* fail
> after this patch, because once the non-pkey grouping column is removed
> the join removal optimization should apply.  I think we should look a bit
> more closely at what's happening there.
>
> (IOW, I wasn't so much unhappy with the change to that test case as
> that it was being used as the only test case for this new behavior.
> I see you added some new, separate test cases, so that's good; but
> there's something fishy if the existing case doesn't change behavior.)

Thanks for looking at this again.

I've looked into why the join is not removed; since the redundant
GROUP BY columns are removed during planning, and since the outer
query is planned before the sub query, then when the join removal code
checks if the subquery can been removed, the subquery is yet to be
planned, so still contains the 2 GROUP BY items.

Perhaps the useless columns can be removed a bit earlier, perhaps in
parse analysis. I will look into that now.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Removing Functionally Dependent GROUP BY Columns

От

Tom Lane

Дата:

25 января 2016 г., 00:17:30

David Rowley <david.rowley@2ndquadrant.com> writes:
> I've looked into why the join is not removed; since the redundant
> GROUP BY columns are removed during planning, and since the outer
> query is planned before the sub query, then when the join removal code
> checks if the subquery can been removed, the subquery is yet to be
> planned, so still contains the 2 GROUP BY items.

Hmm ... but why did it get removed in the earlier patch version, then?

> Perhaps the useless columns can be removed a bit earlier, perhaps in
> parse analysis. I will look into that now.

No; doing this in parse analysis will be sufficient reason to reject the
patch.  That would mean adding a not-semantically-necessary dependency on
the pkey to a query when it is stored as a view.  It has to be done at
planning time and no sooner.

It's possible that you could integrate it into some earlier phase of
planning, like prepjointree, but I think that would be messy and likely
not worth it.  I don't see any existing query-tree traversal this could
piggyback on, and I doubt we want to add a new pass just for this.
        regards, tom lane

Re: Removing Functionally Dependent GROUP BY Columns

От

David Rowley

Дата:

25 января 2016 г., 00:25:39

On 25 January 2016 at 10:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
>> I've looked into why the join is not removed; since the redundant
>> GROUP BY columns are removed during planning, and since the outer
>> query is planned before the sub query, then when the join removal code
>> checks if the subquery can been removed, the subquery is yet to be
>> planned, so still contains the 2 GROUP BY items.
>
> Hmm ... but why did it get removed in the earlier patch version, then?

I'm not sure now, it was months ago. Perhaps I misremembered and only
altered the test because I mistakenly anticipated it would break.

>> Perhaps the useless columns can be removed a bit earlier, perhaps in
>> parse analysis. I will look into that now.
>
> No; doing this in parse analysis will be sufficient reason to reject the
> patch.  That would mean adding a not-semantically-necessary dependency on
> the pkey to a query when it is stored as a view.  It has to be done at
> planning time and no sooner.
>
> It's possible that you could integrate it into some earlier phase of
> planning, like prepjointree, but I think that would be messy and likely
> not worth it.  I don't see any existing query-tree traversal this could
> piggyback on, and I doubt we want to add a new pass just for this.

It seems like a bit of a corner case anyway. Maybe it's fine as is.

-- David Rowley                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Removing Functionally Dependent GROUP BY Columns

От

Alvaro Herrera

Дата:

02 февраля 2016 г., 01:50:04

This patch got a good share of review, so it's fair to move it to the
next commitfest.  I marked it as ready-for-committer there which seems
to be the right status.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Removing Functionally Dependent GROUP BY Columns

От

Tom Lane

Дата:

12 февраля 2016 г., 02:01:22

David Rowley <david.rowley@2ndquadrant.com> writes:
> [ prune_group_by_clause_ab4f491_2016-01-23.patch ]
> [ check_functional_grouping_refactor.patch ]

I've committed this with mostly-cosmetic revisions (principally, rewriting
a lot of the comments, which seemed on the sloppy side).

>> * Both of the loops iterating over the groupClause neglect to check
>> varlevelsup, thus leading to assert failures or worse if an outer-level
>> Var is present in the GROUP BY list.  (I'm pretty sure outer Vars can
>> still be present at this point, though I might be wrong.)

> Fixed in the first loop, and the way I've rewritten the code to use
> bms_difference, there's no need to check again in the 2nd loop.

Um, AFAICS, you *do* need to check again in the second loop, else you'll
be accessing a surplusvars[] entry that might not exist at all, and in
any case might falsely tell you that you can exclude the outer var from
the new GROUP BY list.

That was the only actual bug I found, though I changed some other stuff.
        regards, tom lane

Re: Removing Functionally Dependent GROUP BY Columns

От

David Rowley

Дата:

14 февраля 2016 г., 11:53:26

<p dir="ltr">On 12/02/2016 12:01 am, "Tom Lane" <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>>
wrote:<br/> ><br /> > David Rowley <<a
href="mailto:david.rowley@2ndquadrant.com">david.rowley@2ndquadrant.com</a>>writes:<br /> > > [
prune_group_by_clause_ab4f491_2016-01-23.patch]<br /> > > [ check_functional_grouping_refactor.patch ]<br />
><br/> > I've committed this with mostly-cosmetic revisions (principally, rewriting<br /> > a lot of the
comments,which seemed on the sloppy side).<p dir="ltr">Many thanks for committing this and improving the comments.<p
dir="ltr">>>> * Both of the loops iterating over the groupClause neglect to check<br /> > >>
varlevelsup,thus leading to assert failures or worse if an outer-level<br /> > >> Var is present in the GROUP
BYlist.  (I'm pretty sure outer Vars can<br /> > >> still be present at this point, though I might be
wrong.)<br/> ><br /> > > Fixed in the first loop, and the way I've rewritten the code to use<br /> > >
bms_difference,there's no need to check again in the 2nd loop.<br /> ><br /> > Um, AFAICS, you *do* need to check
againin the second loop, else you'll<br /> > be accessing a surplusvars[] entry that might not exist at all, and
in<br/> > any case might falsely tell you that you can exclude the outer var from<br /> > the new GROUP BY
list.<br/> ><p dir="ltr">I can't quite understand what you're seeing here. As far as I can tell from reading the
codeagain (on my phone ) the varlevelsup check in the 2nd loop is not required. Any var with varlevelsup above zero
won'tmake it into the surplus bitmap for the release as the bms_difference call just removes the pk vars from the
varlevelsup=0 vars. So the bms_ismember should be fine. I can't see what I'm missing. In either case it test does no
harm.<p dir="ltr">> That was the only actual bug I found, though I changed some other stuff.

Re: Removing Functionally Dependent GROUP BY Columns

От

Tom Lane

Дата:

14 февраля 2016 г., 19:12:09

David Rowley <david.rowley@2ndquadrant.com> writes:
> On 12/02/2016 12:01 am, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
>> Um, AFAICS, you *do* need to check again in the second loop, else you'll
>> be accessing a surplusvars[] entry that might not exist at all, and in
>> any case might falsely tell you that you can exclude the outer var from
>> the new GROUP BY list.

> I can't quite understand what you're seeing here.

The second loop is iterating through the original GROUP BY list, so it
will see again any outer Vars that were excluded by the first loop.
It needs to exclude them exactly the same, because they are outside
the scope of its data structures.  Consider something like (admittedly
pretty silly, but legal SQL)

create table up (u1 int, u2 int, u3 int);
create table down (f1 int primary key, f2 int);

select * from othertable, up
where u1 in (select f2 from down group by f1, f2, up.u3);

up.u3 would have varlevelsup = 1, varno = 2, varattno = 3.
If you don't skip it then the surplusvars[var->varno] access
will be trying to fetch off the end of the surplusvars[] array,
because there is only one RTE in the subquery's rangetable
though there are two in the outer query's rangetable.

When I trace through this example, it manages not to crash because
in point of fact the outer Var has already been replaced by a Param.
But I don't think this code should be assuming that; it's an artifact
of the detailed order of processing in subquery_planner(), and could
easily change in the future, for example if we tried to move the
whole affair to the jointree preprocessing stage as we discussed earlier.
        regards, tom lane

Re: Removing Functionally Dependent GROUP BY Columns

От

David Rowley

Дата:

15 февраля 2016 г., 00:16:42

<p dir="ltr">On 14/02/2016 5:11 pm, "Tom Lane" <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>>
wrote:<br/> ><br /> > David Rowley <<a
href="mailto:david.rowley@2ndquadrant.com">david.rowley@2ndquadrant.com</a>>writes:<br /> > > On 12/02/2016
12:01am, "Tom Lane" <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>> wrote:<br /> > > I can't
quiteunderstand what you're seeing here.<br /> ><br /> > The second loop is iterating through the original GROUP
BYlist, so it<br /> > will see again any outer Vars that were excluded by the first loop.<br /> > It needs to
excludethem exactly the same, because they are outside<br /> > the scope of its data structures.  Consider something
like(admittedly<br /> > pretty silly, but legal SQL)<br /> ><br /> > create table up (u1 int, u2 int, u3
int);<br/> > create table down (f1 int primary key, f2 int);<br /> ><br /> > select * from othertable, up<br
/>> where u1 in (select f2 from down group by f1, f2, up.u3);<br /> ><br /> > up.u3 would have varlevelsup =
1,varno = 2, varattno = 3.<br /> > If you don't skip it then the surplusvars[var->varno] access<br /> > will
betrying to fetch off the end of the surplusvars[] array,<br /> > because there is only one RTE in the subquery's
rangetable<br/> > though there are two in the outer query's rangetable.<p dir="ltr">Thanks for explaining this.
ClearlyI missed the case of the varno pointing off the end of the array. Thanks for noticing and fixing. <br />

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Removing Functionally Dependent GROUP BY Columns

Вложения

Вложения

Вложения

Вложения

Вложения