Обсуждение: Issue with query_is_distinct_for() and grouping sets
While working on the fix for the push-down-HAVING optimization, I
noticed $subject.
query_is_distinct_for() is intended to determine whether a query never
returns duplicates of the specified columns. For a query using
grouping sets, if there are no grouping expressions, we know there are
no non-empty grouping sets and potentially one or more empty grouping
sets. The idea is to check whether there is only one empty grouping
set, in which case the query would return only one row and thus
unique. Currently the check being performed is:
if (list_length(query->groupingSets) == 1 &&
((GroupingSet *) linitial(query->groupingSets))->kind ==
GROUPING_SET_EMPTY)
return true;
else
return false;
It seems to me that this check is not thorough enough. First, if the
DISTINCT clause is used on the GROUP BY, duplicate empty grouping sets
will be removed, leaving only one. However, the current code does not
check groupDistinct at all.
Second, if the GROUP BY clause is written in the form of "grouping
sets(())", the GroupingSet is actually a GROUPING_SET_SETS node
containing a single GROUPING_SET_EMPTY node, which still represents
only one empty grouping set. The current check does not account for
this structure.
Third, the check "list_length(query->groupingSets) == 1" is also
insufficient. It's possible to have multiple GroupingSet nodes and
still end up with a single empty grouping set, as long as each node
contains only one GROUPING_SET_EMPTY element, for example, in a query
like "group by grouping sets (()), grouping sets (())".
To fix this issue, I suggest first checking the groupDistinct flag,
which is straightforward and does not cost much. After that, there
seem to be two possible approaches to verify whether there is only one
empty grouping set.
1) In addition to the current check, we can iterate through
query->groupingSets and verify that each GroupingSet node contains
exactly one element. This should cover all possible structures that
represent a single empty grouping set.
2) Alternatively, we can perform expand_grouping_sets() and then check
whether the resulting expanded flat list contains only one element.
I'm somewhat inclined toward option #2, as it seems to be a more
principled approach. It could be argued that calling
expand_grouping_sets() can introduce planning overhead, but I don't
think this is a problem, because 1) we're dealing with a simple
grouping sets structure that contains only empty grouping sets, and 2)
since we have already checked groupDistinct, there's no need to
perform deduplication of grouping sets within expand_grouping_sets().
Therefore, I expect the overhead from calling expand_grouping_sets()
to be minimal.
Attached is a patch along the lines of option #2. The LCOV report
indicates that there is currently no test coverage for the "else if
(query->groupingSets)" branch in query_is_distinct_for(). This patch
also adds test cases to cover that branch.
Thoughts?
- Richard
Вложения
On Wed, Oct 22, 2025 at 6:25 PM Richard Guo <guofenglinux@gmail.com> wrote: > Attached is a patch along the lines of option #2. The LCOV report > indicates that there is currently no test coverage for the "else if > (query->groupingSets)" branch in query_is_distinct_for(). This patch > also adds test cases to cover that branch. Here is an updated patch that includes a commit message and adds a new test case involving DISTINCT clause used with GROUP BY. - Richard
Вложения
On Thu, 23 Oct 2025 at 15:45, Richard Guo <guofenglinux@gmail.com> wrote: > > On Wed, Oct 22, 2025 at 6:25 PM Richard Guo <guofenglinux@gmail.com> wrote: > > Attached is a patch along the lines of option #2. The LCOV report > > indicates that there is currently no test coverage for the "else if > > (query->groupingSets)" branch in query_is_distinct_for(). This patch > > also adds test cases to cover that branch. > > Here is an updated patch that includes a commit message and adds a new > test case involving DISTINCT clause used with GROUP BY. I've not reviewed the patch, so I'm assuming this is broken and your fix is correct, but I did see your commit message says: > No backpatch as this could result in plan changes. If this is broken then it'll need to be backpatched as if that function returns true when it should return false, then you could have LEFT JOINs being removed when they shouldn't or joins being marked as "Inner Unique" when they shouldn't, which could result in incorrect query results. David
On Thu, 23 Oct 2025 at 15:59, David Rowley <dgrowleyml@gmail.com> wrote: > > No backpatch as this could result in plan changes. > > If this is broken then it'll need to be backpatched as if that > function returns true when it should return false, then you could have > LEFT JOINs being removed when they shouldn't or joins being marked as > "Inner Unique" when they shouldn't, which could result in incorrect > query results. Or if it's a case of it returning false when it could have returned true, then maybe the commit message should make that clear. I'm unable to tell from reading it. Something like; "The previous logic in query_is_distinct_for() was incomplete [as it failed to detect that a query was distinct when ...]". David
On Thu, Oct 23, 2025 at 12:07 PM David Rowley <dgrowleyml@gmail.com> wrote: > Or if it's a case of it returning false when it could have returned > true, then maybe the commit message should make that clear. I'm unable > to tell from reading it. Something like; "The previous logic in > query_is_distinct_for() was incomplete [as it failed to detect that a > query was distinct when ...]". It's the case of failing to recognize distinctness when it actually holds (i.e., a false negative). Therefore, this issue does not cause incorrect results, but rather leads to missed optimization opportunities. How about using the following wording in the commit message? " The previous logic in query_is_distinct_for() was incomplete because the check was insufficiently thorough and could return false when it should have returned true. " - Richard
On Thu, 23 Oct 2025 at 16:43, Richard Guo <guofenglinux@gmail.com> wrote: > How about using the following wording in the commit message? > > " > The previous logic in query_is_distinct_for() was incomplete because > the check was insufficiently thorough and could return false when it > should have returned true. > " "it could" might be more accurate. The function is under no obligation to return true for every possible case. David
On Thu, Oct 23, 2025 at 1:33 PM David Rowley <dgrowleyml@gmail.com> wrote: > On Thu, 23 Oct 2025 at 16:43, Richard Guo <guofenglinux@gmail.com> wrote: > > How about using the following wording in the commit message? > > > > " > > The previous logic in query_is_distinct_for() was incomplete because > > the check was insufficiently thorough and could return false when it > > should have returned true. > > " > "it could" might be more accurate. The function is under no obligation > to return true for every possible case. Fair point. Patch updated with a revised commit message. - Richard