Обсуждение: [PATCH] GROUP BY ALL
I see that there'd been some chatter but not a lot of discussion about a GROUP BY ALL feature/functionality. There certainly is utility in such a construct IMHO. The grammar is unambiguous, so can support this construct in lieu of the traditional GROUP BY clause. Enclosed is a patch which adds this via just scanning the TargetEntry list and adding anything that is not an aggregate function call to the groupList. Still need some docs; just throwing this out there and getting some feedback. Thanks, David
Вложения
On Mon, Jul 22, 2024 at 1:55 PM David Christensen <david@pgguru.net> wrote:
I see that there'd been some chatter but not a lot of discussion about
a GROUP BY ALL feature/functionality. There certainly is utility in
such a construct IMHO.
Still need some docs; just throwing this out there and getting some feedback.
I strongly dislike adding this feature. I'd only consider supporting it if it was part of the SQL standard.
Code is written once and read many times. This feature caters to the writer, not the reader. And furthermore usage of this is prone to be to the writer's detriment as well.
David J.
On Mon, 22 Jul 2024 at 17:34, David G. Johnston <david.g.johnston@gmail.com> wrote:
On Mon, Jul 22, 2024 at 1:55 PM David Christensen <david@pgguru.net> wrote:I see that there'd been some chatter but not a lot of discussion about
a GROUP BY ALL feature/functionality. There certainly is utility in
such a construct IMHO.
Still need some docs; just throwing this out there and getting some feedback.I strongly dislike adding this feature. I'd only consider supporting it if it was part of the SQL standard.Code is written once and read many times. This feature caters to the writer, not the reader. And furthermore usage of this is prone to be to the writer's detriment as well.
And for when this might be useful, the syntax for it already exists, although a spurious error message is generated:
odyssey=> select (uw_term).*, count(*) from uw_term group by uw_term;
ERROR: column "uw_term.term_id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: select (uw_term).*, count(*) from uw_term group by uw_term;
^
ERROR: column "uw_term.term_id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: select (uw_term).*, count(*) from uw_term group by uw_term;
^
I'm not sure exactly what's going on here — it's like it's still seeing the table name in the field list as only a table name and not the value corresponding to the whole table as a row value (But in general I'm not happy with the system's ability to figure out that a column's value has only one possibility given the grouping columns). You can work around:
odyssey=> with t as (select uw_term, count(*) from uw_term group by uw_term) select (uw_term).*, count from t;
This query works.
"David G. Johnston" <david.g.johnston@gmail.com> writes: > On Mon, Jul 22, 2024 at 1:55 PM David Christensen <david@pgguru.net> wrote: >> I see that there'd been some chatter but not a lot of discussion about >> a GROUP BY ALL feature/functionality. There certainly is utility in >> such a construct IMHO. > I strongly dislike adding this feature. I'd only consider supporting it if > it was part of the SQL standard. Yeah ... my recollection is that we already rejected this idea. If you want to re-litigate that, "throwing this out there" is not a sufficient argument. (Personally, I'd wonder exactly what ALL is quantified over: the whole output of the FROM clause, or only columns mentioned in the SELECT tlist, or what? And why that choice rather than another?) regards, tom lane
Isaac Morland <isaac.morland@gmail.com> writes: > And for when this might be useful, the syntax for it already exists, > although a spurious error message is generated: > odyssey=> select (uw_term).*, count(*) from uw_term group by uw_term; > ERROR: column "uw_term.term_id" must appear in the GROUP BY clause or be > used in an aggregate function > LINE 1: select (uw_term).*, count(*) from uw_term group by uw_term; > ^ > I'm not sure exactly what's going on here The SELECT entry is expanded into "uw_term.col1, uw_term.col2, uw_term.col3, ...", and those single-column Vars don't match the whole-row Var appearing in the GROUP BY list. I guess if we think this is important, we could add a proof rule saying that a per-column Var is functionally dependent on a whole-row Var of the same relation. Odd that the point hasn't come up before (though I guess that suggests that few people try this). regards, tom lane
On Mon, Jul 22, 2024 at 4:34 PM David G. Johnston <david.g.johnston@gmail.com> wrote: > > On Mon, Jul 22, 2024 at 1:55 PM David Christensen <david@pgguru.net> wrote: >> >> I see that there'd been some chatter but not a lot of discussion about >> a GROUP BY ALL feature/functionality. There certainly is utility in >> such a construct IMHO. >> >> Still need some docs; just throwing this out there and getting some feedback. >> > > I strongly dislike adding this feature. I'd only consider supporting it if it was part of the SQL standard. > > Code is written once and read many times. This feature caters to the writer, not the reader. And furthermore usage ofthis is prone to be to the writer's detriment as well. I'd say this feature (at least for me) caters to the investigator; someone who is interactively looking at data hence why it would cater to the writer. Consider acquainting yourself with a large table that has a large number of annoying-named fields where you want to look at how different data is correlated or broken-down. Something along the lines of: SELECT last_name, substring(first_name,1,1) as first_initial, income_range, count(*) FROM census_data GROUP BY ALL; If you are iteratively looking at things, adding or removing fields from your breakdown, you only need to change it in a single place, the tlist. Additionally, expressions can be used transparently without needing to repeat them. (Yes, in practice, I'd often use GROUP BY 1, 2, say, but if you add more fields to this you need to edit in multiple places.) David
On Mon, Jul 22, 2024 at 5:29 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > "David G. Johnston" <david.g.johnston@gmail.com> writes: > > On Mon, Jul 22, 2024 at 1:55 PM David Christensen <david@pgguru.net> wrote: > >> I see that there'd been some chatter but not a lot of discussion about > >> a GROUP BY ALL feature/functionality. There certainly is utility in > >> such a construct IMHO. > > > I strongly dislike adding this feature. I'd only consider supporting it if > > it was part of the SQL standard. > > Yeah ... my recollection is that we already rejected this idea. > If you want to re-litigate that, "throwing this out there" is > not a sufficient argument. Heh, fair enough. I actually wrote the patch after encountering the syntax in DuckDB and it seemed easy enough to add to Postgres while providing some utility, then ended up seeing a thread about it later. I did not get the sense that this had been globally rejected; though there were definitely voices against it, it seemed to trail off rather than coming to a conclusion. > (Personally, I'd wonder exactly what ALL is quantified over: the > whole output of the FROM clause, or only columns mentioned in the > SELECT tlist, or what? And why that choice rather than another?) My intention here was to basically be a shorthand for "group by specified non-aggregate fields in the select list". Perhaps I'm not being creative enough, but what is the interpretation/use case for anything else? :-) While there are other ways to accomplish these things, making an easy way to GROUP BY with aggregate queries would be useful in the field, particularly when doing iterative discovery work would save a lot of time with a situation that is both detectable and hits users with errors all the time. I'm not married to the exact syntax of this feature; anything else short and consistent could work if `ALL` is considered to potentially gain a different interpretation in the future. David
On Tue, Jul 23, 2024 at 8:21 AM Andrei Borodin <x4mmm@yandex-team.ru> wrote: > > On 23 Jul 2024, at 00:40, Isaac Morland <isaac.morland@gmail.com> wrote: > > odyssey=> select (uw_term).*, count(*) from uw_term group by uw_term; > ERROR: column "uw_term.term_id" must appear in the GROUP BY clause or be used in an aggregate function > LINE 1: select (uw_term).*, count(*) from uw_term group by uw_term; > > > AFAIR this problem was solved in my implementation [0] > > On 23 Jul 2024, at 01:29, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > (Personally, I'd wonder exactly what ALL is quantified over: the > whole output of the FROM clause, or only columns mentioned in the > SELECT tlist, or what? And why that choice rather than another?) > > > I'd like to have GROUP BY AUTO (I also proposed version GROUP BY SURPRISE ME). But I wouldn't like to open pandora boxof syntax sugar extensions which may will be incompatible with future standards. > If we could have extensible grammar - I'd be happy to have a lot of such enhancements. My top 2 are FROM table SELECT columnand better GROUP BY. GROUP BY AUTO also seems fine here to me; I understand the desire to avoid major incompatible syntax changes; GROUP BY ALL does exist in multiple products so it's not unprecedented. I wrote my patch before seeing your thread, sorry I missed that. :-) David
On Mon, Jul 22, 2024 at 4:41 PM Isaac Morland <isaac.morland@gmail.com> wrote: > And for when this might be useful, the syntax for it already exists, although a spurious error message is generated: > > odyssey=> select (uw_term).*, count(*) from uw_term group by uw_term; > ERROR: column "uw_term.term_id" must appear in the GROUP BY clause or be used in an aggregate function > LINE 1: select (uw_term).*, count(*) from uw_term group by uw_term; > ^ This is with my patch, or existing postgres? Grouping by record is not actually what this patch is trying to do, so perhaps there is some ambiguity; this is intended to GROUP BY any select item that is not an aggregate function.
On Tue, 2024-07-23 at 08:37 -0500, David Christensen wrote: > My intention here was to basically be a shorthand for "group by > specified non-aggregate fields in the select list". Perhaps I'm not > being creative enough, but what is the interpretation/use case for > anything else? :-) I am somewhat against this feature. It is too much magic for my taste. It might be handy for interactive use, but I would frown at an application that uses code like that, much like I'd frown at "SELECT *" in application code. Yours, Laurenz Albe
On Tue, Jul 23, 2024 at 10:57 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote: > > On Tue, 2024-07-23 at 08:37 -0500, David Christensen wrote: > > My intention here was to basically be a shorthand for "group by > > specified non-aggregate fields in the select list". Perhaps I'm not > > being creative enough, but what is the interpretation/use case for > > anything else? :-) > > I am somewhat against this feature. > It is too much magic for my taste. > > It might be handy for interactive use, but I would frown at an application > that uses code like that, much like I'd frown at "SELECT *" in application code. Sure, not everything that makes things easier is strictly necessary; we could require `CAST(field AS text)` instead of `::text`, make subqueries required for transforming oids into specific system tables instead of `::regfoo` casts, any number of other choices, remove `SELECT *` as a parse option, but making it easier to do common things interactively as a DBA has value as well. David
On Tue, Jul 23, 2024 at 9:48 AM David Christensen <david@pgguru.net> wrote:
Sure, not everything that makes things easier is strictly necessary;
we could require `CAST(field AS text)` instead of `::text`,
Probably should have...being standard and all. Though syntactic sugar is quite different from new behavior - transforming :: to CAST is straight-forward.
make
subqueries required for transforming oids into specific system tables
instead of `::regfoo` casts,
Since OID is non-standard this falls within our purview.
any number of other choices, remove
`SELECT *` as a parse option,
Again, standard dictated.
but making it easier to do common things
interactively as a DBA has value as well.
Agreed, but this isn't a clear-cut win, and doesn't have standard conformance to tip the scale over fully.
Also, there are so many better tools for data exploration. Removing this quirk only marginally closes that gap.
David J.
On 7/22/24 15:43, Tom Lane wrote: > Isaac Morland <isaac.morland@gmail.com> writes: >> And for when this might be useful, the syntax for it already exists, >> although a spurious error message is generated: > >> odyssey=> select (uw_term).*, count(*) from uw_term group by uw_term; >> ERROR: column "uw_term.term_id" must appear in the GROUP BY clause or be >> used in an aggregate function >> LINE 1: select (uw_term).*, count(*) from uw_term group by uw_term; >> ^ > >> I'm not sure exactly what's going on here > > The SELECT entry is expanded into "uw_term.col1, uw_term.col2, > uw_term.col3, ...", and those single-column Vars don't match the > whole-row Var appearing in the GROUP BY list. I guess if we > think this is important, we could add a proof rule saying that > a per-column Var is functionally dependent on a whole-row Var > of the same relation. Odd that the point hasn't come up before > (though I guess that suggests that few people try this). I was just using this group-by-row feature last week to implement a temporal outer join in a way that would work for arbitrary tables. Here is some example SQL: https://github.com/pjungwir/temporal_ops/blob/b10d65323749faa6c47956db2e8f95441e508fce/sql/outer_join.sql#L48-L66 That does `GROUP BY a` then `SELECT (x.a).*`.[1] It is very useful for writing queries that don't want to know about the structure of the row. I noticed the same error as Isaac. I worked around the problem by wrapping it in a subquery and decomposing the row outside. It's already an obscure feature, and an easy workaround might be why you haven't heard complaints before. I wouldn't mind writing a patch for that rule when I get a chance (if no one else gets to it first.) [1] Actually I see it does `GROUP BY a, a.valid_at`, but that is surely more than I need. I think that `a.valid_at` is leftover from a previous version of the query. Yours, -- Paul ~{:-) pj@illuminatedcomputing.com
On 23.07.24 00:29, Tom Lane wrote: > (Personally, I'd wonder exactly what ALL is quantified over: the > whole output of the FROM clause, or only columns mentioned in the > SELECT tlist, or what? And why that choice rather than another?) Looks like the main existing implementations take it to mean all entries in the SELECT list that are not aggregate functions. https://duckdb.org/docs/sql/query_syntax/groupby.html#group-by-all https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-qry-select-groupby.html#parameters https://docs.snowflake.com/en/sql-reference/constructs/group-by#parameters
On Mon, 22 Jul 2024 at 22:55, David Christensen <david@pgguru.net> wrote: > I see that there'd been some chatter but not a lot of discussion about > a GROUP BY ALL feature/functionality. There certainly is utility in > such a construct IMHO. +1 from me. When exploring data, this is extremely useful because you don't have to update the GROUP BY clause every time Regarding the arguments against this: 1. I don't think this is any more unreadable than being able to GROUP BY 1, 2, 3. Or being able to use column aliases from the SELECT in the GROUP BY clause. Again this is already allowed. Personally I actually think it's more readable than specifying e.g. 5 columns in the group by, because then I have to cross-reference with columns in the SELECT clause to find out if they are the same. With ALL I instantly know it's grouped by all 2. This is indeed not part of the standard. But we have many things that are not part of the standard. I think as long as we use the same syntax as snowflake, databricks and duckdb I personally don't see a big problem. Then we could try and make this be part of the standard in the next version of the standard.
On Tue, 23 Jul 2024 at 15:22, Andrei Borodin <x4mmm@yandex-team.ru> wrote: > I'd like to have GROUP BY AUTO (I also proposed version GROUP BY SURPRISE ME). But I wouldn't like to open pandora boxof syntax sugar extensions which may will be incompatible with future standards. > If we could have extensible grammar - I'd be happy to have a lot of such enhancements. My top 2 are FROM table SELECT columnand better GROUP BY. Personally my number one enhancement would be allowing a trailing comma after the last column in the SELECT clause.
Hi
st 24. 7. 2024 v 10:57 odesílatel Jelte Fennema-Nio <postgres@jeltef.nl> napsal:
On Mon, 22 Jul 2024 at 22:55, David Christensen <david@pgguru.net> wrote:
> I see that there'd been some chatter but not a lot of discussion about
> a GROUP BY ALL feature/functionality. There certainly is utility in
> such a construct IMHO.
+1 from me. When exploring data, this is extremely useful because you
don't have to update the GROUP BY clause every time
Regarding the arguments against this:
1. I don't think this is any more unreadable than being able to GROUP
BY 1, 2, 3. Or being able to use column aliases from the SELECT in the
GROUP BY clause. Again this is already allowed. Personally I actually
think it's more readable than specifying e.g. 5 columns in the group
by, because then I have to cross-reference with columns in the SELECT
clause to find out if they are the same. With ALL I instantly know
it's grouped by all
2. This is indeed not part of the standard. But we have many things
that are not part of the standard. I think as long as we use the same
syntax as snowflake, databricks and duckdb I personally don't see a
big problem. Then we could try and make this be part of the standard
in the next version of the standard.
Aggregation against more columns is pretty slow and memory expensive in Postgres.
DuckDB is an analytic database with different storage, different executors. I like it very much, but I am not sure if I want to see these features in Postgres.
Lot of developers are not very smart, and with proposed feature, then instead to write correct and effective query, then they use just GROUP BY ALL. Slow query should look like a slow query :-)
Regards
Pavel
> On 24 Jul 2024, at 13:58, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > > On Tue, 23 Jul 2024 at 15:22, Andrei Borodin <x4mmm@yandex-team.ru> wrote: >> I'd like to have GROUP BY AUTO (I also proposed version GROUP BY SURPRISE ME). But I wouldn't like to open pandora boxof syntax sugar extensions which may will be incompatible with future standards. >> If we could have extensible grammar - I'd be happy to have a lot of such enhancements. My top 2 are FROM table SELECTcolumn and better GROUP BY. > > Personally my number one enhancement would be allowing a trailing > comma after the last column in the SELECT clause. Yes, trailing comma sounds great too. One more similar syntax sugar I can think of. I see lots of queries like SELECT somtheing FROM table1 WHERE 1=1 and id = x --and col1 = val1 and col2 = val2 I was wondering where does that "1=1" comes from. It's because developer comment condition one by one like "--, col1 = val1".And they do not want to cope with and\or continuation. Best regards, Andrey Borodin. PS. Seems like Mail.App mangled my previous message despite using plain text. It's totally broken in archives...
On Tue, Jul 23, 2024 at 6:53 PM David Christensen <david@pgguru.net> wrote: > > On Mon, Jul 22, 2024 at 4:34 PM David G. Johnston > <david.g.johnston@gmail.com> wrote: > > > > On Mon, Jul 22, 2024 at 1:55 PM David Christensen <david@pgguru.net> wrote: > >> > >> I see that there'd been some chatter but not a lot of discussion about > >> a GROUP BY ALL feature/functionality. There certainly is utility in > >> such a construct IMHO. > >> > >> Still need some docs; just throwing this out there and getting some feedback. > >> > > > > I strongly dislike adding this feature. I'd only consider supporting it if it was part of the SQL standard. > > > > Code is written once and read many times. This feature caters to the writer, not the reader. And furthermore usageof this is prone to be to the writer's detriment as well. > > I'd say this feature (at least for me) caters to the investigator; > someone who is interactively looking at data hence why it would cater > to the writer. Consider acquainting yourself with a large table that > has a large number of annoying-named fields where you want to look at > how different data is correlated or broken-down. Something along the > lines of: To me this looks like a feature that a data exploration tool may implement instead of being part of the server. It would then provide more statistics about each correlation/column set etc. -- Best Wishes, Ashutosh Bapat