Обсуждение: Why does GROUP BY reduce number of rows?
This question is an attempt to understand/question implementation of SQL in PostgreSQL.
While reading PostgreSQL v10 docs (PDF version), I noticed a footnote at end of **2.5. Querying a Table** stating:
> the implementation of DISTINCT automatically orders the rows and so ORDER BY is unnecessary. But this is not required by the SQL standard, and current PostgreSQL does not guarantee that DISTINCT causes the rows to be ordered
Let's call this error/mistake a "UDFS" (Unintended Deviation From Standards). It is possible there are other such errors in PostgreSQL's SQL implementation. (I'm glad PostgreSQL rectified it).
I think: reduction in number of rows in output when `GROUP BY` clause is used is another UDFS.
Does the SQL Standard (can't buy, and I probably won't understand it) explicitly state that **only the final row of an aggregate should be displayed**? Why not first row of an aggregate group, why not n-th row?
While reading PostgreSQL v10 docs (PDF version), I noticed a footnote at end of **2.5. Querying a Table** stating:
> the implementation of DISTINCT automatically orders the rows and so ORDER BY is unnecessary. But this is not required by the SQL standard, and current PostgreSQL does not guarantee that DISTINCT causes the rows to be ordered
Let's call this error/mistake a "UDFS" (Unintended Deviation From Standards). It is possible there are other such errors in PostgreSQL's SQL implementation. (I'm glad PostgreSQL rectified it).
I think: reduction in number of rows in output when `GROUP BY` clause is used is another UDFS.
Does the SQL Standard (can't buy, and I probably won't understand it) explicitly state that **only the final row of an aggregate should be displayed**? Why not first row of an aggregate group, why not n-th row?
What if important (i.e. "uniquely identifying" for "human eyes") data was contained in first row of a group. Why does the underlying program (psql server, I suppose) **decide** that preceding rows are not worth displaying?
I don't mean to question the developers but perhaps there are bigger reasons why things are the way they are.
Cheers.
P.S. I am a SQL novice, this thread lead me to think like this. In it's context: Why do we need to calculate aggregate for each row? Isn't it *wrong*?
>>>>> "Rocky" == Rocky Ji <rockyji3438@gmail.com> writes: Rocky> Let's call this error/mistake a "UDFS" (Unintended Deviation Rocky> From Standards). What deviation? If the standard leaves something unspecified, the PostgreSQL can choose to do as it pleases. (We can also choose to just violate the standard...) Rocky> I think: reduction in number of rows in output when `GROUP BY` Rocky> clause is used is another UDFS. Reducing the number of rows of output is the SOLE PURPOSE AND REASON TO EXIST of the GROUP BY clause. By definition, the clause produces only one output row for each matching group in the input. The rows produced by GROUP BY are also not "the final row" or even "the first row" of the group. They are _new_ rows computed from column expressions which must reference the input in only two ways: - aggregate functions computed over the input group - values which can be proved by the implementation to be guaranteed to be the same in all rows of the input group (that is, the grouping columns themselves and anything the implementation can prove is a functional dependency of the grouping columns). Some other databases may be lax about enforcing these restrictions and allow non-dependent columns to be included in the result, taking their values from some arbitrary row of the input group. PG if anything is _too_ strict about this, since for implementation reasons we can prove functional dependency only in very limited cases and we reject anything else. Rocky> What if important (i.e. "uniquely identifying" for "human eyes") Rocky> data was contained in first row of a group. Why does the Rocky> underlying program (psql server, I suppose) **decide** that Rocky> preceding rows are not worth displaying? The person who wrote GROUP BY in the query made that decision. BTW, if you want to calculate aggregated values across rows without reducing the number of rows of output, there's an entirely separate mechanism for that: window functions. -- Andrew (irc:RhodiumToad)