Обсуждение: Warts with SELECT DISTINCT
Normally Postgres extends SQL to allow ORDER BY to include arbitrary expressions not in the select list. However this doesn't seem to work with SELECT DISTINCT. stark=> \d test Table "public.test" Column | Type | Modifiers --------+------+----------- col1 | text | stark=> select distinct col1 from test order by upper(col1); ERROR: for SELECT DISTINCT, ORDER BY expressions must appearin select list It seems like as long as the expressions involve only columns or expressions present in the SELECT DISTINCT list and as long as those functions are stable or immutable then this shouldn't be a problem. Just prepend those expressions to the select list to use as the sort key. In fact the equivalent GROUP BY query does work as expected: stark=> select col1 from test group by col1 order by upper(col1);col1 ------acx (3 rows) Though it's optimized poorly and does a superfluous sort step: stark=> explain select col1 from test group by col1 order by upper(col1); QUERY PLAN ---------------------------------------------------------------------------Sort (cost=99.72..100.22 rows=200 width=32) Sort Key: upper(col1) -> Group (cost=85.43..92.08 rows=200 width=32) -> Sort (cost=85.43..88.50 rows=1230 width=32) Sort Key: col1 -> Seq Scan on test (cost=0.00..22.30 rows=1230 width=32) (6 rows) Whereas it shouldn't be hard to prove that this is equivalent: stark=> explain select col1 from test group by upper(col1),col1 order by upper(col1); QUERY PLAN ---------------------------------------------------------------------Group (cost=88.50..98.23 rows=200 width=32) -> Sort (cost=88.50..91.58 rows=1230 width=32) Sort Key: upper(col1), col1 -> Seq Scan on test (cost=0.00..25.38rows=1230 width=32) (4 rows) My understanding is that the DISTINCT and DISTINCT ON code path is old and grotty. Perhaps it's time to remove those code paths, and replace them with a transformation that creates the equivalent GROUP BY query and then optimize that path until it can produce plans as good as DISTINCT and DISTINCT ON ever did. -- greg
On Wed, May 03, 2006 at 17:58:07 -0400, Greg Stark <gsstark@mit.edu> wrote: > > Though it's optimized poorly and does a superfluous sort step: > > stark=> explain select col1 from test group by col1 order by upper(col1); > QUERY PLAN > --------------------------------------------------------------------------- > Sort (cost=99.72..100.22 rows=200 width=32) > Sort Key: upper(col1) > -> Group (cost=85.43..92.08 rows=200 width=32) > -> Sort (cost=85.43..88.50 rows=1230 width=32) > Sort Key: col1 > -> Seq Scan on test (cost=0.00..22.30 rows=1230 width=32) > (6 rows) > > > Whereas it shouldn't be hard to prove that this is equivalent: > > stark=> explain select col1 from test group by upper(col1),col1 order by upper(col1); > QUERY PLAN > --------------------------------------------------------------------- > Group (cost=88.50..98.23 rows=200 width=32) > -> Sort (cost=88.50..91.58 rows=1230 width=32) > Sort Key: upper(col1), col1 > -> Seq Scan on test (cost=0.00..25.38 rows=1230 width=32) > (4 rows) I don't think you can assume that that will be true for any locale. If there are two different characters that both have the same uppercase version, this will break things. And while you would expect that x = y => upper(x) = upper(y) I am not sure that is guarenteed for locales. I can imagine having two different characters that are treated the same for ordering purposes, but have uppercase versions that are considered different for ordering purposes.
Bruno Wolff III <bruno@wolff.to> writes: > > Whereas it shouldn't be hard to prove that this is equivalent: > > > > stark=> explain select col1 from test group by upper(col1),col1 order by upper(col1); > > QUERY PLAN > > --------------------------------------------------------------------- > > Group (cost=88.50..98.23 rows=200 width=32) > > -> Sort (cost=88.50..91.58 rows=1230 width=32) > > Sort Key: upper(col1), col1 > > -> Seq Scan on test (cost=0.00..25.38 rows=1230 width=32) > > (4 rows) > > I don't think you can assume that that will be true for any locale. If there > are two different characters that both have the same uppercase version, this > will break things. No it won't. -- greg
On Thu, May 04, 2006 at 00:05:16 -0400, Greg Stark <gsstark@mit.edu> wrote: > Bruno Wolff III <bruno@wolff.to> writes: > > > > Whereas it shouldn't be hard to prove that this is equivalent: > > > > > > stark=> explain select col1 from test group by upper(col1),col1 order by upper(col1); > > > QUERY PLAN > > > --------------------------------------------------------------------- > > > Group (cost=88.50..98.23 rows=200 width=32) > > > -> Sort (cost=88.50..91.58 rows=1230 width=32) > > > Sort Key: upper(col1), col1 > > > -> Seq Scan on test (cost=0.00..25.38 rows=1230 width=32) > > > (4 rows) > > > > I don't think you can assume that that will be true for any locale. If there > > are two different characters that both have the same uppercase version, this > > will break things. > > No it won't. Sure it will, because when you do the group by you will get a different number of groups. When grouping by the original characters you will get separate groups for characters that have the same uppercase character, where as when grouing by the uppercased characters you won't.
Bruno Wolff III <bruno@wolff.to> writes: > Greg Stark <gsstark@mit.edu> wrote [ baldly summarized ] >> [ x > y implies upper(x) > upper(y) ] > I don't think you can assume that that will be true for any locale. Whether or not that may actually be true for upper() (I share Bruno's skepticism, but maybe it's so), it really does not matter because the planner doesn't have enough knowledge about the behavior of upper() to make such an inference. I think it's a fair point that we could allow "SELECT DISTINCT x ORDER BY foo(x)" if foo() is stable, but that does not imply that sorting by x is interchangeable with sorting by foo(x). foo = abs is a trivial counterexample. As far as the original point goes: feel free to reimplement DISTINCT, but don't break the documented behavior of DISTINCT ON + ORDER BY, or you'll have a lot of unhappy villagers appear on your doorstep bearing torches and pitchforks. It might be mostly an implementation artifact, but it's an awfully useful one ... regards, tom lane
Bruno Wolff III <bruno@wolff.to> writes: > On Thu, May 04, 2006 at 00:05:16 -0400, > Greg Stark <gsstark@mit.edu> wrote: > > Bruno Wolff III <bruno@wolff.to> writes: > > > > > > Whereas it shouldn't be hard to prove that this is equivalent: > > > > > > > > stark=> explain select col1 from test group by upper(col1),col1 order by upper(col1); > > > > QUERY PLAN > > > > --------------------------------------------------------------------- > > > > Group (cost=88.50..98.23 rows=200 width=32) > > > > -> Sort (cost=88.50..91.58 rows=1230 width=32) > > > > Sort Key: upper(col1), col1 > > > > -> Seq Scan on test (cost=0.00..25.38 rows=1230 width=32) > > > > (4 rows) > > > > > > I don't think you can assume that that will be true for any locale. If there > > > are two different characters that both have the same uppercase version, this > > > will break things. > > > > No it won't. > > Sure it will, because when you do the group by you will get a different > number of groups. When grouping by the original characters you will get > separate groups for characters that have the same uppercase character, where > as when grouing by the uppercased characters you won't. But grouping on *both* will produce the same groups as grouping on the original characters alone. -- greg
On Thu, May 04, 2006 at 01:32:45 -0400, Greg Stark <gsstark@mit.edu> wrote: > Bruno Wolff III <bruno@wolff.to> writes: > > > On Thu, May 04, 2006 at 00:05:16 -0400, > > Greg Stark <gsstark@mit.edu> wrote: > > > Bruno Wolff III <bruno@wolff.to> writes: > > > > > > > > Whereas it shouldn't be hard to prove that this is equivalent: > > > > > > > > > > stark=> explain select col1 from test group by upper(col1),col1 order by upper(col1); > > > > > QUERY PLAN > > > > > --------------------------------------------------------------------- > > > > > Group (cost=88.50..98.23 rows=200 width=32) > > > > > -> Sort (cost=88.50..91.58 rows=1230 width=32) > > > > > Sort Key: upper(col1), col1 > > > > > -> Seq Scan on test (cost=0.00..25.38 rows=1230 width=32) > > > > > (4 rows) > > > > > > > > I don't think you can assume that that will be true for any locale. If there > > > > are two different characters that both have the same uppercase version, this > > > > will break things. > > > > > > No it won't. > > > > Sure it will, because when you do the group by you will get a different > > number of groups. When grouping by the original characters you will get > > separate groups for characters that have the same uppercase character, where > > as when grouing by the uppercased characters you won't. > > But grouping on *both* will produce the same groups as grouping on the > original characters alone. OK, I misssed that. My brain only saw upper(col) and not the immediately following ,col1. I aggree that grouping by col1 and upper(col1), col1 will give you the same groups. And hence the queries should be equivalent.
On Thu, May 04, 2006 at 01:13:20 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I think it's a fair point that we could allow "SELECT DISTINCT x ORDER BY > foo(x)" if foo() is stable, but that does not imply that sorting by x is > interchangeable with sorting by foo(x). foo = abs is a trivial > counterexample. I misunderstood Greg's example. Sorting by (foo(x), x) is a suitable replacement for sorting by foo(x). So that it would be OK to rewrite SELECT DISTINCT x ORDER BY foo(x) as SELECT DISTINCT ON (foo(x), x) x ORDER BY foo(x) Whether or not this is worthwhile to automate, I am not in a good position to judge.
Bruno Wolff III <bruno@wolff.to> writes: > ... it would be OK to rewrite > SELECT DISTINCT x ORDER BY foo(x) > as > SELECT DISTINCT ON (foo(x), x) x ORDER BY foo(x) This assumes that x = y implies foo(x) = foo(y), which is something that's not necessarily the case, mainly because a datatype's "=" function need not have a lot to do with the behavior of arbitrary functions foo(), especially if foo() yields a different datatype. The citext datatype is an easy counterexample: it thinks "foo" = "Foo", but md5() of those values will not yield the same answers. The bottom line here is that this sort of deduction requires more understanding of the properties of datatypes and functions than our existing catalogs allow the planner to obtain. regards, tom lane
On Thu, May 04, 2006 at 02:39:33 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Bruno Wolff III <bruno@wolff.to> writes: > > ... it would be OK to rewrite > > SELECT DISTINCT x ORDER BY foo(x) > > as > > SELECT DISTINCT ON (foo(x), x) x ORDER BY foo(x) > > This assumes that x = y implies foo(x) = foo(y), which is something > that's not necessarily the case, mainly because a datatype's "=" > function need not have a lot to do with the behavior of arbitrary > functions foo(), especially if foo() yields a different datatype. > The citext datatype is an easy counterexample: it thinks "foo" = "Foo", > but md5() of those values will not yield the same answers. > > The bottom line here is that this sort of deduction requires more > understanding of the properties of datatypes and functions than > our existing catalogs allow the planner to obtain. Thanks for pointing that out. I should have realized that this was the same (or at least close to) issue I was thinking would be a problem initially, but then I started thinking that '=' promised more than it did and assumed that x = y implies foo(x) = foo(y), which as you point out isn't always true.
Bruno Wolff III <bruno@wolff.to> writes: > Thanks for pointing that out. I should have realized that this was the same > (or at least close to) issue I was thinking would be a problem initially, but > then I started thinking that '=' promised more than it did and assumed that > x = y implies foo(x) = foo(y), which as you point out isn't always true. Hm. This goes back to the earlier conversation about whether = should ever be true for two objects that aren't, well, equal. I thought there was some consensus at the time that sorting should impose a superficial ordering on items that compare equal but aren't in fact the same. -- greg
Greg Stark <gsstark@mit.edu> writes: > Hm. This goes back to the earlier conversation about whether = should ever be > true for two objects that aren't, well, equal. I thought there was some > consensus at the time that sorting should impose a superficial ordering on > items that compare equal but aren't in fact the same. We forced that recently for text strings (overriding locale-specific cases in strcoll()), but there certainly was not any intent to decree that every datatype must do likewise. regards, tom lane