Обсуждение: SELECT DISTINCT ON ... ORDER BY ...
I have the following problem: Assuming the table TEST as follows: ID NAME ----------------- 1 Alex 2 Oliver 1 Thomas 2 Fenella SELECT DISTINCT ON id id, name FROM test; produces: ID NAME ----------------- 1 Alex 2 Oliver SELECT DISTINCT ON id, name FROM test ORDER BY name; produces: ID NAME ----------------- 1 Alex 2 Fenella 1 Thomas I would have expected only two rows in both queries. I don't care which names actually appear in the output as long as they are sorted, but there should no longer be duplicate id's. If the table is as follows: ID NAME ----------------- 2 Oliver 2 Alex 1 Thomas 1 Fenella SELECT DISTINCT ON id id, name FROM test; produces: ID NAME ----------------- 1 Thomas 2 Oliver SELECT DISTINCT ON id, name FROM test ORDER BY name; produces: ID NAME ----------------- 2 Alex 1 Fenella 2 Oliver 1 Thomas What seems to happen is that from the sorted table, duplicate id's are eliminated only if they are grouped. If there is no explicit sorting I assume the DISTINCT performs an implicit sorting on id and then eliminates records correctly. Is that the correct behaviour? Is there another (simple) way of getting the results I want? I am still using version 6.3 tm -- Thomas Metz GSF - National Research Center for Environment and Health Institute of Mammalian Genetics
I have the same on 6.4.2 Looks like a bug Sincerely, Sergei On Thu, 28 Jan 1999, Thomas Metz wrote: > I have the following problem: > > Assuming the table TEST as follows: > > ID NAME > ----------------- > 1 Alex > 2 Oliver > 1 Thomas > 2 Fenella > > > SELECT DISTINCT ON id id, name FROM test; > produces: > > ID NAME > ----------------- > 1 Alex > 2 Oliver > > > SELECT DISTINCT ON id, name FROM test ORDER BY name; > produces: > > ID NAME > ----------------- > 1 Alex > 2 Fenella > 1 Thomas > > I would have expected only two rows in both queries. I don't care which > names actually appear in the output as long as they are sorted, but > there should no longer be duplicate id's. > > If the table is as follows: > > ID NAME > ----------------- > 2 Oliver > 2 Alex > 1 Thomas > 1 Fenella > > > SELECT DISTINCT ON id id, name FROM test; > produces: > > ID NAME > ----------------- > 1 Thomas > 2 Oliver > > > SELECT DISTINCT ON id, name FROM test ORDER BY name; > produces: > > ID NAME > ----------------- > 2 Alex > 1 Fenella > 2 Oliver > 1 Thomas > > What seems to happen is that from the sorted table, duplicate id's are > eliminated only if they are grouped. If there is no explicit sorting I > assume the DISTINCT performs an implicit sorting on id and then > eliminates records correctly. Is that the correct behaviour? Is there > another (simple) way of getting the results I want? > > I am still using version 6.3 > > tm > -- > Thomas Metz > GSF - National Research Center for Environment and Health > Institute of Mammalian Genetics > >
Thomas Metz wrote: >I have the following problem: > >Assuming the table TEST as follows: > >ID NAME >----------------- >1 Alex >2 Oliver >1 Thomas >2 Fenella ... >SELECT DISTINCT ON id, name FROM test ORDER BY name; >produces: > >ID NAME >----------------- >1 Alex >2 Fenella >1 Thomas > >I would have expected only two rows in both queries. I don't care which >names actually appear in the output as long as they are sorted, but >there should no longer be duplicate id's. I looked at the documentation for SELECT and found that DISTINCT ON is mentioned but not explained; there are also some other parts of that documentation that need expanding or correcting, so I'm doing that at the moment. Is there any intended difference between DISTINCT ON and GROUP BY? (I realise from this thread that there is an actual difference!) Am I correct in saying that DISTINCT ON is not part of SQL92? -- Oliver Elphick Oliver.Elphick@lfix.co.uk Isle of Wight http://www.lfix.co.uk/oliver PGP key from public servers; key ID 32B8FAA1 ======================================== "Many are the afflictions of the righteous; but the LORD delivereth him out of them all." Psalm 34:19
Thomas Metz <tmetz@gsf.de> writes:
> SELECT DISTINCT ON id id, name FROM test ORDER BY name;
> [doesn't work as expected]
There have been related discussions before on pg-hackers mail list;
you might care to check the list archives.  The conclusion I recall
is that it's not real clear how the combination of SELECT DISTINCT
on one column and ORDER BY on another *should* work.  Postgres'
current behavior is clearly wrong IMHO, but there isn't a unique
definition of right behavior, because it's not clear which tuples
should get selected for the sort.
This "SELECT DISTINCT ON attribute" option strikes me as even more
bogus.  Where did we get that from --- is it in the SQL92 standard?
If you SELECT DISTINCT on a subset of the attributes to be returned,
then there's no unique definition of which values get returned in the
other columns.  In Thomas' example:
> Assuming the table TEST as follows:
> ID     NAME
> - -----------------
> 1      Alex
> 2      Oliver
> 1      Thomas
> 2      Fenella
> SELECT DISTINCT ON id id, name FROM test;
> produces:
> ID     NAME
> - -----------------
> 1      Alex
> 2      Oliver
There's no justifiable reason for preferring this output over
    1      Thomas
    2      Oliver
or
    1      Alex
    2      Fenella
or
    1      Thomas
    2      Fenella
Any of these are "DISTINCT ON id", but it's purely a matter of
happenstance table order and unspecified implementation choices which
one will appear.  Do we really have (or want) a statement with
inherently undefined behavior?
Anyway, to answer Thomas' question, the way SELECT DISTINCT is
implemented is that first there's a sort on the DISTINCT columns,
then there's a pass that eliminates adjacent duplicates (like the Unix
uniq(1) program).  In the current backend, doing an ORDER BY on another
column overrides the sorting on the DISTINCT columns, so when the
duplicate-eliminator runs it will fail to get rid of duplicates that
don't happen to appear consecutively in its input.  That's pretty
broken, but then the entire concept of combining these two options
doesn't seem well defined; the SELECT DISTINCT doesn't make any promises
about which tuples (with the same DISTINCT columns) it's going to pick,
therefore the result of ordering by some other column isn't clear.
If you're willing to live with poorly defined behavior, the fix
is fairly obvious: run the sort and uniq passes for the DISTINCT
columns, *then* run the sort on the ORDER BY columns --- which
will use whichever tuple the DISTINCT phase selected at random
out of each set with the same DISTINCT value.
I think the issue got put on the back burner last time in hopes that
some definition with consistent behavior would come up, but I haven't
seen any hope that there is one.
            regards, tom lane
			
		Tom Lane wrote: >Thomas Metz <tmetz@gsf.de> writes: >> SELECT DISTINCT ON id id, name FROM test ORDER BY name; >> [doesn't work as expected] > >There have been related discussions before on pg-hackers mail list; >you might care to check the list archives. The conclusion I recall >is that it's not real clear how the combination of SELECT DISTINCT >on one column and ORDER BY on another *should* work. Postgres' >current behavior is clearly wrong IMHO, but there isn't a unique >definition of right behavior, because it's not clear which tuples >should get selected for the sort. > >This "SELECT DISTINCT ON attribute" option strikes me as even more >bogus. Where did we get that from --- is it in the SQL92 standard? I looked through the standard yesterday and couldn't find it. It doesn't seem to be a useful extension, since it does nothing that you can't do with GROUP BY and seems much less well defined. For the moment I have added a brief description to the reference documentation for SELECT. >If you SELECT DISTINCT on a subset of the attributes to be returned, >then there's no unique definition of which values get returned in the >other columns. In Thomas' example: ... >Any of these are "DISTINCT ON id", but it's purely a matter of >happenstance table order and unspecified implementation choices which >one will appear. Do we really have (or want) a statement with >inherently undefined behavior? We have it; I suggest we don't want it! -- Oliver Elphick Oliver.Elphick@lfix.co.uk Isle of Wight http://www.lfix.co.uk/oliver PGP key from public servers; key ID 32B8FAA1 ======================================== "My son, if sinners entice thee, consent thou not." Proverbs 1:10
Oliver Elphick ha scritto:
Tom Lane wrote:Yes, seems that SELECT DISTINCT ON is not part of SQL92 but it is very interesting and I think it is something missing to Standard.
>Thomas Metz <tmetz@gsf.de> writes:
>> SELECT DISTINCT ON id id, name FROM test ORDER BY name;
>> [doesn't work as expected]
>
>There have been related discussions before on pg-hackers mail list;
>you might care to check the list archives. The conclusion I recall
>is that it's not real clear how the combination of SELECT DISTINCT
>on one column and ORDER BY on another *should* work. Postgres'
>current behavior is clearly wrong IMHO, but there isn't a unique
>definition of right behavior, because it's not clear which tuples
>should get selected for the sort.
>
>This "SELECT DISTINCT ON attribute" option strikes me as even more
>bogus. Where did we get that from --- is it in the SQL92 standard?I looked through the standard yesterday and couldn't find it. It doesn't
seem to be a useful extension, since it does nothing that you can't do
with GROUP BY and seems much less well defined. For the moment I have
added a brief description to the reference documentation for SELECT.>If you SELECT DISTINCT on a subset of the attributes to be returned,
>then there's no unique definition of which values get returned in the
>other columns. In Thomas' example:
...
>Any of these are "DISTINCT ON id", but it's purely a matter of
>happenstance table order and unspecified implementation choices which
>one will appear. Do we really have (or want) a statement with
>inherently undefined behavior?We have it; I suggest we don't want it!
I don't know how to do the following, if we take off DISTINCT ON from PostgreSQL:
db=> select distinct cognome, nome,via from membri where cap = '41010';
cognome|nome            |via
-------+----------------+--------------------------
FIORANI|ELISABETTA      |VIA PRETI PARTIGIANI, 63
FIORANI|GASTONE         |VIA PRETI PARTIGIANI, 63
FIORANI|MATTIA          |VIA PRETI PARTIGIANI, 63
FIORANI|SIMONE          |VIA PRETI PARTIGIANI, 63
GOZZI  |LILIANA         |VIA MAGNAGHI, 39
GOZZI  |MATTEO          |VIA MAGNAGHI, 39
RUSSO  |DAVIDE          |STRADA CORLETTO SUD, 194/1
RUSSO  |ELENA TERESA    |STRADA CORLETTO SUD, 194/1
RUSSO  |FORTUNATO       |STRADA CORLETTO SUD, 194/1
RUSSO  |MAURIZIO ANTONIO|STRADA CORLETTO SUD, 194/1
(10 rows)
db=> select distinct on cognome cognome, nome,via from membri where cap = '41010';
cognome|nome            |via
-------+----------------+--------------------------
FIORANI|GASTONE         |VIA PRETI PARTIGIANI, 63
GOZZI  |LILIANA         |VIA MAGNAGHI, 39
RUSSO  |MAURIZIO ANTONIO|STRADA CORLETTO SUD, 194/1
(3 rows)
 
			
		
"jose' soares" wrote: >Yes, seems that SELECT DISTINCT ON is not part of SQL92 but it is very >interesting and I think it is something missing to Standard. >I don't know how to do the following, if we take off DISTINCT ON from >PostgreSQL: > >db=> select distinct cognome, nome,via from membri where cap = '41010'; >cognome|nome |via >-------+----------------+-------------------------- >FIORANI|ELISABETTA |VIA PRETI PARTIGIANI, 63 >FIORANI|GASTONE |VIA PRETI PARTIGIANI, 63 >FIORANI|MATTIA |VIA PRETI PARTIGIANI, 63 >FIORANI|SIMONE |VIA PRETI PARTIGIANI, 63 >GOZZI |LILIANA |VIA MAGNAGHI, 39 >GOZZI |MATTEO |VIA MAGNAGHI, 39 >RUSSO |DAVIDE |STRADA CORLETTO SUD, 194/1 >RUSSO |ELENA TERESA |STRADA CORLETTO SUD, 194/1 >RUSSO |FORTUNATO |STRADA CORLETTO SUD, 194/1 >RUSSO |MAURIZIO ANTONIO|STRADA CORLETTO SUD, 194/1 >(10 rows) > >db=> select distinct on cognome cognome, nome,via from membri where cap = >'41010'; >cognome|nome |via >-------+----------------+-------------------------- >FIORANI|GASTONE |VIA PRETI PARTIGIANI, 63 >GOZZI |LILIANA |VIA MAGNAGHI, 39 >RUSSO |MAURIZIO ANTONIO|STRADA CORLETTO SUD, 194/1 >(3 rows) This gives the same results: junk=> select cognome, nome, via from membri where cap = '41010' group by cognome; cognome|nome |via -------+----------+-------------------------- FIORANI|ELISABETTA|VIA PRETI PARTIGIANI, 63 GOZZI |LILIANA |VIA MAGNAGHI, 39 RUSSO |DAVIDE |STRADA CORLETTO SUD, 194/1 (3 rows) The particular values returned for nome and via are different from yours but the same as I get using DISTINCT ON. Since nome and via are not aggregated, the value returned for those columns is unpredictable and therefore not useful. I think that it is actually a bug that you are able to name them at all. In fact, if you add an aggregate column to the column list, GROUP BY does not then allow columns that are neither grouped nor aggregated: junk=> select cognome, nome,via, max(age) from membri where cap = '41010' group by cognome; ERROR: parser: illegal use of aggregates or non-group column in target list junk=> select cognome, max(age) from membri where cap = '41010' group by cognome; cognome|max -------+--- FIORANI| 54 GOZZI | 76 RUSSO | 45 (3 rows) which definitely suggests that it is a bug to allow such fields when no aggregate is specified. DISTINCT ON fails with an aggregate, even if no other columns are named: junk=> select distinct on cognome cognome, max(age) from membri where cap = '41010'; ERROR: parser: illegal use of aggregates or non-group column in target list which makes it even less useful! -- Oliver Elphick Oliver.Elphick@lfix.co.uk Isle of Wight http://www.lfix.co.uk/oliver PGP key from public servers; key ID 32B8FAA1 ======================================== "And be not conformed to this world; but be ye transformed by the renewing of your mind, that ye may prove what is that good, and acceptable, and perfect, will of God." Romans 12:2
Oliver Elphick ha scritto:
"jose' soares" wrote:This is very interesting and useful, I thought it wasn't possible. Seems that standard allows only the "order by" column(s)
>Yes, seems that SELECT DISTINCT ON is not part of SQL92 but it is very
>interesting and I think it is something missing to Standard.
>I don't know how to do the following, if we take off DISTINCT ON from
>PostgreSQL:
>
>db=> select distinct cognome, nome,via from membri where cap = '41010';
>cognome|nome |via
>-------+----------------+--------------------------
>FIORANI|ELISABETTA |VIA PRETI PARTIGIANI, 63
>FIORANI|GASTONE |VIA PRETI PARTIGIANI, 63
>FIORANI|MATTIA |VIA PRETI PARTIGIANI, 63
>FIORANI|SIMONE |VIA PRETI PARTIGIANI, 63
>GOZZI |LILIANA |VIA MAGNAGHI, 39
>GOZZI |MATTEO |VIA MAGNAGHI, 39
>RUSSO |DAVIDE |STRADA CORLETTO SUD, 194/1
>RUSSO |ELENA TERESA |STRADA CORLETTO SUD, 194/1
>RUSSO |FORTUNATO |STRADA CORLETTO SUD, 194/1
>RUSSO |MAURIZIO ANTONIO|STRADA CORLETTO SUD, 194/1
>(10 rows)
>
>db=> select distinct on cognome cognome, nome,via from membri where cap =
>'41010';
>cognome|nome |via
>-------+----------------+--------------------------
>FIORANI|GASTONE |VIA PRETI PARTIGIANI, 63
>GOZZI |LILIANA |VIA MAGNAGHI, 39
>RUSSO |MAURIZIO ANTONIO|STRADA CORLETTO SUD, 194/1
>(3 rows)This gives the same results:
junk=> select cognome, nome, via from membri where cap = '41010' group by cognome;
cognome|nome |via
-------+----------+--------------------------
FIORANI|ELISABETTA|VIA PRETI PARTIGIANI, 63
GOZZI |LILIANA |VIA MAGNAGHI, 39
RUSSO |DAVIDE |STRADA CORLETTO SUD, 194/1
and the aggregate function(s) on target list.
I tried the same query on Informix, also on Ocelot but it gives me an error.
On Informix and Ocelot 
queries like:
   select  cognome, max(age) from membri where cap = '41010' group by cognome;
are allowed. 
but 
queries like: 
   select cognome, nome, via from membri where cap = '41010' group by cognome;
aren't allowed.
-Jose'-
 
			
		
At 17:25 +0200 on 02/02/1999, jose' soares wrote: > This gives the same results: > > junk=> select cognome, nome, via from membri where cap = '41010' > group by cognome; > cognome|nome |via > -------+----------+-------------------------- > FIORANI|ELISABETTA|VIA PRETI PARTIGIANI, 63 > GOZZI |LILIANA |VIA MAGNAGHI, 39 > RUSSO |DAVIDE |STRADA CORLETTO SUD, 194/1 > > This is very interesting and useful, I thought it wasn't possible. Seems >that standard allows only the "order by" column(s) > and the aggregate function(s) on target list. > I tried the same query on Informix, also on Ocelot but it gives me an error. And with good reason, too. The above query has the same drawback as the "select distinct on", which is: it does not fully specify which value should be selected for the "nome" and "via" fields. Thus, running this same query on a table that has the same data but was, for example, filled in a different order, gives a different result. That's bad, because order should not make a difference for output. Tables are taken to be unordered sets. If you want to have a representative of the "nome" and "via" fields, and it doesn't matter which representative, then min(nome) or max(nome) should do the trick. And this query (select cognome, min(nome), min(via)... group by cognome) should give you the same result on all databases, no matter which rows were inserted first. If it was up to me, I wouldn't use the above form, and frankly, I am surprised the Postgres allows this. Herouth -- Herouth Maoz, Internet developer. Open University of Israel - Telem project http://telem.openu.ac.il/~herutma
At 17:25 +0200 on 02/02/1999, jose' soares wrote: > This gives the same results: > > junk=> select cognome, nome, via from membri where cap = '41010' > group by cognome; > cognome|nome |via > -------+----------+-------------------------- > FIORANI|ELISABETTA|VIA PRETI PARTIGIANI, 63 > GOZZI |LILIANA |VIA MAGNAGHI, 39 > RUSSO |DAVIDE |STRADA CORLETTO SUD, 194/1 > > This is very interesting and useful, I thought it wasn't possible. Seems >that standard allows only the "order by" column(s) > and the aggregate function(s) on target list. > I tried the same query on Informix, also on Ocelot but it gives me an error. And with good reason, too. The above query has the same drawback as the "select distinct on", which is: it does not fully specify which value should be selected for the "nome" and "via" fields. Thus, running this same query on a table that has the same data but was, for example, filled in a different order, gives a different result. That's bad, because order should not make a difference for output. Tables are taken to be unordered sets. If you want to have a representative of the "nome" and "via" fields, and it doesn't matter which representative, then min(nome) or max(nome) should do the trick. And this query (select cognome, min(nome), min(via)... group by cognome) should give you the same result on all databases, no matter which rows were inserted first. If it was up to me, I wouldn't use the above form, and frankly, I am surprised the Postgres allows this. Herouth -- Herouth Maoz, Internet developer. Open University of Israel - Telem project http://telem.openu.ac.il/~herutma