Обсуждение: Re: select random order by random

Поиск
Список
Период
Сортировка

Re: select random order by random

От
Lee Keel
Дата:
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of piotr_sobolewski
> Sent: Thursday, November 01, 2007 9:25 AM
> To: pgsql-general@postgresql.org
> Subject: [GENERAL] select random order by random
>
> Dear sirs,
>
> I was very surprised when I executed such SQL query (under PostgreSQL
> 8.2):
> select random() from generate_series(1, 10) order by random();
>
> I thought I would receive ten random numbers in random order. But I
> received
> ten random numbers sorted numerically:
>       random
> -------------------
>  0.102324520237744
>   0.17704638838768
>  0.533014383167028
>   0.60182224214077
>  0.644065519794822
>  0.750732169486582
>  0.821376844774932
>   0.88221683120355
>  0.889879426918924
>  0.924697323236614
> (10 rows)
>
> I don't understand - why the result is like that? It seems like in each
> row
> both random()s were giving the same result. Why is it like that? What
> caused
> it?
>
> --
> Piotr Sobolewski
> http://www.piotrsobolewski.w.pl
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
[Lee Keel]

Would this not have to do with the 'order by' you added to the end of the
statement?  If you remove the order by clause, then it works for me...

-LK


This email and any files transmitted with it are confidential and intended solely for the use of the individual or
entityto whom they are addressed. If you have received this email in error please notify the sender. This message
containsconfidential information and is intended only for the individual named. If you are not the named addressee you
shouldnot disseminate, distribute or copy this e-mail. 

Re: select random order by random

От
"Scott Marlowe"
Дата:
On 11/1/07, Lee Keel <lee.keel@uai.com> wrote:
> > Dear sirs,
> >
> > I was very surprised when I executed such SQL query (under PostgreSQL
> > 8.2):
> > select random() from generate_series(1, 10) order by random();
> >
> > I thought I would receive ten random numbers in random order. But I
> > received
> > ten random numbers sorted numerically:
> >       random
> > -------------------
> >  0.102324520237744
> >   0.17704638838768
> >  0.533014383167028
> >   0.60182224214077
> >  0.644065519794822
> >  0.750732169486582
> >  0.821376844774932
> >   0.88221683120355
> >  0.889879426918924
> >  0.924697323236614
> > (10 rows)
> >
> > I don't understand - why the result is like that? It seems like in each
> > row
> > both random()s were giving the same result. Why is it like that? What
> > caused
> > it?
>
> Would this not have to do with the 'order by' you added to the end of the
> statement?  If you remove the order by clause, then it works for me...

I think that Piotr expected the random() to be evaluated in both
places separately.

My guess is that it was recognized by the planner as the same function
and evaluated once per row only.

If you try this:

select random() from generate_series(1, 10) order by random()*1;

then you'll get random ordering.

Re: select random order by random

От
Gregory Stark
Дата:
"Scott Marlowe" <scott.marlowe@gmail.com> writes:

> I think that Piotr expected the random() to be evaluated in both
> places separately.
>
> My guess is that it was recognized by the planner as the same function
> and evaluated once per row only.
>
> If you try this:
>
> select random() from generate_series(1, 10) order by random()*1;
>
> then you'll get random ordering.

This does strike me as wrong. random() is marked volatile and the planner
ought not collapse multiple calls into one. Note that it affects other
volatile functions too:

postgres=#  select nextval('s') from generate_series(1, 10) order by nextval('s');
 nextval
---------
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
(10 rows)

postgres=#  select nextval('s') from generate_series(1, 10) order by nextval('s');
 nextval
---------
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
(10 rows)

That's certainly not how I remembered it working but I'm not sure I ever
tested it before.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

Re: select random order by random

От
Richard Huxton
Дата:
Gregory Stark wrote:
> "Scott Marlowe" <scott.marlowe@gmail.com> writes:
>
>> I think that Piotr expected the random() to be evaluated in both
>> places separately.
>>
>> My guess is that it was recognized by the planner as the same function
>> and evaluated once per row only.
>>
>> If you try this:
>>
>> select random() from generate_series(1, 10) order by random()*1;
>>
>> then you'll get random ordering.
>
> This does strike me as wrong. random() is marked volatile and the planner
> ought not collapse multiple calls into one.

I think I agree with the earlier poster. Surely these two queries should
be equivalent?

SELECT random()        FROM generate_series(1, 10) ORDER BY random();
SELECT random() AS foo FROM generate_series(1, 10) ORDER BY foo;


--
   Richard Huxton
   Archonet Ltd

Re: select random order by random

От
Sam Mason
Дата:
On Thu, Nov 01, 2007 at 04:49:16PM +0000, Richard Huxton wrote:
> Gregory Stark wrote:
> >This does strike me as wrong. random() is marked volatile and the planner
> >ought not collapse multiple calls into one.
>
> I think I agree with the earlier poster. Surely these two queries should
> be equivalent?
>
> SELECT random()        FROM generate_series(1, 10) ORDER BY random();
> SELECT random() AS foo FROM generate_series(1, 10) ORDER BY foo;

If they were pure (in the functional programming sense) then this would
be a correct optimisation.  However, if they're marked as volatile then
they should be called independently---they're not pure anymore and
you're calling the code for its side-effects and optimising out the
either call changes the semantics.  Try playing around with monads in
Haskell or uniqueness types in Clean, they help to clarify what's going
on when you call a "function" in an impure language.


  Sam

Re: select random order by random

От
Tom Lane
Дата:
Richard Huxton <dev@archonet.com> writes:
> Gregory Stark wrote:
>> This does strike me as wrong. random() is marked volatile and the planner
>> ought not collapse multiple calls into one.

> I think I agree with the earlier poster. Surely these two queries should
> be equivalent?

> SELECT random()        FROM generate_series(1, 10) ORDER BY random();
> SELECT random() AS foo FROM generate_series(1, 10) ORDER BY foo;

Well, the latter case is why it acts that way, but Greg has a point that
when a volatile function is involved maybe they shouldn't be the same.
OTOH it's always been like that, and in the absence of a clear reason
to change it I'm inclined to leave it alone.

(BTW, this is not the planner's fault; the collapsing of the two
targetlist entries into one happens in the parser.)

            regards, tom lane

Re: select random order by random

От
Martijn van Oosterhout
Дата:
On Thu, Nov 01, 2007 at 02:22:58PM -0400, Tom Lane wrote:
> > SELECT random()        FROM generate_series(1, 10) ORDER BY random();
> > SELECT random() AS foo FROM generate_series(1, 10) ORDER BY foo;
>
> (BTW, this is not the planner's fault; the collapsing of the two
> targetlist entries into one happens in the parser.)

Something twigged telling me that in fact the latter expression is not
in standard SQL but a (very common) extension. A <sort key> is clearly
indicated to be a <value expression> with no indication anywhere that
column aliases are allowed here (though that may be in the common rules
somewhere).

Then again, I may be remembering all wrong...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Вложения

Re: select random order by random

От
"Scott Marlowe"
Дата:
On 11/1/07, Martijn van Oosterhout <kleptog@svana.org> wrote:
> On Thu, Nov 01, 2007 at 02:22:58PM -0400, Tom Lane wrote:
> > > SELECT random()        FROM generate_series(1, 10) ORDER BY random();
> > > SELECT random() AS foo FROM generate_series(1, 10) ORDER BY foo;
> >
> > (BTW, this is not the planner's fault; the collapsing of the two
> > targetlist entries into one happens in the parser.)
>
> Something twigged telling me that in fact the latter expression is not
> in standard SQL but a (very common) extension. A <sort key> is clearly
> indicated to be a <value expression> with no indication anywhere that
> column aliases are allowed here (though that may be in the common rules
> somewhere).

Well, the standard way I know if is to use column numbers.  i.e.:

select random() from generate_series(1,10) order by 1

That I'm pretty sure IS in the standard.  Don't see why column aliases
would be disallowed.  It's not like the where clause where the select
field doesn't exist when it fires.  The select field list does exist
when order by fires, so referring to it makes sense.

Re: select random order by random

От
Tom Lane
Дата:
Martijn van Oosterhout <kleptog@svana.org> writes:
> Something twigged telling me that in fact the latter expression is not
> in standard SQL but a (very common) extension. A <sort key> is clearly
> indicated to be a <value expression> with no indication anywhere that
> column aliases are allowed here (though that may be in the common rules
> somewhere).

SQL92 says differently.  The committee basically redefined ORDER BY
entirely between SQL92 and SQL99.

What we actually try to support is both SQL92 and SQL99 interpretations,
which is a pretty unholy mess, but enough people (and programs) are used
to the SQL92 way that I don't foresee being able to drop it.

            regards, tom lane

Re: select random order by random

От
Gregory Stark
Дата:
"Scott Marlowe" <scott.marlowe@gmail.com> writes:

> On 11/1/07, Martijn van Oosterhout <kleptog@svana.org> wrote:
>> On Thu, Nov 01, 2007 at 02:22:58PM -0400, Tom Lane wrote:
>> > > SELECT random()        FROM generate_series(1, 10) ORDER BY random();
>> > > SELECT random() AS foo FROM generate_series(1, 10) ORDER BY foo;
>> >
>> > (BTW, this is not the planner's fault; the collapsing of the two
>> > targetlist entries into one happens in the parser.)
>>
>> Something twigged telling me that in fact the latter expression is not
>> in standard SQL but a (very common) extension. A <sort key> is clearly
>> indicated to be a <value expression> with no indication anywhere that
>> column aliases are allowed here (though that may be in the common rules
>> somewhere).
>
> Well, the standard way I know if is to use column numbers.  i.e.:
>
> select random() from generate_series(1,10) order by 1
>
> That I'm pretty sure IS in the standard.  Don't see why column aliases
> would be disallowed.  It's not like the where clause where the select
> field doesn't exist when it fires.  The select field list does exist
> when order by fires, so referring to it makes sense.

Well IIRC the standard requires the sort keys to be columns from the select
list. You can't put any old expression there, only copies of the expressions
used in the select list.

So in the spec "random()" can't really be considered a second call to
random(), it's just a retyped instance of the "random()" in the select list.
That is, it's just a longwinded way of saying "order by 1" (meaning column 1).

So I guess having the parser do this substitution kind of makes sense if
you're thinking about things the way the spec does. It doesn't make much sense
if you're thinking the way Postgres does of having arbitrary expressions there
independent of what's in the select list.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's On-Demand Production Tuning

Re: select random order by random

От
Tom Lane
Дата:
Gregory Stark <stark@enterprisedb.com> writes:
> So I guess having the parser do this substitution kind of makes sense
> if you're thinking about things the way the spec does. It doesn't make
> much sense if you're thinking the way Postgres does of having
> arbitrary expressions there independent of what's in the select list.

Again: this is not "Postgres vs the spec", it is "SQL92 vs SQL99".
I draw your attention to the relevant text...

SQL92:

         <order by clause> ::=
              ORDER BY <sort specification list>

         <sort specification list> ::=
              <sort specification> [ { <comma> <sort specification> }... ]

         <sort specification> ::=
              <sort key> [ <collate clause > ] [ <ordering specification> ]

         <sort key> ::=
                <column name>
              | <unsigned integer>

         <ordering specification> ::= ASC | DESC

...

         10)If ORDER BY is specified, then each <sort specification> in the
            <order by clause> shall identify a column of T.

            Case:

            a) If a <sort specification> contains a <column name>, then T
              shall contain exactly one column with that <column name> and
              the <sort specification> identifies that column.

            b) If a <sort specification> contains an <unsigned integer>,
              then the <unsigned integer> shall be greater than 0 and not
              greater than the degree of T. The <sort specification> iden-
              tifies the column of T with the ordinal position specified by
              the <unsigned integer>.

(T is the table emitted by the SELECT.)


SQL99:

         <order by clause> ::=
              ORDER BY <sort specification list>

         <sort specification list> ::=
              <sort specification> [ { <comma> <sort specification> }... ]

         <sort specification> ::=
              <sort key> [ <collate clause> ] [ <ordering specification> ]

         <sort key> ::=
              <value expression>

         <ordering specification> ::= ASC | DESC


        18) If an <order by clause> is specified, then:

            a) Let K(i) be the <sort key> contained in the i-th <sort
              specification>.

            b) Let DT be the declared type of K(i).

            c) If DT is a user-defined type, then the comparison form of DT
              shall be FULL.

            d) K(i) shall not be a <literal>.

            e) If QE is a <query expression body> that is a <non-join query
              expression> that is a <non-join query term> that is a <non-
              join query primary> that is a <simple table> that is a <query
              specification>, then the <cursor specification> is said to be
              a simple table query.

            f) Case:

              i) If <sort specification list> contains any <sort key> K(i)
                 that contains a column reference to a column that is not a
                 column of T, then:

                 1) The <cursor specification> shall be a simple table
                   query.

                 2) Case:

                   A) If K(i) is not equivalent to a <value expression>
                      immediately contained in any <derived column> in the
                      <select list> SL of <query specification> QS contained
                      in QE, then:

                      I) T shall not be a grouped table.

                     II) QS shall not specify the <set quantifier> DISTINCT
                        or directly contain one or more <set function
                        specification>s.

                    III) Let C(j) be a column that is not a column of T and
                        whose column reference is contained in some K(i).

                     IV) Let SKL be the list of <derived column>s that are
                        <column name>s of column references to every C(j).
                        The columns C(j) are said to be extended sort key
                        columns.

                      V) Let TE be the <table expression> immediately
                        contained in QS.

                     VI) Let ST be the result of evaluating the <query
                        specification>:

                           SELECT SL, SKL FROM TE

                   B) Otherwise:

                      I) Let ST be T.

                     II) For every <derived column> DC(e) of SL that is
                        equivalent to K(i), if DC(e) has a <column name>,
                        then let CN(e) be that <column name>; otherwise:

                        1) Let CN(e) be an implementation-defined <column
                           name> that is not equal to any <column name> of
                           any column of ST.

                        2) DC(e) is effectively replaced by DE(e) AS CN(e)
                           in the <select list> of ST, where DE(e) is the
                           <derived element> of DC(e).

                    III) K(i) is effectively replaced by CN(e).

             ii) Otherwise, let ST be T.

            g) ST is said to be a sort table.

            h) K(i) is a <value expression>. The <value expression> shall
              not contain a <subquery> or a <set function specification>,
              but shall contain a <column reference>.

              i) Let X be any <column reference> directly contained in K(i).

             ii) If X does not contain an explicit <table or query name> or
                 <correlation name>, then K(i) shall be a <column name> that
                 shall be equivalent to the name of exactly one column of
                 ST.

              NOTE 287 - A previous version of ISO/IEC 9075 allows <sort
              specification> to be a <signed integer> to denote a column
              reference of a column of T. That facility no longer exists.
              See Annex E, "Incompatibilities with ISO/IEC 9075:1992 and
              ISO/IEC 9075-4:1996".


In the usual tradition of SQL99, the spec text is enormously less
readable than SQL92 was, but I *think* this says nearly the same thing
as what we do: a plain column reference in ORDER BY is first sought as
an output column name, and failing that sought as a column name of one
of the input tables.  They are more restrictive than we are but that's
OK.

For the particular issue at hand here, it seems to me that 18.f.i.2.B
dictates that a <sort key> matching an output column be treated as a
reference to the column, not as an independently evaluated expression.
Admittedly they are not talking about volatile functions per se, but
I think there's some defense here for the way our parser does it.

            regards, tom lane

Re: select random order by random

От
"John D. Burger"
Дата:
On Nov 1, 2007, at 18:57, Tom Lane wrote:

> In the usual tradition of SQL99, the spec text is enormously less
> readable than SQL92 was, but I *think* this says nearly the same thing
> as what we do: a plain column reference in ORDER BY is first sought as
> an output column name, and failing that sought as a column name of one
> of the input tables.  They are more restrictive than we are but that's
> OK.
>
> For the particular issue at hand here, it seems to me that 18.f.i.2.B
> dictates that a <sort key> matching an output column be treated as a
> reference to the column, not as an independently evaluated expression.
> Admittedly they are not talking about volatile functions per se, but
> I think there's some defense here for the way our parser does it.

But the described behavior (or rather its "obvious" extension to
Postgres) does not seem to match the OP's later example:

 > select random() as xxx, random() from generate_series(1, 10) order
by random();

         xxx        |       random
-------------------+--------------------
0.117905601913997 |  0.587338728172397
0.167445760298262 |  0.183822357647038
0.212947336590359 |  0.726537112484936
0.215260865732683 |   0.57848364467662
0.503411483719671 |   0.51932220557673
0.783855747796528 |  0.366456756538924
0.803222402838628 | 0.0357640516179446
0.917076966221015 |  0.918215564414028
0.937211547017662 |  0.146829404470897
0.987405426328725 |  0.308503020232778

Clearly the <sort key> is matched to the first output column, despite
its renaming.  Contrast this with

   ... order by random;    // plain column reference

This substantially breaks the principle of least surprise for me.

Caveat - this is on 7.4 (sigh), perhaps more modern versions have
different behavior.

- John D. Burger
   MITRE