Обсуждение: ORDER BY random() LIMIT 1 slowness

Поиск
Список
Период
Сортировка

ORDER BY random() LIMIT 1 slowness

От
"Gavin M. Roy"
Дата:
I have a query where i just want to randomly pick out one row of the
table.  The query as I am running it looks like:

SELECT * FROM poetry ORDER BY random() LIMIT 1;

There are only roughly 35,000 rows of data and there is no way that I
have found to specify what is randomly being ordered, I assume it's
picking the primary key.  The explain output looks like:

QUERY PLAN:
 Limit  (cost=49279.75..49279.75 rows=1 width=1062) (actual
time=8503.83..8503.84 rows=1 loops=1)
   ->  Sort  (cost=49279.75..49365.68 rows=34375 width=1062) (actual
time=8503.82..8503.83 rows=2 loops=1)
         Sort Key: random()
         ->  Seq Scan on poetry  (cost=0.00..5029.75 rows=34375
width=1062) (actual time=0.12..2503.35 rows=34376 loops=1)
 Total runtime: 8526.31 msec

I have different variations on the runtime, all between 5000 and 15000
msec.  Anyone have any ideas on how to speed this up?  I've thought
about doing a count query to get the total count and then use limit 1
offset #, where offset # is a number supplied by my program, but it
would be preferred to do this in query.

Thanks,

Gavin



Re: ORDER BY random() LIMIT 1 slowness

От
Tom Lane
Дата:
"Gavin M. Roy" <gmr@justsportsusa.com> writes:
> SELECT * FROM poetry ORDER BY random() LIMIT 1;
> [ is slow for 35000 rows ]

Yeah.  Basically this query is implemented as
  (a) select all 35000 rows of "poetry";
  (b) compute a random() value for each row;
  (c) sort by the random() values;
  (d) take the first row, discard the rest.

The good news: this gives you a pretty-durn-random selection.  The bad
news: you didn't really care about choosing a random ordering of the
other 34999 rows, but it computed one anyway.

This problem's been discussed before, but I've not seen any really
ideal answer.  SQL is not designed to offer unpredictable results ;-)

If you have a reasonably-uniformly-distributed index key on the rows,
you can try something like

    select * from poetry where key > (scale * random() + min)
    order by key limit 1;

(where scale and min are chosen to make the result cover the range of
index keys).  Given an index on "key" this should pick a result quickly
via an indexscan+limit plan.  However, if the distribution of keys isn't
real uniform then you won't get real random results --- keys just above
larger gaps will be selected with unfairly high probability.  You also
have to know the min and max keys accurately.

I recall some folks have suggested generating an actual random column
(ie, something initialized with "default random()" and suitably indexed)
but this seems like a lot of overhead ... you'd have to be mighty
concerned with the exactness of your random selection to put up with
that, I think.

There are some other suggestions in the archives, IIRC.

            regards, tom lane

Re: ORDER BY random() LIMIT 1 slowness

От
Jean-Luc Lachance
Дата:
Gavin,

Assuming that you have a serial column rand on poetry and you did not
delete any row,
here is my suggestion:

CREATE TABLE poetry ( rand SERIAL, ... );

SELECT * FROM poetry WHERE rand = (
  SELECT int8( curval( 'poetry_rand_seq') * random()));


JLL


Tom Lane wrote:
>
> "Gavin M. Roy" <gmr@justsportsusa.com> writes:
> > SELECT * FROM poetry ORDER BY random() LIMIT 1;
> > [ is slow for 35000 rows ]
>
> Yeah.  Basically this query is implemented as
>   (a) select all 35000 rows of "poetry";
>   (b) compute a random() value for each row;
>   (c) sort by the random() values;
>   (d) take the first row, discard the rest.
>
> The good news: this gives you a pretty-durn-random selection.  The bad
> news: you didn't really care about choosing a random ordering of the
> other 34999 rows, but it computed one anyway.
>
> This problem's been discussed before, but I've not seen any really
> ideal answer.  SQL is not designed to offer unpredictable results ;-)
>
> If you have a reasonably-uniformly-distributed index key on the rows,
> you can try something like
>
>         select * from poetry where key > (scale * random() + min)
>         order by key limit 1;
>
> (where scale and min are chosen to make the result cover the range of
> index keys).  Given an index on "key" this should pick a result quickly
> via an indexscan+limit plan.  However, if the distribution of keys isn't
> real uniform then you won't get real random results --- keys just above
> larger gaps will be selected with unfairly high probability.  You also
> have to know the min and max keys accurately.
>
> I recall some folks have suggested generating an actual random column
> (ie, something initialized with "default random()" and suitably indexed)
> but this seems like a lot of overhead ... you'd have to be mighty
> concerned with the exactness of your random selection to put up with
> that, I think.
>
> There are some other suggestions in the archives, IIRC.
>
>                         regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Re: ORDER BY random() LIMIT 1 slowness

От
Greg Stark
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> "Gavin M. Roy" <gmr@justsportsusa.com> writes:
> > SELECT * FROM poetry ORDER BY random() LIMIT 1;
> > [ is slow for 35000 rows ]
>
> Yeah.  Basically this query is implemented as
>   (a) select all 35000 rows of "poetry";
>   (b) compute a random() value for each row;
>   (c) sort by the random() values;
>   (d) take the first row, discard the rest.

If you can generate a random value from your application layer you could do

select * from poetry LIMIT 1 OFFSET <random value>

Can offset values be placeholders in prepared queries? If not then this has
that disadvantage.

--
greg

Re: ORDER BY random() LIMIT 1 slowness

От
SZUCS Gábor
Дата:
----- Original Message -----
From: "Jean-Luc Lachance" <jllachan@nsd.ca>
Sent: Tuesday, December 17, 2002 5:04 PM


> Gavin,
>
> Assuming that you have a serial column rand on poetry and you did not
> delete any row,
> here is my suggestion:
>
> CREATE TABLE poetry ( rand SERIAL, ... );
>
> SELECT * FROM poetry WHERE rand = (
>   SELECT int8( curval( 'poetry_rand_seq') * random()));


Mmmm... It usually doesn't work for me. Isn't currval (NOTE: with two r's)
bound to session and has no meaning before the first call to nextval()?
7.2.1 says the following; has it changed in 7.3(.*)?

---------------------------- cut here ------------------------------
tir=> create sequence test_seq;
CREATE
tir=> select currval('test_seq');
ERROR:  test_seq.currval is not yet defined in this session
tir=> select nextval('test_seq');
 nextval
---------
       1
(1 row)

tir=> select currval('test_seq');
 currval
---------
       1
(1 row)
---------------------------- cut here ------------------------------

G.
--
while (!asleep()) sheep++;

---------------------------- cut here ------------------------------


Re: ORDER BY random() LIMIT 1 slowness

От
Tom Lane
Дата:
=?iso-8859-1?Q?SZUCS_G=E1bor?= <surrano@mailbox.hu> writes:
>> CREATE TABLE poetry ( rand SERIAL, ... );
>>
>> SELECT * FROM poetry WHERE rand = (
>> SELECT int8( curval( 'poetry_rand_seq') * random()));

> Mmmm... It usually doesn't work for me.

Yeah ... better would be

>> SELECT * FROM poetry WHERE rand = (
>> SELECT int8( (select last_value from poetry_rand_seq) * random()));

Personally though, I'd skip the sequence entirely and do

    create table poetry (...,
                 rand float8 default random());
    create index on poetry.rand

    select * from poetry where rand > random() order by rand limit 1;

A difficulty with either of these approaches is that the system won't
optimize comparisons involving random() into indexscans.  To get around
that, you'd have to hide the random() call inside a user-defined
function that is (bogusly) marked cachable (or in 7.3, "stable" would be
the best choice).  At the moment I think it'd also work to stick the
random() call inside a subselect, but the UDF approach is less likely to
get broken by future changes.

            regards, tom lane

Re: ORDER BY random() LIMIT 1 slowness

От
Jean-Luc Lachance
Дата:
Gabor,

You are right about the missing 'r', but I think you missed my point.
You should modify your table so that it has a serial field and reload
it.

JLL

P.S. I run 7.2 so ALTER TABLE ADD rand SERIAL; does not work, but it may
work under 7.3


SZUCS Gábor wrote:
>
> ----- Original Message -----
> From: "Jean-Luc Lachance" <jllachan@nsd.ca>
> Sent: Tuesday, December 17, 2002 5:04 PM
>
> > Gavin,
> >
> > Assuming that you have a serial column rand on poetry and you did not
> > delete any row,
> > here is my suggestion:
> >
> > CREATE TABLE poetry ( rand SERIAL, ... );
> >
> > SELECT * FROM poetry WHERE rand = (
> >   SELECT int8( curval( 'poetry_rand_seq') * random()));
>
> Mmmm... It usually doesn't work for me. Isn't currval (NOTE: with two r's)
> bound to session and has no meaning before the first call to nextval()?
> 7.2.1 says the following; has it changed in 7.3(.*)?
>
> ---------------------------- cut here ------------------------------
> tir=> create sequence test_seq;
> CREATE
> tir=> select currval('test_seq');
> ERROR:  test_seq.currval is not yet defined in this session
> tir=> select nextval('test_seq');
>  nextval
> ---------
>        1
> (1 row)
>
> tir=> select currval('test_seq');
>  currval
> ---------
>        1
> (1 row)
> ---------------------------- cut here ------------------------------
>
> G.
> --
> while (!asleep()) sheep++;
>
> ---------------------------- cut here ------------------------------
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Re: ORDER BY random() LIMIT 1 slowness

От
SZUCS Gábor
Дата:
Dear Jean-Luc,

I don't think my simplified example missed any of your solution's features.
The essence, in my eyes, is that it has nothing to do with tables. It's only
related to sequences.

In short, you _cannot_ use currval() in any single _session_ until you use
nextval() in the same session, even if you created the sequence in the very
same session. Using a serial field in a table or using the sequence directly
is indifferent.

Or I'm missing something here.

As for Tom's solution:

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
Sent: Wednesday, December 18, 2002 4:56 PM


> Personally though, I'd skip the sequence entirely and do
>
> create table poetry (...,
>      rand float8 default random());
> create index on poetry.rand
>
> select * from poetry where rand > random() order by rand limit 1;

I'm not sure it's as flat as a random number should be. I have some relation
to mathematics but can't see it clearly right now. I fear it's more likely a
normal distribution, not linear (or whatsits called). But if I needed
something like this, I'd be happy with this solution anyway.

G.
--
while (!asleep()) sheep++;

---------------------------- cut here ------------------------------
----- Original Message -----
From: "Jean-Luc Lachance" <jllachan@nsd.ca>
Sent: Wednesday, December 18, 2002 5:55 PM


Gabor,

You are right about the missing 'r', but I think you missed my point.
You should modify your table so that it has a serial field and reload
it.

JLL

P.S. I run 7.2 so ALTER TABLE ADD rand SERIAL; does not work, but it may
work under 7.3



Re: ORDER BY random() LIMIT 1 slowness

От
Jean-Luc Lachance
Дата:
OK Gabor,

I'm the one who misunderstood.

To me, it seem to be a bug (or at least a mis-feature) that one cannot
call currval() before calling nextval().

Does anyone know why it should be like this?

JLL


SZUCS Gábor wrote:
>
> Dear Jean-Luc,
>
> I don't think my simplified example missed any of your solution's features.
> The essence, in my eyes, is that it has nothing to do with tables. It's only
> related to sequences.
>
> In short, you _cannot_ use currval() in any single _session_ until you use
> nextval() in the same session, even if you created the sequence in the very
> same session. Using a serial field in a table or using the sequence directly
> is indifferent.
>
> Or I'm missing something here.
>
> As for Tom's solution:
>
> ----- Original Message -----
> From: "Tom Lane" <tgl@sss.pgh.pa.us>
> Sent: Wednesday, December 18, 2002 4:56 PM
>
> > Personally though, I'd skip the sequence entirely and do
> >
> > create table poetry (...,
> >      rand float8 default random());
> > create index on poetry.rand
> >
> > select * from poetry where rand > random() order by rand limit 1;
>
> I'm not sure it's as flat as a random number should be. I have some relation
> to mathematics but can't see it clearly right now. I fear it's more likely a
> normal distribution, not linear (or whatsits called). But if I needed
> something like this, I'd be happy with this solution anyway.
>
> G.
> --
> while (!asleep()) sheep++;
>
> ---------------------------- cut here ------------------------------
> ----- Original Message -----
> From: "Jean-Luc Lachance" <jllachan@nsd.ca>
> Sent: Wednesday, December 18, 2002 5:55 PM
>
> Gabor,
>
> You are right about the missing 'r', but I think you missed my point.
> You should modify your table so that it has a serial field and reload
> it.
>
> JLL
>
> P.S. I run 7.2 so ALTER TABLE ADD rand SERIAL; does not work, but it may
> work under 7.3
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly

Re: ORDER BY random() LIMIT 1 slowness

От
Alvaro Herrera
Дата:
On Wed, Dec 18, 2002 at 02:09:42PM -0500, Jean-Luc Lachance wrote:
> OK Gabor,
>
> I'm the one who misunderstood.
>
> To me, it seem to be a bug (or at least a mis-feature) that one cannot
> call currval() before calling nextval().
>
> Does anyone know why it should be like this?

It doesn't make sense to call currval() if you haven't called nextval()
before.  The meaning of currval() is "the value that was last assigned
to you".  If you haven't called nextval(), there isn't a value assigned
to you.

If you want to know what was the last value the sequence gave to anyway,
SELECT last_value FROM sequence.  But be aware that this is
non-transaction safe, non-isolatable, non-anything.

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Entristecido, Wutra
echa a Freyr a rodar
y a nosotros al mar" (cancion de Las Barreras)

Re: ORDER BY random() LIMIT 1 slowness

От
Jean-Luc Lachance
Дата:
Alvara,

But instead of returning an error, currval() should return last_value if
nextval() was not called (with all the caveat of couse).  I think it
would be more usefull that way.

JLL


Alvaro Herrera wrote:
>
> On Wed, Dec 18, 2002 at 02:09:42PM -0500, Jean-Luc Lachance wrote:
> > OK Gabor,
> >
> > I'm the one who misunderstood.
> >
> > To me, it seem to be a bug (or at least a mis-feature) that one cannot
> > call currval() before calling nextval().
> >
> > Does anyone know why it should be like this?
>
> It doesn't make sense to call currval() if you haven't called nextval()
> before.  The meaning of currval() is "the value that was last assigned
> to you".  If you haven't called nextval(), there isn't a value assigned
> to you.
>
> If you want to know what was the last value the sequence gave to anyway,
> SELECT last_value FROM sequence.  But be aware that this is
> non-transaction safe, non-isolatable, non-anything.
>
> --
> Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
> "Entristecido, Wutra
> echa a Freyr a rodar
> y a nosotros al mar" (cancion de Las Barreras)

Re: ORDER BY random() LIMIT 1 slowness

От
"scott.marlowe"
Дата:
On Wed, 18 Dec 2002, Jean-Luc Lachance wrote:

> OK Gabor,
>
> I'm the one who misunderstood.
>
> To me, it seem to be a bug (or at least a mis-feature) that one cannot
> call currval() before calling nextval().
>
> Does anyone know why it should be like this?

First, read this page:

http://developer.postgresql.org/docs/postgres/functions-sequence.html

which answers a bit of that question.

the real issue with sequences is that in order to be transactionally safe,
they have to live outside of all transactions.

The purpose of the sequence manipulation functions is to interact with
sequence's in ways that ensure that the same sequence number is never used
by two different transactions.  Let me illustrate with a pair of
concurrent transactions, A and B:

A: begin;
B: begin;
A: select currval('seq'); <- client stores this value
B: select currval('seq'); <- ditto
A: insert into table (name, id) values ('john',idnum);
B: insert into table (name, id) values ('sue',idnum);
A: commit;
B: commit;

See the problem with the above?  It's why you can't use currval to get the
sequence number if you haven't called nextval, setval, or some other
fuction that has changed the sequence, and why using it will cause an
error.  Let's fix the above queries:

(seq=20)

A: begin;
B: begin;
A: select nextval('seq'); <- client doesn't store this (but could)
B: select nextval('seq'); <- client stores 22
A: insert into table (name, id) values ('john',currval('seq'));
B: insert into table (name, id) values ('sue',idnum);
A: commit;
B: commit;

All is well.

Note that if A: were to roll back, B would still complete, but we would
have a hole in our sequence for number 21.  this is normal.  The price we
pay for having sequences be safe in transactions is that they live outside
of transactions, and the functions that provide the interface are what are
transactionally aware.


Re: ORDER BY random() LIMIT 1 slowness

От
"scott.marlowe"
Дата:
On Wed, 18 Dec 2002, Jean-Luc Lachance wrote:

> Alvara,
>
> But instead of returning an error, currval() should return last_value if
> nextval() was not called (with all the caveat of couse).  I think it
> would be more usefull that way.

no, that would be like walking around with a gun pointed at your foot, to
quote Tom Lane.

See my post on transactions and such. Remember that everything in
Postgresql is designed to make transactions safe.  currval working without
a nextval or setval before it is dangerous in the exterme to transactions.


Re: ORDER BY random() LIMIT 1 slowness

От
Jean-Luc Lachance
Дата:
Scott,

I understand it all.

If a programmer understand that currval() return the last_(used)_value
and did not himself call nextval() he should be aware of the caveat.

I did not want to make a big fuss of it.  I will just use select
last_value myself since I am already aware of the caveat.  :)

JLL


"scott.marlowe" wrote:
>
> On Wed, 18 Dec 2002, Jean-Luc Lachance wrote:
>
> > Alvara,
> >
> > But instead of returning an error, currval() should return last_value if
> > nextval() was not called (with all the caveat of couse).  I think it
> > would be more usefull that way.
>
> no, that would be like walking around with a gun pointed at your foot, to
> quote Tom Lane.
>
> See my post on transactions and such. Remember that everything in
> Postgresql is designed to make transactions safe.  currval working without
> a nextval or setval before it is dangerous in the exterme to transactions.

Re: ORDER BY random() LIMIT 1 slowness

От
"scott.marlowe"
Дата:
But the problem is that we'd immediately be innundated on the
pgsql-general list by people who were using currval() in an unsafe way but
didn't know it until they went into production and watched their
application develop serious issues.  I.e. they have a right to expect the
sequence functions to operate in a transaction safe manner.  It's not
something that is likely to ever change, which is, imnsho, a good thing.

On Wed, 18 Dec 2002, Jean-Luc Lachance wrote:

> Scott,
>
> I understand it all.
>
> If a programmer understand that currval() return the last_(used)_value
> and did not himself call nextval() he should be aware of the caveat.
>
> I did not want to make a big fuss of it.  I will just use select
> last_value myself since I am already aware of the caveat.  :)
>
> JLL
>
>
> "scott.marlowe" wrote:
> >
> > On Wed, 18 Dec 2002, Jean-Luc Lachance wrote:
> >
> > > Alvara,
> > >
> > > But instead of returning an error, currval() should return last_value if
> > > nextval() was not called (with all the caveat of couse).  I think it
> > > would be more usefull that way.
> >
> > no, that would be like walking around with a gun pointed at your foot, to
> > quote Tom Lane.
> >
> > See my post on transactions and such. Remember that everything in
> > Postgresql is designed to make transactions safe.  currval working without
> > a nextval or setval before it is dangerous in the exterme to transactions.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>


Re: ORDER BY random() LIMIT 1 slowness

От
Simon Mitchell
Дата:
Hi,
         If you need the last gobal value of the sequence just query the
sequence directly.


select last_value, increment_by  from xyz_seq;
last_value | increment_by
------------+--------------
       355 |            1
(1 row)
          or just

select last_value from xyz_seq;
last_value
------------
       355
(1 row)

I think that sequences are great in postgresql after going from oracle
to mysql to postgresql.


Regards,
Simon

scott.marlowe wrote:

> But the problem is that we'd immediately be innundated on the
> pgsql-general list by people who were using currval() in an unsafe way
> but didn't know it until they went into production and watched their
> application develop serious issues.  I.e. they have a right to expect
> the sequence functions to operate in a transaction safe manner.  It's
> not something that is likely to ever change, which is, imnsho, a good
> thing.
>
> On Wed, 18 Dec 2002, Jean-Luc Lachance wrote:
>
>
>
>> Scott,
>>
>> I understand it all.
>>
>> If a programmer understand that currval() return the last_(used)_value
>> and did not himself call nextval() he should be aware of the caveat.
>>
>> I did not want to make a big fuss of it.  I will just use select
>> last_value myself since I am already aware of the caveat.  :)
>>
>> JLL
>>
>>
>> "scott.marlowe" wrote:
>>
>>
>>> On Wed, 18 Dec 2002, Jean-Luc Lachance wrote:
>>>
>>>
>>>
>>>> Alvara,
>>>>
>>>> But instead of returning an error, currval() should return
>>>> last_value if
>>>> nextval() was not called (with all the caveat of couse).  I think it
>>>> would be more usefull that way.
>>>>
>>>
>>> no, that would be like walking around with a gun pointed at your
>>> foot, to
>>> quote Tom Lane.
>>>
>>> See my post on transactions and such. Remember that everything in
>>> Postgresql is designed to make transactions safe.  currval working
>>> without
>>> a nextval or setval before it is dangerous in the exterme to
>>> transactions.
>>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
>>
>>
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html
>
>
>




Re: ORDER BY random() LIMIT 1 slowness

От
Jean-Luc Lachance
Дата:
Have another suggestion then -- add lastval().

We would have:

nextval() transaction safe
currval() transaction safe (nextval() must be called first)
lastval() transaction unsafe

JLL


"scott.marlowe" wrote:
>
> But the problem is that we'd immediately be innundated on the
> pgsql-general list by people who were using currval() in an unsafe way but
> didn't know it until they went into production and watched their
> application develop serious issues.  I.e. they have a right to expect the
> sequence functions to operate in a transaction safe manner.  It's not
> something that is likely to ever change, which is, imnsho, a good thing.
>
> On Wed, 18 Dec 2002, Jean-Luc Lachance wrote:
>
> > Scott,
> >
> > I understand it all.
> >
> > If a programmer understand that currval() return the last_(used)_value
> > and did not himself call nextval() he should be aware of the caveat.
> >
> > I did not want to make a big fuss of it.  I will just use select
> > last_value myself since I am already aware of the caveat.  :)
> >
> > JLL
> >
> >
> > "scott.marlowe" wrote:
> > >
> > > On Wed, 18 Dec 2002, Jean-Luc Lachance wrote:
> > >
> > > > Alvara,
> > > >
> > > > But instead of returning an error, currval() should return last_value if
> > > > nextval() was not called (with all the caveat of couse).  I think it
> > > > would be more usefull that way.
> > >
> > > no, that would be like walking around with a gun pointed at your foot, to
> > > quote Tom Lane.
> > >
> > > See my post on transactions and such. Remember that everything in
> > > Postgresql is designed to make transactions safe.  currval working without
> > > a nextval or setval before it is dangerous in the exterme to transactions.
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
> >