Обсуждение: Obtaining random rows from a result set

Поиск

Список

Период

Сортировка

Obtaining random rows from a result set

От

Alban Hertroys

Дата:

31 августа 2007 г., 09:42:37

Hello,

I've recently been busy improving a query that yields a fixed number of
random records matching certain conditions. I have tried all the usual
approaches, and although they do work, they're all limited in some way
and don't translate really well to what you "want". They're kludges, IMHO.

The methods I've tried are explained quite well on
http://people.planetpostgresql.org/greg/index.php?/archives/40-Getting-random-rows-from-a-database-table.html

All these methods involve calculating a random number for every record
in the result set at some point in time, which is really not what I'm
trying to model. I think the database should provide some means to get
those records, so...

Dear Santa,

I'd like my database to have functionality analogue to how LIMIT works,
but for other - non-sequential - algorithms.

I was thinking along the lines of:

    SELECT *
      FROM table
     WHERE condition = true
     RANDOM 5;

Which would (up to) return 5 random rows from the result set, just as
LIMIT 5 returns (up to) the first 5 records in the result set.


Or maybe even with a custom function, so that you could get non-linear
distributions:

    SELECT *
      FROM table
     WHERE condition = true
     LIMIT 5 USING my_func();

    Where my_func() could be a user definable function accepting a number
that should be (an estimate of?) the number of results being returned so
that it can provide pointers to which rows in the resultset will be
returned from the query.

Examples:
* random(maxrows) would return random rows from the resultset.
* median() would return the rows in the middle of the result set (this
would require ordering to be meaningful).

What do people think, is this feasable? Desirable? Necessary?

If I'd have time I'd volunteer for at least looking into this, but I'm
working on three projects simultaneously already. Alas...

Regards,
Alban Hertroys.

--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
   7500 AK Enschede

// Integrate Your World //

Re: Obtaining random rows from a result set

От

"Josh Tolley"

Дата:

31 августа 2007 г., 10:33:47

On 8/31/07, Alban Hertroys <alban@magproductions.nl> wrote:
> Hello,
>
> I've recently been busy improving a query that yields a fixed number of
> random records matching certain conditions. I have tried all the usual
> approaches, and although they do work, they're all limited in some way
> and don't translate really well to what you "want". They're kludges, IMHO.
>
> The methods I've tried are explained quite well on
> http://people.planetpostgresql.org/greg/index.php?/archives/40-Getting-random-rows-from-a-database-table.html
>
> All these methods involve calculating a random number for every record
> in the result set at some point in time, which is really not what I'm
> trying to model. I think the database should provide some means to get
> those records, so...
>
> Dear Santa,
>
> I'd like my database to have functionality analogue to how LIMIT works,
> but for other - non-sequential - algorithms.
>
> I was thinking along the lines of:
>
>         SELECT *
>           FROM table
>          WHERE condition = true
>          RANDOM 5;
>
> Which would (up to) return 5 random rows from the result set, just as
> LIMIT 5 returns (up to) the first 5 records in the result set.
>
>
> Or maybe even with a custom function, so that you could get non-linear
> distributions:
>
>         SELECT *
>           FROM table
>          WHERE condition = true
>          LIMIT 5 USING my_func();
>
>         Where my_func() could be a user definable function accepting a number
> that should be (an estimate of?) the number of results being returned so
> that it can provide pointers to which rows in the resultset will be
> returned from the query.
>
> Examples:
> * random(maxrows) would return random rows from the resultset.
> * median() would return the rows in the middle of the result set (this
> would require ordering to be meaningful).
>
> What do people think, is this feasable? Desirable? Necessary?
>
> If I'd have time I'd volunteer for at least looking into this, but I'm
> working on three projects simultaneously already. Alas...
>
> Regards,
> Alban Hertroys.
>
> --
> Alban Hertroys
> alban@magproductions.nl
>
> magproductions b.v.
>
> T: ++31(0)534346874
> F: ++31(0)534346876
> M:
> I: www.magproductions.nl
> A: Postbus 416
>    7500 AK Enschede
>
> // Integrate Your World //
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>

It seems to me that anything that wants to return a random set of rows
will need to calculate a random number for all the rows it processes,
unless you change how the database scans rows in indexes or tables,
which if at all possible will probably make things *really* slow. If
it's a given that the database will always sequentially scan whatever
it is the query plan tells it to scan, you're pretty much stuck with
the rows in your result set being in the same order unless you start
picking random numbers.

One possible alternative not mentioned on the site you linked to is to
as follows:

select [whatever] from [table] where random() < [some number between 0
and 1] limit [limit value]

That doesn't require assigning a random number for *every* row in the
table, nor does it require sorting everything. It does mean that
numbers encountered earlier in the query processing have a higher
likelihood of being returned, and it also means that there's some
chance you won't actually get as many as [limit value] rows returned.

jtolley=# create table a (i integer);
CREATE TABLE
jtolley=# insert into a (i) select * from generate_series(1, 100);
INSERT 0 100
jtolley=# create table a (i integer);
CREATE TABLE
jtolley=# insert into a (i) select * from generate_series(1, 100);
INSERT 0 100
jtolley=# select * from a where random() < .1 limit 3;
 i
----
 22
 23
 25
(3 rows)

Hope this helps...

-Josh

Re: Obtaining random rows from a result set

От

"Albe Laurenz"

Дата:

31 августа 2007 г., 10:39:21

Alban Hertroys wrote:
> I've recently been busy improving a query that yields a fixed
> number of random records matching certain conditions.

> Dear Santa,
>
> I'd like my database to have functionality analogue to how
> LIMIT works,
> but for other - non-sequential - algorithms.
>
> I was thinking along the lines of:
>
>     SELECT *
>       FROM table
>      WHERE condition = true
>      RANDOM 5;

Ho, ho, ho.

SELECT *
FROM table
WHERE condition = true
ORDER BY hashfloat8(random())
LIMIT 5;

Yours,
Laurenz Albe

Re: Obtaining random rows from a result set

От

Csaba Nagy

Дата:

31 августа 2007 г., 10:45:48

> Dear Santa,
>
> I'd like my database to have functionality analogue to how LIMIT works,
> but for other - non-sequential - algorithms.

There was some discussion before to possibly reuse the algorithm ANALYZE
is using for sampling some given percentage of the table data and
provide this for some kind of "SELECT SAMPLE x% " style of
functionality. This would be the fastest you can get for a reasonably
big sample so it can be statistically significant, but not repeatable.
I'm not sure if this is the same what you were asking for though, I
would like something like this for statistical stuff, not for randomly
selecting rows.

Cheers,
Csaba.

Re: Obtaining random rows from a result set

От

Martijn van Oosterhout

Дата:

31 августа 2007 г., 10:54:29

On Fri, Aug 31, 2007 at 02:42:18PM +0200, Alban Hertroys wrote:
> Examples:
> * random(maxrows) would return random rows from the resultset.
> * median() would return the rows in the middle of the result set (this
> would require ordering to be meaningful).

It would be possible to write an aggregate that returns a single random
value from a set. The algorithm is something like:

n = 1
v = null
for each row
  if random() < 1/n:
     v = value of row
  n = n + 1

return v

It does require a seqscan though. If you're asking for 5 random rows
you probably mean 5 random but distinct rows, which is different to
just running the above set 5 times in parallel.

I don't know if there's a similar method for median...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Вложения

signature.asc

Re: Obtaining random rows from a result set

От

Kaloyan Iliev

Дата:

31 августа 2007 г., 11:01:22

Hi,
Why not generate a random number in your application and then:

SELECT *
FROM table_x
WHERE condition = true
OFFSET generated_random_number
LIMIT xx

Kaloyan Iliev

Alban Hertroys wrote:

>Hello,
>
>I've recently been busy improving a query that yields a fixed number of
>random records matching certain conditions. I have tried all the usual
>approaches, and although they do work, they're all limited in some way
>and don't translate really well to what you "want". They're kludges, IMHO.
>
>The methods I've tried are explained quite well on
>http://people.planetpostgresql.org/greg/index.php?/archives/40-Getting-random-rows-from-a-database-table.html
>
>All these methods involve calculating a random number for every record
>in the result set at some point in time, which is really not what I'm
>trying to model. I think the database should provide some means to get
>those records, so...
>
>Dear Santa,
>
>I'd like my database to have functionality analogue to how LIMIT works,
>but for other - non-sequential - algorithms.
>
>I was thinking along the lines of:
>
>    SELECT *
>      FROM table
>     WHERE condition = true
>     RANDOM 5;
>
>Which would (up to) return 5 random rows from the result set, just as
>LIMIT 5 returns (up to) the first 5 records in the result set.
>
>
>Or maybe even with a custom function, so that you could get non-linear
>distributions:
>
>    SELECT *
>      FROM table
>     WHERE condition = true
>     LIMIT 5 USING my_func();
>
>    Where my_func() could be a user definable function accepting a number
>that should be (an estimate of?) the number of results being returned so
>that it can provide pointers to which rows in the resultset will be
>returned from the query.
>
>Examples:
>* random(maxrows) would return random rows from the resultset.
>* median() would return the rows in the middle of the result set (this
>would require ordering to be meaningful).
>
>What do people think, is this feasable? Desirable? Necessary?
>
>If I'd have time I'd volunteer for at least looking into this, but I'm
>working on three projects simultaneously already. Alas...
>
>Regards,
>Alban Hertroys.
>
>
>

Re: Obtaining random rows from a result set

От

Erik Jones

Дата:

31 августа 2007 г., 13:06:03

On Aug 31, 2007, at 8:34 AM, Kaloyan Iliev wrote:

> Alban Hertroys wrote:
>
>> Hello,
>>
>> I've recently been busy improving a query that yields a fixed
>> number of
>> random records matching certain conditions. I have tried all the
>> usual
>> approaches, and although they do work, they're all limited in some
>> way
>> and don't translate really well to what you "want". They're
>> kludges, IMHO.
>>
>> The methods I've tried are explained quite well on
>> http://people.planetpostgresql.org/greg/index.php?/archives/40-
>> Getting-random-rows-from-a-database-table.html
>>
>> All these methods involve calculating a random number for every
>> record
>> in the result set at some point in time, which is really not what I'm
>> trying to model. I think the database should provide some means to
>> get
>> those records, so...
>>
>> Dear Santa,
>>
>> I'd like my database to have functionality analogue to how LIMIT
>> works,
>> but for other - non-sequential - algorithms.
>>
>> I was thinking along the lines of:
>>
>>     SELECT *
>>       FROM table
>>      WHERE condition = true
>>      RANDOM 5;
>>
>> Which would (up to) return 5 random rows from the result set, just as
>> LIMIT 5 returns (up to) the first 5 records in the result set.
>>
>>
>> Or maybe even with a custom function, so that you could get non-
>> linear
>> distributions:
>>
>>     SELECT *
>>       FROM table
>>      WHERE condition = true
>>      LIMIT 5 USING my_func();
>>
>>     Where my_func() could be a user definable function accepting a
>> number
>> that should be (an estimate of?) the number of results being
>> returned so
>> that it can provide pointers to which rows in the resultset will be
>> returned from the query.
>>
>> Examples:
>> * random(maxrows) would return random rows from the resultset.
>> * median() would return the rows in the middle of the result set
>> (this
>> would require ordering to be meaningful).
>>
>> What do people think, is this feasable? Desirable? Necessary?
>>
>> If I'd have time I'd volunteer for at least looking into this, but
>> I'm
>> working on three projects simultaneously already. Alas...
>>
>> Regards,
>> Alban Hertroys.
>>
>>
> Hi,
> Why not generate a random number in your application and then:
>
> SELECT *
> FROM table_x
> WHERE condition = true
> OFFSET generated_random_number
> LIMIT xx
>
> Kaloyan Iliev
>

That won't work without some kind of a priori knowledge of how many
rows the query would return without the offset and limit.

Erik Jones

Software Developer | Emma®
erik@myemma.com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com

Re: Obtaining random rows from a result set

От

Alban Hertroys

Дата:

01 сентября 2007 г., 08:08:43

On Aug 31, 2007, at 15:54, Martijn van Oosterhout wrote:

> On Fri, Aug 31, 2007 at 02:42:18PM +0200, Alban Hertroys wrote:
>> Examples:
>> * random(maxrows) would return random rows from the resultset.
>> * median() would return the rows in the middle of the result set
>> (this
>> would require ordering to be meaningful).
>
> It would be possible to write an aggregate that returns a single
> random
> value from a set. The algorithm is something like:
>
> n = 1
> v = null
> for each row
>   if random() < 1/n:
>      v = value of row
>   n = n + 1
>
> return v

Doesn't this always return the first record, since random() is always
less than 1/1?
I don't think this method has a linear distribution, but then again I
don't understand what 'value of row' refers to...

> It does require a seqscan though.

I doubt that a seqscan can be entirely avoided to fetch random rows
from a set, at least not until the last random result has been
returned, _unless_ the number of matching records would be known
before starting taking random samples.

> If you're asking for 5 random rows
> you probably mean 5 random but distinct rows, which is different to
> just running the above set 5 times in parallel.

Indeed, that is one of the distinctions that need some thought for my
original preposition. I left it out, as it's an implementation detail
(an important one, admittedly).

> I don't know if there's a similar method for median...

I'm not entirely sure, but I think your method is the only one
suggested that doesn't involve calculating random() a million times
(for a million records) to return 5 (random) records.

My suggestion involved a way to calculate random() only when
retrieving records from the result set (only 5 times for a million
records in this case).
For a linearly distributed random set it does require knowing the
number of records in the set though, an estimate would make it non-
linear (although only a little bit if accurate enough).

OTOH, I'm starting to think that the last sort step of an order by
can be postponed to the result set fetching cycle under the
conditions that:
- the ordering expression is unrelated to the records involved, and
- only a fraction of the total number of records will be returned.
(Which is somewhat similar to the condition for an index being more
efficient than a seqscan, btw)

Comparing records with each other for something not related seems a
waste of effort, while the result set has already been determined
(just not ordered in any particular way), am I right?

With that change (postponing sorting) my original ORDER BY random()
LIMIT 5 would perform quite adequately, I think - it'd only involve
calculating random() at least 5 times, not as often as the number of
records in the result set.

Or is order by random() acting as some kind of shuffling method? Is
that a requirement to get a linearly distributed set to randomly draw
from?
I can see how it wouldn't be linear if you'd start randomly comparing
records from the beginning of the result set... (Which would be the
logical method if you don't know the size of the set before hand)

I thought of another solution (with only a few calculations of random
()) that can be deployed in existing versions of PG, using a set-
returning function with a scrolling cursor that accepts the query
string as input like this (in pseudoish-code):

----
create function random(text _query, integer _limit)
returns set
volatile
as $$
DECLARE
    _cur cursor;
    _cnt bigint;
    _idx integer;
    _rowpos bigint;

    _rec record;
BEGIN
    open _cur for execute query;
    fetch forward all into _rec;
    -- select total nr of records into _cnt

    for _idx in 1.._limit loop
        _rowpos := random() * _cnt;

        fetch absolute _rowpos into _rec;
        return next _rec;
    end loop;

    return;
END;
$$
language 'plpgsql';
----

This method could return the same record twice though, I'll need to
build in some accounting for used up rowpos'es.
Would it be more efficient than the usual methods?

Sorry for the brain dump, I tried to get everything into this single
message. I hope it is at least comprehensible and useful, or
interesting or at least mildly amusing if not.

Regards,

Alban Hertroys
magproductions b.v.

!DSPAM:737,46d93d9b289906550616460!

Re: Obtaining random rows from a result set

От

Alban Hertroys

Дата:

01 сентября 2007 г., 09:24:34

On Sep 1, 2007, at 12:44, Alban Hertroys wrote:

>> It would be possible to write an aggregate that returns a single
>> random
>> value from a set. The algorithm is something like:
>>
>> n = 1
>> v = null
>> for each row
>>   if random() < 1/n:
>>      v = value of row
>>   n = n + 1
>>
>> return v
>
> Doesn't this always return the first record, since random() is
> always less than 1/1?
> I don't think this method has a linear distribution, but then again
> I don't understand what 'value of row' refers to...

Oh, now I see... The first time guarantees that v has a value (as
random() < 1/1), and after that there is a decreasing chance that a
new row gets re-assigned to v. That means the last row has a chance
of 1/n, which would be it's normal chance if the distribution were
linear, but doesn't the first row have a chance of 1/(n!) to be
returned?

--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
    7500 AK Enschede

// Integrate Your World //

!DSPAM:737,46d9551a289904044091126!

Re: Obtaining random rows from a result set

От

Martijn van Oosterhout

Дата:

01 сентября 2007 г., 09:44:34

On Sat, Sep 01, 2007 at 02:24:25PM +0200, Alban Hertroys wrote:
> Oh, now I see... The first time guarantees that v has a value (as
> random() < 1/1), and after that there is a decreasing chance that a
> new row gets re-assigned to v. That means the last row has a chance
> of 1/n, which would be it's normal chance if the distribution were
> linear, but doesn't the first row have a chance of 1/(n!) to be
> returned?

No. Consider at the first row it has chance 1 of being selected. At the
second row it has chance 1/2 of being *kept*. At the third row it has
chance 2/3 of being kept. At row four it's 3/4. As you see, the
numerators and denominators cancel, leaving 1/n at the end...

Neat huh?
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Вложения

signature.asc

Re: Obtaining random rows from a result set

От

Alban Hertroys

Дата:

01 сентября 2007 г., 12:44:39

On Sep 1, 2007, at 14:44, Martijn van Oosterhout wrote:

> On Sat, Sep 01, 2007 at 02:24:25PM +0200, Alban Hertroys wrote:
>> Oh, now I see... The first time guarantees that v has a value (as
>> random() < 1/1), and after that there is a decreasing chance that a
>> new row gets re-assigned to v. That means the last row has a chance
>> of 1/n, which would be it's normal chance if the distribution were
>> linear, but doesn't the first row have a chance of 1/(n!) to be
>> returned?
>
> No. Consider at the first row it has chance 1 of being selected. At
> the
> second row it has chance 1/2 of being *kept*. At the third row it has
> chance 2/3 of being kept. At row four it's 3/4. As you see, the
> numerators and denominators cancel, leaving 1/n at the end...

Ah, now I see where I went wrong. If the first row got through to any
next iteration, of course there's no chance anymore that it didn't.

> Neat huh?

Neat from an algorithmic point of view, yes. But it also means that
it's calculating random() for every record just like the rest of the
suggested solutions :(

I'm still convinced doing that isn't the right approach to the problem.

I think I'll do some experimenting with the set returning function on
Monday to see how that performs comparative to ordering by random().

The problem with the approaches that use pre-calculated random values
is I need my result to be truly random each time. If I'd return a
number of records starting at a record that has a certain random
value, I'd end up returning the records directly after that in the
same order every time because their order is pre-determined.

I realise the chances of that happening are slim provided enough
records to choose from, and chance dictates that it could happen
anyway, but you couldn't conscientiously sell that as an equal
chance... My boss thinks otherwise though, maybe I'll have to settle
for almost fair :P

--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
    7500 AK Enschede

// Integrate Your World //

!DSPAM:737,46d983fa289902833059189!

Re: Obtaining random rows from a result set

От

Alban Hertroys

Дата:

04 сентября 2007 г., 12:43:56

To follow up on my own post, I came up with a workable solution based on
scrolling cursors. The SP approach didn't work out for me, I didn't
manage to declare a cursor in PL/pgSQL that could be positioned
absolutely (maybe that's due to us still using PG 8.1.something?).

A solution to that would be appreciated.

Anyway, I solved the problem in our application (PHP). I even got a
workable solution to prevent returning the same record more than once.
Here goes:

function randomSet($query, $limit, $uniqueColumn) {

    // queries; depends on your DB connector
    DECLARE _cur SCROLL CURSOR WITHOUT HOLD FOR $query;
    MOVE FORWARD ALL IN _cur;

    //GET DIAGNOSTICS _count := ROW_COUNT;
    $count = pg_affected_rows();

    $uniques = array();
    $resultSet = array();
    while ($limit > 0 && count($uniques) < $count) {
        $idx = random(1, $count);

        //query
        $record = FETCH ABSOLUTE $idx FROM _cur;

        // Skip records with a column value we want to be unique
        if (in_array($record[$uniqueColumn], $uniques)
            continue;

        $uniques[] = $record[$uniqueColumn];
        $resultSet[] = $record;
        $limit--;
    }

    // query
    CLOSE _cur;

    return $resultSet;
}

I hope this is useful to anyone. It worked for us; it is definitely
faster than order by random(), and more random than precalculated column
values. Plus it translates directly to what we are requesting :)

Alban Hertroys wrote:
> I thought of another solution (with only a few calculations of random())
> that can be deployed in existing versions of PG, using a set-returning
> function with a scrolling cursor that accepts the query string as input
> like this (in pseudoish-code):
>
> ----
> create function random(text _query, integer _limit)
> returns set
> volatile
> as $$
> DECLARE
>     _cur cursor;
>     _cnt bigint;
>     _idx integer;
>     _rowpos bigint;
>
>     _rec record;
> BEGIN
>     open _cur for execute query;
>     fetch forward all into _rec;
>     -- select total nr of records into _cnt
>
>     for _idx in 1.._limit loop
>         _rowpos := random() * _cnt;
>
>         fetch absolute _rowpos into _rec;
>         return next _rec;
>     end loop;
>
>     return;
> END;
> $$
> language 'plpgsql';
> ----

--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
   7500 AK Enschede

// Integrate Your World //

Re: Obtaining random rows from a result set

От

Alban Hertroys

Дата:

04 сентября 2007 г., 12:45:09

Alban Hertroys wrote:
> To follow up on my own post, I came up with a workable solution based on
> scrolling cursors. The SP approach didn't work out for me, I didn't
> manage to declare a cursor in PL/pgSQL that could be positioned
> absolutely (maybe that's due to us still using PG 8.1.something?).

Doh! I mean I couldn't use MOVE FORWARD ALL IN _cur for some reason, it
kept saying "Syntax error".

--
Alban Hertroys
alban@magproductions.nl

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
   7500 AK Enschede

// Integrate Your World //

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Obtaining random rows from a result set

Вложения

Вложения