Обсуждение: Obtaining random rows from a result set
Hello, I've recently been busy improving a query that yields a fixed number of random records matching certain conditions. I have tried all the usual approaches, and although they do work, they're all limited in some way and don't translate really well to what you "want". They're kludges, IMHO. The methods I've tried are explained quite well on http://people.planetpostgresql.org/greg/index.php?/archives/40-Getting-random-rows-from-a-database-table.html All these methods involve calculating a random number for every record in the result set at some point in time, which is really not what I'm trying to model. I think the database should provide some means to get those records, so... Dear Santa, I'd like my database to have functionality analogue to how LIMIT works, but for other - non-sequential - algorithms. I was thinking along the lines of: SELECT * FROM table WHERE condition = true RANDOM 5; Which would (up to) return 5 random rows from the result set, just as LIMIT 5 returns (up to) the first 5 records in the result set. Or maybe even with a custom function, so that you could get non-linear distributions: SELECT * FROM table WHERE condition = true LIMIT 5 USING my_func(); Where my_func() could be a user definable function accepting a number that should be (an estimate of?) the number of results being returned so that it can provide pointers to which rows in the resultset will be returned from the query. Examples: * random(maxrows) would return random rows from the resultset. * median() would return the rows in the middle of the result set (this would require ordering to be meaningful). What do people think, is this feasable? Desirable? Necessary? If I'd have time I'd volunteer for at least looking into this, but I'm working on three projects simultaneously already. Alas... Regards, Alban Hertroys. -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //
On 8/31/07, Alban Hertroys <alban@magproductions.nl> wrote: > Hello, > > I've recently been busy improving a query that yields a fixed number of > random records matching certain conditions. I have tried all the usual > approaches, and although they do work, they're all limited in some way > and don't translate really well to what you "want". They're kludges, IMHO. > > The methods I've tried are explained quite well on > http://people.planetpostgresql.org/greg/index.php?/archives/40-Getting-random-rows-from-a-database-table.html > > All these methods involve calculating a random number for every record > in the result set at some point in time, which is really not what I'm > trying to model. I think the database should provide some means to get > those records, so... > > Dear Santa, > > I'd like my database to have functionality analogue to how LIMIT works, > but for other - non-sequential - algorithms. > > I was thinking along the lines of: > > SELECT * > FROM table > WHERE condition = true > RANDOM 5; > > Which would (up to) return 5 random rows from the result set, just as > LIMIT 5 returns (up to) the first 5 records in the result set. > > > Or maybe even with a custom function, so that you could get non-linear > distributions: > > SELECT * > FROM table > WHERE condition = true > LIMIT 5 USING my_func(); > > Where my_func() could be a user definable function accepting a number > that should be (an estimate of?) the number of results being returned so > that it can provide pointers to which rows in the resultset will be > returned from the query. > > Examples: > * random(maxrows) would return random rows from the resultset. > * median() would return the rows in the middle of the result set (this > would require ordering to be meaningful). > > What do people think, is this feasable? Desirable? Necessary? > > If I'd have time I'd volunteer for at least looking into this, but I'm > working on three projects simultaneously already. Alas... > > Regards, > Alban Hertroys. > > -- > Alban Hertroys > alban@magproductions.nl > > magproductions b.v. > > T: ++31(0)534346874 > F: ++31(0)534346876 > M: > I: www.magproductions.nl > A: Postbus 416 > 7500 AK Enschede > > // Integrate Your World // > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend > It seems to me that anything that wants to return a random set of rows will need to calculate a random number for all the rows it processes, unless you change how the database scans rows in indexes or tables, which if at all possible will probably make things *really* slow. If it's a given that the database will always sequentially scan whatever it is the query plan tells it to scan, you're pretty much stuck with the rows in your result set being in the same order unless you start picking random numbers. One possible alternative not mentioned on the site you linked to is to as follows: select [whatever] from [table] where random() < [some number between 0 and 1] limit [limit value] That doesn't require assigning a random number for *every* row in the table, nor does it require sorting everything. It does mean that numbers encountered earlier in the query processing have a higher likelihood of being returned, and it also means that there's some chance you won't actually get as many as [limit value] rows returned. jtolley=# create table a (i integer); CREATE TABLE jtolley=# insert into a (i) select * from generate_series(1, 100); INSERT 0 100 jtolley=# create table a (i integer); CREATE TABLE jtolley=# insert into a (i) select * from generate_series(1, 100); INSERT 0 100 jtolley=# select * from a where random() < .1 limit 3; i ---- 22 23 25 (3 rows) Hope this helps... -Josh
Alban Hertroys wrote: > I've recently been busy improving a query that yields a fixed > number of random records matching certain conditions. > Dear Santa, > > I'd like my database to have functionality analogue to how > LIMIT works, > but for other - non-sequential - algorithms. > > I was thinking along the lines of: > > SELECT * > FROM table > WHERE condition = true > RANDOM 5; Ho, ho, ho. SELECT * FROM table WHERE condition = true ORDER BY hashfloat8(random()) LIMIT 5; Yours, Laurenz Albe
> Dear Santa, > > I'd like my database to have functionality analogue to how LIMIT works, > but for other - non-sequential - algorithms. There was some discussion before to possibly reuse the algorithm ANALYZE is using for sampling some given percentage of the table data and provide this for some kind of "SELECT SAMPLE x% " style of functionality. This would be the fastest you can get for a reasonably big sample so it can be statistically significant, but not repeatable. I'm not sure if this is the same what you were asking for though, I would like something like this for statistical stuff, not for randomly selecting rows. Cheers, Csaba.
On Fri, Aug 31, 2007 at 02:42:18PM +0200, Alban Hertroys wrote: > Examples: > * random(maxrows) would return random rows from the resultset. > * median() would return the rows in the middle of the result set (this > would require ordering to be meaningful). It would be possible to write an aggregate that returns a single random value from a set. The algorithm is something like: n = 1 v = null for each row if random() < 1/n: v = value of row n = n + 1 return v It does require a seqscan though. If you're asking for 5 random rows you probably mean 5 random but distinct rows, which is different to just running the above set 5 times in parallel. I don't know if there's a similar method for median... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Вложения
Hi, Why not generate a random number in your application and then: SELECT * FROM table_x WHERE condition = true OFFSET generated_random_number LIMIT xx Kaloyan Iliev Alban Hertroys wrote: >Hello, > >I've recently been busy improving a query that yields a fixed number of >random records matching certain conditions. I have tried all the usual >approaches, and although they do work, they're all limited in some way >and don't translate really well to what you "want". They're kludges, IMHO. > >The methods I've tried are explained quite well on >http://people.planetpostgresql.org/greg/index.php?/archives/40-Getting-random-rows-from-a-database-table.html > >All these methods involve calculating a random number for every record >in the result set at some point in time, which is really not what I'm >trying to model. I think the database should provide some means to get >those records, so... > >Dear Santa, > >I'd like my database to have functionality analogue to how LIMIT works, >but for other - non-sequential - algorithms. > >I was thinking along the lines of: > > SELECT * > FROM table > WHERE condition = true > RANDOM 5; > >Which would (up to) return 5 random rows from the result set, just as >LIMIT 5 returns (up to) the first 5 records in the result set. > > >Or maybe even with a custom function, so that you could get non-linear >distributions: > > SELECT * > FROM table > WHERE condition = true > LIMIT 5 USING my_func(); > > Where my_func() could be a user definable function accepting a number >that should be (an estimate of?) the number of results being returned so >that it can provide pointers to which rows in the resultset will be >returned from the query. > >Examples: >* random(maxrows) would return random rows from the resultset. >* median() would return the rows in the middle of the result set (this >would require ordering to be meaningful). > >What do people think, is this feasable? Desirable? Necessary? > >If I'd have time I'd volunteer for at least looking into this, but I'm >working on three projects simultaneously already. Alas... > >Regards, >Alban Hertroys. > > >
On Aug 31, 2007, at 8:34 AM, Kaloyan Iliev wrote: > Alban Hertroys wrote: > >> Hello, >> >> I've recently been busy improving a query that yields a fixed >> number of >> random records matching certain conditions. I have tried all the >> usual >> approaches, and although they do work, they're all limited in some >> way >> and don't translate really well to what you "want". They're >> kludges, IMHO. >> >> The methods I've tried are explained quite well on >> http://people.planetpostgresql.org/greg/index.php?/archives/40- >> Getting-random-rows-from-a-database-table.html >> >> All these methods involve calculating a random number for every >> record >> in the result set at some point in time, which is really not what I'm >> trying to model. I think the database should provide some means to >> get >> those records, so... >> >> Dear Santa, >> >> I'd like my database to have functionality analogue to how LIMIT >> works, >> but for other - non-sequential - algorithms. >> >> I was thinking along the lines of: >> >> SELECT * >> FROM table >> WHERE condition = true >> RANDOM 5; >> >> Which would (up to) return 5 random rows from the result set, just as >> LIMIT 5 returns (up to) the first 5 records in the result set. >> >> >> Or maybe even with a custom function, so that you could get non- >> linear >> distributions: >> >> SELECT * >> FROM table >> WHERE condition = true >> LIMIT 5 USING my_func(); >> >> Where my_func() could be a user definable function accepting a >> number >> that should be (an estimate of?) the number of results being >> returned so >> that it can provide pointers to which rows in the resultset will be >> returned from the query. >> >> Examples: >> * random(maxrows) would return random rows from the resultset. >> * median() would return the rows in the middle of the result set >> (this >> would require ordering to be meaningful). >> >> What do people think, is this feasable? Desirable? Necessary? >> >> If I'd have time I'd volunteer for at least looking into this, but >> I'm >> working on three projects simultaneously already. Alas... >> >> Regards, >> Alban Hertroys. >> >> > Hi, > Why not generate a random number in your application and then: > > SELECT * > FROM table_x > WHERE condition = true > OFFSET generated_random_number > LIMIT xx > > Kaloyan Iliev > That won't work without some kind of a priori knowledge of how many rows the query would return without the offset and limit. Erik Jones Software Developer | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
On Aug 31, 2007, at 15:54, Martijn van Oosterhout wrote: > On Fri, Aug 31, 2007 at 02:42:18PM +0200, Alban Hertroys wrote: >> Examples: >> * random(maxrows) would return random rows from the resultset. >> * median() would return the rows in the middle of the result set >> (this >> would require ordering to be meaningful). > > It would be possible to write an aggregate that returns a single > random > value from a set. The algorithm is something like: > > n = 1 > v = null > for each row > if random() < 1/n: > v = value of row > n = n + 1 > > return v Doesn't this always return the first record, since random() is always less than 1/1? I don't think this method has a linear distribution, but then again I don't understand what 'value of row' refers to... > It does require a seqscan though. I doubt that a seqscan can be entirely avoided to fetch random rows from a set, at least not until the last random result has been returned, _unless_ the number of matching records would be known before starting taking random samples. > If you're asking for 5 random rows > you probably mean 5 random but distinct rows, which is different to > just running the above set 5 times in parallel. Indeed, that is one of the distinctions that need some thought for my original preposition. I left it out, as it's an implementation detail (an important one, admittedly). > I don't know if there's a similar method for median... I'm not entirely sure, but I think your method is the only one suggested that doesn't involve calculating random() a million times (for a million records) to return 5 (random) records. My suggestion involved a way to calculate random() only when retrieving records from the result set (only 5 times for a million records in this case). For a linearly distributed random set it does require knowing the number of records in the set though, an estimate would make it non- linear (although only a little bit if accurate enough). OTOH, I'm starting to think that the last sort step of an order by can be postponed to the result set fetching cycle under the conditions that: - the ordering expression is unrelated to the records involved, and - only a fraction of the total number of records will be returned. (Which is somewhat similar to the condition for an index being more efficient than a seqscan, btw) Comparing records with each other for something not related seems a waste of effort, while the result set has already been determined (just not ordered in any particular way), am I right? With that change (postponing sorting) my original ORDER BY random() LIMIT 5 would perform quite adequately, I think - it'd only involve calculating random() at least 5 times, not as often as the number of records in the result set. Or is order by random() acting as some kind of shuffling method? Is that a requirement to get a linearly distributed set to randomly draw from? I can see how it wouldn't be linear if you'd start randomly comparing records from the beginning of the result set... (Which would be the logical method if you don't know the size of the set before hand) I thought of another solution (with only a few calculations of random ()) that can be deployed in existing versions of PG, using a set- returning function with a scrolling cursor that accepts the query string as input like this (in pseudoish-code): ---- create function random(text _query, integer _limit) returns set volatile as $$ DECLARE _cur cursor; _cnt bigint; _idx integer; _rowpos bigint; _rec record; BEGIN open _cur for execute query; fetch forward all into _rec; -- select total nr of records into _cnt for _idx in 1.._limit loop _rowpos := random() * _cnt; fetch absolute _rowpos into _rec; return next _rec; end loop; return; END; $$ language 'plpgsql'; ---- This method could return the same record twice though, I'll need to build in some accounting for used up rowpos'es. Would it be more efficient than the usual methods? Sorry for the brain dump, I tried to get everything into this single message. I hope it is at least comprehensible and useful, or interesting or at least mildly amusing if not. Regards, Alban Hertroys magproductions b.v. !DSPAM:737,46d93d9b289906550616460!
On Sep 1, 2007, at 12:44, Alban Hertroys wrote: >> It would be possible to write an aggregate that returns a single >> random >> value from a set. The algorithm is something like: >> >> n = 1 >> v = null >> for each row >> if random() < 1/n: >> v = value of row >> n = n + 1 >> >> return v > > Doesn't this always return the first record, since random() is > always less than 1/1? > I don't think this method has a linear distribution, but then again > I don't understand what 'value of row' refers to... Oh, now I see... The first time guarantees that v has a value (as random() < 1/1), and after that there is a decreasing chance that a new row gets re-assigned to v. That means the last row has a chance of 1/n, which would be it's normal chance if the distribution were linear, but doesn't the first row have a chance of 1/(n!) to be returned? -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World // !DSPAM:737,46d9551a289904044091126!
On Sat, Sep 01, 2007 at 02:24:25PM +0200, Alban Hertroys wrote: > Oh, now I see... The first time guarantees that v has a value (as > random() < 1/1), and after that there is a decreasing chance that a > new row gets re-assigned to v. That means the last row has a chance > of 1/n, which would be it's normal chance if the distribution were > linear, but doesn't the first row have a chance of 1/(n!) to be > returned? No. Consider at the first row it has chance 1 of being selected. At the second row it has chance 1/2 of being *kept*. At the third row it has chance 2/3 of being kept. At row four it's 3/4. As you see, the numerators and denominators cancel, leaving 1/n at the end... Neat huh? -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Вложения
On Sep 1, 2007, at 14:44, Martijn van Oosterhout wrote: > On Sat, Sep 01, 2007 at 02:24:25PM +0200, Alban Hertroys wrote: >> Oh, now I see... The first time guarantees that v has a value (as >> random() < 1/1), and after that there is a decreasing chance that a >> new row gets re-assigned to v. That means the last row has a chance >> of 1/n, which would be it's normal chance if the distribution were >> linear, but doesn't the first row have a chance of 1/(n!) to be >> returned? > > No. Consider at the first row it has chance 1 of being selected. At > the > second row it has chance 1/2 of being *kept*. At the third row it has > chance 2/3 of being kept. At row four it's 3/4. As you see, the > numerators and denominators cancel, leaving 1/n at the end... Ah, now I see where I went wrong. If the first row got through to any next iteration, of course there's no chance anymore that it didn't. > Neat huh? Neat from an algorithmic point of view, yes. But it also means that it's calculating random() for every record just like the rest of the suggested solutions :( I'm still convinced doing that isn't the right approach to the problem. I think I'll do some experimenting with the set returning function on Monday to see how that performs comparative to ordering by random(). The problem with the approaches that use pre-calculated random values is I need my result to be truly random each time. If I'd return a number of records starting at a record that has a certain random value, I'd end up returning the records directly after that in the same order every time because their order is pre-determined. I realise the chances of that happening are slim provided enough records to choose from, and chance dictates that it could happen anyway, but you couldn't conscientiously sell that as an equal chance... My boss thinks otherwise though, maybe I'll have to settle for almost fair :P -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World // !DSPAM:737,46d983fa289902833059189!
To follow up on my own post, I came up with a workable solution based on scrolling cursors. The SP approach didn't work out for me, I didn't manage to declare a cursor in PL/pgSQL that could be positioned absolutely (maybe that's due to us still using PG 8.1.something?). A solution to that would be appreciated. Anyway, I solved the problem in our application (PHP). I even got a workable solution to prevent returning the same record more than once. Here goes: function randomSet($query, $limit, $uniqueColumn) { // queries; depends on your DB connector DECLARE _cur SCROLL CURSOR WITHOUT HOLD FOR $query; MOVE FORWARD ALL IN _cur; //GET DIAGNOSTICS _count := ROW_COUNT; $count = pg_affected_rows(); $uniques = array(); $resultSet = array(); while ($limit > 0 && count($uniques) < $count) { $idx = random(1, $count); //query $record = FETCH ABSOLUTE $idx FROM _cur; // Skip records with a column value we want to be unique if (in_array($record[$uniqueColumn], $uniques) continue; $uniques[] = $record[$uniqueColumn]; $resultSet[] = $record; $limit--; } // query CLOSE _cur; return $resultSet; } I hope this is useful to anyone. It worked for us; it is definitely faster than order by random(), and more random than precalculated column values. Plus it translates directly to what we are requesting :) Alban Hertroys wrote: > I thought of another solution (with only a few calculations of random()) > that can be deployed in existing versions of PG, using a set-returning > function with a scrolling cursor that accepts the query string as input > like this (in pseudoish-code): > > ---- > create function random(text _query, integer _limit) > returns set > volatile > as $$ > DECLARE > _cur cursor; > _cnt bigint; > _idx integer; > _rowpos bigint; > > _rec record; > BEGIN > open _cur for execute query; > fetch forward all into _rec; > -- select total nr of records into _cnt > > for _idx in 1.._limit loop > _rowpos := random() * _cnt; > > fetch absolute _rowpos into _rec; > return next _rec; > end loop; > > return; > END; > $$ > language 'plpgsql'; > ---- -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //
Alban Hertroys wrote: > To follow up on my own post, I came up with a workable solution based on > scrolling cursors. The SP approach didn't work out for me, I didn't > manage to declare a cursor in PL/pgSQL that could be positioned > absolutely (maybe that's due to us still using PG 8.1.something?). Doh! I mean I couldn't use MOVE FORWARD ALL IN _cur for some reason, it kept saying "Syntax error". -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //