Обсуждение: random record from small set
I am trying to retrieve a random record (according to a chance
attribute) from a small set of records, each with a "chance" attribute.
This may eventually be somwhat of a performance concern, so I'd like to
make sure I'm doing this right.
Here's what I have so far:
create table r1 (
i int,
chance numeric
)
create or replace function randrec() returns int as $$
$res = spi_exec_query('select i,chance from r1');
$r = rand;
$accum = 0;
$i = 0;
while($accum < $r) {
$accum += $res->{rows}[$i++]->{chance}
}
return $res->{rows}[$i-1]->{i};
$$ language plperl;
test=# select * from r1;
i | chance
---+--------
1 | 0.25
2 | 0.20
3 | 0.15
4 | 0.10
5 | 0.30
That seems to work, in that out of 10k times, I got the following
numbers of each:
1 2479
2 1959
3 1522
4 950
5 3090
But I have a few questions:
* Am I right to use NUMERIC for the chance attribute?
* Does perl's arithmetic leave me with the chance that those numeric
values don't add up to 1.00 (and in this case that could mean an
infinite loop)?
* In my design I'll need a constraint trigger making sure that the
numbers add up to 1.00. Will that be a performance problem for
operations on the table that don't modify the chance attribute?
* Is there a better way?
* Does spi_exec_query pull the entire result set into memory at once? Is
there a point at which performance could be a serious problem if there
are a large number of items to select among?
Regards,
Jeff Davis
On Mon, Feb 14, 2005 at 06:15:56PM -0800, Jeff Davis wrote:
>
> * Am I right to use NUMERIC for the chance attribute?
I ran tests with numeric, real, and double precision; double precision
was consistently about 10% faster than the others. I used the
sample data you posted and the PL/pgSQL function shown later in
this message.
> * Does perl's arithmetic leave me with the chance that those numeric
> values don't add up to 1.00 (and in this case that could mean an
> infinite loop)?
I'd suggest looping through the records so you can't possibly end
up in an infinite loop.
> * In my design I'll need a constraint trigger making sure that the
> numbers add up to 1.00.
If the sum must be exactly 1.00 then be careful if you use double
precision -- if you test with the equality operator then the check
might fail because the sum is 0.9999999987.
> Will that be a performance problem for operations on the table that
> don't modify the chance attribute?
Any trigger that you didn't otherwise need will cause a performance
hit. I'd expect a statement-level AFTER trigger to have the lowest
impact since it would run only once per statement, whereas a row-level
trigger might run multiple times per statement. On the other hand,
if you make a lot of updates that don't modify the chance attribute,
then you might want to try a row-level trigger that skips the check
when NEW.chance = OLD.chance. I'd suggesting testing different
methods under expected conditions and see which has the lowest impact.
> * Is there a better way?
> * Does spi_exec_query pull the entire result set into memory at once?
I think it does. I ran some tests with the following PL/pgSQL
function and got significantly faster times than with PL/Perl,
especially as the data set grew:
CREATE FUNCTION randrec() RETURNS integer AS $$
DECLARE
r double precision := random();
accum double precision := 0.0;
row record;
BEGIN
FOR row IN SELECT * FROM r1 LOOP
accum := accum + row.chance;
IF accum >= r THEN
EXIT;
END IF;
END LOOP;
RETURN row.i;
END;
$$ LANGUAGE plpgsql VOLATILE;
SELECT * FROM r1;
i | chance
---+--------
1 | 0.25
2 | 0.20
3 | 0.15
4 | 0.10
5 | 0.30
SELECT i, count(*)
FROM (SELECT randrec() AS i FROM generate_series(1, 10000)) AS s
GROUP BY i
ORDER by i;
i | count
---+-------
1 | 2467
2 | 1939
3 | 1536
4 | 1016
5 | 3042
(5 rows)
Time: 3300.710 ms
Here are the results using the PL/Perl function you posted:
SELECT i, count(*)
FROM (SELECT randrec_perl() AS i FROM generate_series(1, 10000)) AS s
GROUP BY i
ORDER by i;
i | count
---+-------
1 | 2501
2 | 2040
3 | 1463
4 | 994
5 | 3002
(5 rows)
Time: 8765.584 ms
I ran each query several times and those times were typical of both.
With a data set of 100 records, the PL/pgSQL function ran in about
14 seconds, while the PL/Perl function took around 65 seconds.
--
Michael Fuhr
http://www.fuhr.org/~mfuhr/
Thanks very much for the information. I had very similar results on my
machine. I will take your advice and use the double-precision values,
since it doesn't affect the probability significantly anyway. As far as
the constraint trigger, I will see if it becomes a problem before I
worry about its performance. As far as whether those values add up to
1.0, I'll just check to make sure it's fairly close to 1.0 :)
The only real difference that I saw was that I didn't notice much
difference if the underlying table's chance attribute was double
precision vs. numeric. I used your generate_series()-based query.
Perhaps that was because I was using ALTER TABLE to modify r1's "chance"
attribute. One thing that I noticed there was that I had to CREATE OR
REPLACE the plpgsql function for that to work. Perhaps it was a bug?
Here's a test case:
test=# create table t1(i int);
CREATE TABLE
test=# insert into t1 values(1);
INSERT 17775 1
test=# create or replace function err() returns int as $$ DECLARE accum
double precision := 0.0; row record; BEGIN FOR row IN SELECT i FROM t1
LOOP accum := accum + row.i; END LOOP; RETURN row.i; END; $$ language
plpgsql;
CREATE FUNCTION
test=# insert into t1 values(2); INSERT 17778 1
test=# select err();
err
-----
2
(1 row)
test=# alter table t1 alter column i type numeric;
ALTER TABLE
test=# select err();
err
-----
10
(1 row)
And if you keep playing around with the type and values you can get
other errors like:
ERROR: type "double precision" value out of range: underflow
CONTEXT: PL/pgSQL function "err" line 1 at assignment
Or:
ERROR: invalid memory alloc request size 4294967290
CONTEXT: PL/pgSQL function "randrec" line 7 at assignment
If any more information would be helpful someone let me know. It looks a
little like a bug; perhaps we should throw an error when dependent
functions are called after the underlying types have changed? Or I
suppose if we can recognize that, it might as well recompile the
function and proceed without error.
Regards,
Jeff Davis
On Mon, 2005-02-14 at 22:18 -0700, Michael Fuhr wrote:
> On Mon, Feb 14, 2005 at 06:15:56PM -0800, Jeff Davis wrote:
> >
> > * Am I right to use NUMERIC for the chance attribute?
>
> I ran tests with numeric, real, and double precision; double precision
> was consistently about 10% faster than the others. I used the
> sample data you posted and the PL/pgSQL function shown later in
> this message.
>
> > * Does perl's arithmetic leave me with the chance that those numeric
> > values don't add up to 1.00 (and in this case that could mean an
> > infinite loop)?
>
> I'd suggest looping through the records so you can't possibly end
> up in an infinite loop.
>
> > * In my design I'll need a constraint trigger making sure that the
> > numbers add up to 1.00.
>
> If the sum must be exactly 1.00 then be careful if you use double
> precision -- if you test with the equality operator then the check
> might fail because the sum is 0.9999999987.
>
> > Will that be a performance problem for operations on the table that
> > don't modify the chance attribute?
>
> Any trigger that you didn't otherwise need will cause a performance
> hit. I'd expect a statement-level AFTER trigger to have the lowest
> impact since it would run only once per statement, whereas a row-level
> trigger might run multiple times per statement. On the other hand,
> if you make a lot of updates that don't modify the chance attribute,
> then you might want to try a row-level trigger that skips the check
> when NEW.chance = OLD.chance. I'd suggesting testing different
> methods under expected conditions and see which has the lowest impact.
>
> > * Is there a better way?
> > * Does spi_exec_query pull the entire result set into memory at once?
>
> I think it does. I ran some tests with the following PL/pgSQL
> function and got significantly faster times than with PL/Perl,
> especially as the data set grew:
>
> CREATE FUNCTION randrec() RETURNS integer AS $$
> DECLARE
> r double precision := random();
> accum double precision := 0.0;
> row record;
> BEGIN
> FOR row IN SELECT * FROM r1 LOOP
> accum := accum + row.chance;
> IF accum >= r THEN
> EXIT;
> END IF;
> END LOOP;
>
> RETURN row.i;
> END;
> $$ LANGUAGE plpgsql VOLATILE;
>
> SELECT * FROM r1;
> i | chance
> ---+--------
> 1 | 0.25
> 2 | 0.20
> 3 | 0.15
> 4 | 0.10
> 5 | 0.30
>
> SELECT i, count(*)
> FROM (SELECT randrec() AS i FROM generate_series(1, 10000)) AS s
> GROUP BY i
> ORDER by i;
> i | count
> ---+-------
> 1 | 2467
> 2 | 1939
> 3 | 1536
> 4 | 1016
> 5 | 3042
> (5 rows)
> Time: 3300.710 ms
>
> Here are the results using the PL/Perl function you posted:
>
> SELECT i, count(*)
> FROM (SELECT randrec_perl() AS i FROM generate_series(1, 10000)) AS s
> GROUP BY i
> ORDER by i;
> i | count
> ---+-------
> 1 | 2501
> 2 | 2040
> 3 | 1463
> 4 | 994
> 5 | 3002
> (5 rows)
> Time: 8765.584 ms
>
> I ran each query several times and those times were typical of both.
> With a data set of 100 records, the PL/pgSQL function ran in about
> 14 seconds, while the PL/Perl function took around 65 seconds.
>
And what about another data representation like
create table r1 (
i int,
chance_from numeric,
chance_to numeric
)
, you can select one random row in one select, for instance
select * from r1 where chance_from <= $rnd and chance_to > $rnd;
I see these advantages
- Only one select.
- Indices can improve performance if r1 has many rows.
and disadvantage
- Tricky update
Jeff Davis wrote:
>I am trying to retrieve a random record (according to a chance
>attribute) from a small set of records, each with a "chance" attribute.
>This may eventually be somwhat of a performance concern, so I'd like to
>make sure I'm doing this right.
>
>Here's what I have so far:
>
>create table r1 (
> i int,
> chance numeric
>)
>create or replace function randrec() returns int as $$
> $res = spi_exec_query('select i,chance from r1');
> $r = rand;
> $accum = 0;
> $i = 0;
> while($accum < $r) {
> $accum += $res->{rows}[$i++]->{chance}
> }
> return $res->{rows}[$i-1]->{i};
>$$ language plperl;
>
>test=# select * from r1;
> i | chance
>---+--------
> 1 | 0.25
> 2 | 0.20
> 3 | 0.15
> 4 | 0.10
> 5 | 0.30
>
>
>That seems to work, in that out of 10k times, I got the following
>numbers of each:
>1 2479
>2 1959
>3 1522
>4 950
>5 3090
>
>But I have a few questions:
>* Am I right to use NUMERIC for the chance attribute?
>* Does perl's arithmetic leave me with the chance that those numeric
>values don't add up to 1.00 (and in this case that could mean an
>infinite loop)?
>* In my design I'll need a constraint trigger making sure that the
>numbers add up to 1.00. Will that be a performance problem for
>operations on the table that don't modify the chance attribute?
>* Is there a better way?
>* Does spi_exec_query pull the entire result set into memory at once? Is
>there a point at which performance could be a serious problem if there
>are a large number of items to select among?
>
>Regards,
> Jeff Davis
>
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>
>
>
Or
create table r1 (
i int,
chance_from numeric
)
and
select * from r1 where chance_from <= $rnd order by chance_from desc
limit 1;
which can be easier updated...
Just ideas, I has never tested it...
Jan Poslusny wrote:
> And what about another data representation like
>
> create table r1 (
> i int,
> chance_from numeric,
> chance_to numeric
> )
> , you can select one random row in one select, for instance
> select * from r1 where chance_from <= $rnd and chance_to > $rnd;
>
> I see these advantages
> - Only one select.
> - Indices can improve performance if r1 has many rows.
> and disadvantage
> - Tricky update
>
>
> Jeff Davis wrote:
>
>> I am trying to retrieve a random record (according to a chance
>> attribute) from a small set of records, each with a "chance" attribute.
>> This may eventually be somwhat of a performance concern, so I'd like to
>> make sure I'm doing this right.
>>
>> Here's what I have so far:
>>
>> create table r1 (
>> i int,
>> chance numeric
>> )
>> create or replace function randrec() returns int as $$
>> $res = spi_exec_query('select i,chance from r1');
>> $r = rand;
>> $accum = 0;
>> $i = 0;
>> while($accum < $r) {
>> $accum += $res->{rows}[$i++]->{chance}
>> }
>> return $res->{rows}[$i-1]->{i};
>> $$ language plperl;
>>
>> test=# select * from r1;
>> i | chance
>> ---+--------
>> 1 | 0.25
>> 2 | 0.20
>> 3 | 0.15
>> 4 | 0.10
>> 5 | 0.30
>>
>>
>> That seems to work, in that out of 10k times, I got the following
>> numbers of each:
>> 1 2479
>> 2 1959
>> 3 1522
>> 4 950
>> 5 3090
>>
>> But I have a few questions:
>> * Am I right to use NUMERIC for the chance attribute?
>> * Does perl's arithmetic leave me with the chance that those numeric
>> values don't add up to 1.00 (and in this case that could mean an
>> infinite loop)?
>> * In my design I'll need a constraint trigger making sure that the
>> numbers add up to 1.00. Will that be a performance problem for
>> operations on the table that don't modify the chance attribute?
>> * Is there a better way?
>> * Does spi_exec_query pull the entire result set into memory at once? Is
>> there a point at which performance could be a serious problem if there
>> are a large number of items to select among?
>>
>> Regards,
>> Jeff Davis
>>
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 2: you can get off all lists at once with the unregister command
>> (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>>
>>
>>
>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
> Here's what I have so far:
If you go that route, make sure you check for edge cases, such
as reaching the end of the rows without hitting your number:
while($accum < $r) {
die qq{Ran out of rows!\n} if ! defined $res->{rows}[$i];
Also, your query should be "select i,chance from r1 ORDER BY random()"
else you are getting back the same order each time (until a row is
changed) which certainly reduces the randomness.
Anyway, here's another solution, which shifts as much work as possible
off of the actual random row call, and uses a trigger to keep things
in sync. I switched the 'chance' from 0.25 to 25 (numeric to int) to make
things easier to read.
UPDATE r1 SET chance = chance*100;
ALTER TABLE r1 ALTER COLUMN chance TYPE INTEGER;
CREATE TABLE r2(integer);
CREATE OR REPLACE FUNCTION r1_cleanup() RETURNS trigger LANGUAGE plpgsql AS
$$
DECLARE
mychance integer;
BEGIN
IF TG_OP = 'DELETE' THEN
DELETE FROM r2 WHERE id = OLD.i;
ELSE
IF TG_OP = 'UPDATE' THEN
DELETE FROM r2 WHERE id = OLD.i or id = NEW.i;
END IF;
SELECT chance FROM r1 WHERE i=NEW.i INTO mychance;
LOOP
mychance := mychance - 1;
EXIT WHEN mychance < 0;
INSERT INTO r2 VALUES (NEW.i);
END LOOP;
END IF;
RETURN NULL;
END;
$$;
CREATE TRIGGER r1_trigger AFTER INSERT or UPDATE or DELETE ON r1
FOR EACH ROW EXECUTE PROCEDURE r1_cleanup();
UPDATE r1 SET i=i; -- To initially populate r2
SELECT id FROM r2 ORDER BY random() LIMIT 1; -- repeat as needed
- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200502152252
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----
iD8DBQFCEsOvvJuQZxSWSsgRAjysAJ9X3JpMfuXV2ST049bhCWuJOp6Y1ACg/sNx
PXqxVlfvlsKMTBDDhsh3BmU=
=7/IE
-----END PGP SIGNATURE-----