Обсуждение: Problem with volatile function

Поиск
Список
Период
Сортировка

Problem with volatile function

От
Artacus
Дата:
So my understanding of volatile functions is that volatile functions can
return different results given the same input.

I have a function random(int, int) that returns a random value between
$1 and $2. I want to use it in a query to generate values. But it only
evaluates once per query and not once per row like I need it to.

-- This always returns the same value
SELECT ts.sis_id, bldg_id, f_name.name, l_name.name
FROM tmp_students ts
JOIN names AS f_name ON
   ts.gender = f_name.gender
WHERE f_name.counter = random(1,300)

--As does this
SELECT ts.sis_id, bldg_id, f_name.name, l_name.name
FROM tmp_students ts
JOIN names AS f_name ON
   ts.gender = f_name.gender
   AND ts.counter = random(1,100)

-- This generates different numbers
SELECT random(1,100), s.*
FROM usr_students s



Re: Problem with volatile function

От
Tom Lane
Дата:
Artacus <artacus@comcast.net> writes:
> I have a function random(int, int) that returns a random value between
> $1 and $2. I want to use it in a query to generate values. But it only
> evaluates once per query and not once per row like I need it to.

> -- This always returns the same value
> SELECT ts.sis_id, bldg_id, f_name.name, l_name.name
> FROM tmp_students ts
> JOIN names AS f_name ON
>    ts.gender = f_name.gender
> WHERE f_name.counter = random(1,300)

You haven't really explained what your problem is, but offhand I'd
expect random() to be evaluated once per f_name row here.  What are
you looking for?

            regards, tom lane

Re: Problem with volatile function

От
Craig Ringer
Дата:
Artacus wrote:
> So my understanding of volatile functions is that volatile functions can
> return different results given the same input.
>
> I have a function random(int, int) that returns a random value between
> $1 and $2. I want to use it in a query to generate values. But it only
> evaluates once per query and not once per row like I need it to.
>
> -- This always returns the same value
> SELECT ts.sis_id, bldg_id, f_name.name, l_name.name
> FROM tmp_students ts
> JOIN names AS f_name ON
>   ts.gender = f_name.gender
> WHERE f_name.counter = random(1,300)

Personally I'd expect that to only evaluate once. It's saying "where
f_name.counter in this row is equal to some single random value
generated at the start of the query". The parameters of the random()
function do not depend on the input, so Pg evaluates it once rather than
with each row. That's not to do with volatility; rather, it's the
semantics of your query.

It's like the difference between correlated and uncorrelated subqueries.
A subquery that doesn't reference outside context is only evaluated
once, ie it's uncorrelated. Only if the subquery contains references to
values in rows in the outside query is it evaluated once for each row.

If Pg re-evaluated your random() function for every record, how would
you then write "fetch all records that have f_name.counter equal to the
same randomly selected value" ?

You can force Pg to re-evaluate random() by adding a dummy parameter
that depends on the input record, or (probably better) by writing a
variant of it that tests the input against a randomly generated value
and returns a boolean. Eg:

SELECT ts.sis_id, bldg_id, f_name.name, l_name.name
FROM tmp_students ts
JOIN names AS f_name ON
  ts.gender = f_name.gender
WHERE random_equals(1,300,f_name.counter)


random_equals would be called once per input record. It would generate a
random number between 1 and 300, and test the result to see if it was
equal to the argument value f_name.counter, then return the result of
the test.

Alternately, you could pair each result up with a random value and
re-check the results with a subquery (slower):

SELECT sis_id, bldg_id, fname, lname
FROM
  (SELECT
      ts.sis_id,
      bldg_id,
      f_name.name AS fname,
      l_name.name AS lname,
      f_name.counter AS counter,
      random(1,300) AS r
  FROM tmp_students ts
  JOIN names AS f_name ON
    ts.gender = f_name.gender
  )
AS x
WHERE r = counter;

... but that's pretty ugly.

> --As does this
> SELECT ts.sis_id, bldg_id, f_name.name, l_name.name
> FROM tmp_students ts
> JOIN names AS f_name ON
>   ts.gender = f_name.gender
>   AND ts.counter = random(1,100)

Same deal as above.

>
> -- This generates different numbers
> SELECT random(1,100), s.*
> FROM usr_students s

... because you've used random() in the result list. That's where
VOLATILE and STABLE etc will make a difference.

--
Craig Ringer

Re: Problem with volatile function

От
Klint Gore
Дата:
Artacus wrote:
> So my understanding of volatile functions is that volatile functions can
> return different results given the same input.
>
> I have a function random(int, int) that returns a random value between
> $1 and $2. I want to use it in a query to generate values. But it only
> evaluates once per query and not once per row like I need it to.
>
> -- This always returns the same value
> SELECT ts.sis_id, bldg_id, f_name.name, l_name.name
> FROM tmp_students ts
> JOIN names AS f_name ON
>    ts.gender = f_name.gender
> WHERE f_name.counter = random(1,300)
>
> --As does this
> SELECT ts.sis_id, bldg_id, f_name.name, l_name.name
> FROM tmp_students ts
> JOIN names AS f_name ON
>    ts.gender = f_name.gender
>    AND ts.counter = random(1,100)
>
> -- This generates different numbers
> SELECT random(1,100), s.*
> FROM usr_students s
>
Are you sure it's volatile? (as opposed to stable)

postgres=# create or replace function random(int,int) returns int as
$$select round($2*random()*$1)::int;$$ language sql volatile;
CREATE FUNCTION
postgres=# select generate_series from generate_series(1,10,1) where
random(1,100) > 50;
 generate_series
-----------------
               1
               3
               4
               6
               7
               9
(6 rows)

postgres=# select generate_series from generate_series(1,10,1) where
random(1,100) > 50;
 generate_series
-----------------
               3
               7
               8
               9
(4 rows)

postgres=# select random(1,100), generate_series from
generate_series(1,10,1);
 random | generate_series
--------+-----------------
     56 |               1
     23 |               2
     80 |               3
     57 |               4
     16 |               5
     99 |               6
      9 |               7
     41 |               8
     90 |               9
     88 |              10
(10 rows)

postgres=# create or replace function random(int,int) returns int as
$$select round($2*random()*$1)::int;$$ language sql stable;
CREATE FUNCTION
postgres=# select generate_series from generate_series(1,10,1) where
random(1,100) > 50;
 generate_series
-----------------
               1
               2
               3
               4
               5
               6
               7
               8
               9
              10
(10 rows)

postgres=# select generate_series from generate_series(1,10,1) where
random(1,100) > 50;
 generate_series
-----------------
(0 rows)

postgres=# select random(1,100), generate_series from
generate_series(1,10,1);
 random | generate_series
--------+-----------------
     72 |               1
     90 |               2
     53 |               3
     47 |               4
     53 |               5
     33 |               6
     10 |               7
     56 |               8
     78 |               9
     87 |              10
(10 rows)

postgres=#

--
Klint Gore
Database Manager
Sheep CRC
A.G.B.U.
University of New England
Armidale NSW 2350

Ph: 02 6773 3789
Fax: 02 6773 3266
EMail: kgore4@une.edu.au


Re: Problem with volatile function

От
Tom Lane
Дата:
Craig Ringer <craig@postnewspapers.com.au> writes:
> Personally I'd expect that to only evaluate once. It's saying "where
> f_name.counter in this row is equal to some single random value
> generated at the start of the query". The parameters of the random()
> function do not depend on the input, so Pg evaluates it once rather than
> with each row.

No, that's not the model that we use.  In my view, the semantics
specified by the SQL standard are that
    SELECT ... FROM a, b, c WHERE ...
are that you notionally form the Cartesian product of a X b X c
and then evaluate the WHERE condition at each row of the product
to find out which rows make it into the SELECT result.  Obviously,
the DBMS is going to try like heck to optimize the execution of that
model, but its optimizations should not change the results.

We're not perfect about that, in a number of ways --- in particular,
we allow short-circuit evaluation of ANDed conditions, which means
that we might not bother to execute a volatile function for some
row of the Cartesian product if we're able to exclude that row from
the result based on other WHERE conditions.

The OP's complaint is a bit different from this, in that (if I
understand it correctly) he doesn't want the WHERE condition to
be dropped down to the individual relation scan, because that amounts
to improperly assuming that the volatile WHERE condition will produce
the same result at every row of the Cartesian product that included
that row from the base relation.  This gripe isn't without foundation.
But I'm not seeing what properties you get from that that wouldn't
be broken by AND-shortcircuit evaluation.

Dropping the AND-shortcircuit optimization would mean that we revert
to the brain-dead "form the Cartesian product and then filter" approach
if even one volatile function appears anywhere in the WHERE conditions.
I doubt that either the OP or anyone else wants that.

            regards, tom lane

Re: Problem with volatile function

От
Artacus
Дата:
> You can force Pg to re-evaluate random() by adding a dummy parameter
> that depends on the input record, or (probably better) by writing a
> variant of it that tests the input against a randomly generated value
> and returns a boolean. Eg:
>
>
Thanks all. So here's the situation. I added a dummy parameter and
passed the id like you suggested. That had no effect. I still got one
name for males and one name for females. So I used the 3rd param in a
trivial way:

    select round($2*random()*$1+($3/10000000))::int;

And that actually forced it to execute for every row. However, it
returned unpredictable results. There should have been only one match
for first_name and last_name for each person but it return from 1 to 5
rows for each person.

 sis_id     gender     name        name        counter     counter
 ---------  ---------  ----------  ----------  ----------  ----------
 105607     M          Denis       Weber       19          671
 105666     M          Javier      Custodio    154         182
 105666     M          Javier      Nelson      154         250
 105839     M          Johnnie     Whicker     295         32
 105847     F          Trina       Garcia      259         155
 105847     F          Dione       Freeman     103         651
 105847     F          Dione       Harden      103         897
 105847     F          Cruz        Brannen     249         1240

So what I actually had to do was get the sis_id and the two random
numbers in a subselect.

SELECT stu.sis_id, stu.gender, f_name.name AS first_name, l_name.name AS
last_name
FROM usr_students stu
JOIN (
    SELECT sis_id, random(1,300) AS f_cnt, random(1,1700) AS l_cnt
    FROM usr_students s
) sub ON stu.sis_id = sub.sis_id
JOIN names f_name ON stu.gender = f_name.gender
   AND f_name.counter = sub.f_cnt
JOIN names l_name ON l_name.gender IS NULL
   AND l_name.counter = sub.l_cnt

So while that works, postgres isn't behaving how I'd expect (or how Tom
expects from the sounds of it)

Re: Problem with volatile function

От
Craig Ringer
Дата:
Tom Lane wrote:
> Craig Ringer <craig@postnewspapers.com.au> writes:
>> Personally I'd expect that to only evaluate once. It's saying "where
>> f_name.counter in this row is equal to some single random value
>> generated at the start of the query". The parameters of the random()
>> function do not depend on the input, so Pg evaluates it once rather than
>> with each row.
>
> No, that's not the model that we use.  In my view, the semantics
> specified by the SQL standard are that
>     SELECT ... FROM a, b, c WHERE ...
> are that you notionally form the Cartesian product of a X b X c
> and then evaluate the WHERE condition at each row of the product
> to find out which rows make it into the SELECT result.  Obviously,
> the DBMS is going to try like heck to optimize the execution of that
> model, but its optimizations should not change the results.

Thanks for explaining that; I'd completely misunderstood the way Pg
handles evaluation of expressions that don't explicitly reference fields
in the records being scanned.

I think the workarounds I suggested should still work, despite the
incorrect reasoning on which they were based.

--
Craig Ringer