Обсуждение: pl/R questions

Поиск
Список
Период
Сортировка

pl/R questions

От
Mike Mascari
Дата:
Hello.

I've downloaded and installed pl/R version 0.4.4. My goal is to be
able to perform multivariate linear regression analysis. However, for
the sake of simplicity, I'm performing a single variable regression
and prediction as a test:

1) CREATE TABLE entries(float8 x, float8 y);

I have the following values:

 x  |   y
----+--------
  1 | 133890
  2 | 135000
  3 | 135790
  4 | 137300
  5 | 138130
  6 | 139100
  7 | 139900
  8 | 141120
  9 | 141890
 10 | 143230
 11 | 144000
 12 | 145290

2) CREATE TABLE predictions(float8 x);

I have the following new x values:

 x
----
 13
 14
 15
 16
 17

3) I have the following functions:

-- Reset R global variables

CREATE OR REPLACE FUNCTION r_resetlm() RETURNS integer AS '

 xs <<- c()
 ys <<- c()
 nxs <<- c()
 return(1)

' LANGUAGE 'plr' WITH (isStrict);

-- Add a new known x, f(x)

CREATE OR REPLACE FUNCTION r_initknowns(float8, float8) RETURNS
integer AS '

 xs <<- c(xs, arg1)
 ys <<- c(ys, arg2)
 return(1)

' LANGUAGE 'plr' WITH (isStrict);

-- Add a predicting x

CREATE OR REPLACE FUNCTION r_initpredicts(float8) RETURNS integer AS '

 nxs <<- c(nxs, arg1)
 return(1)

' LANGUAGE 'plr' WITH (isStrict);

-- Generate the predictions

CREATE OR REPLACE FUNCTION r_predict()
RETURNS SETOF RECORD AS '

 samples <- data.frame(xs=nxs)
 result <- predict(lm(ys ~ xs), samples)
 return (result)

' LANGUAGE 'plr' WITH (isStrict);

4) I perform the prediction like so:

select r_resetlm();
select r_initknowns(x, y) from entries;
select r_initpredicts(x) from predictions;
select * from r_predict() as trend(ny float8);

This works, but there are several potential problems using this method:

(A) The function r_resetlm() must be called to reset the global values
before each invocation. Not a big problem, but I would like to avoid
globals, if possible. The relations supplying the data are temporary
tables and thus I cannot refer to their names in static pl/R. I can't
figure out a way to use pg.spi.prepare()/pg.spi.execp() to initialize
R variables with the result of the executed queries. I would like to
do something like this, instead:

CREATE OR REPLACE FUNCTION r_predict(text, text)
RETURNS SETOF RECORD AS '

 sql <- paste("SELECT x, y FROM", arg1, "ORDER BY x")
 plan <- pg.spi.prepare(sql, NA)
 pg.spi.execp(plan, NA)

 ??? Read results into appropriate vectors

 samples <- data.frame(xs=nxs)
 result <- predict(lm(ys ~ xs), samples)
 return (result)

' LANGUAGE 'plr' WITH (isStrict);

(B) I suppose an unqualified SELECT will always invoke r_initknowns()
and r_initpredicts() but is this guaranteed? And guaranteed to only be
executed once for each tuple? If so, then I'm somewhat less bothered
by the use of R globals. Is using the VOLATILE attribute in the CREATE
FUNTION statement sufficient to guarantee that the call will always be
made?

(C) For the life of me, and this is an R question, I cannot figure out
how to get R to perform predictions on multivariate data:

ys <- c(133890, 135000, 135790, 137300, 138130, 139100, 139900,
141120, 141890, 143230, 144000, 145290)

xs1 <- c(1:12)
xs2 <- c(22, 24.5, 27, 33, 36.8, 40, 44, 57, 59, 62, 74, 77)
xm <- cbind(xs1, xs2)

nx1 <- c(13:17)
nx2 <- c(82, 85, 88.3, 90, 95)

samples <- data.frame(xs1=nx1, xs2=nx2)
f <- predict(lm(ys ~ xm), samples)

data.frame(f) yields:

         f
1  133949.8
2  134970.2
3  135990.6
4  137008.1
5  138027.5
6  139047.3
7  140066.5
8  141078.3
9  142099.1
10 143119.1
11 144131.7
12 145151.7

Not the predicted y's for the new x1's and x2's. I tried:

f <- predict.mlm(lm(ys ~ xm), samples) and got:

Error in object$coefficients[piv, ] : incorrect number of dimensions

And I have no mlm() to use.

Any clues? I'm a bit of an R newbie. I wrote my own multivariate
linear regression functions years ago in C++, but I'd prefer to
leverage R from PostgreSQL.

Mike Mascari
mascarm@mascari.com

PS: Thanks to Joe Conway for this procedural language.





Re: pl/R questions

От
Joe Conway
Дата:
Mike Mascari wrote:
> (A) The function r_resetlm() must be called to reset the global values
> before each invocation. Not a big problem, but I would like to avoid
> globals, if possible. The relations supplying the data are temporary
> tables and thus I cannot refer to their names in static pl/R. I can't
> figure out a way to use pg.spi.prepare()/pg.spi.execp() to initialize
> R variables with the result of the executed queries. I would like to
> do something like this, instead:
>
> CREATE OR REPLACE FUNCTION r_predict(text, text)
> RETURNS SETOF RECORD AS '
>
>  sql <- paste("SELECT x, y FROM", arg1, "ORDER BY x")
>  plan <- pg.spi.prepare(sql, NA)
>  pg.spi.execp(plan, NA)
>
>  ??? Read results into appropriate vectors
>
>  samples <- data.frame(xs=nxs)
>  result <- predict(lm(ys ~ xs), samples)
>  return (result)
>
> ' LANGUAGE 'plr' WITH (isStrict);

I don't think you can do a prepared plan if the table itself is going to
change, only when parameters change. Maybe something like this works:

CREATE OR REPLACE FUNCTION r_predict(text, text)
RETURNS SETOF RECORD AS '
  sql <- paste("SELECT x, y FROM", arg1, "ORDER BY x")
  xyknowns <- pg.spi.exec(sql)
  xs <- as.numeric(xyknowns[,1])
  ys <- as.numeric(xyknowns[,2])

  sql <- paste("SELECT x FROM", arg2, "ORDER BY x")
  xypred <- pg.spi.exec(sql)
  nxs <- as.numeric(xypred[,1])

  samples <- data.frame(xs=nxs)
  result <- predict(lm(ys ~ xs), samples)
  return (result)
' LANGUAGE 'plr' WITH (isStrict);

regression=# select * from r_predict('entries', 'predictions') as
trend(ny float8);
         ny
------------------
  146171.515151515
  147189.696969697
  148207.878787879
  149226.060606061
  150244.242424242
(5 rows)

> (B) I suppose an unqualified SELECT will always invoke r_initknowns()
> and r_initpredicts() but is this guaranteed? And guaranteed to only be
> executed once for each tuple? If so, then I'm somewhat less bothered
> by the use of R globals. Is using the VOLATILE attribute in the CREATE
> FUNTION statement sufficient to guarantee that the call will always be
> made?

Use the above -- I think your original multistep process is not the way
to go anyway

> (C) For the life of me, and this is an R question, I cannot figure out
> how to get R to perform predictions on multivariate data:

I'm sure there is support for multivariate linear regression in R, but
I'm still too new at R to know the answer myself. You should try posting
that one to R-help.

BTW, I created a PL/R specific mailing list on gborg, but no one is
subscribed currently. If people on this list find PL/R specific
questions too off-topic, perhaps we should move there. R specific
questions should definitely be posted to R-help though.

Regards,

Joe


Re: pl/R questions

От
Mike Mascari
Дата:
Joe Conway wrote:

> Mike Mascari wrote:
>
>> (A) The function r_resetlm() must be called to reset the global values
>> before each invocation. Not a big problem, but I would like to avoid
>> globals, if possible. The relations supplying the data are temporary
>> tables and thus I cannot refer to their names in static pl/R. I can't
>> figure out a way to use pg.spi.prepare()/pg.spi.execp() to initialize
>> R variables with the result of the executed queries. I would like to
>> do something like this, instead:
>
> I don't think you can do a prepared plan if the table itself is going to
> change, only when parameters change. Maybe something like this works:
>
> CREATE OR REPLACE FUNCTION r_predict(text, text)
> RETURNS SETOF RECORD AS '
>  sql <- paste("SELECT x, y FROM", arg1, "ORDER BY x")
>  xyknowns <- pg.spi.exec(sql)
>  xs <- as.numeric(xyknowns[,1])
>  ys <- as.numeric(xyknowns[,2])
>
>  sql <- paste("SELECT x FROM", arg2, "ORDER BY x")
>  xypred <- pg.spi.exec(sql)
>  nxs <- as.numeric(xypred[,1])
>
>  samples <- data.frame(xs=nxs)
>  result <- predict(lm(ys ~ xs), samples)
>  return (result)
> ' LANGUAGE 'plr' WITH (isStrict);
>
> regression=# select * from r_predict('entries', 'predictions') as
> trend(ny float8);
>         ny
> ------------------
>  146171.515151515
>  147189.696969697
>  148207.878787879
>  149226.060606061
>  150244.242424242
> (5 rows)

Very nice. Thanks a bunch!

> I'm sure there is support for multivariate linear regression in R, but
> I'm still too new at R to know the answer myself. You should try posting
> that one to R-help.

Okay.

> BTW, I created a PL/R specific mailing list on gborg, but no one is
> subscribed currently. If people on this list find PL/R specific
> questions too off-topic, perhaps we should move there. R specific
> questions should definitely be posted to R-help though.

Okay. Great! Thanks again!

Mike Mascari
mascarm@mascari.com