Обсуждение: Functions in C with Ornate Data Structures

Поиск
Список
Период
Сортировка

Functions in C with Ornate Data Structures

От
"Stephen P. Berry"
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


I'm trying to write C functions to handle some of the number crunching
that I have been doing via backend processing.  Specifically, I want
to be able to construct a function such that a query like:

    select crunch_number(foo) from bar where [some condition];

...where `foo' is the name of some column and `bar' is some table
name.  The function needs to create some ornate data structures (i.e.,
doubly linked lists), and outputs some summary statistic.

If my data types were simpler, I could simply use an AGGREGATE function.
Unfortunately, I don't know of any way to schlep something as complex
as a doubly-linked list of arrays of arbitrary precision numbers.

I suppose ideally I'd like some way of either:

    -Being able to call a function on each row (like most user-defined
     functions) which only returns a result on the last row;  or
    -Being able to pass the table name, column name, and selection
     conditions to the function, and walk through the matching rows
     inside the function, returning a single result upon completion

In terms of logical structure, this looks similar to functions to do
things like compute means or standard deviations.  The complication (as
far as I can tell) is because I can't get by with a simple accumulator
variable/transform function.

Is there any clean way to accomplish this in Postgres?  Any pointers
or suggestions would be appreciated.






- -Steve


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.3 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE8SLyyG3kIaxeRZl8RAsNNAKCy8YDnMMZCIGrMYT6pt2IxqxtCJwCgxFp2
HFlA8B9X5BJRfnMDmQSh8Ss=
=Kx46
-----END PGP SIGNATURE-----

Re: Functions in C with Ornate Data Structures

От
Tom Lane
Дата:
"Stephen P. Berry" <spb@meshuggeneh.net> writes:
> If my data types were simpler, I could simply use an AGGREGATE function.
> Unfortunately, I don't know of any way to schlep something as complex
> as a doubly-linked list of arrays of arbitrary precision numbers.

You could, but the amount of data copying needed would be annoying.
However, there's no law that says you can't cheat.  I'd suggest that
you build this as an aggregate function whose nominal state value is
simply a pointer to data structures that are off somewhere else.

For example, assuming that you are willing to cheat to the extent of
assuming sizeof(pointer) = sizeof(integer), try something like this:

CREATE AGGREGATE crunch_number (
 basetype = float8,  -- or whatever the input column type is
 sfunc = crunch_func,
 stype = integer,
 ffunc = crunch_finish,
 initcond = 0);

where crunch_func(integer) returns integer is your data accumulation
function, and it looks like

    datstruct *ptr = (datstruct *) PG_GET_INT32(0);
    double newdataval = PG_GET_FLOAT8(1);

    if (ptr == NULL)
    {
        /* first call of query; initialize datastructures */
    }

    /* update datastructures using newdataval */

    PG_RETURN_INT32((int32) ptr);

Finally, crunch_finish(integer) returns float8 (or whatever is needed)
contains your code to compute the final result and release the working
datastructure.

Now, the important detail: you can't allocate your working
datastructures with a simple palloc(), because these functions will be
called in a short-lived memory context.  What I'd suggest is that in
your setup step, you create a private memory context that is a child
of TransactionCommandContext; then allocate all your datastructures in
that.  Then in the crunch_finish step, you needn't bother with retail
releasing of the data structures, just destroy the private context
and you're done.

            regards, tom lane

PS: this is not a novice-level question ;-).  You should be asking this
kind of stuff on pgsql-hackers, methinks.  There really isn't any other
list that discusses C coding inside the backend.

Re: Functions in C with Ornate Data Structures

От
"Stephen P. Berry"
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


In message <11292.1011403856@sss.pgh.pa.us>, Tom Lane writes:

>You could, but the amount of data copying needed would be annoying.
>However, there's no law that says you can't cheat.  I'd suggest that
>you build this as an aggregate function whose nominal state value is
>simply a pointer to data structures that are off somewhere else.
>For example, assuming that you are willing to cheat to the extent of
>assuming sizeof(pointer) = sizeof(integer), try something like this:

I'd actually thought of doing something like this, but couldn't find
an actual explicit argument type for pointers[0], and I can't make
the assumption you describe for portability reasons (my three main
test platforms are alpha, sparc64, and x86).

I was also hoping that I could get away with not passing the problem
data structures internally at all...i.e., have a crunch_input() function
that initialises the linked list I need and populates it,
then a crunch_result() function that spits out the result.  The
accumulator is just a dummy variable, and the interesting data structure(s)
aren't in the argument list for any of the functions.  I tried
doing such a thing with an aggregate, but it didn't work---although,
interestingly, invoking the input and result functions manually
in a single session did.  I took this to mean that I really didn't
understand aggregates, so I was assuming it was a novice-level question.
I was sorta hoping this would turn out to be a standard question (although
I couldn't find any useful references in the mailing list archives or
via web searches).

>Now, the important detail: you can't allocate your working
>datastructures with a simple palloc(), because these functions will be
>called in a short-lived memory context.  What I'd suggest is that in
>your setup step, you create a private memory context that is a child
>of TransactionCommandContext; then allocate all your datastructures in
>that.  Then in the crunch_finish step, you needn't bother with retail
>releasing of the data structures, just destroy the private context
>and you're done.

Is there any way to keep `intermediate' data used by user-defined
functions around indefinitely?  I.e., have some sort of crunch_init()
function that creates a bunch of in-memory data structures, which
can then be used by subsequent (and independent) queries?  I'm
assuming not...and if I want to do that sort of thing I should populate
a temporary table with the data from these `intermediate' results.
Or keep all this fancy stuff in standalone applications rather than
in user-defined functions.

It seems like the general class of thing I'm trying to accomplish
isn't that esoteric.  Imagine trying to write a function to compute
the standard deviation of arbitrary precision numbers using the GMP
library or some such.  Note that I'm not saying that that's what I'm
trying to do...I'm just offering it as a simple sample problem in
which one can't pass everything as an argument in an aggregate.  How
does one set about doing such a thing in Postgres?







- -Steve

- -----
0    I was hoping that there would be, since the macro widgetry in
    the version 1 function semantics clearly includes the concept
    of pointers as a distinct type.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.3 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE8SN5tG3kIaxeRZl8RAmPmAJ4ilTeyoC//MRG5JHf7AmNuR7oW/QCdHHqw
RoE/GplKts1rxNO85ADEebk=
=Oedz
-----END PGP SIGNATURE-----

Re: Functions in C with Ornate Data Structures

От
Tom Lane
Дата:
"Stephen P. Berry" <spb@meshuggeneh.net> writes:
>> For example, assuming that you are willing to cheat to the extent of
>> assuming sizeof(pointer) = sizeof(integer), try something like this:

> I'd actually thought of doing something like this, but couldn't find
> an actual explicit argument type for pointers[0], and I can't make
> the assumption you describe for portability reasons (my three main
> test platforms are alpha, sparc64, and x86).

Fair enough.  I had actually thought better of that shortly after writing,
so here's how I'd really do it:

Still make the declaration of the state datatype be "integer" at the SQL
level, and say initcond = 0.  (If you don't do this, you have to fight
nodeAgg.c's ideas about what to do with a pass-by-reference datatype,
and it ain't worth the trouble.)  But in the C code, write acquisition
and return of the state value as

    datstruct *ptr = (datstruct *) PG_GETARG_POINTER(0);

    ...

    PG_RETURN_POINTER(ptr);

This relies on the fact that what you are *really* passing and returning
is not an int but a Datum, and Datum is by definition large enough for
pointers.  The only part of the above that's even slightly dubious is
the assumption that a Datum created from an int32 zero will read as a
pointer NULL --- but I am not aware of any platform where a zero bit
pattern doesn't read as a pointer NULL (and lots of pieces of Postgres
would break on such a platform).  You could get around that too by
making the initial state condition be a SQL NULL instead of a zero, but
I don't see the point.  Unless you need to treat NULL input values as
something other than "ignores", you really want to declare the sfunc as
strict, and that gets in the way of using a NULL initcond.

> Is there any way to keep `intermediate' data used by user-defined
> functions around indefinitely?  I.e., have some sort of crunch_init()
> function that creates a bunch of in-memory data structures, which
> can then be used by subsequent (and independent) queries?

You can if you can figure out how to find them again.  However, the
only obvious answer to that is to use static variables, which falls
down miserably if someone tries to run two independent instances of
your aggregate in one query.  I'd suggest hewing closely to the external
behavior of standard aggregates --- ie, each one is an independent
calculation.  You can use the above techniques to build an efficient
implementation.  If you instead build something that has an API
involving state that persists across queries, I'm pretty sure you'll
regret it in the long run.

> It seems like the general class of thing I'm trying to accomplish
> isn't that esoteric.  Imagine trying to write a function to compute
> the standard deviation of arbitrary precision numbers using the GMP
> library or some such.  Note that I'm not saying that that's what I'm
> trying to do...I'm just offering it as a simple sample problem in
> which one can't pass everything as an argument in an aggregate.  How
> does one set about doing such a thing in Postgres?

I blink not an eye to say that I'd do it exactly as described above.
Stick all the intermediate state into a data structure that's referenced
by a single master pointer, and pass the pointer as the "state value"
of the aggregate.

BTW, mlw posted some contrib code on pghackers just a day or two back
that does something similar to this.  He did some details differently
than I would've, notably this INT32-vs-POINTER business; but it's a
working example.

            regards, tom lane