Обсуждение: Functionscan estimates

От:
Josh Berkus
Дата:

Folks,

I'm wondering if it might be useful to be able to add estimated selectivity to
a function definition for purposes of query estimation.  Currently function
scans automatically return a flat default 1000 estimated rows.   It seems
like the DBA ought to be able to ALTER FUNCTION and give it a row estimate
for planning purposes.

Thoughts?

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

От:
Michael Fuhr
Дата:

On Fri, Apr 08, 2005 at 03:15:50PM -0700, Josh Berkus wrote:
>
> I'm wondering if it might be useful to be able to add estimated selectivity to
> a function definition for purposes of query estimation.  Currently function
> scans automatically return a flat default 1000 estimated rows.   It seems
> like the DBA ought to be able to ALTER FUNCTION and give it a row estimate
> for planning purposes.

About a month ago I mentioned that I'd find that useful.  In a
followup, Christopher Kings-Lynne brought up the idea of a GUC
variable that could give hints about the expected row count.

http://archives.postgresql.org/pgsql-hackers/2005-03/msg00146.php
http://archives.postgresql.org/pgsql-hackers/2005-03/msg00153.php

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

От:
Alvaro Herrera
Дата:

On Fri, Apr 08, 2005 at 04:38:20PM -0600, Michael Fuhr wrote:
> On Fri, Apr 08, 2005 at 03:15:50PM -0700, Josh Berkus wrote:
> >
> > I'm wondering if it might be useful to be able to add estimated selectivity to
> > a function definition for purposes of query estimation.  Currently function
> > scans automatically return a flat default 1000 estimated rows.   It seems
> > like the DBA ought to be able to ALTER FUNCTION and give it a row estimate
> > for planning purposes.
>
> About a month ago I mentioned that I'd find that useful.  In a
> followup, Christopher Kings-Lynne brought up the idea of a GUC
> variable that could give hints about the expected row count.

That seems pretty limited ... what happens if the query contains more
that one SRF?

Maybe issuing some sort of special call to the function (say, with
some boolean in the call info struct) on which it returns planning data;
thus the planner can call the function itself.  The hard part would be
figuring out how to do it without breaking backwards compatibility with
functions that don't know how to handle that.  (And how to do it in
plpgsql).

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"La principal característica humana es la tontería"
(Augusto Monterroso)

От:
Josh Berkus
Дата:

Alvaro, Michael,

> > About a month ago I mentioned that I'd find that useful.  In a
> > followup, Christopher Kings-Lynne brought up the idea of a GUC
> > variable that could give hints about the expected row count.
>
> That seems pretty limited ... what happens if the query contains more
> that one SRF?

Yeah, I'd see that as a pretty bad idea too.  I don't want to tell the planner
how many rows I expect "all functions" to return, I want to tell it how many
*one particular* function will return.

> Maybe issuing some sort of special call to the function (say, with
> some boolean in the call info struct) on which it returns planning data;
> thus the planner can call the function itself.  The hard part would be
> figuring out how to do it without breaking backwards compatibility with
> functions that don't know how to handle that.  (And how to do it in
> plpgsql).

Or in pl/perl, or pl/python, or plsh .... doesn't sound feasable.

My solution would be a lot simpler, since we could simply populate
pg_proc.proestrows with "1000" by default if not changed by the DBA.  In an
even better world, we could tie it to a table, saying that, for example,
proestrows = my_table*0.02.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

От:
Alvaro Herrera
Дата:

On Fri, Apr 08, 2005 at 04:04:27PM -0700, Josh Berkus wrote:

> My solution would be a lot simpler, since we could simply populate
> pg_proc.proestrows with "1000" by default if not changed by the DBA.  In an
> even better world, we could tie it to a table, saying that, for example,
> proestrows = my_table*0.02.

The problem with that approach is that it can't differ depending on the
arguments to the function, so it too seems limited to me.

Ideally an estimator would be able to peek at other table statistics and
do some computation with them, just like other nodes are able to.

Another idea would be have an estimator function (pg_proc.proestimator)
for each regular function.  The estimator would be a very cheap function
to be called with the same arguments, and it would return the estimated
number of tuples the other function would return.  The default estimator
could be "return 1000".

--
Alvaro Herrera (<alvherre[@]dcc.uchile.cl>)
"A wizard is never late, Frodo Baggins, nor is he early.
 He arrives precisely when he means to."  (Gandalf, en LoTR FoTR)

От:
Tom Lane
Дата:

Not too many releases ago, there were several columns in pg_proc that
were intended to support estimation of the runtime cost and number of
result rows of set-returning functions.  I believe in fact that these
were the remains of Joe Hellerstein's thesis on expensive-function
evaluation, and are exactly what he was talking about here:
http://archives.postgresql.org/pgsql-hackers/2002-06/msg00085.php

But with all due respect to Joe, I think the reason that stuff got
trimmed is that it didn't work very well.  In most cases it's
*hard* to write an estimator for a SRF.  Let's see you produce
one for dblink() for instance ...

            regards, tom lane

От:
PFC
Дата:

> My solution would be a lot simpler, since we could simply populate
> pg_proc.proestrows with "1000" by default if not changed by the DBA.  In
> an
> even better world, we could tie it to a table, saying that, for example,
> proestrows = my_table*0.02.

    What if the estimated row is a function of a parameter ?
    Say a function takes as a parameter :
    - a number to use in a LIMIT
    - it's a function to generate a certain number of values from a
predetermined set (like, array -> set returning function)

    In all those cases it's no use to have just a fixed number.

    Id suggest two solutions :
    - The ideal solution which is impossible to do :
    The function tells the planner about its stats, looking at its parameters

    - A solution that would be possible to do
     pg_proc.proestrows is... the name of another function, defined by the
user, which takes the exact same parameters as the set returning function
we're talking about, and which returns estimates.

    For instance, in pseudo-sql :

CREATE FUNCTION int_array_srf( INTEGER[] ) RETURNS SETOF INTEGER LANGUAGE
plpgsql AS $$
BEGIN
    FOR _i IN 1..icount($1)
        RETURN NEXT $1[_i];
    END
END    In the two cases above, this would give :

CREATE FUNCTION array_srf_estimator( INTEGER[] ) RETURNS INTEGER LANGUAGE
plpgsql AS $$
BEGIN
    RETURN icount( $1 );
END;

ALTER FUNCTION array_srf SET ESTIMATOR array_srf_estimator;

    Another interesting case would be the famous "Top 5 by category" case
where we use a SRF to emulate an index skip scan. Say we have a table
Categories and a table Users, each User having columns "categories" and
"score" and we want the N users with best score in each category :

CREATE FUNCTION top_n_by_category( INTEGER ) RETURN SETOF users%ROWTYPE
LANGUAGE plpgsql AS $$
DECLARE
    _cat_id    INTEGER;
    _n ALIAS FOR $1;
    _user    users%ROWTYPE;
BEGIN
    FOR _cat_id IN SELECT category_id FROM categories DO
        FOR _user IN SELECT * FROM users WHERE category_id = _cat_id ORDER BY
score DESC LIMIT _n DO
            RETURN NEXT _user;
        END
    END
END

CREATE FUNCTION top_n_by_category_estimator( INTEGER ) RETURN INTEGER
LANGUAGE plpgsql AS $$
BEGIN
    RETURN $1 * (the estimated number of rows for the categories table taken
 from the table statistics);
END;

ALTER FUNCTION top_n_by_category SET ESTIMATOR top_n_by_category_estimator;

    Got it ?

    The array_srf case would be extremely useful as this type of function is
generally used to join against other tables, and having estimates is
useful for that.
    The top_n case would be useless if we're just returning the rows from the
function directly, but would be useful if we'll join them to other tables.

    This sounds pretty simple, powerful, and versatile.

    Additionally, in some cases (most notably the array case) it's easy to
estimate the statistics on the returned values because they're all in the
array already, so the mechanism could be extended to have a way of
returning a pseudo pg_stats for a Set Returning function.

    For instance, say you have a SRF which returns N random rows from a
table. It could have an estimator which would return a rowcount of N, and
a statistics estimator which would return the sats rows for the source
table, appropriately modified.

    This sounds harder to do.

    WHat do you think ?


















От:
PFC
Дата:

> But with all due respect to Joe, I think the reason that stuff got
> trimmed is that it didn't work very well.  In most cases it's
> *hard* to write an estimator for a SRF.  Let's see you produce
> one for dblink() for instance ...

    Good one...
    Well in some cases it'll be impossible, but suppose I have a function
get_id_for_something() which just grabs an ID using dblink, then I know it
returns one row, and pg would be interested in that information too !

От:
"Jim C. Nasby"
Дата:

On Sat, Apr 09, 2005 at 12:00:56AM -0400, Tom Lane wrote:
> Not too many releases ago, there were several columns in pg_proc that
> were intended to support estimation of the runtime cost and number of
> result rows of set-returning functions.  I believe in fact that these
> were the remains of Joe Hellerstein's thesis on expensive-function
> evaluation, and are exactly what he was talking about here:
> http://archives.postgresql.org/pgsql-hackers/2002-06/msg00085.php
>
> But with all due respect to Joe, I think the reason that stuff got
> trimmed is that it didn't work very well.  In most cases it's
> *hard* to write an estimator for a SRF.  Let's see you produce
> one for dblink() for instance ...

Actually, if the remote database supported a way to get a rows estimate
from the query passed to db_link, it would be trivial, since you'd just
pass that back.

In fact, having such a function (estimate_rows_for_sql(text)) would
probably be very useful to functions that wanted to support returning a
rows estimate.
--
Jim C. Nasby, Database Consultant               
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

От:
Tom Lane
Дата:

"Jim C. Nasby" <> writes:
> On Sat, Apr 09, 2005 at 12:00:56AM -0400, Tom Lane wrote:
>> But with all due respect to Joe, I think the reason that stuff got
>> trimmed is that it didn't work very well.  In most cases it's
>> *hard* to write an estimator for a SRF.  Let's see you produce
>> one for dblink() for instance ...

> Actually, if the remote database supported a way to get a rows estimate
> from the query passed to db_link, it would be trivial, since you'd just
> pass that back.

This assumes that (1) you have the complete query argument at the time
of estimation, and (2) it's OK to contact the remote database and do an
EXPLAIN at that time.  Both of these seem pretty shaky assumptions.

The larger point is that writing an estimator for an SRF is frequently a
task about as difficult as writing the SRF itself, and sometimes much
*more* difficult due to lack of information.  I don't foresee a whole
lot of use of an estimator hook designed as proposed here.  In
particular, if the API is such that we can only use the estimator when
all the function arguments are plan-time constants, it's not going to be
very helpful.

            regards, tom lane

От:
Neil Conway
Дата:

Tom Lane wrote:
> Not too many releases ago, there were several columns in pg_proc that
> were intended to support estimation of the runtime cost and number of
> result rows of set-returning functions.  I believe in fact that these
> were the remains of Joe Hellerstein's thesis on expensive-function
> evaluation

FYI, Hellerstein's thesis on xfunc optimization is available here:

     ftp://ftp.cs.wisc.edu/pub/tech-reports/reports/1996/tr1304.ps.Z

There's also a paper on this subject by Hellerstein that was published
in Transactions on Database Systems:

     http://www.cs.berkeley.edu/~jmh/miscpapers/todsxfunc.pdf

I haven't had a chance to digest either one yet, but it might be worth a
look.

-Neil

От:
Neil Conway
Дата:

Tom Lane wrote:
> The larger point is that writing an estimator for an SRF is frequently a
> task about as difficult as writing the SRF itself

True, although I think this doesn't necessarily kill the idea. If
writing an estimator for a given SRF is too difficult, the user is no
worse off than they are today. Hopefully there would be a fairly large
class of SRFs for which writing an estimator would be relatively simple,
and result in improved planner behavior.

> I don't foresee a whole lot of use of an estimator hook designed as
> proposed here.  In particular, if the API is such that we can only
> use the estimator when all the function arguments are plan-time
> constants, it's not going to be very helpful.

Yes :( One approach might be to break the function's domain into pieces
and have the estimator function calculate the estimated result set size
for each piece. So, given a trivial function like:

foo(int):
    if $1 < 10 then produce 100 rows
    else produce 10000 rows

If the planner has encoded the distribution of input tuples to the
function as a histogram, it could invoke the SRF's estimator function
for the boundary values of each histogram bucket, and use that to get an
idea of the function's likely result set size at runtime.

And yes, the idea as sketched is totally unworkable :) For one thing,
the difficulty of doing this grows rapidly as the number of arguments to
the function increases. But perhaps there is some variant of this idea
that might work...

Another thought is that the estimator could provide information on the
cost of evaluating the function, the number of tuples produced by the
function, and even the distribution of those tuples.

BTW, why is this on -performance? It should be on -hackers.

-Neil

От:
Josh Berkus
Дата:

People:

(HACKERS: Please read this entire thread at
http://archives.postgresql.org/pgsql-performance/2005-04/msg00179.php
Sorry for crossing this over.)

> > The larger point is that writing an estimator for an SRF is frequently a
> > task about as difficult as writing the SRF itself
>
> True, although I think this doesn't necessarily kill the idea. If
> writing an estimator for a given SRF is too difficult, the user is no
> worse off than they are today. Hopefully there would be a fairly large
> class of SRFs for which writing an estimator would be relatively simple,
> and result in improved planner behavior.

For that matter, even supplying an estimate constant would be a vast
improvement over current functionality.  I would suggest, in fact, that we
allow the use of either a constant number, or an estimator function, in that
column.  Among other things, this would allow implementing the constant
number right now and the use of an estimating function later, in case we can
do the one but not the other for 8.1.

To be more sophisticated about the estimator function, it could take a subset
of the main functions arguments, based on $1 numbering, for example:
CREATE FUNCTION some_func ( INT, TEXT, TEXT, INT, INT ) ...
ALTER FUNCTION some_func WITH ESTIMATOR some_func_est( $4, $5 )

This would make writing estimators which would work for several functions
easier.   Estimators would be a special type of functions which would take
any params and RETURN ESTIMATOR, which would be implicitly castable from some
general numeric type (like INT or FLOAT).

> > I don't foresee a whole lot of use of an estimator hook designed as
> > proposed here.  In particular, if the API is such that we can only
> > use the estimator when all the function arguments are plan-time
> > constants, it's not going to be very helpful.

Actually, 95% of the time I use SRFs they are accepting constants and not row
references.  And I use a lot of SRFs.

>
> Yes :( One approach might be to break the function's domain into pieces
> and have the estimator function calculate the estimated result set size
> for each piece. So, given a trivial function like:
>
> foo(int):
>     if $1 < 10 then produce 100 rows
>     else produce 10000 rows
>
> If the planner has encoded the distribution of input tuples to the
> function as a histogram, it could invoke the SRF's estimator function
> for the boundary values of each histogram bucket, and use that to get an
> idea of the function's likely result set size at runtime.
>
> And yes, the idea as sketched is totally unworkable :) For one thing,
> the difficulty of doing this grows rapidly as the number of arguments to
> the function increases. But perhaps there is some variant of this idea
> that might work...
>
> Another thought is that the estimator could provide information on the
> cost of evaluating the function, the number of tuples produced by the
> function, and even the distribution of those tuples.

Another possibility would be to support default values for all estimator
functions and have functions called in row context passed DEFAULT, thus
leaving it up to the estimator writer to supply median values for context
cases.  Or to simply take the "first" values and use those.

While any of these possibilites aren't ideal, they are an improvement over the
current "flat 1000" estimate.   As I said, even the ability to set a
per-function flat constant estimate would be an improvement.

> BTW, why is this on -performance? It should be on -hackers.

'cause I spend more time reading -performance, and I started the thread.
Crossed over now.

--
Josh Berkus
Aglio Database Solutions
San Francisco