Обсуждение: Function execution costs 'n all that
So I've been working on the scheme I suggested a few days ago of
representing "equivalence classes" of variables explicitly, and avoiding
the current ad-hocery of generating and then removing redundant clauses
in favor of generating only the ones we want in the first place.  Any
clause that looks like an equijoin gets sent to the EquivalenceClass
machinery by distribute_qual_to_rels, and not put into the
restrictlist/joinlist data structure at all.  Then we make passes over
the EquivalenceClass lists at appropriate times to generate the clauses
we want.  This is turning over well enough now to pass the regression
tests, but I noticed that one query in opr_sanity got a whole lot slower
than before.  The query is
SELECT p1.opcname, p1.opcfamily
FROM pg_opclass AS p1
WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS p2                WHERE p2.amopfamily = p1.opcfamily                  AND
binary_coercible(p1.opcintype,p2.amoplefttype));
 
and investigation showed that the plan changed from (8.2 and before)
Seq Scan on pg_opclass p1  (cost=0.00..393.94 rows=51 width=68)  Filter: (NOT (subplan))  SubPlan    ->  Seq Scan on
pg_amopp2  (cost=0.00..7.66 rows=2 width=0)          Filter: ((amopfamily = $0) AND binary_coercible($1,
amoplefttype))
to
Seq Scan on pg_opclass p1  (cost=0.00..393.94 rows=51 width=68)  Filter: (NOT (subplan))  SubPlan    ->  Seq Scan on
pg_amopp2  (cost=0.00..7.66 rows=2 width=0)          Filter: (binary_coercible($1, amoplefttype) AND (amopfamily =
$0))
thus resulting in many more calls of binary_coercible() which is a
pretty expensive function.  This is not too surprising: the clause
p2.amopfamily = p1.opcfamily got diverted through the EquivalenceClass
code for just long enough to end up behind the other one in the table's
restrictlist :-(
In short, this approach results in a whole lot less stability in the
order in which WHERE clauses are evaluated.  That might be a killer
objection to the whole thing, but on the other hand we've never made
any strong promises about WHERE evaluation order.
Instead, I'm thinking it might be time to re-introduce some notion of
function execution cost into the system, and make use of that info to
sort WHERE clauses into a reasonable execution order.  This example
would be fixed with even a very stupid rule-of-thumb about SQL functions
being more expensive than C functions, but if we're going to go to the
trouble it seems like it'd be a good idea to provide a way to label
user-defined functions with execution costs.
Would a simple constant value be workable, or do we need some more
complex model (and if so what)?
Comments?
        regards, tom lane
			
		Tom Lane wrote: > Instead, I'm thinking it might be time to re-introduce some notion of > function execution cost into the system, and make use of that info to > sort WHERE clauses into a reasonable execution order. That sounds like a straightforward idea. > This example > would be fixed with even a very stupid rule-of-thumb about SQL functions > being more expensive than C functions, but if we're going to go to the > trouble it seems like it'd be a good idea to provide a way to label > user-defined functions with execution costs. Agreed. > Would a simple constant value be workable, or do we need some more > complex model (and if so what)? A simple constant would probably be enough. If we want anything fancier than that, it should be up to the author of the function to define the cost model as well. I'm envisioning that you could attach a custom cost function to a user-defined function which would return the estimated CPU cost. And # of rows returned for a set-returning function. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> Would a simple constant value be workable, or do we need some more
>> complex model (and if so what)?
> A simple constant would probably be enough. If we want anything fancier 
> than that, it should be up to the author of the function to define the 
> cost model as well. I'm envisioning that you could attach a custom cost 
> function to a user-defined function which would return the estimated CPU 
> cost. And # of rows returned for a set-returning function.
But what will such an estimation function work on?  In general the
planner does not have the value(s) that will be passed to the actual
function at runtime.  It might have expressions or estimates but
those data structures are certainly not something we could pass to
non-C-coded functions.  Are we willing to restrict these functions
to being coded in C, as selectivity estimation functions are?
If we go this route it seems like we'll need four more columns in
pg_proc (cost estimation function OID, rowcount estimation function OID,
fallback cost constant, fallback rowcount constant).
BTW, I'm thinking that a "cost constant" probably ought to be measured
in units of cpu_operator_cost.  The default for built-in functions would
thus be 1, at least till such time as someone wants to refine the
estimates.  We'd probably want the default for PL and SQL functions to
be 10 or 100 or so.
        regards, tom lane
			
		Tom Lane wrote: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> A simple constant would probably be enough. If we want anything fancier >> than that, it should be up to the author of the function to define the >> cost model as well. I'm envisioning that you could attach a custom cost >> function to a user-defined function which would return the estimated CPU >> cost. And # of rows returned for a set-returning function. > > But what will such an estimation function work on? In general the > planner does not have the value(s) that will be passed to the actual > function at runtime. It might have expressions or estimates but > those data structures are certainly not something we could pass to > non-C-coded functions. Are we willing to restrict these functions > to being coded in C, as selectivity estimation functions are? Yeah, I don't know. If the planner knows the actual value, that would certainly be the easiest for the UDF writer to work with. Anything more than that gets really complicated. > If we go this route it seems like we'll need four more columns in > pg_proc (cost estimation function OID, rowcount estimation function OID, > fallback cost constant, fallback rowcount constant). What would the fallbacks be for? > BTW, I'm thinking that a "cost constant" probably ought to be measured > in units of cpu_operator_cost. The default for built-in functions would > thus be 1, at least till such time as someone wants to refine the > estimates. We'd probably want the default for PL and SQL functions to > be 10 or 100 or so. Agreed. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> If we go this route it seems like we'll need four more columns in
>> pg_proc (cost estimation function OID, rowcount estimation function OID,
>> fallback cost constant, fallback rowcount constant).
> What would the fallbacks be for?
By "fallback" I meant "this is what to use if no estimation function is
provided".
I'm envisioning that the CREATE FUNCTION syntax would add optional
clauses
    COST function-name-or-numeric-constant    ROWS function-name-or-numeric-constant
that would be used to fill these columns.
        regards, tom lane
			
		On Mon, 15 Jan 2007, Tom Lane wrote: > So I've been working on the scheme I suggested a few days ago of > representing "equivalence classes" of variables explicitly, and avoiding > the current ad-hocery of generating and then removing redundant clauses > in favor of generating only the ones we want in the first place. Any > clause that looks like an equijoin gets sent to the EquivalenceClass > machinery by distribute_qual_to_rels, and not put into the > restrictlist/joinlist data structure at all. Then we make passes over > the EquivalenceClass lists at appropriate times to generate the clauses > we want. This is turning over well enough now to pass the regression > tests, That was quick... > In short, this approach results in a whole lot less stability in the > order in which WHERE clauses are evaluated. That might be a killer > objection to the whole thing, but on the other hand we've never made > any strong promises about WHERE evaluation order. Showing my ignorance here, but I've never been a fan of "syntax based optimization," though it is better than no optimization. If people are counting on order for optimization, then, hmmm... If you can provide a way to at least _try_ to do better, then don't worry about it. It will improve with time. > Instead, I'm thinking it might be time to re-introduce some notion of > function execution cost into the system, and make use of that info to > sort WHERE clauses into a reasonable execution order. Ingres did/does it that way, IIRC. It's a solid strategy. > This example > would be fixed with even a very stupid rule-of-thumb about SQL functions > being more expensive than C functions, but if we're going to go to the > trouble it seems like it'd be a good idea to provide a way to label > user-defined functions with execution costs. > > Would a simple constant value be workable, or do we need some more > complex model (and if so what)? Ingres would, if I'm not mistaken, gain through historical use through histograms. Short of that, you've got classes of functions, agregations, for example, and there's sure to be missing information to make a great decision at planning time. However, I take it that the cost here is primarily CPU and not I/O. I therefore propose that the engine evaluate - benchmark, if you will - all functions as they are ingested, or vacuum-like at some later date (when valid data for testing may exist), and assign a cost relative to what it already knows - the built-ins, for example. Doing so could allow this strategy to be functional in short order and be improved with time so all the work doesn't have to be implemented on day 1. And, DBA/sys-admin tweaking can always be done by updating the catalogues. HTH, Richard -- Richard Troy, Chief Scientist Science Tools Corporation 510-924-1363 or 202-747-1263 rtroy@ScienceTools.com, http://ScienceTools.com/
On Mon, 2007-01-15 at 10:51 -0800, Richard Troy wrote: > I therefore propose that the engine evaluate - > benchmark, if you will - all functions as they are ingested, or > vacuum-like at some later date (when valid data for testing may exist), > and assign a cost relative to what it already knows - the built-ins, for > example. That seems pretty unworkable. It is unsafe, for one: evaluating a function may have side effects (inside or outside the database), so the DBMS cannot just invoke user-defined functions at whim. Also, the relationship between a function's arguments and its performance will often be highly complex -- it would be very difficult, not too mention computationally infeasible, to reconstruct that relationship automatically, especially without any real knowledge about the function's behavior. -Neil
On Mon, 15 Jan 2007, Neil Conway wrote: > On Mon, 2007-01-15 at 10:51 -0800, Richard Troy wrote: > > I therefore propose that the engine evaluate - > > benchmark, if you will - all functions as they are ingested, or > > vacuum-like at some later date (when valid data for testing may exist), > > and assign a cost relative to what it already knows - the built-ins, for > > example. > > That seems pretty unworkable. It is unsafe, for one: evaluating a > function may have side effects (inside or outside the database), so the > DBMS cannot just invoke user-defined functions at whim. Also, the > relationship between a function's arguments and its performance will > often be highly complex -- it would be very difficult, not too mention > computationally infeasible, to reconstruct that relationship > automatically, especially without any real knowledge about the > function's behavior. > > -Neil Hi Neil, Tom had already proposed: > > I'm envisioning that the CREATE FUNCTION syntax would add optional > clauses > > COST function-name-or-numeric-constant > ROWS function-name-or-numeric-constant > > that would be used to fill these columns. I was considering these ideas in the mix; let the user provide either a numeric or a function, the distinction here being that instead of running that function at planning-time, it could be run "off-line", so to speak. Richard -- Richard Troy, Chief Scientist Science Tools Corporation 510-924-1363 or 202-747-1263 rtroy@ScienceTools.com, http://ScienceTools.com/
Neil Conway wrote:<br /><blockquote cite="mid1168887288.6174.109.camel@localhost.localdomain" type="cite"><pre wrap="">OnMon, 2007-01-15 at 10:51 -0800, Richard Troy wrote: </pre><blockquote type="cite"><pre wrap="">I therefore proposethat the engine evaluate - benchmark, if you will - all functions as they are ingested, or vacuum-like at some later date (when valid data for testing may exist), and assign a cost relative to what it already knows - the built-ins, for example. </pre></blockquote><pre wrap=""> That seems pretty unworkable. It is unsafe, for one: evaluating a function may have side effects (inside or outside the database), so the DBMS cannot just invoke user-defined functions at whim. Also, the relationship between a function's arguments and its performance will often be highly complex -- it would be very difficult, not too mention computationally infeasible, to reconstruct that relationship automatically, especially without any real knowledge about the function's behavior. </pre></blockquote> Non-developer here, but we use a lot of plpgsql functions at work. And the functionswe use fall into two broad, ill-defined catagories- "expensive" functions and "cheap" functions. What I'd likeas a user is some way to tell the planner "this function is expensive- prefer plans which call this function less evenif they're otherwise more expensive" or "this function is cheap, prefer plans that are otherwise less expensive evenif they call this function more often". Precise cost estimates aren't that important, IMHO.<br /><br /> Brian<br /><br/>
Brian Hurt <bhurt@janestcapital.com> writes:
> Non-developer here, but we use a lot of plpgsql functions at work.  And 
> the functions we use fall into two broad, ill-defined catagories- 
> "expensive" functions and "cheap" functions.  What I'd like as a user is 
> some way to tell the planner "this function is expensive- prefer plans 
> which call this function less even if they're otherwise more expensive" 
> or "this function is cheap, prefer plans that are otherwise less 
> expensive even if they call this function more often".  Precise cost 
> estimates aren't that important, IMHO.
Right, so a plain constant cost would be plenty for your situation.
I suspect there's an 80/20 rule at work here --- the estimator-function
side of this will take most of the effort to design/implement, but not
get used nearly as much as the plain-constant form ... maybe we should
just do the constant for starters and see how many people really want to
write C-code estimators ...
        regards, tom lane
			
		On Mon, 2007-01-15 at 15:05 -0500, Tom Lane wrote: > maybe we should just do the constant for starters and see how many > people really want to write C-code estimators ... +1 BTW, your proposal would still pushdown all qualifiers, right? Hellerstein's xfunc work discusses situations in which it makes sense to pullup expensive qualifiers above joins, for example, in order to reduce the number of tuples the qualifier is applied to. Unfortunately, this would probably increase the optimizer's search space by a fairly significant degree, so it might need to be controlled by a GUC variable, or only applied when the estimated cost of applying a qualifier is particularly large relative to the total estimated cost of the plan. -Neil
Neil Conway <neilc@samurai.com> writes:
> BTW, your proposal would still pushdown all qualifiers, right?
Yeah, I have no intention of readopting xfunc in the near future ...
especially seeing that it's possible for the user to force that
sort of thing if he really has to.
SELECT * FROM (SELECT ... OFFSET 0) ss    WHERE expensive_function(...);
        regards, tom lane
			
		On Mon, 2007-01-15 at 15:05 -0500, Tom Lane wrote:
> Brian Hurt <bhurt@janestcapital.com> writes:
> > Non-developer here, but we use a lot of plpgsql functions at work.  And 
> > the functions we use fall into two broad, ill-defined catagories- 
> > "expensive" functions and "cheap" functions.  What I'd like as a user is 
> > some way to tell the planner "this function is expensive- prefer plans 
> > which call this function less even if they're otherwise more expensive" 
> > or "this function is cheap, prefer plans that are otherwise less 
> > expensive even if they call this function more often".  Precise cost 
> > estimates aren't that important, IMHO.
> 
> Right, so a plain constant cost would be plenty for your situation.
> 
> I suspect there's an 80/20 rule at work here --- the estimator-function
> side of this will take most of the effort to design/implement, but not
> get used nearly as much as the plain-constant form ... maybe we should
> just do the constant for starters and see how many people really want to
> write C-code estimators ...
> 
>             regards, tom lane
Hi Tom et al,
Having worked with stored procedures on large datasets for reporting, I
would say that it would be useful to have a non-constant estimator for
the number of rows, whereas a single CPU cost constant should be fine.
Where I have struggled with this has been joining onto slightly more
exotic queries when doing large scale data processing as part of a
custom report or an application upgrade.
Using PL/PGSQL I would find it useful to have access to the constants
passed into a function to be used to help provide a row count estimate
(typically useful for passing in table/column names), e.g.
SELECT * FROM my_func('my_table1') AS t1, my_table2 AS t2 WHERE t1.id =
t2.id;
CREATE FUNCTION my_func(text) AS $$
...
$$ LANGUAGE 'plpgsql' COST 1.0 ROWS my_func_row_cost;
In my cost function, I could then estimate the number of rows using
something like below, where all constants are passed into the cost
function as parameters, e.g.:
CREATE FUNCTION my_func_row_cost(text) AS $$
DECLAREfoo bigint;
BEGINEXECUTE INTO foo 'SELECT COUNT(*) FROM ' || quote_literal($1);RETURN foo;
END;
$$ LANGUAGE 'plpgsql';
In the case where a parameter was not a constant but a column name, then
it would be reasonable in my mind to simply replace that parameter with
NULL when passing to the row cost function, e.g.
SELECT * FROM my_table1 WHERE my_table1.junk = (SELECT
my_func(my_table1.name));
In this case, the text parameter passed to my_func_row_cost would be
replaced by NULL to indicate that it was non-constant.
Of course, even with constants passed upon input, it still may not be
possible to calculate a number of rows that can be returned - it could
be the case that the only parameter passed to cost function has been
converted to NULL because it is a column name. Perhaps in this case it
would be useful to specify returning NULL from my_func_row_cost means "I
can't return anything meaningful, so use the fallback values".
Kind regards,
Mark.
			
		> Tom Lane wrote: > > Instead, I'm thinking it might be time to re-introduce some notion of > > function execution cost into the system, and make use of that info to > > sort WHERE clauses into a reasonable execution order. I imagine you've thought of this already but just in case, the cost of the function call has to be combined with the selectivity to get this right. If you can do an expensive but very selective clause first and save 100 cheap calls that almost always return true it may still be worthwhile. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <gsstark@mit.edu> writes:
> I imagine you've thought of this already but just in case, the cost of the
> function call has to be combined with the selectivity to get this right. If
> you can do an expensive but very selective clause first and save 100 cheap
> calls that almost always return true it may still be worthwhile.
I've thought of it, but I haven't figured out a reasonable algorithm for
ordering the clauses in view of that.  Have you?
        regards, tom lane
			
		"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Gregory Stark <gsstark@mit.edu> writes: >> I imagine you've thought of this already but just in case, the cost of the >> function call has to be combined with the selectivity to get this right. If >> you can do an expensive but very selective clause first and save 100 cheap >> calls that almost always return true it may still be worthwhile. > > I've thought of it, but I haven't figured out a reasonable algorithm for > ordering the clauses in view of that. Have you? Hum, I hadn't tried. Now that I think about it it's certainly not obvious. And picturing the possible failure modes I would rather it execute cheap expressions more often than necessary than call some user-defined perl function that could be doing i/o or involve waiting on other resources any more than absolutely necessary. So I guess what you originally described is safest. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > > BTW, I'm thinking that a "cost constant" probably ought to be measured > in units of cpu_operator_cost. The default for built-in functions would > thus be 1, at least till such time as someone wants to refine the > estimates. We'd probably want the default for PL and SQL functions to > be 10 or 100 or so. Any chance that costs could eventually change to real-world units? It's hard for me to guess how many cpu_operator_cost units something might take; but relatively easy for me to measure or estimate in fractions-of-a-seconds how long something takes. I could imagine having the other planner costs be measured in seconds too - perhaps with the goal of eventually writing some auto-config code that tries to measure values like cpu_tuple_cost on a given piece of hardware.
Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
> Tom Lane wrote:
>> BTW, I'm thinking that a "cost constant" probably ought to be measured
>> in units of cpu_operator_cost.
> Any chance that costs could eventually change to real-world units?
Define "real world units".
If you like you can try to adjust things so that cpu_operator_cost bears
some relation to elapsed-msec-on-your-own-machine, and scale everything
else accordingly; indeed that's why we added seq_page_cost in 8.2.
But it's awfully hard to see the numbers being very portable.
        regards, tom lane
			
		On Mon, 2007-01-15 at 13:54 -0500, Neil Conway wrote: > On Mon, 2007-01-15 at 10:51 -0800, Richard Troy wrote: > > I therefore propose that the engine evaluate - > > benchmark, if you will - all functions as they are ingested, or > > vacuum-like at some later date (when valid data for testing may exist), > > and assign a cost relative to what it already knows - the built-ins, for > > example. > > That seems pretty unworkable. It is unsafe, for one: evaluating a > function may have side effects (inside or outside the database), so the Would any form of cost estimate have meaning if the function has side effects? If it's a volatile function, doesn't that mean that the planner can't avoid or favor executing it? Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes:
> Would any form of cost estimate have meaning if the function has side
> effects? If it's a volatile function, doesn't that mean that the planner
> can't avoid or favor executing it?
No, not really.  If the function is down inside a sub-select or
something like that, the number of executions could depend on any number
of factors (like whether we put it on the inside or outside of a join)
--- and if it's expensive then we will and should try to arrange the
query to minimize the number of executions.  We're certainly not going
to drop back to all-plain-nestloops just because the query contains one
volatile function.
(Now mind you, a smart user will probably avoid using a volatile
function like that, but I don't think it's an adequate argument for
saying that we don't need cost information.)
        regards, tom lane
			
		Neil Conway <neilc@samurai.com> writes:
> On Mon, 2007-01-15 at 15:05 -0500, Tom Lane wrote:
>> maybe we should just do the constant for starters and see how many
>> people really want to write C-code estimators ...
> +1
It seemed like that was the list's consensus, so I'll go off and do the
simple-constant case.  We can always extend it later.
For reference, the plan is to add these options to CREATE/ALTER
FUNCTION:
COST float-constant        (defaults to 1)ROWS float-constant        (defaults to 1000)
feeding into two float4 columns added to pg_proc.  ROWS is only
allowed/meaningful for set-returning functions.  COST is implicitly
scaled by cpu_operator_cost.
        regards, tom lane
			
		I complained about how:
> The query is
> SELECT p1.opcname, p1.opcfamily
> FROM pg_opclass AS p1
> WHERE NOT EXISTS(SELECT 1 FROM pg_amop AS p2
>                  WHERE p2.amopfamily = p1.opcfamily
>                    AND binary_coercible(p1.opcintype, p2.amoplefttype));
> and investigation showed that the plan changed from (8.2 and before)
>  Seq Scan on pg_opclass p1  (cost=0.00..393.94 rows=51 width=68)
>    Filter: (NOT (subplan))
>    SubPlan
>      ->  Seq Scan on pg_amop p2  (cost=0.00..7.66 rows=2 width=0)
>            Filter: ((amopfamily = $0) AND binary_coercible($1, amoplefttype))
> to
>  Seq Scan on pg_opclass p1  (cost=0.00..393.94 rows=51 width=68)
>    Filter: (NOT (subplan))
>    SubPlan
>      ->  Seq Scan on pg_amop p2  (cost=0.00..7.66 rows=2 width=0)
>            Filter: (binary_coercible($1, amoplefttype) AND (amopfamily = $0))
Now that some function-cost smarts are in there, I expected to see the
plan go back to the first case, but what I actually see in CVS HEAD is
Seq Scan on pg_opclass p1  (cost=0.00..660.35 rows=51 width=68)  Filter: (NOT (subplan))  SubPlan    ->  Bitmap Heap
Scanon pg_amop p2  (cost=4.29..8.60 rows=2 width=0)          Recheck Cond: (amopfamily = $0)          Filter:
binary_coercible($1,amoplefttype)          ->  Bitmap Index Scan on pg_amop_fam_strat_index  (cost=0.00..4.29 rows=5
width=0)               Index Cond: (amopfamily = $0)
 
The reason this happens is that cost_qual_eval charges the entire cost of
evaluating all the arms of an AND, even though we'll drop out as soon as
something returns FALSE; and so the planner is led to avoid the seqscan
because it now appears to have a high filter-condition evaluation cost,
in favor of a plan that will evaluate the filter condition many fewer
times.  In reality those two plans will call binary_coercible() exactly
the same number of times, and so this is a bogus reason to switch.
I'm kind of inclined to leave it alone though, because the second plan
seems a bit more "failsafe".  To do anything differently, we'd have to
order the qual conditions the way we expect to execute them before
any use of cost_qual_eval, which sounds expensive; and as noted in
an upthread discussion with Greg, relying on the correctness of *both*
cost and selectivity estimates seems a tad fragile.
Comments?
        regards, tom lane
			
		Tom Lane wrote: > Would a simple constant value be workable, or do we need some more > complex model (and if so what)? Consider: ANALYZE myfunc(integer) ON (SELECT myfunc(7)) WITH RATIO 0.03; ANALYZE myfunc(text,text) ON (SELECT myfunc(mt.a,mt.b) FROM mytable mt) WITH RATIO 1.071; ANALYZE myfunc(text,text) ON ( SELECT myfunc(mt.a,mt.b) FROM mytable mt UNION SELECT myfunc(ot.a,ot.b) FROM othertableot ) WITH RATIO 0.5; These commands could turn on function timing for the lifespan of the query, with statistics gathered about the given function's runtimes. The "WITH RATIO" clause would be there to translate runtimes (in milliseconds) into units of cpu_operator_cost. The "WITH RATIO" clause could be optional, with a default ratio taken from the postgresql.conf file, if any exists, and finally defaulting to a hardcoded "reasonable" value. Users would be well advised to adopt a consistent policy regarding system load at the time that various analyze functions are run. If the function has side effects, it would be the user's responsibility to not analyze the function unless those side effects are acceptable. The user can only analyze those queries that the user has permissions to run, so there shouldn't be any additional ability to generate side-effects beyond what the user already has permission to do. The syntax might need some adjusting to make the parser happy and to avoid new reserved words. The syntax used above is just an example. It seems to me that the above system would work perfectly well for collecting the number of rows returned from a set returning function, not just the run times. mark
Mark Dilger <pgsql@markdilger.com> writes:
> Tom Lane wrote:
>> Would a simple constant value be workable, or do we need some more
>> complex model (and if so what)?
> Consider:
> ANALYZE myfunc(integer) ON (SELECT myfunc(7)) WITH RATIO 0.03;
> ...
> It seems to me that the above system would work perfectly well for
> collecting the number of rows returned from a set returning function,
> not just the run times.
I don't really think that data collection is the bottleneck.  If a
constant estimate isn't good enough for you, then you need some kind of
model of how the runtime or number of rows varies with the function's
inputs ... and I hardly see how something like the above is likely to
figure out how to fit a good model.  Or at least, if you think it can,
then you skipped all the interesting bits.
One other point is that we already know that sampling overhead and
measurement error are significant problems when trying to measure
intervals on the order of one Plan-node execution.  I'm afraid that
would get a great deal worse if we try to use a similar approach to
timing individual function calls.
        regards, tom lane
			
		Tom Lane wrote: > Mark Dilger <pgsql@markdilger.com> writes: >> Tom Lane wrote: >>> Would a simple constant value be workable, or do we need some more >>> complex model (and if so what)? > >> Consider: >> ANALYZE myfunc(integer) ON (SELECT myfunc(7)) WITH RATIO 0.03; >> ... >> It seems to me that the above system would work perfectly well for >> collecting the number of rows returned from a set returning function, >> not just the run times. > > I don't really think that data collection is the bottleneck. Ahh, I'm not just thinking about data collection. I'm thinking about usability for non-hackers who know enough plpgsql to write a function and then want to train the system to plan for it appropriately. It's a much easier task for a novice user to say "go away and figure out how expensive this thing is" than for a novice to think about things like statistical variance, etc. We don't demand that users have that kind of knowledge to write queries or run analyze on tables, so why would they need that kind of knowledge to write a function? > If a > constant estimate isn't good enough for you, then you need some kind of > model of how the runtime or number of rows varies with the function's > inputs ... and I hardly see how something like the above is likely to > figure out how to fit a good model. Or at least, if you think it can, > then you skipped all the interesting bits. I am (perhaps naively) imagining that the user will train the database over the same query as the one that will actually get used most often in production. In the case that the query modifies the table, the user could train the database over a copy of that table. The data collected by the analyze phase would just be constant stuff like average and stddev. That would make the job of the planner / cost estimator easier, right? It could treat the function as a constant cost function. > One other point is that we already know that sampling overhead and > measurement error are significant problems when trying to measure > intervals on the order of one Plan-node execution. I'm afraid that > would get a great deal worse if we try to use a similar approach to > timing individual function calls. The query could be run with the arguments passed to "myfunc" being recorded to a temporary table. After the query is complete (and the temporary table populated), data from the temp table could be pulled into memory in batches, with the "myfunc" run on them again in a tight loop. The loop itself could be timed, rather than each iteration. The sum of all the timings for the various loops would then be the final runtime which would be divided by the total number of rows to get the average runtime per call. The downside is that I don't see how you retrieve the standard deviation. (I also don't know if the planner knows how to use standard deviation information, so perhaps this is a non issue.) A further refinement would be to batch the inputs based on properties of the input data. For text, you could run a batch of short text first, a batch of medium second, and a batch of long text last, and use best-fit linear algebra to determine the runtime cost vs. input text length function. I'm not sure how such a refinement would be done for fixed size datatypes. And for some text functions the runtime won't vary with length but with some other property anyway. mark