Обсуждение: Maximum function call nesting depth for regression tests

Поиск
Список
Период
Сортировка

Maximum function call nesting depth for regression tests

От
Tom Lane
Дата:
A few days ago I added a regression test that involves a plpgsql
function calling a sql function, which recurses back to the plpgsql
function, etc, to a depth of 10 cycles (ie 10 plpgsql function calls
and 10 sql function calls).  There are three buildfarm members that
are failing with "stack depth limit exceeded" errors on this test.
What should we do about that?  Possibilities include:

1. Back off the recursion nesting depth of the test to whatever
it takes to get those buildfarm critters happy.

2. Lobby the buildfarm owners to increase their ulimit -s settings.

3. Chisel things enough to get the case to pass, eg by reducing the
no-doubt-generous value of STACK_DEPTH_SLOP.

I don't especially care for choice #1.  To me, one of the things that
the regression tests ought to flag is whether a machine is so limited
that "reasonable" coding might fail.  If you can't do twenty or so
levels of function call you've got a mighty limited machine.  For
comparison, the parallel regression tests will probably fail if you
can't support twenty concurrent sessions, and nobody's seriously
advocated cutting that.

One point worth noting is that the failing machines are running on
IA64 or PPC64, and some of them seem to be only failing in some
branches.  So maybe there is some platform-specific effect here
that could be fixed with a narrow hack.  I'm not too hopeful though.

Thoughts?
        regards, tom lane


Re: Maximum function call nesting depth for regression tests

От
Robert Haas
Дата:
On Sat, Oct 30, 2010 at 10:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> A few days ago I added a regression test that involves a plpgsql
> function calling a sql function, which recurses back to the plpgsql
> function, etc, to a depth of 10 cycles (ie 10 plpgsql function calls
> and 10 sql function calls).  There are three buildfarm members that
> are failing with "stack depth limit exceeded" errors on this test.
> What should we do about that?  Possibilities include:
>
> 1. Back off the recursion nesting depth of the test to whatever
> it takes to get those buildfarm critters happy.
>
> 2. Lobby the buildfarm owners to increase their ulimit -s settings.
>
> 3. Chisel things enough to get the case to pass, eg by reducing the
> no-doubt-generous value of STACK_DEPTH_SLOP.
>
> I don't especially care for choice #1.  To me, one of the things that
> the regression tests ought to flag is whether a machine is so limited
> that "reasonable" coding might fail.  If you can't do twenty or so
> levels of function call you've got a mighty limited machine.

Agreed.  So how much stack space does 10 or 20 nested calls actually use?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Maximum function call nesting depth for regression tests

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, Oct 30, 2010 at 10:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I don't especially care for choice #1. �To me, one of the things that
>> the regression tests ought to flag is whether a machine is so limited
>> that "reasonable" coding might fail. �If you can't do twenty or so
>> levels of function call you've got a mighty limited machine.

> Agreed.  So how much stack space does 10 or 20 nested calls actually use?

I just did some testing with git HEAD on RHEL-5 machines (gcc 4.1.2).
It appears the actual stack consumption for one cycle (plpgsql to sql
back to plpgsql) is 4112 bytes on ia64, 4000 bytes on ppc32, 8256 bytes
on ppc64.  Of course these numbers could be expected to vary some
depending on compiler version and options, but 4K to 8K looks like
the expected number.

Now the odd thing about this is that we're not running up against
the actual kernel stack limit, because we're not dumping core.
What we're hitting is the max_stack_depth check; and the reason
that is odd is that we *never* set max_stack_depth to less than 100kB,
no matter what insane reading we might get from getrlimit.  So it seems
like there should be enough room for 10 of these cycles, certainly so
on the ia64 machine although maybe ppc64 is marginal.  So I'm back to
suspecting funny business on the buildfarm machines.

Just for the record, this is the call stack cycle we're talking about:

#0  plpgsql_call_handler (fcinfo=0x60000fffffcbf9b0) at pl_handler.c:100
#1  0x40000000003ae660 in ExecMakeFunctionResult (fcache=0x60000000001e6180,    econtext=0x60000000001e5fc8,
isNull=0x60000000001e6bc8"\177~\177\177\177\177\177\177\bW'",    isDone=0x60000000001e6d08) at execQual.c:1836
 
#2  0x40000000003a4c70 in ExecTargetList (projInfo=<value optimized out>,    isDone=0x60000fffffcbfd60) at
execQual.c:5110
#3  ExecProject (projInfo=<value optimized out>, isDone=0x60000fffffcbfd60)   at execQual.c:5325
#4  0x40000000003e36d0 in ExecResult (node=<value optimized out>)   at nodeResult.c:155
#5  0x40000000003a3b10 in ExecProcNode (node=0x60000000001e5eb0)   at execProcnode.c:361
#6  0x40000000003d64f0 in ExecLimit (node=0x60000000001e5b70) at nodeLimit.c:89
#7  0x40000000003a3e50 in ExecProcNode (node=0x60000000001e5b70)   at execProcnode.c:480
#8  0x40000000003a0630 in ExecutePlan (queryDesc=<value optimized out>,    direction=<value optimized out>, count=1) at
execMain.c:1236
#9  standard_ExecutorRun (queryDesc=<value optimized out>,    direction=<value optimized out>, count=1) at
execMain.c:282
#10 0x40000000003c15b0 in postquel_getnext (fcinfo=0x60000fffffcbfe00)   at functions.c:475
#11 fmgr_sql (fcinfo=0x60000fffffcbfe00) at functions.c:704
#12 0x40000000003ae660 in ExecMakeFunctionResult (fcache=0x60000000001dde80,    econtext=0x60000000001ddc58,
isNull=0x60000000001df1a0"\177~\177\177\177\177\177\177\030\356#",    isDone=0x60000000001df2e0) at execQual.c:1836
 
#13 0x40000000003a4c70 in ExecTargetList (projInfo=<value optimized out>,    isDone=0x60000fffffcc01b0) at
execQual.c:5110
#14 ExecProject (projInfo=<value optimized out>, isDone=0x60000fffffcc01b0)   at execQual.c:5325
#15 0x40000000003e36d0 in ExecResult (node=<value optimized out>)   at nodeResult.c:155
#16 0x40000000003a3b10 in ExecProcNode (node=0x60000000001ddb40)   at execProcnode.c:361
#17 0x40000000003a0630 in ExecutePlan (queryDesc=<value optimized out>,    direction=<value optimized out>, count=2) at
execMain.c:1236
#18 standard_ExecutorRun (queryDesc=<value optimized out>,    direction=<value optimized out>, count=2) at
execMain.c:282
#19 0x40000000003fdc40 in _SPI_execute_plan (plan=<value optimized out>,    paramLI=0x60000000001afb60, snapshot=0x0,
crosscheck_snapshot=0x0,   read_only=0 '\000', fire_triggers=<value optimized out>, tcount=2)   at spi.c:2092
 
#20 0x40000000003fe5e0 in SPI_execute_plan_with_paramlist (   plan=0x600000000020767c, params=0x60000000001afb60,
read_only=0'\000',    tcount=2) at spi.c:423
 
#22 0x2000000006193820 in exec_eval_expr (estate=0x60000fffffcc05a8,    expr=0x6000000000204480,
isNull=0x60000fffffcc05b8"\001",    rettype=0x60000fffffcc05bc) at pl_exec.c:4222
 
#23 0x200000000619fbf0 in exec_stmt (estate=0x60000fffffcc05a8,    stmts=<value optimized out>) at pl_exec.c:2148
#24 exec_stmts (estate=0x60000fffffcc05a8, stmts=<value optimized out>)   at pl_exec.c:1239
#25 0x20000000061a1a30 in exec_stmt_if (estate=0x60000fffffcc05a8,    stmt=<value optimized out>) at pl_exec.c:1479
#26 0x200000000619dd00 in exec_stmt (estate=0x60000fffffcc05a8,    stmts=<value optimized out>) at pl_exec.c:1288
#27 exec_stmts (estate=0x60000fffffcc05a8, stmts=<value optimized out>)   at pl_exec.c:1239
#28 0x200000000619c810 in exec_stmt_block (estate=0x60000fffffcbf9b0, block=0x0)   at pl_exec.c:1177
#29 0x20000000061a3660 in plpgsql_exec_function (func=0x60000000001547c8,    fcinfo=0x60000fffffcc09c0) at
pl_exec.c:317
#30 0x2000000006186ae0 in plpgsql_call_handler (fcinfo=0x10000)   at pl_handler.c:122
#31 0x40000000003ae660 in ExecMakeFunctionResult (fcache=0x60000000001cd230,    econtext=0x60000000001cd078, 

I haven't looked to see if any of these have an excessive amount of
local variables.
        regards, tom lane


Re: Maximum function call nesting depth for regression tests

От
Tom Lane
Дата:
I wrote:
> I haven't looked to see if any of these have an excessive amount of
> local variables.

I poked through the call stack and found that the only function in
this nest that seems to have a large amount of local variables is
ExecMakeFunctionResult().  The space hog there is the local
FunctionCallInfoData struct, which requires ~500 bytes on a 32-bit
machine and ~900 bytes on a 64-bit one.  Now the interesting thing
about that is that we *also* keep a FunctionCallInfoData struct in
the FuncExprState.  The reason for this redundancy is stated to be:
   /*    * For non-set-returning functions, we just use a local-variable    * FunctionCallInfoData.  For set-returning
functionswe keep the callinfo    * record in fcache->setArgs so that it can survive across multiple    * value-per-call
invocations. (The reason we don't just do the latter    * all the time is that plpgsql expects to be able to use simple
  * expression trees re-entrantly.  Which might not be a good idea, but the    * penalty for not doing so is high.)
*/

AFAICS this argument is no longer valid given the changes we made last
week to avoid collisions due to reuse of fcinfo->flinfo->fn_extra.
I'm pretty strongly tempted to get rid of the local-variable
FunctionCallInfoData and just use the one in the FuncExprState always.
That would simplify and marginally speed up the function-call code,
which seems surely worth doing regardless of any savings in stack
depth.

I'm not sure that this would save enough stack space to make the
buildfarm critters happy, but it might.  However, I wouldn't care
to risk changing this in the back branches, so we'd still need some
other plan for fixing the buildfarm failures.

Any objections?
        regards, tom lane