Обсуждение: Why is infinite_recurse test suddenly failing?

Поиск
Список
Период
Сортировка

Why is infinite_recurse test suddenly failing?

От
Tom Lane
Дата:
In the past week, four different buildfarm members have shown
non-reproducing segfaults in the "select infinite_recurse()"
test case, rather than the expected detection of stack overrun
before we get to the point of a segfault.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bonito&dt=2019-05-01%2023%3A05%3A36
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=takin&dt=2019-05-01%2008%3A16%3A48
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=buri&dt=2019-04-27%2023%3A54%3A46
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=demoiselle&dt=2019-04-27%2014%3A55%3A52

They're all on HEAD, and they all look like

2019-05-01 23:11:00.145 UTC [13933:65] LOG:  server process (PID 17161) was terminated by signal 11: Segmentation fault
2019-05-01 23:11:00.145 UTC [13933:66] DETAIL:  Failed process was running: select infinite_recurse();

I scraped the buildfarm database and verified that there are no similar
failures for at least three months back; nor, offhand, can I remember ever
having seen this test fail in many years.  So it seems we broke something
recently.  No idea what though.

(Another possibility, seeing that these are all members of Mark's PPC64
flotilla, is that there's some common misconfiguration --- but it's hard
to credit that such a problem would only affect HEAD not the back
branches.)

Thoughts?

            regards, tom lane



Re: Why is infinite_recurse test suddenly failing?

От
Andres Freund
Дата:
Hi,

On 2019-05-02 11:02:03 -0400, Tom Lane wrote:
> In the past week, four different buildfarm members have shown
> non-reproducing segfaults in the "select infinite_recurse()"
> test case, rather than the expected detection of stack overrun
> before we get to the point of a segfault.

I was just staring at bonito's failure in confusion.


> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bonito&dt=2019-05-01%2023%3A05%3A36
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=takin&dt=2019-05-01%2008%3A16%3A48
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=buri&dt=2019-04-27%2023%3A54%3A46
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=demoiselle&dt=2019-04-27%2014%3A55%3A52
> 
> They're all on HEAD, and they all look like
> 
> 2019-05-01 23:11:00.145 UTC [13933:65] LOG:  server process (PID 17161) was terminated by signal 11: Segmentation
fault
> 2019-05-01 23:11:00.145 UTC [13933:66] DETAIL:  Failed process was running: select infinite_recurse();
> 
> I scraped the buildfarm database and verified that there are no similar
> failures for at least three months back; nor, offhand, can I remember ever
> having seen this test fail in many years.  So it seems we broke something
> recently.  No idea what though.

I can't recall any recent changes to relevant area of code.


> (Another possibility, seeing that these are all members of Mark's PPC64
> flotilla, is that there's some common misconfiguration --- but it's hard
> to credit that such a problem would only affect HEAD not the back
> branches.)

Hm, I just noticed:
                                       'HEAD' => [
                                                   'force_parallel_mode = regress'
                                                 ]

on all those animals. So it's not necessarily the case that HEAD and
backbranch runs are behaving all that identical.  Note that isn't a
recent config change, so it's not an explanation as to why they started
to fail only recently.

Greetings,

Andres Freund



Re: Why is infinite_recurse test suddenly failing?

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> Hm, I just noticed:
>                                        'HEAD' => [
>                                                    'force_parallel_mode = regress'
>                                                  ]

Oooh, I didn't see that.

> on all those animals. So it's not necessarily the case that HEAD and
> backbranch runs are behaving all that identical.  Note that isn't a
> recent config change, so it's not an explanation as to why they started
> to fail only recently.

No, but it does point at another area of the code in which a relevant
change could've occurred.

While we're looking at this --- Mark, if you could install gdb
on your buildfarm hosts, that would be really handy.  I think that's
the only extra thing the buildfarm script needs to extract stack
traces from core dumps.  We'd likely already know where the problem
is if we had a stack trace ...

            regards, tom lane



Re: Why is infinite_recurse test suddenly failing?

От
Mark Wong
Дата:
On Thu, May 02, 2019 at 11:45:34AM -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Hm, I just noticed:
> >                                        'HEAD' => [
> >                                                    'force_parallel_mode = regress'
> >                                                  ]
> 
> Oooh, I didn't see that.
> 
> > on all those animals. So it's not necessarily the case that HEAD and
> > backbranch runs are behaving all that identical.  Note that isn't a
> > recent config change, so it's not an explanation as to why they started
> > to fail only recently.
> 
> No, but it does point at another area of the code in which a relevant
> change could've occurred.
> 
> While we're looking at this --- Mark, if you could install gdb
> on your buildfarm hosts, that would be really handy.  I think that's
> the only extra thing the buildfarm script needs to extract stack
> traces from core dumps.  We'd likely already know where the problem
> is if we had a stack trace ...

Ok, I think I have gdb installed now...

Regards,
Mark

-- 
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/



Re: Why is infinite_recurse test suddenly failing?

От
Tom Lane
Дата:
Mark Wong <mark@2ndquadrant.com> writes:
> On Thu, May 02, 2019 at 11:45:34AM -0400, Tom Lane wrote:
>> While we're looking at this --- Mark, if you could install gdb
>> on your buildfarm hosts, that would be really handy.  I think that's
>> the only extra thing the buildfarm script needs to extract stack
>> traces from core dumps.  We'd likely already know where the problem
>> is if we had a stack trace ...

> Ok, I think I have gdb installed now...

Thanks!

            regards, tom lane



Re: Why is infinite_recurse test suddenly failing?

От
Andres Freund
Дата:
Hi,

On 2019-05-03 10:08:59 -0700, Mark Wong wrote:
> Ok, I think I have gdb installed now...

Thanks! Any chance you could turn on force_parallel_mode for the other
branches it applies to too? Makes it easier to figure out whether
breakage is related to that, or independent.

Greetings,

Andres Freund



Re: Why is infinite_recurse test suddenly failing?

От
Tom Lane
Дата:
We just got another one of these, on yet another member of Mark's
ppc64 armada:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=shoveler&dt=2019-05-10%2014%3A04%3A34

Now we have a stack trace (thanks Mark!), but it is pretty unsurprising:

Core was generated by `postgres: debian regression [local] SELECT                                     '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  sysmalloc (nb=8208, av=0x3fff916e0d28 <main_arena>) at malloc.c:2748
2748    malloc.c: No such file or directory.
#0  sysmalloc (nb=8208, av=0x3fff916e0d28 <main_arena>) at malloc.c:2748
#1  0x00003fff915bedc8 in _int_malloc (av=0x3fff916e0d28 <main_arena>, bytes=8192) at malloc.c:3865
#2  0x00003fff915c1064 in __GI___libc_malloc (bytes=8192) at malloc.c:2928
#3  0x00000000106acfd8 in AllocSetContextCreateInternal (parent=0x1000babdad0, name=0x1085508c "inline_function",
minContextSize=<optimizedout>, initBlockSize=<optimized out>, maxBlockSize=8388608) at aset.c:477 
#4  0x00000000103d5e00 in inline_function (funcid=20170, result_type=<optimized out>, result_collid=<optimized out>,
input_collid=<optimizedout>, funcvariadic=<optimized out>, func_tuple=<optimized out>, context=0x3fffe3da15d0,
args=<optimizedout>) at clauses.c:4459 
#5  simplify_function (funcid=<optimized out>, result_type=<optimized out>, result_typmod=<optimized out>,
result_collid=<optimizedout>, input_collid=<optimized out>, args_p=<optimized out>, funcvariadic=<optimized out>,
process_args=<optimizedout>, allow_non_const=<optimized out>, context=<optimized out>) at clauses.c:4040 
#6  0x00000000103d2e74 in eval_const_expressions_mutator (node=0x1000babe968, context=0x3fffe3da15d0) at clauses.c:2474
#7  0x00000000103511bc in expression_tree_mutator (node=<optimized out>, mutator=0x103d2b10
<eval_const_expressions_mutator>,context=0x3fffe3da15d0) at nodeFuncs.c:2893 
#8  0x00000000103d2cbc in eval_const_expressions_mutator (node=0x1000babe9c0, context=0x3fffe3da15d0) at clauses.c:3606
#9  0x00000000103510c8 in expression_tree_mutator (node=<optimized out>, mutator=<optimized out>, context=<optimized
out>)at nodeFuncs.c:2942 
#10 0x00000000103d2cbc in eval_const_expressions_mutator (node=0x1000babea40, context=0x3fffe3da15d0) at clauses.c:3606
#11 0x00000000103d2ae8 in eval_const_expressions (root=<optimized out>, node=<optimized out>) at clauses.c:2266
#12 0x00000000103b6264 in preprocess_expression (root=0x1000babee18, expr=<optimized out>, kind=1) at planner.c:1087
#13 0x00000000103b496c in subquery_planner (glob=<optimized out>, parse=<optimized out>, parent_root=<optimized out>,
hasRecursion=<optimizedout>, tuple_fraction=<optimized out>) at planner.c:769 
#14 0x00000000103b3c58 in standard_planner (parse=<optimized out>, cursorOptions=<optimized out>,
boundParams=<optimizedout>) at planner.c:406 
#15 0x00000000103b3a68 in planner (parse=<optimized out>, cursorOptions=<optimized out>, boundParams=<optimized out>)
atplanner.c:275 
#16 0x00000000104cc2cc in pg_plan_query (querytree=0x1000babe7f8, cursorOptions=256, boundParams=0x0) at postgres.c:878
#17 0x00000000102ef850 in init_execution_state (lazyEvalOK=<optimized out>, queryTree_list=<optimized out>,
fcache=<optimizedout>) at functions.c:507 
#18 init_sql_fcache (finfo=<optimized out>, collation=<optimized out>, lazyEvalOK=<optimized out>) at functions.c:770
#19 fmgr_sql (fcinfo=<optimized out>) at functions.c:1053
#20 0x00000000102cef24 in ExecInterpExpr (state=<optimized out>, econtext=<optimized out>, isnull=<optimized out>) at
execExprInterp.c:625
#21 0x00000000102cddb8 in ExecInterpExprStillValid (state=0x1000bab41e8, econtext=0x1000bab3ed8, isNull=<optimized
out>)at execExprInterp.c:1769 
#22 0x0000000010314f10 in ExecEvalExprSwitchContext (state=0x1000bab41e8, econtext=0x1000bab3ed8, isNull=<optimized
out>)at ../../../src/include/executor/executor.h:307 
#23 ExecProject (projInfo=0x1000bab41e0) at ../../../src/include/executor/executor.h:341
#24 ExecResult (pstate=<optimized out>) at nodeResult.c:136
#25 0x00000000102e319c in ExecProcNodeFirst (node=0x1000bab3dc0) at execProcnode.c:445
#26 0x00000000102d9c94 in ExecProcNode (node=<optimized out>) at ../../../src/include/executor/executor.h:239
#27 ExecutePlan (estate=<optimized out>, planstate=<optimized out>, use_parallel_mode=false, operation=<optimized out>,
numberTuples=<optimizedout>, direction=<optimized out>, dest=<optimized out>, execute_once=<optimized out>,
sendTuples=<optimizedout>) at execMain.c:1648 
#28 standard_ExecutorRun (queryDesc=<optimized out>, direction=<optimized out>, count=<optimized out>,
execute_once=<optimizedout>) at execMain.c:365 
#29 0x00000000102d9ac8 in ExecutorRun (queryDesc=<optimized out>, direction=<optimized out>, count=<optimized out>,
execute_once=<optimizedout>) at execMain.c:309 
#30 0x00000000102efe84 in postquel_getnext (es=<optimized out>, fcache=<optimized out>) at functions.c:867
#31 fmgr_sql (fcinfo=<optimized out>) at functions.c:1153
#32 0x00000000102cef24 in ExecInterpExpr (state=<optimized out>, econtext=<optimized out>, isnull=<optimized out>) at
execExprInterp.c:625
#33 0x00000000102cddb8 in ExecInterpExprStillValid (state=0x1000baa8158, econtext=0x1000baa7e48, isNull=<optimized
out>)at execExprInterp.c:1769 
#34 0x0000000010314f10 in ExecEvalExprSwitchContext (state=0x1000baa8158, econtext=0x1000baa7e48, isNull=<optimized
out>)at ../../../src/include/executor/executor.h:307 
#35 ExecProject (projInfo=0x1000baa8150) at ../../../src/include/executor/executor.h:341
#36 ExecResult (pstate=<optimized out>) at nodeResult.c:136
#37 0x00000000102e319c in ExecProcNodeFirst (node=0x1000baa7d30) at execProcnode.c:445

... lots and lots of repetitions ...

#11809 0x00000000102e319c in ExecProcNodeFirst (node=0x10008c01e90) at execProcnode.c:445
#11810 0x00000000102d9c94 in ExecProcNode (node=<optimized out>) at ../../../src/include/executor/executor.h:239
#11811 ExecutePlan (estate=<optimized out>, planstate=<optimized out>, use_parallel_mode=false, operation=<optimized
out>,numberTuples=<optimized out>, direction=<optimized out>, dest=<optimized out>, execute_once=<optimized out>,
sendTuples=<optimizedout>) at execMain.c:1648 
#11812 standard_ExecutorRun (queryDesc=<optimized out>, direction=<optimized out>, count=<optimized out>,
execute_once=<optimizedout>) at execMain.c:365 
#11813 0x00000000102d9ac8 in ExecutorRun (queryDesc=<optimized out>, direction=<optimized out>, count=<optimized out>,
execute_once=<optimizedout>) at execMain.c:309 
#11814 0x00000000104d39ec in PortalRunSelect (portal=0x10008be9de8, forward=<optimized out>, count=0, dest=<optimized
out>)at pquery.c:929 
#11815 0x00000000104d34c0 in PortalRun (portal=0x10008be9de8, count=<optimized out>, isTopLevel=<optimized out>,
run_once=<optimizedout>, dest=<optimized out>, altdest=<optimized out>, completionTag=0x3fffe3ecd6c0 "") at
pquery.c:770
#11816 0x00000000104d1bc4 in exec_simple_query (query_string=<optimized out>) at postgres.c:1215
#11817 0x00000000104ced50 in PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=<optimized out>,
username=<optimizedout>) at postgres.c:4249 
#11818 0x00000000104110fc in BackendRun (port=<optimized out>) at postmaster.c:4430
#11819 BackendStartup (port=<optimized out>) at postmaster.c:4121
#11820 ServerLoop () at postmaster.c:1704
#11821 PostmasterMain (argc=<optimized out>, argv=<optimized out>) at postmaster.c:1377
#11822 0x000000001034def4 in main (argc=8, argv=0x10008b7efb0) at main.c:228


So that lets out any theory that somehow we're getting into a weird
control path that misses calling check_stack_depth;
expression_tree_mutator does so for one, and it was called just nine
stack frames down from the crash.

Casting about for possible explanations, I notice that the failure
seems to have occurred at a nesting depth of 982 SQL-function calls
((11809 - 25)/12).  I'd previously scraped the buildfarm database
to find out what nesting depths we normally get to before erroring
out, by counting the number of "infinite_recurse" context messages
in the postmaster log.  Here's the entry for shoveler:

 shoveler      | 2019-05-03 14:19:19 |  1674

So this failed at substantially less stack depth than it's successfully
consumed in other runs, and *very* substantially less stack than ought
to be there, considering we pay attention to getrlimit in setting
max_stack_depth and add quite a generous amount of slop too.

I am wondering if, somehow, the stack depth limit seen by the postmaster
sometimes doesn't apply to its children.  That would be pretty wacko
kernel behavior, especially if it's only intermittently true.
But we're running out of other explanations.

            regards, tom lane



Re: Why is infinite_recurse test suddenly failing?

От
Andres Freund
Дата:
Hi,

On 2019-05-10 11:38:57 -0400, Tom Lane wrote:
> Core was generated by `postgres: debian regression [local] SELECT                                     '.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  sysmalloc (nb=8208, av=0x3fff916e0d28 <main_arena>) at malloc.c:2748
> 2748    malloc.c: No such file or directory.
> #0  sysmalloc (nb=8208, av=0x3fff916e0d28 <main_arena>) at malloc.c:2748
> #1  0x00003fff915bedc8 in _int_malloc (av=0x3fff916e0d28 <main_arena>, bytes=8192) at malloc.c:3865
> #2  0x00003fff915c1064 in __GI___libc_malloc (bytes=8192) at malloc.c:2928
> #3  0x00000000106acfd8 in AllocSetContextCreateInternal (parent=0x1000babdad0, name=0x1085508c "inline_function",
minContextSize=<optimizedout>, initBlockSize=<optimized out>, maxBlockSize=8388608) at aset.c:477
 
> #4  0x00000000103d5e00 in inline_function (funcid=20170, result_type=<optimized out>, result_collid=<optimized out>,
input_collid=<optimizedout>, funcvariadic=<optimized out>, func_tuple=<optimized out>, context=0x3fffe3da15d0,
args=<optimizedout>) at clauses.c:4459
 
> #5  simplify_function (funcid=<optimized out>, result_type=<optimized out>, result_typmod=<optimized out>,
result_collid=<optimizedout>, input_collid=<optimized out>, args_p=<optimized out>, funcvariadic=<optimized out>,
process_args=<optimizedout>, allow_non_const=<optimized out>, context=<optimized out>) at clauses.c:4040
 
> #6  0x00000000103d2e74 in eval_const_expressions_mutator (node=0x1000babe968, context=0x3fffe3da15d0) at
clauses.c:2474
> #7  0x00000000103511bc in expression_tree_mutator (node=<optimized out>, mutator=0x103d2b10
<eval_const_expressions_mutator>,context=0x3fffe3da15d0) at nodeFuncs.c:2893
 


> So that lets out any theory that somehow we're getting into a weird
> control path that misses calling check_stack_depth;
> expression_tree_mutator does so for one, and it was called just nine
> stack frames down from the crash.

Right. There's plenty places checking it...


> I am wondering if, somehow, the stack depth limit seen by the postmaster
> sometimes doesn't apply to its children.  That would be pretty wacko
> kernel behavior, especially if it's only intermittently true.
> But we're running out of other explanations.

I wonder if this is a SIGSEGV that actually signals an OOM
situation. Linux, if it can't actually extend the stack on-demand due to
OOM, sends a SIGSEGV.  The signal has that information, but
unfortunately the buildfarm code doesn't print it.  p $_siginfo would
show us some of that...

Mark, how tight is the memory on that machine? Does dmesg have any other
information (often segfaults are logged by the kernel with the code
IIRC).

Greetings,

Andres Freund



Re: Why is infinite_recurse test suddenly failing?

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2019-05-10 11:38:57 -0400, Tom Lane wrote:
>> I am wondering if, somehow, the stack depth limit seen by the postmaster
>> sometimes doesn't apply to its children.  That would be pretty wacko
>> kernel behavior, especially if it's only intermittently true.
>> But we're running out of other explanations.

> I wonder if this is a SIGSEGV that actually signals an OOM
> situation. Linux, if it can't actually extend the stack on-demand due to
> OOM, sends a SIGSEGV.  The signal has that information, but
> unfortunately the buildfarm code doesn't print it.  p $_siginfo would
> show us some of that...

> Mark, how tight is the memory on that machine? Does dmesg have any other
> information (often segfaults are logged by the kernel with the code
> IIRC).

It does sort of smell like a resource exhaustion problem, especially
if all these buildfarm animals are VMs running on the same underlying
platform.  But why would that manifest as "you can't have a measly two
megabytes of stack" and not as any other sort of OOM symptom?

Mark, if you don't mind modding your local copies of the buildfarm
script, I think what Andres is asking for is a pretty trivial addition
in PGBuild/Utils.pm's sub get_stack_trace:

    my $cmdfile = "./gdbcmd";
    my $handle;
    open($handle, '>', $cmdfile) || die "opening $cmdfile: $!";
    print $handle "bt\n";
+    print $handle "p $_siginfo\n";
    close($handle);

            regards, tom lane



Re: Why is infinite_recurse test suddenly failing?

От
Andrew Dunstan
Дата:
On 5/10/19 3:35 PM, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On 2019-05-10 11:38:57 -0400, Tom Lane wrote:
>>> I am wondering if, somehow, the stack depth limit seen by the postmaster
>>> sometimes doesn't apply to its children.  That would be pretty wacko
>>> kernel behavior, especially if it's only intermittently true.
>>> But we're running out of other explanations.
>> I wonder if this is a SIGSEGV that actually signals an OOM
>> situation. Linux, if it can't actually extend the stack on-demand due to
>> OOM, sends a SIGSEGV.  The signal has that information, but
>> unfortunately the buildfarm code doesn't print it.  p $_siginfo would
>> show us some of that...
>> Mark, how tight is the memory on that machine? Does dmesg have any other
>> information (often segfaults are logged by the kernel with the code
>> IIRC).
> It does sort of smell like a resource exhaustion problem, especially
> if all these buildfarm animals are VMs running on the same underlying
> platform.  But why would that manifest as "you can't have a measly two
> megabytes of stack" and not as any other sort of OOM symptom?
>
> Mark, if you don't mind modding your local copies of the buildfarm
> script, I think what Andres is asking for is a pretty trivial addition
> in PGBuild/Utils.pm's sub get_stack_trace:
>
>     my $cmdfile = "./gdbcmd";
>     my $handle;
>     open($handle, '>', $cmdfile) || die "opening $cmdfile: $!";
>     print $handle "bt\n";
> +    print $handle "p $_siginfo\n";
>     close($handle);
>
>             


I think we'll need to write that as:


    print $handle 'p $_siginfo',"\n";


As you have it written perl will try to interpolate a variable called
$_siginfo.


cheers


andrew







Re: Why is infinite_recurse test suddenly failing?

От
Michael Paquier
Дата:
On Fri, May 10, 2019 at 05:26:43PM -0400, Andrew Dunstan wrote:
> I think we'll need to write that as:
>
>     print $handle 'p $_siginfo',"\n";
>
> As you have it written perl will try to interpolate a variable called
> $_siginfo.

Anything in double quotes with a dollar sign would be interpreted as a
variable, and single quotes make that safe.
--
Michael

Вложения

Re: Why is infinite_recurse test suddenly failing?

От
Andrew Dunstan
Дата:
On 5/12/19 4:57 AM, Michael Paquier wrote:
> On Fri, May 10, 2019 at 05:26:43PM -0400, Andrew Dunstan wrote:
>> I think we'll need to write that as:
>>
>>     print $handle 'p $_siginfo',"\n";
>>
>> As you have it written perl will try to interpolate a variable called
>> $_siginfo.
> Anything in double quotes with a dollar sign would be interpreted as a
> variable, and single quotes make that safe.


Yes, that's why I did it that way.


cheers


andrew




Re: Why is infinite_recurse test suddenly failing?

От
Mark Wong
Дата:
On Fri, May 03, 2019 at 11:45:33AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2019-05-03 10:08:59 -0700, Mark Wong wrote:
> > Ok, I think I have gdb installed now...
> 
> Thanks! Any chance you could turn on force_parallel_mode for the other
> branches it applies to too? Makes it easier to figure out whether
> breakage is related to that, or independent.

Slowly catching up on my collection of ppc64le animals...

I still need to upgrade the build farm client (v8) on:
* dhole
* vulpes
* wobbegong
* cuon
* batfish
* devario
* cardinalfish

The following I've enabled force_parallel_mode for HEAD, 11, 10, and
9.6:

* buri
* urocryon
* ayu
* shoveler
* chimaera
* bonito
* takin
* bufflehead
* elasmobranch
* demoiselle
* cavefish

Regards,
Mark



Re: Why is infinite_recurse test suddenly failing?

От
Tom Lane
Дата:
Mark Wong <mark@2ndQuadrant.com> writes:
> The following I've enabled force_parallel_mode for HEAD, 11, 10, and
> 9.6:

Thanks Mark!

In theory, the stack trace we now have from shoveler proves that parallel
mode has nothing to do with this, because the crash happens during
planning (specifically, inlining SQL functions).  I wonder though if
it's possible that previous force_parallel_mode queries have had some
undesirable effect on the process's stack allocation.  Just grasping
at straws, because it's sure not clear what's going wrong.

            regards, tom lane



Re: Why is infinite_recurse test suddenly failing?

От
Tom Lane
Дата:
Mark Wong <mark@2ndQuadrant.com> writes:
> Slowly catching up on my collection of ppc64le animals...

Oh, btw ... you didn't answer my question from before: are these animals
all running on a common platform (and if so, what is that), or are they
really different hardware?  If they're all VMs on one machine then it
might be that there's some common-mode effect from the underlying system.

(Another thing I notice, now, is that these are all Linux variants;
I'd been thinking you had some BSDen in there too, but now I see
that none of those are ppc64.  Hm.)

            regards, tom lane



Re: Why is infinite_recurse test suddenly failing?

От
Mark Wong
Дата:
On Fri, May 10, 2019 at 11:27:07AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2019-05-10 11:38:57 -0400, Tom Lane wrote:
> > Core was generated by `postgres: debian regression [local] SELECT                                     '.
> > Program terminated with signal SIGSEGV, Segmentation fault.
> > #0  sysmalloc (nb=8208, av=0x3fff916e0d28 <main_arena>) at malloc.c:2748
> > 2748    malloc.c: No such file or directory.
> > #0  sysmalloc (nb=8208, av=0x3fff916e0d28 <main_arena>) at malloc.c:2748
> > #1  0x00003fff915bedc8 in _int_malloc (av=0x3fff916e0d28 <main_arena>, bytes=8192) at malloc.c:3865
> > #2  0x00003fff915c1064 in __GI___libc_malloc (bytes=8192) at malloc.c:2928
> > #3  0x00000000106acfd8 in AllocSetContextCreateInternal (parent=0x1000babdad0, name=0x1085508c "inline_function",
minContextSize=<optimizedout>, initBlockSize=<optimized out>, maxBlockSize=8388608) at aset.c:477
 
> > #4  0x00000000103d5e00 in inline_function (funcid=20170, result_type=<optimized out>, result_collid=<optimized
out>,input_collid=<optimized out>, funcvariadic=<optimized out>, func_tuple=<optimized out>, context=0x3fffe3da15d0,
args=<optimizedout>) at clauses.c:4459
 
> > #5  simplify_function (funcid=<optimized out>, result_type=<optimized out>, result_typmod=<optimized out>,
result_collid=<optimizedout>, input_collid=<optimized out>, args_p=<optimized out>, funcvariadic=<optimized out>,
process_args=<optimizedout>, allow_non_const=<optimized out>, context=<optimized out>) at clauses.c:4040
 
> > #6  0x00000000103d2e74 in eval_const_expressions_mutator (node=0x1000babe968, context=0x3fffe3da15d0) at
clauses.c:2474
> > #7  0x00000000103511bc in expression_tree_mutator (node=<optimized out>, mutator=0x103d2b10
<eval_const_expressions_mutator>,context=0x3fffe3da15d0) at nodeFuncs.c:2893
 
> 
> 
> > So that lets out any theory that somehow we're getting into a weird
> > control path that misses calling check_stack_depth;
> > expression_tree_mutator does so for one, and it was called just nine
> > stack frames down from the crash.
> 
> Right. There's plenty places checking it...
> 
> 
> > I am wondering if, somehow, the stack depth limit seen by the postmaster
> > sometimes doesn't apply to its children.  That would be pretty wacko
> > kernel behavior, especially if it's only intermittently true.
> > But we're running out of other explanations.
> 
> I wonder if this is a SIGSEGV that actually signals an OOM
> situation. Linux, if it can't actually extend the stack on-demand due to
> OOM, sends a SIGSEGV.  The signal has that information, but
> unfortunately the buildfarm code doesn't print it.  p $_siginfo would
> show us some of that...
> 
> Mark, how tight is the memory on that machine?

There's about 2GB allocated:

debian@postgresql-debian:~$ cat /proc/meminfo
MemTotal:        2080704 kB
MemFree:         1344768 kB
MemAvailable:    1824192 kB


At the moment it looks like plenty. :)  Maybe I should set something up
to monitor these things.

> Does dmesg have any other
> information (often segfaults are logged by the kernel with the code
> IIRC).

It's been up for about 49 days:

debian@postgresql-debian:~$ uptime
 14:54:30 up 49 days, 14:59,  3 users,  load average: 0.00, 0.34, 1.04


I see one line from dmesg that is related to postgres:

[3939350.616849] postgres[17057]: bad frame in setup_rt_frame: 00003fffe3d9fe00 nip 00003fff915bdba0 lr
00003fff915bde9c


But only that one time in 49 days up.  Otherwise I see a half dozen
hung_task_timeout_secs messages around jdb2 and dhclient.

Regards,
Mark

-- 
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/



Re: Why is infinite_recurse test suddenly failing?

От
Mark Wong
Дата:
On Tue, May 14, 2019 at 10:52:07AM -0400, Tom Lane wrote:
> Mark Wong <mark@2ndQuadrant.com> writes:
> > Slowly catching up on my collection of ppc64le animals...
> 
> Oh, btw ... you didn't answer my question from before: are these animals
> all running on a common platform (and if so, what is that), or are they
> really different hardware?  If they're all VMs on one machine then it
> might be that there's some common-mode effect from the underlying system.

Sorry, I was almost there. :)

These systems are provisioned with OpenStack.  Additionally, a couple
more (cardinalfish, devario) are using docker under that.

> (Another thing I notice, now, is that these are all Linux variants;
> I'd been thinking you had some BSDen in there too, but now I see
> that none of those are ppc64.  Hm.)

Right, the BSDen I have are on different hardware.

Regards,
Mark

-- 
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/



Re: Why is infinite_recurse test suddenly failing?

От
Mark Wong
Дата:
On Fri, May 10, 2019 at 05:26:43PM -0400, Andrew Dunstan wrote:
> 
> On 5/10/19 3:35 PM, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> >> On 2019-05-10 11:38:57 -0400, Tom Lane wrote:
> >>> I am wondering if, somehow, the stack depth limit seen by the postmaster
> >>> sometimes doesn't apply to its children.  That would be pretty wacko
> >>> kernel behavior, especially if it's only intermittently true.
> >>> But we're running out of other explanations.
> >> I wonder if this is a SIGSEGV that actually signals an OOM
> >> situation. Linux, if it can't actually extend the stack on-demand due to
> >> OOM, sends a SIGSEGV.  The signal has that information, but
> >> unfortunately the buildfarm code doesn't print it.  p $_siginfo would
> >> show us some of that...
> >> Mark, how tight is the memory on that machine? Does dmesg have any other
> >> information (often segfaults are logged by the kernel with the code
> >> IIRC).
> > It does sort of smell like a resource exhaustion problem, especially
> > if all these buildfarm animals are VMs running on the same underlying
> > platform.  But why would that manifest as "you can't have a measly two
> > megabytes of stack" and not as any other sort of OOM symptom?
> >
> > Mark, if you don't mind modding your local copies of the buildfarm
> > script, I think what Andres is asking for is a pretty trivial addition
> > in PGBuild/Utils.pm's sub get_stack_trace:
> >
> >     my $cmdfile = "./gdbcmd";
> >     my $handle;
> >     open($handle, '>', $cmdfile) || die "opening $cmdfile: $!";
> >     print $handle "bt\n";
> > +    print $handle "p $_siginfo\n";
> >     close($handle);
> >
> >             
> 
> 
> I think we'll need to write that as:
> 
> 
>     print $handle 'p $_siginfo',"\n";

Ok, I have this added to everyone now.

I think I also have caught up on this thread, but let me know if I
missed anything.

Regards,
Mark

-- 
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/



Re: Why is infinite_recurse test suddenly failing?

От
Andres Freund
Дата:
Hi,

On 2019-05-14 08:31:37 -0700, Mark Wong wrote:
> Ok, I have this added to everyone now.
> 
> I think I also have caught up on this thread, but let me know if I
> missed anything.

I notice https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=demoiselle&dt=2019-05-19%2014%3A22%3A23
failed recently, but unfortunately does not appear to have gdb
installed? Or the buildfarm version is too old? Or ulimits are set
strictly on a system wide basis?

Greetings,

Andres Freund



Re: Why is infinite_recurse test suddenly failing?

От
Mark Wong
Дата:
On Sun, May 19, 2019 at 02:38:26PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2019-05-14 08:31:37 -0700, Mark Wong wrote:
> > Ok, I have this added to everyone now.
> > 
> > I think I also have caught up on this thread, but let me know if I
> > missed anything.
> 
> I notice https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=demoiselle&dt=2019-05-19%2014%3A22%3A23
> failed recently, but unfortunately does not appear to have gdb
> installed? Or the buildfarm version is too old? Or ulimits are set
> strictly on a system wide basis?

It looks like I did have gdb on there:

opensuse@postgresql-opensuse-p9:~> gdb --version
GNU gdb (GDB; openSUSE Leap 15.0) 8.1


I'm on v9 of the build-farm here (have it on my todo list to get
everything up to 10.)


I hope nothing is overriding my core size ulimit:

opensuse@postgresql-opensuse-p9:~> ulimit -c
unlimited


This animal is using clang.  I wonder if gdb is disagreeing with the
clang binaries?

Regards,
Mark

-- 
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/



Re: Why is infinite_recurse test suddenly failing?

От
Mark Wong
Дата:
On Mon, May 20, 2019 at 12:15:49PM -0700, Mark Wong wrote:
> On Sun, May 19, 2019 at 02:38:26PM -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2019-05-14 08:31:37 -0700, Mark Wong wrote:
> > > Ok, I have this added to everyone now.
> > > 
> > > I think I also have caught up on this thread, but let me know if I
> > > missed anything.
> > 
> > I notice https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=demoiselle&dt=2019-05-19%2014%3A22%3A23
> > failed recently, but unfortunately does not appear to have gdb
> > installed? Or the buildfarm version is too old? Or ulimits are set
> > strictly on a system wide basis?
> 
> I'm on v9 of the build-farm here (have it on my todo list to get
> everything up to 10.)


Andrew let me know I need to get on v10.  I've upgraded demoiselle, and
am trying to work through the rest now...

Regards,
Mark

-- 
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/



Re: Why is infinite_recurse test suddenly failing?

От
Thomas Munro
Дата:
On Tue, May 21, 2019 at 9:22 AM Mark Wong <mark@2ndquadrant.com> wrote:
> Andrew let me know I need to get on v10.  I've upgraded demoiselle, and
> am trying to work through the rest now...

Here's another crash like that.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=cavefish&dt=2019-07-13%2003%3A49%3A38

2019-07-13 04:01:23.437 UTC [9365:70] LOG:  server process (PID 12951)
was terminated by signal 11: Segmentation fault
2019-07-13 04:01:23.437 UTC [9365:71] DETAIL:  Failed process was
running: select infinite_recurse();

-- 
Thomas Munro
https://enterprisedb.com



Re: Why is infinite_recurse test suddenly failing?

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@gmail.com> writes:
> Here's another crash like that.

> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=cavefish&dt=2019-07-13%2003%3A49%3A38
> 2019-07-13 04:01:23.437 UTC [9365:70] LOG:  server process (PID 12951)
> was terminated by signal 11: Segmentation fault
> 2019-07-13 04:01:23.437 UTC [9365:71] DETAIL:  Failed process was
> running: select infinite_recurse();

It occurred to me to scrape the buildfarm database for these failures,
and what I got was

   sysname    |    branch     |      snapshot       |      stage      |
   data                                                      |   architecture 

--------------+---------------+---------------------+-----------------+---------------------------------------------------------------------------------------------------------------+------------------
 demoiselle   | HEAD          | 2019-04-27 14:55:52 | pg_upgradeCheck | 2019-04-27 15:00:42.736 UTC [1457:66] DETAIL:
Failedprocess was running: select infinite_recurse();         | ppc64le (POWER9)  
 buri         | HEAD          | 2019-04-27 23:54:46 | Check           | 2019-04-28 00:01:49.794 UTC [3041:66] DETAIL:
Failedprocess was running: select infinite_recurse();         | ppc64le (POWER9)  
 takin        | HEAD          | 2019-05-01 08:16:48 | pg_upgradeCheck | 2019-05-01 08:23:27.159 UTC [32303:59] DETAIL:
Failedprocess was running: select infinite_recurse();        | ppc64le           
 bonito       | HEAD          | 2019-05-01 23:05:36 | Check           | 2019-05-01 23:11:00.145 UTC [13933:66] DETAIL:
Failedprocess was running: select infinite_recurse();        | ppc64le (POWER9)  
 shoveler     | HEAD          | 2019-05-10 14:04:34 | Check           | 2019-05-10 14:11:26.833 UTC [13456:73] DETAIL:
Failedprocess was running: select infinite_recurse();        | ppc64le (POWER8)  
 demoiselle   | HEAD          | 2019-05-19 14:22:23 | pg_upgradeCheck | 2019-05-19 14:26:17.002 UTC [23275:80] DETAIL:
Failedprocess was running: select infinite_recurse();        | ppc64le (POWER9)  
 vulpes       | HEAD          | 2019-06-15 09:16:45 | pg_upgradeCheck | 2019-06-15 09:22:22.268 UTC [4885:77] DETAIL:
Failedprocess was running: select infinite_recurse();         | ppc64le           
 ayu          | HEAD          | 2019-06-19 22:13:23 | pg_upgradeCheck | 2019-06-19 22:18:16.805 UTC [2708:71] DETAIL:
Failedprocess was running: select infinite_recurse();         | ppc64le (POWER8)  
 quokka       | HEAD          | 2019-07-10 14:20:13 | pg_upgradeCheck | 2019-07-10 15:24:06.102 BST [5d25f4fb.2644:5]
DETAIL: Failed process was running: select infinite_recurse(); | ppc64             
 cavefish     | HEAD          | 2019-07-13 03:49:38 | pg_upgradeCheck | 2019-07-13 04:01:23.437 UTC [9365:71] DETAIL:
Failedprocess was running: select infinite_recurse();         | ppc64le (POWER9)  
 pintail      | REL_12_STABLE | 2019-07-13 19:36:51 | Check           | 2019-07-13 19:39:29.013 UTC [31086:5] DETAIL:
Failedprocess was running: select infinite_recurse();         | ppc64le (POWER9)  
 bonito       | HEAD          | 2019-07-19 23:13:01 | Check           | 2019-07-19 23:16:33.330 UTC [24191:70] DETAIL:
Failedprocess was running: select infinite_recurse();        | ppc64le (POWER9)  
 takin        | HEAD          | 2019-07-24 08:24:56 | Check           | 2019-07-24 08:28:01.735 UTC [16366:75] DETAIL:
Failedprocess was running: select infinite_recurse();        | ppc64le           
 quokka       | HEAD          | 2019-07-31 02:00:07 | pg_upgradeCheck | 2019-07-31 03:04:04.043 BST [5d40f709.776a:5]
DETAIL: Failed process was running: select infinite_recurse(); | ppc64             
 elasmobranch | HEAD          | 2019-08-01 03:13:38 | Check           | 2019-08-01 03:19:05.394 UTC [22888:62] DETAIL:
Failedprocess was running: select infinite_recurse();        | ppc64le (POWER9)  
 buri         | HEAD          | 2019-08-02 00:10:23 | Check           | 2019-08-02 00:17:11.075 UTC [28222:73] DETAIL:
Failedprocess was running: select infinite_recurse();        | ppc64le (POWER9)  
 urocryon     | HEAD          | 2019-08-02 05:43:46 | Check           | 2019-08-02 05:51:51.944 UTC [2724:64] DETAIL:
Failedprocess was running: select infinite_recurse();         | ppc64le           
 batfish      | HEAD          | 2019-08-04 19:02:36 | pg_upgradeCheck | 2019-08-04 19:08:11.728 UTC [23899:79] DETAIL:
Failedprocess was running: select infinite_recurse();        | ppc64le           
 buri         | REL_12_STABLE | 2019-08-07 00:03:29 | pg_upgradeCheck | 2019-08-07 00:11:24.500 UTC [1405:5] DETAIL:
Failedprocess was running: select infinite_recurse();          | ppc64le (POWER9)  
 quokka       | REL_12_STABLE | 2019-08-08 02:43:45 | pg_upgradeCheck | 2019-08-08 03:47:38.115 BST [5d4b8d3f.cdd7:5]
DETAIL: Failed process was running: select infinite_recurse(); | ppc64             
 quokka       | HEAD          | 2019-08-08 14:00:08 | Check           | 2019-08-08 15:02:59.770 BST [5d4c2b88.cad9:5]
DETAIL: Failed process was running: select infinite_recurse(); | ppc64             
 mereswine    | REL_11_STABLE | 2019-08-11 02:10:12 | InstallCheck-C  | 2019-08-11 02:36:10.159 PDT [5004:4] DETAIL:
Failedprocess was running: select infinite_recurse();          | ARMv7             
 takin        | HEAD          | 2019-08-11 08:02:48 | Check           | 2019-08-11 08:05:57.789 UTC [11500:67] DETAIL:
Failedprocess was running: select infinite_recurse();        | ppc64le           
 mereswine    | REL_12_STABLE | 2019-08-11 09:52:46 | pg_upgradeCheck | 2019-08-11 04:21:16.756 PDT [6804:5] DETAIL:
Failedprocess was running: select infinite_recurse();          | ARMv7             
 mereswine    | HEAD          | 2019-08-11 11:29:27 | pg_upgradeCheck | 2019-08-11 07:15:28.454 PDT [9954:76] DETAIL:
Failedprocess was running: select infinite_recurse();         | ARMv7             
 demoiselle   | HEAD          | 2019-08-11 14:51:38 | pg_upgradeCheck | 2019-08-11 14:57:29.422 UTC [9436:70] DETAIL:
Failedprocess was running: select infinite_recurse();         | ppc64le (POWER9)  
(26 rows)

This is from a scan going back 9 months (to mid-December), so the lack of
any matches before late April is pretty notable: it seems highly probable
that some change we made during April is related.

A cursory scan of commits during April finds only one that seems
conceivably related (though perhaps I just lack enough imagination):

Author: Tom Lane <tgl@sss.pgh.pa.us>
Branch: master Release: REL_12_BR [798070ec0] 2019-04-11 18:16:50 -0400

    Re-order some regression test scripts for more parallelism.

    Move the strings, numerology, insert, insert_conflict, select and
    errors tests to be parts of nearby parallel groups, instead of
    executing by themselves.

So that leads to the thought that "the infinite_recurse test is fine
if it runs by itself, but it tends to fall over if there are
concurrently-running backends".  I have absolutely no idea how that
would happen on anything that passes for a platform built in this
century.  Still, it's a place to start, which we hadn't before.

Also notable is that we now have a couple of hits on ARM, not
only ppc64.  Don't know what to make of that.

            regards, tom lane



Re: Why is infinite_recurse test suddenly failing?

От
Thomas Munro
Дата:
On Thu, Aug 15, 2019 at 5:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So that leads to the thought that "the infinite_recurse test is fine
> if it runs by itself, but it tends to fall over if there are
> concurrently-running backends".  I have absolutely no idea how that
> would happen on anything that passes for a platform built in this
> century.  Still, it's a place to start, which we hadn't before.

Hmm.  mereswin's recent failure on REL_11_STABLE was running the
serial schedule.

I read about 3 ways to get SEGV from stack-related faults: you can
exceed RLIMIT_STACK (the total mapping size) and then you'll get SEGV
(per man pages), you can access a page that is inside the mapping but
is beyond the stack pointer (with some tolerance, exact details vary
by arch), and you can fail to allocate a page due to low memory.

The first kind of failure doesn't seem right -- we carefully set
max_stack_size based on RLIMIT_STACK minus some slop, so that theory
would require child processes to have different stack limits than the
postmaster as you said (perhaps OpenStack, Docker, related tooling or
concurrent activity on the host system is capable of changing it?), or
a bug in our slop logic.  The second kind of failure would imply that
we have a bug -- we're accessing something below the stack pointer --
but that doesn't seem right either -- I think various address
sanitising tools would have told us about that, and it's hard to
believe there is a bug in the powerpc and arm implementation of the
stack pointer check (see Linux arch/{powerpc,arm}/mm/fault.c).  That
leaves the third explanation, except then I'd expect to see other
kinds of problems like OOM etc before you get to that stage, and why
just here?  Confused.

> Also notable is that we now have a couple of hits on ARM, not
> only ppc64.  Don't know what to make of that.

Yeah, that is indeed interesting.

-- 
Thomas Munro
https://enterprisedb.com