Обсуждение: Re: BUG #15033: Segmentation fault running a query

Поиск
Список
Период
Сортировка

Re: BUG #15033: Segmentation fault running a query

От
Tom Lane
Дата:
=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes:
> I have encountered a segmentation fault running a particular query.

Given the shape of the query and the fact that it's calling a recursive
function, I'm suspicious that this is caused by a stack overrun that we've
missed checking for.  If so, it might not reproduce for someone else,
unless they were running with the exact same physical stack limit (not
max_stack_depth, but the kernel limit) as you are, and a similar compiler.

It would likely be easier all round if you could get a stack trace
from the crash as you observe it.  See
https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

If my idea is right, the whole trace would be very long (probably
thousands of stack frames) but the first hundred or so ought to
be enough to diagnose.

BTW, seems like there's gotta be a more efficient way to solve your
problem than a 14000-way INTERSECT.

            regards, tom lane


Re: BUG #15033: Segmentation fault running a query

От
Andrew Grossman
Дата:
Oh, I have little doubt there's a better way to solve the actual problem--this actually resulted from some buggy code that passed a bunch of unexpected arguments to a query generating function. I would note that this did reproduce for me on a different system with some different specs.

I will follow-up with a stack trace. In the meantime, a more accessible version of the reproduction script has been placed at the following address:



On Sat, Jan 27, 2018 at 3:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
PG Bug reporting form <noreply@postgresql.org> writes:
> I have encountered a segmentation fault running a particular query.

Given the shape of the query and the fact that it's calling a recursive
function, I'm suspicious that this is caused by a stack overrun that we've
missed checking for.  If so, it might not reproduce for someone else,
unless they were running with the exact same physical stack limit (not
max_stack_depth, but the kernel limit) as you are, and a similar compiler.

It would likely be easier all round if you could get a stack trace
from the crash as you observe it.  See
https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend

If my idea is right, the whole trace would be very long (probably
thousands of stack frames) but the first hundred or so ought to
be enough to diagnose.

BTW, seems like there's gotta be a more efficient way to solve your
problem than a 14000-way INTERSECT.

                        regards, tom lane

Re: BUG #15033: Segmentation fault running a query

От
Tom Lane
Дата:
Andrew Grossman <agrossman@gmail.com> writes:
> I will follow-up with a stack trace. In the meantime, a more accessible
> version of the reproduction script has been placed at the following address:
> https://fleetinventory.com/media/static/pg10crash.sql

Thanks.  I pulled that down, and while it didn't reproduce for me
immediately, some fooling with the postmaster's "ulimit -s" setting
eventually produced a crash in the recursion in create_plan_recurse(),
which indeed lacks a check_stack_depth call and should have one.
But I wonder whether that's the identical crash site you're seeing.
This query is going to result in deep recursion in quite a few places,
and there might be other ones that need protection.  The amount of
stack consumed per recursion level could vary across machines, so
that the deepest stack growth might not be at the same place for
everybody.  (I'm actually rather surprised to see create_plan_recurse
be the weakest link --- I'd have figured that earlier planner phases
would consume more stack per setop.)

            regards, tom lane


Re: BUG #15033: Segmentation fault running a query

От
Andrew Grossman
Дата:
It looks like the same crash:


Program received signal SIGSEGV, Segmentation fault.
0x0000000000838205 in AllocSetAlloc (context=0x17e05b8, size=48) at aset.c:563
563     {
#0  0x0000000000838205 in AllocSetAlloc (context=0x17e05b8, size=48) at aset.c:563
#1  0x000000000083df0c in MemoryContextAllocZeroAligned (context=<optimized out>, size=size@entry=48) at mcxt.c:791
#2  0x00000000006463f6 in makeTargetEntry (expr=0x7f5e37898b18, resno=resno@entry=1, resname=resname@entry=0x0, resjunk=resjunk@entry=0 '\000') at makefuncs.c:240
#3  0x0000000000671f18 in build_path_tlist (root=root@entry=0x7f5e3a5dcc70, path=0x7f5e37898cf8, path=0x7f5e37898cf8) at createplan.c:749
#4  0x0000000000672eb2 in create_append_plan (best_path=0x7f5e37898cf8, root=0x7f5e3a5dcc70) at createplan.c:1005
#5  create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e37898cf8, flags=flags@entry=4) at createplan.c:384
#6  0x0000000000673eb1 in create_setop_plan (flags=0, best_path=0x7f5e37898ec8, root=0x7f5e3a5dcc70) at createplan.c:2240
#7  create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e37898ec8, flags=flags@entry=0) at createplan.c:462
#8  0x0000000000674a81 in create_projection_plan (best_path=0x7f5e37899210, root=0x7f5e3a5dcc70) at createplan.c:1559
#9  create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e37899210, flags=flags@entry=1) at createplan.c:394
#10 0x0000000000672ee8 in create_append_plan (best_path=0x7f5e3789b070, root=0x7f5e3a5dcc70) at createplan.c:1040
#11 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e3789b070, flags=flags@entry=4) at createplan.c:384
#12 0x0000000000673eb1 in create_setop_plan (flags=0, best_path=0x7f5e3789b240, root=0x7f5e3a5dcc70) at createplan.c:2240
#13 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e3789b240, flags=flags@entry=0) at createplan.c:462
#14 0x0000000000674a81 in create_projection_plan (best_path=0x7f5e3789b588, root=0x7f5e3a5dcc70) at createplan.c:1559
#15 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e3789b588, flags=flags@entry=1) at createplan.c:394
#16 0x0000000000672ee8 in create_append_plan (best_path=0x7f5e3789d3e8, root=0x7f5e3a5dcc70) at createplan.c:1040
#17 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e3789d3e8, flags=flags@entry=4) at createplan.c:384
#18 0x0000000000673eb1 in create_setop_plan (flags=0, best_path=0x7f5e3789d5b8, root=0x7f5e3a5dcc70) at createplan.c:2240
#19 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e3789d5b8, flags=flags@entry=0) at createplan.c:462
#20 0x0000000000674a81 in create_projection_plan (best_path=0x7f5e3789d900, root=0x7f5e3a5dcc70) at createplan.c:1559
....
#65425 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e31288480, flags=flags@entry=0) at createplan.c:462
#65426 0x0000000000674a81 in create_projection_plan (best_path=0x7f5e312887c8, root=0x7f5e3a5dcc70) at createplan.c:1559
#65427 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e312887c8, flags=flags@entry=1) at createplan.c:394
#65428 0x0000000000672ee8 in create_append_plan (best_path=0x7f5e3128ac28, root=0x7f5e3a5dcc70) at createplan.c:1040
#65429 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e3128ac28, flags=flags@entry=4) at createplan.c:384
#65430 0x0000000000673eb1 in create_setop_plan (flags=0, best_path=0x7f5e3128adf8, root=0x7f5e3a5dcc70) at createplan.c:2240
#65431 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e3128adf8, flags=flags@entry=0) at createplan.c:462
#65432 0x0000000000674a81 in create_projection_plan (best_path=0x7f5e3128b140, root=0x7f5e3a5dcc70) at createplan.c:1559
#65433 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e3128b140, flags=flags@entry=1) at createplan.c:394
#65434 0x0000000000672ee8 in create_append_plan (best_path=0x7f5e3128d5a0, root=0x7f5e3a5dcc70) at createplan.c:1040
#65435 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e3128d5a0, flags=flags@entry=5) at createplan.c:384
#65436 0x0000000000673eb1 in create_setop_plan (flags=1, best_path=0x7f5e3128d770, root=0x7f5e3a5dcc70) at createplan.c:2240
#65437 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e3128d770, flags=flags@entry=1) at createplan.c:462
#65438 0x0000000000675a39 in create_plan (root=root@entry=0x7f5e3a5dcc70, best_path=<optimized out>) at createplan.c:315
#65439 0x00000000006801de in standard_planner (parse=0x7f5e415ee570, cursorOptions=256, boundParams=0x0) at planner.c:341
#65440 0x0000000000715d3e in pg_plan_query (querytree=querytree@entry=0x7f5e415ee570, cursorOptions=cursorOptions@entry=256, boundParams=boundParams@entry=0x0) at postgres.c:796
#65441 0x0000000000715dfe in pg_plan_queries (querytrees=<optimized out>, cursorOptions=cursorOptions@entry=256, boundParams=boundParams@entry=0x0) at postgres.c:862
#65442 0x00000000007162aa in exec_simple_query (query_string=0x7f5e5adac048 "SELECT zone_id FROM fleetzoneancestorswithself('81379') INTERSECT SELECT zone_id FROM fleetzoneancestorswithself('85078') INTERSECT SELECT zone_id FROM fleetzoneancestorswithself('141578') INTERSECT S"...) at postgres.c:1027
#65443 0x000000000071745c in PostgresMain (argc=<optimized out>, argv=argv@entry=0x180e640, dbname=0x180e508 "fleetinv", username=<optimized out>) at postgres.c:4088
#65444 0x000000000047ad1c in BackendRun (port=0x1807fc0) at postmaster.c:4357
#65445 BackendStartup (port=0x1807fc0) at postmaster.c:4029
#65446 ServerLoop () at postmaster.c:1753
#65447 0x00000000006aea2f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x17e0300) at postmaster.c:1361
#65448 0x000000000047bb4f in main (argc=3, argv=0x17e0300) at main.c:228


On Sat, Jan 27, 2018 at 4:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrew Grossman <agrossman@gmail.com> writes:
> I will follow-up with a stack trace. In the meantime, a more accessible
> version of the reproduction script has been placed at the following address:
> https://fleetinventory.com/media/static/pg10crash.sql

Thanks.  I pulled that down, and while it didn't reproduce for me
immediately, some fooling with the postmaster's "ulimit -s" setting
eventually produced a crash in the recursion in create_plan_recurse(),
which indeed lacks a check_stack_depth call and should have one.
But I wonder whether that's the identical crash site you're seeing.
This query is going to result in deep recursion in quite a few places,
and there might be other ones that need protection.  The amount of
stack consumed per recursion level could vary across machines, so
that the deepest stack growth might not be at the same place for
everybody.  (I'm actually rather surprised to see create_plan_recurse
be the weakest link --- I'd have figured that earlier planner phases
would consume more stack per setop.)

                        regards, tom lane

Re: BUG #15033: Segmentation fault running a query

От
Andrew Grossman
Дата:
At least to my eye.

On Sat, Jan 27, 2018 at 11:38 PM Andrew Grossman <agrossman@gmail.com> wrote:
It looks like the same crash:


Program received signal SIGSEGV, Segmentation fault.
0x0000000000838205 in AllocSetAlloc (context=0x17e05b8, size=48) at aset.c:563
563     {
#0  0x0000000000838205 in AllocSetAlloc (context=0x17e05b8, size=48) at aset.c:563
#1  0x000000000083df0c in MemoryContextAllocZeroAligned (context=<optimized out>, size=size@entry=48) at mcxt.c:791
#2  0x00000000006463f6 in makeTargetEntry (expr=0x7f5e37898b18, resno=resno@entry=1, resname=resname@entry=0x0, resjunk=resjunk@entry=0 '\000') at makefuncs.c:240
#3  0x0000000000671f18 in build_path_tlist (root=root@entry=0x7f5e3a5dcc70, path=0x7f5e37898cf8, path=0x7f5e37898cf8) at createplan.c:749
#4  0x0000000000672eb2 in create_append_plan (best_path=0x7f5e37898cf8, root=0x7f5e3a5dcc70) at createplan.c:1005
#5  create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e37898cf8, flags=flags@entry=4) at createplan.c:384
#6  0x0000000000673eb1 in create_setop_plan (flags=0, best_path=0x7f5e37898ec8, root=0x7f5e3a5dcc70) at createplan.c:2240
#7  create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e37898ec8, flags=flags@entry=0) at createplan.c:462
#8  0x0000000000674a81 in create_projection_plan (best_path=0x7f5e37899210, root=0x7f5e3a5dcc70) at createplan.c:1559
#9  create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e37899210, flags=flags@entry=1) at createplan.c:394
#10 0x0000000000672ee8 in create_append_plan (best_path=0x7f5e3789b070, root=0x7f5e3a5dcc70) at createplan.c:1040
#11 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e3789b070, flags=flags@entry=4) at createplan.c:384
#12 0x0000000000673eb1 in create_setop_plan (flags=0, best_path=0x7f5e3789b240, root=0x7f5e3a5dcc70) at createplan.c:2240
#13 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e3789b240, flags=flags@entry=0) at createplan.c:462
#14 0x0000000000674a81 in create_projection_plan (best_path=0x7f5e3789b588, root=0x7f5e3a5dcc70) at createplan.c:1559
#15 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e3789b588, flags=flags@entry=1) at createplan.c:394
#16 0x0000000000672ee8 in create_append_plan (best_path=0x7f5e3789d3e8, root=0x7f5e3a5dcc70) at createplan.c:1040
#17 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e3789d3e8, flags=flags@entry=4) at createplan.c:384
#18 0x0000000000673eb1 in create_setop_plan (flags=0, best_path=0x7f5e3789d5b8, root=0x7f5e3a5dcc70) at createplan.c:2240
#19 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e3789d5b8, flags=flags@entry=0) at createplan.c:462
#20 0x0000000000674a81 in create_projection_plan (best_path=0x7f5e3789d900, root=0x7f5e3a5dcc70) at createplan.c:1559
....
#65425 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e31288480, flags=flags@entry=0) at createplan.c:462
#65426 0x0000000000674a81 in create_projection_plan (best_path=0x7f5e312887c8, root=0x7f5e3a5dcc70) at createplan.c:1559
#65427 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e312887c8, flags=flags@entry=1) at createplan.c:394
#65428 0x0000000000672ee8 in create_append_plan (best_path=0x7f5e3128ac28, root=0x7f5e3a5dcc70) at createplan.c:1040
#65429 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e3128ac28, flags=flags@entry=4) at createplan.c:384
#65430 0x0000000000673eb1 in create_setop_plan (flags=0, best_path=0x7f5e3128adf8, root=0x7f5e3a5dcc70) at createplan.c:2240
#65431 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e3128adf8, flags=flags@entry=0) at createplan.c:462
#65432 0x0000000000674a81 in create_projection_plan (best_path=0x7f5e3128b140, root=0x7f5e3a5dcc70) at createplan.c:1559
#65433 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e3128b140, flags=flags@entry=1) at createplan.c:394
#65434 0x0000000000672ee8 in create_append_plan (best_path=0x7f5e3128d5a0, root=0x7f5e3a5dcc70) at createplan.c:1040
#65435 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e3128d5a0, flags=flags@entry=5) at createplan.c:384
#65436 0x0000000000673eb1 in create_setop_plan (flags=1, best_path=0x7f5e3128d770, root=0x7f5e3a5dcc70) at createplan.c:2240
#65437 create_plan_recurse (root=root@entry=0x7f5e3a5dcc70, best_path=0x7f5e3128d770, flags=flags@entry=1) at createplan.c:462
#65438 0x0000000000675a39 in create_plan (root=root@entry=0x7f5e3a5dcc70, best_path=<optimized out>) at createplan.c:315
#65439 0x00000000006801de in standard_planner (parse=0x7f5e415ee570, cursorOptions=256, boundParams=0x0) at planner.c:341
#65440 0x0000000000715d3e in pg_plan_query (querytree=querytree@entry=0x7f5e415ee570, cursorOptions=cursorOptions@entry=256, boundParams=boundParams@entry=0x0) at postgres.c:796
#65441 0x0000000000715dfe in pg_plan_queries (querytrees=<optimized out>, cursorOptions=cursorOptions@entry=256, boundParams=boundParams@entry=0x0) at postgres.c:862
#65442 0x00000000007162aa in exec_simple_query (query_string=0x7f5e5adac048 "SELECT zone_id FROM fleetzoneancestorswithself('81379') INTERSECT SELECT zone_id FROM fleetzoneancestorswithself('85078') INTERSECT SELECT zone_id FROM fleetzoneancestorswithself('141578') INTERSECT S"...) at postgres.c:1027
#65443 0x000000000071745c in PostgresMain (argc=<optimized out>, argv=argv@entry=0x180e640, dbname=0x180e508 "fleetinv", username=<optimized out>) at postgres.c:4088
#65444 0x000000000047ad1c in BackendRun (port=0x1807fc0) at postmaster.c:4357
#65445 BackendStartup (port=0x1807fc0) at postmaster.c:4029
#65446 ServerLoop () at postmaster.c:1753
#65447 0x00000000006aea2f in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x17e0300) at postmaster.c:1361
#65448 0x000000000047bb4f in main (argc=3, argv=0x17e0300) at main.c:228


On Sat, Jan 27, 2018 at 4:57 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrew Grossman <agrossman@gmail.com> writes:
> I will follow-up with a stack trace. In the meantime, a more accessible
> version of the reproduction script has been placed at the following address:
> https://fleetinventory.com/media/static/pg10crash.sql

Thanks.  I pulled that down, and while it didn't reproduce for me
immediately, some fooling with the postmaster's "ulimit -s" setting
eventually produced a crash in the recursion in create_plan_recurse(),
which indeed lacks a check_stack_depth call and should have one.
But I wonder whether that's the identical crash site you're seeing.
This query is going to result in deep recursion in quite a few places,
and there might be other ones that need protection.  The amount of
stack consumed per recursion level could vary across machines, so
that the deepest stack growth might not be at the same place for
everybody.  (I'm actually rather surprised to see create_plan_recurse
be the weakest link --- I'd have figured that earlier planner phases
would consume more stack per setop.)

                        regards, tom lane

Re: BUG #15033: Segmentation fault running a query

От
Tom Lane
Дата:
Andrew Grossman <agrossman@gmail.com> writes:
> It looks like the same crash:

Yeah, I agree, create_plan_recurse is at fault there.  Will fix it,
thanks for confirming!

Of course, the "fix" will just result in producing a "stack overflowed"
error instead of dumping core.  There's still the question of why this
case worked for you before and doesn't now.  Seemingly, current code
needs more stack to process this query than 9.5 did.  Is it significantly
more, or were you just unlucky enough to overrun the limit when you'd
not quite done so before?  And if it is significantly more, is there
anything we can reasonably do about that?  These questions remain unclear
at this point.

            regards, tom lane


Re: BUG #15033: Segmentation fault running a query

От
Tom Lane
Дата:
I wrote:
> Andrew Grossman <agrossman@gmail.com> writes:
>> It looks like the same crash:

> Yeah, I agree, create_plan_recurse is at fault there.  Will fix it,
> thanks for confirming!
> Of course, the "fix" will just result in producing a "stack overflowed"
> error instead of dumping core.

I've pushed a fix for that, and also recurse_set_operations which proved
to also be capable of causing a crash at some stack sizes.

> There's still the question of why this
> case worked for you before and doesn't now.  Seemingly, current code
> needs more stack to process this query than 9.5 did.  Is it significantly
> more, or were you just unlucky enough to overrun the limit when you'd
> not quite done so before?  And if it is significantly more, is there
> anything we can reasonably do about that?  These questions remain unclear
> at this point.

It seems that the direct cause of the change is that we now generate a
Path representation of a setop tree, where we didn't before.  The Path
representation is much deeper than the setop tree --- about three
nested Path nodes per setop step --- and it's the deeper recursion
involved in processing that that causes the extra stack demand.
So I'm afraid there's little to be done about it.

On the bright side, I find that with sufficient stack, current code runs
this query in about 4 seconds where 9.5 took 95 seconds.  Seems to be
thanks to commit 25c539233 ("Improve performance in freeing memory
contexts").

            regards, tom lane