Обсуждение: BUG #13490: Segmentation fault on pg_stat_activity
The following bug has been logged on the website: Bug reference: 13490 Logged by: Michael Bommarito Email address: michael@bommaritollc.com PostgreSQL version: 9.5alpha1 Operating system: Ubuntu 14.04.2 LTS Description: Environment: Up-to-date instance running Trusty and using apt.postgresql.org postgresql-9.5-alpha1. Database was upgraded using pg_dump/restore. All other application behavior as expected. Not sure if this is related to #13286 or #12918, but if so, still affecting the binaries in apt. Pretty critical to fly blind without access to this. Crash example 1: 2015-07-07 16:42:45 UTC [26785-2] LOG: server process (PID 11836) was terminated by signal 11: Segmentation fault 2015-07-07 16:42:45 UTC [26785-3] DETAIL: Failed process was running: select * from pg_stat_activity ; Crash example 2: https://github.com/ankane/pghero/issues/30
On Wed, Jul 8, 2015 at 1:52 AM, <michael@bommaritollc.com> wrote: > Environment: Up-to-date instance running Trusty and using apt.postgresql.org > postgresql-9.5-alpha1. Database was upgraded using pg_dump/restore. All > other application behavior as expected. > > Not sure if this is related to #13286 or #12918, but if so, still affecting > the binaries in apt. Pretty critical to fly blind without access to this. > > Crash example 1: > 2015-07-07 16:42:45 UTC [26785-2] LOG: server process (PID 11836) was > terminated by signal 11: Segmentation fault > 2015-07-07 16:42:45 UTC [26785-3] DETAIL: Failed process was running: > select * from pg_stat_activity ; > > Crash example 2: > https://github.com/ankane/pghero/issues/30 If this link gets away we would lost the test case, it is better to directly add it in the email: SELECT pid, state, application_name AS source, age(now(), xact_start) AS duration, waiting, query, xact_start AS started_at FROM pg_stat_activity WHERE query <> '<insufficient privilege>' AND s tate <> 'idle' AND pid <> pg_backend_pid() AND now() - query_start > interval '60 seconds' ORDER BY query_start DESC; This seems unrelated to #13286 which refers to pg_terminate_backend() crashing, and in the case 1 presented above the query used for the crash is more simple. Now this may be related to #12918, so could it be possible to see at least a stack trace to see what is the code path failing? -- Michael
On Wed, Jul 8, 2015 at 10:22 AM, Michael Bommarito <michael@bommaritollc.com> wrote: > I'll try to generate an isolated core dump and stack trace this weekend > when the system is not in use. That would be nice. I have let pgbench -C run for one hour with select * from pg_stat_activity running every second (\watch 1) in parallel but I could not reproduce the issue on HEAD. -- Michael
I'll try to generate an isolated core dump and stack trace this weekend when the system is not in use. I also confirmed that the earlier patch 2897e069c1f4309a237a1fa4bf5d2fcd9c4b9868 for 12918 is applied in our version, so either the patch didn't handle all cases or it's unrelated. Thanks, Michael J. Bommarito II, CEO Bommarito Consulting, LLC *Web:* http://www.bommaritollc.com *Mobile:* +1 (646) 450-3387 On Tue, Jul 7, 2015 at 9:17 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Jul 8, 2015 at 1:52 AM, <michael@bommaritollc.com> wrote: > > Environment: Up-to-date instance running Trusty and using > apt.postgresql.org > > postgresql-9.5-alpha1. Database was upgraded using pg_dump/restore. All > > other application behavior as expected. > > > > Not sure if this is related to #13286 or #12918, but if so, still > affecting > > the binaries in apt. Pretty critical to fly blind without access to > this. > > > > Crash example 1: > > 2015-07-07 16:42:45 UTC [26785-2] LOG: server process (PID 11836) was > > terminated by signal 11: Segmentation fault > > 2015-07-07 16:42:45 UTC [26785-3] DETAIL: Failed process was running: > > select * from pg_stat_activity ; > > > > Crash example 2: > > https://github.com/ankane/pghero/issues/30 > > If this link gets away we would lost the test case, it is better to > directly add it in the email: > SELECT pid, state, application_name AS source, age(now(), xact_start) > AS duration, waiting, query, xact_start AS started_at FROM > pg_stat_activity WHERE query <> '<insufficient privilege>' AND s > tate <> 'idle' AND pid <> pg_backend_pid() AND now() - query_start > > interval '60 seconds' ORDER BY query_start DESC; > > This seems unrelated to #13286 which refers to pg_terminate_backend() > crashing, and in the case 1 presented above the query used for the > crash is more simple. Now this may be related to #12918, so could it > be possible to see at least a stack trace to see what is the code path > failing? > -- > Michael >
On Sat, Jul 11, 2015 at 10:07 PM, Michael Bommarito < michael@bommaritollc.com> wrote: > Here is the offending query and gdb session/stacktrace output. Please > let me know if we can provide anything else from gdb or logs that can be > anonymized. > Thanks. > 2015-07-11 12:57:41 UTC [12803-7] LOG: server process (PID 20696) was > terminated by signal 11: Segmentation fault > 2015-07-11 12:57:41 UTC [12803-8] DETAIL: Failed process was running: > SELECT COUNT(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid() > 12803 is the number of the process PID, right? [...] > Core was generated by `postgres: postgres databasename 127.0.0.1(42063) > BIND '. > Interesting. So you are using the extended query protocol. > Program terminated with signal SIGSEGV, Segmentation fault. > #0 get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at > /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832 > 2832 > /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c: > No such file or directory. > (gdb) bt > #0 get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at > /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832 > #1 0x00007fd0d47cb9dd in pullup_replace_vars_callback > (var=0x7fd0d5d9e958, context=0x7fff52170620) at > /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:2074 > [...] > #15 0x00007fd0d4929760 in BuildCachedPlan (plansource=plansource@entry=0x7fd0d5d7d940, > qlist=0x7fd0d5d30cf8, qlist@entry=0x0, boundParams=boundParams@entry=0x0) > at > /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/utils/cache/plancache.c:951 > And the crash happens in the planner when building a cached plan. This may be enough information for coming up with a reproducible test case. I'll get a closer look on Monday. -- Michael
Hello Michael, Here is the offending query and gdb session/stacktrace output. Please let me know if we can provide anything else from gdb or logs that can be anonymized. *Query:* 2015-07-11 12:57:41 UTC [12803-7] LOG: server process (PID 20696) was terminated by signal 11: Segmentation fault 2015-07-11 12:57:41 UTC [12803-8] DETAIL: Failed process was running: SELECT COUNT(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid() *GDB:* root@hostname:/var/lib/postgresql/9.5/main# gdb -q -c /var/lib/postgresql/9.5/main/core /usr/lib/postgresql/9.5/bin/postgres Reading symbols from /usr/lib/postgresql/9.5/bin/postgres...Reading symbols from /usr/lib/debug/.build-id/7b/273fca376bfad4ea4bb0c4d3099d62ecc9c124.debug...done. done. [New LWP 16028] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: postgres databasename 127.0.0.1(42063) BIND '. Program terminated with signal SIGSEGV, Segmentation fault. #0 get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832 2832 /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c: No such file or directory. (gdb) bt #0 get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832 #1 0x00007fd0d47cb9dd in pullup_replace_vars_callback (var=0x7fd0d5d9e958, context=0x7fff52170620) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:2074 #2 0x00007fd0d481c3ea in replace_rte_variables_mutator (node=<optimized out>, context=0x7fff52170620) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/rewrite/rewriteManip.c:1149 #3 0x00007fd0d478152c in expression_tree_mutator (node=0x7fd0d5d9e908, mutator=0x7fd0d481c390 <replace_rte_variables_mutator>, context=0x7fff52170620) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.c:2769 #4 0x00007fd0d47812b3 in expression_tree_mutator (node=<optimized out>, mutator=0x7fd0d481c390 <replace_rte_variables_mutator>, context=0x7fff52170620) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.c:2675 #5 0x00007fd0d481cc64 in replace_rte_variables (node=<optimized out>, target_varno=<optimized out>, sublevels_up=sublevels_up@entry=0, callback=callback@entry=0x7fd0d47cb880 <pullup_replace_vars_callback>, callback_arg=callback_arg@entry=0x7fff521706c0, outer_hasSubLinks=0x7fd0d5d30d6e "") at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/rewrite/rewriteManip.c:1115 #6 0x00007fd0d47cd1c7 in pullup_replace_vars (context=0x7fff521706c0, expr=<optimized out>) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:1982 #7 pull_up_simple_subquery (deletion_ok=<optimized out>, containing_appendrel=0x0, lowest_nulling_outer_join=0x0, lowest_outer_join=0x0, rte=0x7fd0d5d30ea8, jtnode=<optimized out>, root=0x7fd0d5d9ee48) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:1030 #8 pull_up_subqueries_recurse (root=root@entry=0x7fd0d5d9ee48, jtnode=<optimized out>, lowest_outer_join=lowest_outer_join@entry=0x0, lowest_nulling_outer_join=lowest_nulling_outer_join@entry=0x0, containing_appendrel=containing_appendrel@entry=0x0, deletion_ok=<optimized out>) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:696 #9 0x00007fd0d47cc989 in pull_up_subqueries_recurse (root=root@entry=0x7fd0d5d9ee48, jtnode=0x7fd0d5d9e6c0, lowest_outer_join=lowest_outer_join@entry=0x0, lowest_nulling_outer_join=lowest_nulling_outer_join@entry=0x0, containing_appendrel=containing_appendrel@entry=0x0, deletion_ok=<optimized out>, deletion_ok@entry=0 '\000') at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:762 #10 0x00007fd0d47cd639 in pull_up_subqueries (root=root@entry=0x7fd0d5d9ee48) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:614 #11 0x00007fd0d47c5014 in subquery_planner (glob=glob@entry=0x7fd0d5d9edb8, parse=parse@entry=0x7fd0d5d30d48, parent_root=parent_root@entry=0x0, hasRecursion=hasRecursion@entry=0 '\000', tuple_fraction=0, subroot=subroot@entry=0x7fff52170908) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/plan/planner.c:374 #12 0x00007fd0d47c5975 in standard_planner (parse=0x7fd0d5d30d48, cursorOptions=0, boundParams=0x0) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/plan/planner.c:229 #13 0x00007fd0d4848034 in pg_plan_query (querytree=<optimized out>, cursorOptions=cursorOptions@entry=0, boundParams=boundParams@entry=0x0) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:809 #14 0x00007fd0d4848124 in pg_plan_queries (querytrees=querytrees@entry=0x7fd0d5d30cf8, cursorOptions=0, boundParams=boundParams@entry=0x0) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:868 #15 0x00007fd0d4929760 in BuildCachedPlan (plansource=plansource@entry=0x7fd0d5d7d940, qlist=0x7fd0d5d30cf8, qlist@entry=0x0, boundParams=boundParams@entry=0x0) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/utils/cache/plancache.c:951 #16 0x00007fd0d4929a98 in GetCachedPlan (plansource=plansource@entry=0x7fd0d5d7d940, boundParams=boundParams@entry=0x0, useResOwner=useResOwner@entry=0 '\000') at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/utils/cache/plancache.c:1165 #17 0x00007fd0d48497ab in exec_bind_message (input_message=0x7fff52170be0) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:1774 #18 PostgresMain (argc=<optimized out>, argv=argv@entry=0x7fd0d5c8d950, dbname=0x7fd0d5c8d840 "databasename", username=<optimized out>) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:4071 #19 0x00007fd0d45f239c in BackendRun (port=0x7fd0d5cd2c00) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postmaster.c:4159 #20 BackendStartup (port=0x7fd0d5cd2c00) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postmaster.c:3835 #21 ServerLoop () at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postmaster.c:1609 #22 0x00007fd0d47f18e1 in PostmasterMain (argc=5, argv=<optimized out>) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postmaster.c:1254 #23 0x00007fd0d45f30cd in main (argc=5, argv=0x7fd0d5c8c970) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/main/main.c:221 Thanks, Michael J. Bommarito II, CEO Bommarito Consulting, LLC *Web:* http://www.bommaritollc.com *Mobile:* +1 (646) 450-3387 On Tue, Jul 7, 2015 at 11:49 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, Jul 8, 2015 at 10:22 AM, Michael Bommarito > <michael@bommaritollc.com> wrote: > > I'll try to generate an isolated core dump and stack trace this weekend > > when the system is not in use. > > That would be nice. I have let pgbench -C run for one hour with select > * from pg_stat_activity running every second (\watch 1) in parallel > but I could not reproduce the issue on HEAD. > -- > Michael >
Yes, 12803 was the PID of the parent postgres process. While the extend query protocol was being used by this client and query, we've also been able to trigger it via unprepared statements in `psql` or pgAdmin. Great, thank you! I'll keep the core around for now so we can provide additional details; if we need to test another execution path for segfault, it will need to wait until the weekend again. Thanks, Michael J. Bommarito II, CEO Bommarito Consulting, LLC *Web:* http://www.bommaritollc.com *Mobile:* +1 (646) 450-3387 On Sat, Jul 11, 2015 at 9:18 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > > > On Sat, Jul 11, 2015 at 10:07 PM, Michael Bommarito < > michael@bommaritollc.com> wrote: > >> Here is the offending query and gdb session/stacktrace output. Please >> let me know if we can provide anything else from gdb or logs that can be >> anonymized. >> > > Thanks. > > >> 2015-07-11 12:57:41 UTC [12803-7] LOG: server process (PID 20696) was >> terminated by signal 11: Segmentation fault >> 2015-07-11 12:57:41 UTC [12803-8] DETAIL: Failed process was running: >> SELECT COUNT(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid() >> > > 12803 is the number of the process PID, right? > > [...] >> Core was generated by `postgres: postgres databasename 127.0.0.1(42063) >> BIND '. >> > > Interesting. So you are using the extended query protocol. > > >> Program terminated with signal SIGSEGV, Segmentation fault. >> #0 get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at >> /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832 >> 2832 >> /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c: >> No such file or directory. >> (gdb) bt >> #0 get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at >> /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832 >> #1 0x00007fd0d47cb9dd in pullup_replace_vars_callback >> (var=0x7fd0d5d9e958, context=0x7fff52170620) at >> /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/prepjointree.c:2074 >> [...] >> #15 0x00007fd0d4929760 in BuildCachedPlan (plansource=plansource@entry=0x7fd0d5d7d940, >> qlist=0x7fd0d5d30cf8, qlist@entry=0x0, boundParams=boundParams@entry=0x0) >> at >> /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/utils/cache/plancache.c:951 >> > > And the crash happens in the planner when building a cached plan. This may > be enough information for coming up with a reproducible test case. I'll get > a closer look on Monday. > -- > Michael >
Michael Bommarito <michael@bommaritollc.com> writes: > Hello Michael, > Here is the offending query and gdb session/stacktrace output. Please > let me know if we can provide anything else from gdb or logs that can be > anonymized. > *Query:* > 2015-07-11 12:57:41 UTC [12803-7] LOG: server process (PID 20696) was > terminated by signal 11: Segmentation fault > 2015-07-11 12:57:41 UTC [12803-8] DETAIL: Failed process was running: > SELECT COUNT(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid() I tried tracing through the logic using the line numbers shown in this stack trace, and soon decided that they didn't seem to match the query you show above --- but on closer look, that's to be expected, because this core is evidently from PID 16028: > [New LWP 16028] If you can identify the exact query that 16028 was running from your logs, that would be helpful. Or, if you still have this same core file laying about, "p debug_query_string" in gdb would probably be a more trustworthy guide than trying to match up log entries. >> That would be nice. I have let pgbench -C run for one hour with select >> * from pg_stat_activity running every second (\watch 1) in parallel >> but I could not reproduce the issue on HEAD. Yeah, I tried similar experiments without result. Presumably there is some other triggering condition here, but it's hard to guess what. I tried things like doing VACUUM FULL on pg_database and pg_authid to force replan cycles on the view, but no crash. regards, tom lane
Michael Bommarito <michael@bommaritollc.com> writes: > Here's the session with debug_query_string: > (gdb) printf "%s\n", debug_query_string > SELECT application_name AS source, client_addr AS ip, COUNT(*) AS > total_connections FROM pg_stat_activity WHERE pid <> pg_backend_pid() GROUP > BY application_name, ip ORDER BY COUNT(*) DESC, application_name ASC, > client_addr ASC Thanks. This still doesn't match the stack trace: in particular, this stack frame #3 0x00007fd0d478152c in expression_tree_mutator (node=0x7fd0d5d9e908, mutator=0x7fd0d481c390 <replace_rte_variables_mutator>, context=0x7fff52170620) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.c:2769 indicates that we found a PlaceHolderInfo node in the expression tree that pullup_replace_vars() was applied to, but so far as I can see no such node should exist in the query tree generated by this query. The most likely theory seems to be that something clobbered the query tree while it was sitting in the plancache, causing this recursive function to follow a bogus pointer. But that doesn't leave us with a lot to go on. What can you tell us about the environment this is happening in? How is the client-side code executing the failing queries? (We know it's using extended query protocol, but is it preparing a statement and then executing it repeatedly, or just using a one-shot unnamed prepared statement?) What nondefault settings are in use on the server side? Do you have any extensions loaded, such as pg_stat_statements or auto_explain? regards, tom lane
This particular instance is from pghero, which is a monitoring tool. It can be reproduced simply by querying stat_activity in psql as well. Pghero is using prepared statements via ruby from a quick skim on their github repo. We have pg_stat_statements enabled, and can reproduce without pghero setup as well. No other extensions loaded. On Jul 12, 2015 2:37 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > Michael Bommarito <michael@bommaritollc.com> writes: > > Here's the session with debug_query_string: > > (gdb) printf "%s\n", debug_query_string > > SELECT application_name AS source, client_addr AS ip, COUNT(*) AS > > total_connections FROM pg_stat_activity WHERE pid <> pg_backend_pid() > GROUP > > BY application_name, ip ORDER BY COUNT(*) DESC, application_name ASC, > > client_addr ASC > > Thanks. This still doesn't match the stack trace: in particular, this > stack frame > > #3 0x00007fd0d478152c in expression_tree_mutator (node=0x7fd0d5d9e908, > mutator=0x7fd0d481c390 <replace_rte_variables_mutator>, > context=0x7fff52170620) at > > /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.c:2769 > > indicates that we found a PlaceHolderInfo node in the expression tree that > pullup_replace_vars() was applied to, but so far as I can see no such node > should exist in the query tree generated by this query. The most likely > theory seems to be that something clobbered the query tree while it was > sitting in the plancache, causing this recursive function to follow a > bogus pointer. But that doesn't leave us with a lot to go on. > > What can you tell us about the environment this is happening in? > How is the client-side code executing the failing queries? (We know > it's using extended query protocol, but is it preparing a statement > and then executing it repeatedly, or just using a one-shot unnamed > prepared statement?) What nondefault settings are in use on the > server side? Do you have any extensions loaded, such as > pg_stat_statements or auto_explain? > > regards, tom lane >
Here's the session with debug_query_string: ubuntu@hostname:~=E2=9F=AB sudo gdb -q -c /var/lib/postgresql/9.5/main/core /usr/lib/postgresql/9.5/bin/postgres Reading symbols from /usr/lib/postgresql/9.5/bin/postgres...Reading symbols from /usr/lib/debug/.build-id/7b/273fca376bfad4ea4bb0c4d3099d62ecc9c124.debug...= done. done. [New LWP 16028] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `postgres: postgres database 127.0.0.1(42063) BIND '. Program terminated with signal SIGSEGV, Segmentation fault. #0 get_tle_by_resno (tlist=3D0x7fd0d5da27c0, resno=3Dresno@entry=3D6) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_rel= ation.c:2832 2832 /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_re= lation.c: No such file or directory. (gdb) printf "%s\n", debug_query_string SELECT application_name AS source, client_addr AS ip, COUNT(*) AS total_connections FROM pg_stat_activity WHERE pid <> pg_backend_pid() GROUP BY application_name, ip ORDER BY COUNT(*) DESC, application_name ASC, client_addr ASC (gdb) Thanks, Michael J. Bommarito II, CEO Bommarito Consulting, LLC *Web:* http://www.bommaritollc.com *Mobile:* +1 (646) 450-3387 On Sat, Jul 11, 2015 at 6:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Bommarito <michael@bommaritollc.com> writes: > > Hello Michael, > > Here is the offending query and gdb session/stacktrace output. Pleas= e > > let me know if we can provide anything else from gdb or logs that can b= e > > anonymized. > > > *Query:* > > 2015-07-11 12:57:41 UTC [12803-7] LOG: server process (PID 20696) was > > terminated by signal 11: Segmentation fault > > 2015-07-11 12:57:41 UTC [12803-8] DETAIL: Failed process was running: > > SELECT COUNT(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid() > > I tried tracing through the logic using the line numbers shown in this > stack trace, and soon decided that they didn't seem to match the query > you show above --- but on closer look, that's to be expected, because > this core is evidently from PID 16028: > > > [New LWP 16028] > > If you can identify the exact query that 16028 was running from your logs= , > that would be helpful. Or, if you still have this same core file laying > about, "p debug_query_string" in gdb would probably be a more trustworthy > guide than trying to match up log entries. > > >> That would be nice. I have let pgbench -C run for one hour with select > >> * from pg_stat_activity running every second (\watch 1) in parallel > >> but I could not reproduce the issue on HEAD. > > Yeah, I tried similar experiments without result. Presumably there is > some other triggering condition here, but it's hard to guess what. > I tried things like doing VACUUM FULL on pg_database and pg_authid > to force replan cycles on the view, but no crash. > > regards, tom lane >
On Mon, Jul 13, 2015 at 4:16 AM, Michael Bommarito <michael@bommaritollc.com> wrote: > This particular instance is from pghero, which is a monitoring tool. It > can be reproduced simply by querying stat_activity in psql as well. Pghero > is using prepared statements via ruby from a quick skim on their github > repo. > > We have pg_stat_statements enabled, and can reproduce without pghero setup > as well. No other extensions loaded. > > On Jul 12, 2015 2:37 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: >> >> Michael Bommarito <michael@bommaritollc.com> writes: >> > Here's the session with debug_query_string: >> > (gdb) printf "%s\n", debug_query_string >> > SELECT application_name AS source, client_addr AS ip, COUNT(*) AS >> > total_connections FROM pg_stat_activity WHERE pid <> pg_backend_pid() >> > GROUP >> > BY application_name, ip ORDER BY COUNT(*) DESC, application_name ASC, >> > client_addr ASC >> >> Thanks. This still doesn't match the stack trace: in particular, this >> stack frame >> >> #3 0x00007fd0d478152c in expression_tree_mutator (node=0x7fd0d5d9e908, >> mutator=0x7fd0d481c390 <replace_rte_variables_mutator>, >> context=0x7fff52170620) at >> >> /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.c:2769 >> >> indicates that we found a PlaceHolderInfo node in the expression tree that >> pullup_replace_vars() was applied to, but so far as I can see no such node >> should exist in the query tree generated by this query. The most likely >> theory seems to be that something clobbered the query tree while it was >> sitting in the plancache, causing this recursive function to follow a >> bogus pointer. But that doesn't leave us with a lot to go on. >> >> What can you tell us about the environment this is happening in? >> How is the client-side code executing the failing queries? (We know >> it's using extended query protocol, but is it preparing a statement >> and then executing it repeatedly, or just using a one-shot unnamed >> prepared statement?) What nondefault settings are in use on the >> server side? Do you have any extensions loaded, such as >> pg_stat_statements or auto_explain? FWIW, I have been fooling around with the query reported in the back trace upthread by playing a bit with the extended query protocol to send BIND messages with PQdescribePrepared and PQsendDescribePrepared, as well as with psql and while I am able to reproduce stack traces close to what you had I am not seeing any crashes. I have as well played a bit with pghero with pgbench running in parallel and there were no problems, with and without pg_stat_statements loaded. In the backtrace you send previously (http://www.postgresql.org/message-id/CAN=rtBipwKdHCtmXH3r4GNfUhF9e4ZfJbqcj7s_Ec9e2Mbf_LA@mail.gmail.com), what is the value of MyProcPid? Is it 12803 or 20696? If it is the former, do you have a backtrace for process 20696? What we may be looking at now is actually a side effect of the real problem, and as long as we do not have a real test case, I am afraid that finding the root problem is rather difficult. -- Michael
Here are locals per frame, all the way back up to BackendStartup, in case it helps as well. I will be trying to reproduce this upcoming weekend with a "baseline" version of postgresql.conf and another system with clean import. (gdb) bt #0 get_tle_by_resno (tlist=3D0x7fd0d5da27c0, resno=3Dresno@entry=3D6) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_rel= ation.c:2832 #1 0x00007fd0d47cb9dd in pullup_replace_vars_callback (var=3D0x7fd0d5d9e95= 8, context=3D0x7fff52170620) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/p= repjointree.c:2074 #2 0x00007fd0d481c3ea in replace_rte_variables_mutator (node=3D<optimized out>, context=3D0x7fff52170620) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/rewrite/rewriteM= anip.c:1149 #3 0x00007fd0d478152c in expression_tree_mutator (node=3D0x7fd0d5d9e908, mutator=3D0x7fd0d481c390 <replace_rte_variables_mutator>, context=3D0x7fff52170620) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.= c:2769 #4 0x00007fd0d47812b3 in expression_tree_mutator (node=3D<optimized out>, mutator=3D0x7fd0d481c390 <replace_rte_variables_mutator>, context=3D0x7fff52170620) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFuncs.= c:2675 #5 0x00007fd0d481cc64 in replace_rte_variables (node=3D<optimized out>, target_varno=3D<optimized out>, sublevels_up=3Dsublevels_up@entry=3D0, callback=3Dcallback@entry=3D0x7fd0d47cb880 <pullup_replace_vars_callback>, callback_arg=3Dcallback_arg@entry=3D0x7fff521706c0, outer_hasSubLinks=3D0x7fd0d5d30d6e "") at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/rewrite/rewriteM= anip.c:1115 #6 0x00007fd0d47cd1c7 in pullup_replace_vars (context=3D0x7fff521706c0, expr=3D<optimized out>) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/p= repjointree.c:1982 #7 pull_up_simple_subquery (deletion_ok=3D<optimized out>, containing_appendrel=3D0x0, lowest_nulling_outer_join=3D0x0, lowest_outer_join=3D0x0, rte=3D0x7fd0d5d30ea8, jtnode=3D<optimized out>, root=3D0x7fd0d5d9ee48) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/p= repjointree.c:1030 #8 pull_up_subqueries_recurse (root=3Droot@entry=3D0x7fd0d5d9ee48, jtnode=3D<optimized out>, lowest_outer_join=3Dlowest_outer_join@entry=3D0x0= , lowest_nulling_outer_join=3Dlowest_nulling_outer_join@entry=3D0x0, containing_appendrel=3Dcontaining_appendrel@entry=3D0x0, deletion_ok=3D<opt= imized out>) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/p= repjointree.c:696 #9 0x00007fd0d47cc989 in pull_up_subqueries_recurse (root=3Droot@entry=3D0x7fd0d5d9ee48, jtnode=3D0x7fd0d5d9e6c0, lowest_outer_join=3Dlowest_outer_join@entry=3D0x0, lowest_nulling_outer_join=3Dlowest_nulling_outer_join@entry=3D0x0, containing_appendrel=3Dcontaining_appendrel@entry=3D0x0, deletion_ok=3D<opt= imized out>, deletion_ok@entry=3D0 '\000') at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/p= repjointree.c:762 #10 0x00007fd0d47cd639 in pull_up_subqueries (root=3Droot@entry=3D0x7fd0d5d= 9ee48) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/prep/p= repjointree.c:614 #11 0x00007fd0d47c5014 in subquery_planner (glob=3Dglob@entry=3D0x7fd0d5d9e= db8, parse=3Dparse@entry=3D0x7fd0d5d30d48, parent_root=3Dparent_root@entry=3D0x0= , hasRecursion=3DhasRecursion@entry=3D0 '\000', tuple_fraction=3D0, subroot=3Dsubroot@entry=3D0x7fff52170908) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/plan/p= lanner.c:374 #12 0x00007fd0d47c5975 in standard_planner (parse=3D0x7fd0d5d30d48, cursorOptions=3D0, boundParams=3D0x0) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/optimizer/plan/p= lanner.c:229 #13 0x00007fd0d4848034 in pg_plan_query (querytree=3D<optimized out>, cursorOptions=3DcursorOptions@entry=3D0, boundParams=3DboundParams@entry=3D= 0x0) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:= 809 #14 0x00007fd0d4848124 in pg_plan_queries (querytrees=3Dquerytrees@entry=3D0x7fd0d5d30cf8, cursorOptions=3D0, boundParams=3DboundParams@entry=3D0x0) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:= 868 #15 0x00007fd0d4929760 in BuildCachedPlan (plansource=3Dplansource@entry=3D0x7fd0d5d7d940, qlist=3D0x7fd0d5d30cf8, qlist@entry=3D0x0, boundParams=3DboundParams@entry= =3D0x0) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/utils/cache/plan= cache.c:951 #16 0x00007fd0d4929a98 in GetCachedPlan (plansource=3Dplansource@entry=3D0x7fd0d5d7d940, boundParams=3DboundParams@entry=3D0x0, useResOwner=3DuseResOwner@entry=3D0 = '\000') at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/utils/cache/plan= cache.c:1165 #17 0x00007fd0d48497ab in exec_bind_message (input_message=3D0x7fff52170be0= ) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:= 1774 #18 PostgresMain (argc=3D<optimized out>, argv=3Dargv@entry=3D0x7fd0d5c8d95= 0, dbname=3D0x7fd0d5c8d840 "databasename", username=3D<optimized out>) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/tcop/postgres.c:= 4071 #19 0x00007fd0d45f239c in BackendRun (port=3D0x7fd0d5cd2c00) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postm= aster.c:4159 #20 BackendStartup (port=3D0x7fd0d5cd2c00) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postm= aster.c:3835 #21 ServerLoop () at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postm= aster.c:1609 #22 0x00007fd0d47f18e1 in PostmasterMain (argc=3D5, argv=3D<optimized out>)= at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/postmaster/postm= aster.c:1254 #23 0x00007fd0d45f30cd in main (argc=3D5, argv=3D0x7fd0d5c8c970) at /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/main/main.c:221 (gdb) info locals tle =3D 0x0 l =3D 0x7fd0d5da2940 (gdb) select-frame 1 (gdb) info locals tle =3D <optimized out> rcon =3D 0x7fff521706c0 varattno =3D 6 newnode =3D <optimized out> __func__ =3D "pullup_replace_vars_callback" (gdb) select-frame 2 (gdb) info locals newnode =3D <optimized out> var =3D <optimized out> __func__ =3D "replace_rte_variables_mutator" (gdb) select-frame 3 (gdb) info locals phinfo =3D <optimized out> newnode =3D <optimized out> __func__ =3D "expression_tree_mutator" (gdb) select-frame 4 (gdb) info locals resultlist =3D 0x0 temp =3D 0x7fd0d5d9e8e8 __func__ =3D "expression_tree_mutator" (gdb) select-frame 5 (gdb) info locals result =3D <optimized out> context =3D {callback =3D 0x7fd0d47cb880 <pullup_replace_vars_callback>, callback_arg =3D 0x7fff521706c0, target_varno =3D 1, sublevels_up =3D 0, inserted_sublink =3D 0 '\000'} __func__ =3D "replace_rte_variables" (gdb) select-frame 6 (gdb) info locals No locals. (gdb) select-frame 7 (gdb) info locals parse =3D 0x7fd0d5d30d48 subquery =3D <optimized out> rvcontext =3D {root =3D 0x7fd0d5d9ee48, targetlist =3D 0x7fd0d5da27c0, targ= et_rte =3D 0x7fd0d5d30ea8, relids =3D 0x0, outer_hasSubLinks =3D 0x7fd0d5d30d6e ""= , varno =3D 1, need_phvs =3D 0 '\000', wrap_non_vars =3D 0 '\000', rv_cache = =3D 0x7fd0d5da2960} varno =3D 1 subroot =3D 0x7fd0d5da2560 lc =3D <optimized out> (gdb) select-frame 8 (gdb) info locals varno =3D <optimized out> rte =3D 0x7fd0d5d30ea8 __func__ =3D "pull_up_subqueries_recurse" (gdb) select-frame 9 (gdb) info locals sub_deletion_ok =3D <optimized out> f =3D 0x7fd0d5d9e6c0 have_undeleted_child =3D 0 '\000' l =3D 0x7fd0d5d9e720 __func__ =3D "pull_up_subqueries_recurse" (gdb) select-frame 10 (gdb) info locals No locals. (gdb) select-frame 11 (gdb) info locals root =3D 0x7fd0d5d9ee48 plan =3D <optimized out> newWithCheckOptions =3D <optimized out> newHaving =3D <optimized out> hasOuterJoins =3D <optimized out> l =3D <optimized out> (gdb) select-frame 12 (gdb) info locals result =3D <optimized out> glob =3D 0x7fd0d5d9edb8 tuple_fraction =3D <optimized out> root =3D 0x7fd0d5d9ed18 top_plan =3D <optimized out> lp =3D <optimized out> lr =3D <optimized out> (gdb) select-frame 13 (gdb) info locals plan =3D <optimized out> (gdb) select-frame 14 (gdb) info locals query =3D <optimized out> stmt =3D <optimized out> stmt_list =3D 0x0 query_list =3D 0x7fd0d5d30d28 (gdb) select-frame 15 (gdb) info locals plan =3D <optimized out> plist =3D <optimized out> snapshot_set =3D 0 '\000' spi_pushed =3D <optimized out> plan_context =3D <optimized out> oldcxt =3D 0x7fd0d5c8ccb0 (gdb) select-frame 16 (gdb) info locals plan =3D <optimized out> qlist =3D 0x0 customplan =3D 0 '\000' __func__ =3D "GetCachedPlan" (gdb) select-frame 17 (gdb) info locals pformats =3D 0x0 psrc =3D 0x7fd0d5d7d940 portal =3D 0x7fd0d5ca6da0 query_string =3D 0x7fd0d5cd17a0 "SELECT application_name AS source, client_addr AS ip, COUNT(*) AS total_connections FROM pg_stat_activity WHERE pid <> pg_backend_pid() GROUP BY application_name, ip ORDER BY COUNT(*) DESC, applicatio"... portal_name =3D 0x7fd0d5d308d0 "" stmt_name =3D 0x7fd0d5d308d1 "" numPFormats =3D 0 saved_stmt_name =3D 0x0 rformats =3D 0x7fd0d5d30ce0 params =3D 0x0 save_log_statement_stats =3D 0 '\000' msec_str =3D "SELECT 1\000\061\000\324\320\177\000\000\220\r\027R\377\177\000\000=E1=95=94\324\3= 20\177\000" numParams =3D 0 numRFormats =3D 1 cplan =3D <optimized out> snapshot_set =3D <optimized out> (gdb) select-frame 18 (gdb) info locals firstchar =3D -707588896 input_message =3D {data =3D 0x7fd0d5d308d0 "", len =3D 10, maxlen =3D 1024,= cursor =3D 10} local_sigjmp_buf =3D {{__jmpbuf =3D {140734570630176, -6783158886466612332, 140534916634656, 1, 0, 140534916918272, -6783158886368046188, -6808758974801407084}, __mask_was_saved =3D 1, __saved_mask =3D {__val =3D = {0, 140534916918272, 140534862700952, 140534916634800, 206158430256, 140734570630512, 140734570630304, 140534916918272, 52, 140734570630432, 140534895011200, 1024, 140734570630464, 140734570630628, 0, 140734570630400}}}} send_ready_for_query =3D 0 '\000' __func__ =3D "PostgresMain" (gdb) select-frame 19 (gdb) info locals ac =3D 1 secs =3D 489934357 usecs =3D 656467 i =3D 1 av =3D 0x7fd0d5c8d950 maxac =3D <optimized out> Thanks, Michael J. Bommarito II, CEO Bommarito Consulting, LLC *Web:* http://www.bommaritollc.com *Mobile:* +1 (646) 450-3387 On Mon, Jul 13, 2015 at 2:54 AM, Michael Paquier <michael.paquier@gmail.com= > wrote: > On Mon, Jul 13, 2015 at 4:16 AM, Michael Bommarito > <michael@bommaritollc.com> wrote: > > This particular instance is from pghero, which is a monitoring tool. = It > > can be reproduced simply by querying stat_activity in psql as well. > Pghero > > is using prepared statements via ruby from a quick skim on their github > > repo. > > > > We have pg_stat_statements enabled, and can reproduce without pghero > setup > > as well. No other extensions loaded. > > > > On Jul 12, 2015 2:37 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > >> > >> Michael Bommarito <michael@bommaritollc.com> writes: > >> > Here's the session with debug_query_string: > >> > (gdb) printf "%s\n", debug_query_string > >> > SELECT application_name AS source, client_addr AS ip, COUNT(*) AS > >> > total_connections FROM pg_stat_activity WHERE pid <> pg_backend_pid(= ) > >> > GROUP > >> > BY application_name, ip ORDER BY COUNT(*) DESC, application_name ASC= , > >> > client_addr ASC > >> > >> Thanks. This still doesn't match the stack trace: in particular, this > >> stack frame > >> > >> #3 0x00007fd0d478152c in expression_tree_mutator (node=3D0x7fd0d5d9e9= 08, > >> mutator=3D0x7fd0d481c390 <replace_rte_variables_mutator>, > >> context=3D0x7fff52170620) at > >> > >> > /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/nodes/nodeFunc= s.c:2769 > >> > >> indicates that we found a PlaceHolderInfo node in the expression tree > that > >> pullup_replace_vars() was applied to, but so far as I can see no such > node > >> should exist in the query tree generated by this query. The most like= ly > >> theory seems to be that something clobbered the query tree while it wa= s > >> sitting in the plancache, causing this recursive function to follow a > >> bogus pointer. But that doesn't leave us with a lot to go on. > >> > >> What can you tell us about the environment this is happening in? > >> How is the client-side code executing the failing queries? (We know > >> it's using extended query protocol, but is it preparing a statement > >> and then executing it repeatedly, or just using a one-shot unnamed > >> prepared statement?) What nondefault settings are in use on the > >> server side? Do you have any extensions loaded, such as > >> pg_stat_statements or auto_explain? > > FWIW, I have been fooling around with the query reported in the back > trace upthread by playing a bit with the extended query protocol to > send BIND messages with PQdescribePrepared and PQsendDescribePrepared, > as well as with psql and while I am able to reproduce stack traces > close to what you had I am not seeing any crashes. I have as well > played a bit with pghero with pgbench running in parallel and there > were no problems, with and without pg_stat_statements loaded. > > In the backtrace you send previously > ( > http://www.postgresql.org/message-id/CAN=3DrtBipwKdHCtmXH3r4GNfUhF9e4ZfJb= qcj7s_Ec9e2Mbf_LA@mail.gmail.com > ), > what is the value of MyProcPid? Is it 12803 or 20696? If it is the > former, do you have a backtrace for process 20696? What we may be > looking at now is actually a side effect of the real problem, and as > long as we do not have a real test case, I am afraid that finding the > root problem is rather difficult. > -- > Michael >
On Mon, Jul 13, 2015 at 10:43 PM, Michael Bommarito <michael@bommaritollc.com> wrote: > (gdb) bt > #0 get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at > /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832 > [...] > (gdb) info locals > tle = 0x0 > l = 0x7fd0d5da2940 So basically this is crashing because this tle is NULL, and get_tle_by_resno does not like that much when this item is extracted from the list. I am wondering what is breaking that. -- Michael
Michael Bommarito <michael@bommaritollc.com> writes: > If you can provide a patch that performs input validation in > get_tle_by_resno and logs the condition, I can compile and test with it. Wouldn't prove anything one way or another. Somehow, a corrupt query tree is being fed to the planner; what we need to understand is what conditions cause that to happen. I doubt that getting more details at the point where the code trips over the corruption will teach us that. regards, tom lane
If you can provide a patch that performs input validation in get_tle_by_resno and logs the condition, I can compile and test with it. Thanks, Michael J. Bommarito II, CEO Bommarito Consulting, LLC *Web:* http://www.bommaritollc.com *Mobile:* +1 (646) 450-3387 On Tue, Jul 14, 2015 at 3:33 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Mon, Jul 13, 2015 at 10:43 PM, Michael Bommarito > <michael@bommaritollc.com> wrote: > > (gdb) bt > > #0 get_tle_by_resno (tlist=0x7fd0d5da27c0, resno=resno@entry=6) at > > > /tmp/buildd/postgresql-9.5-9.5~alpha1/build/../src/backend/parser/parse_relation.c:2832 > > [...] > > (gdb) info locals > > tle = 0x0 > > l = 0x7fd0d5da2940 > > So basically this is crashing because this tle is NULL, and > get_tle_by_resno does not like that much when this item is extracted > from the list. I am wondering what is breaking that. > -- > Michael >
Michael Bommarito <michael@bommaritollc.com> writes: > Compiled from source with --debug using 9.5alpha1 with -O0 -ggdb > -fno-omit-frame-pointer -mno-red-zone. Reset to default postgresql.conf. > Was able to generate segfaults repeatedly loading pghero dashboard. First, > some errors from the pg logs. > 2015-07-18 15:07:58 UTC [27112-1] postgres@database ERROR: attribute > number 2 exceeds number of columns 0 > 2015-07-18 15:08:23 UTC [27112-3] postgres@database ERROR: invalid > varattno 66 > 2015-07-18 15:17:19 UTC [3605-1] postgres@database ERROR: tupdesc > reference 0x2bdd8a8 is not owned by resource owner Portal > 2015-07-18 15:48:38 UTC [10281-1] postgres@database ERROR: tupdesc > reference 0xf77248 is not owned by resource owner Portal I'm starting to wonder about flaky hardware and/or compiler bugs. 9.5 seems completely unreliable for you, but we're not seeing any similar reports from anyone else. Is 9.4 reliable on the same machine + same build environment? > Program received signal SIGQUIT, Quit. Catching a SIGQUIT isn't very helpful; that just means you're one of the server processes that got force-terminated after some other one crashed. regards, tom lane
Compiled from source with --debug using 9.5alpha1 with -O0 -ggdb -fno-omit-frame-pointer -mno-red-zone. Reset to default postgresql.conf. Was able to generate segfaults repeatedly loading pghero dashboard. First, some errors from the pg logs. ======================================================================== ======================================================== 2015-07-18 15:07:58 UTC [27112-1] postgres@database ERROR: attribute number 2 exceeds number of columns 0 2015-07-18 15:07:58 UTC [27112-2] postgres@database STATEMENT: SELECT application_name AS source, client_addr AS ip, COUNT(*) AS total_connections FROM pg_stat_activity WHERE pid <> pg_backend_pid() GROUP BY application_name, ip ORDER BY COUNT(*) DESC, application_name ASC, client_addr ASC 2015-07-18 15:08:23 UTC [27112-3] postgres@database ERROR: invalid varattno 66 2015-07-18 15:08:23 UTC [27112-4] postgres@database STATEMENT: SELECT relname AS table, indexrelname AS index, pg_size_pretty(pg_relation_size(i.indexrelid)) AS index_size, idx_scan as index_scans FROM pg_stat_user_indexes ui INNER JOIN pg_index i ON ui.indexrelid = i.indexrelid WHERE NOT indisunique AND idx_scan < 50 ORDER BY pg_relation_size(i.indexrelid) DESC, relname ASC 2015-07-18 15:17:19 UTC [3605-1] postgres@database ERROR: tupdesc reference 0x2bdd8a8 is not owned by resource owner Portal ======================================================================== ======================================================== Next, attached to the backend PID under sudo gdb and `cont`: ======================================================================== ======================================================== 2015-07-18 15:48:38 UTC [10281-1] postgres@database ERROR: tupdesc reference 0xf77248 is not owned by resource owner Portal 2015-07-18 15:48:54 UTC [8812-4] LOG: server process (PID 10538) was terminated by signal 11: Segmentation fault 2015-07-18 15:48:54 UTC [8812-5] LOG: terminating any other active server processes 2015-07-18 15:48:54 UTC [10523-1] postgres@database WARNING: terminating connection because of crash of another server process 2015-07-18 15:48:54 UTC [10523-2] postgres@database DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2015-07-18 15:48:54 UTC [10523-3] postgres@database HINT: In a moment you should be able to reconnect to the database and repeat your command. 2015-07-18 15:48:54 UTC [10239-1] postgres@database WARNING: terminating connection because of crash of another server process 2015-07-18 15:48:54 UTC [10239-2] postgres@database DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2015-07-18 15:48:54 UTC [10239-3] postgres@database HINT: In a moment you should be able to reconnect to the database and repeat your command. 2015-07-18 15:48:54 UTC [10522-1] postgres@database WARNING: terminating connection because of crash of another server process 2015-07-18 15:48:54 UTC [10522-2] postgres@database DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2015-07-18 15:48:54 UTC [10522-3] postgres@database HINT: In a moment you should be able to reconnect to the database and repeat your command. 2015-07-18 15:48:54 UTC [10409-1] postgres@database WARNING: terminating connection because of crash of another server process 2015-07-18 15:48:54 UTC [10409-2] postgres@database DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2015-07-18 15:48:54 UTC [10409-3] postgres@database HINT: In a moment you should be able to reconnect to the database and repeat your command. 2015-07-18 15:48:54 UTC [10408-1] postgres@database WARNING: terminating connection because of crash of another server process 2015-07-18 15:48:54 UTC [10408-2] postgres@database DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2015-07-18 15:48:54 UTC [10408-3] postgres@database HINT: In a moment you should be able to reconnect to the database and repeat your command. Program received signal SIGQUIT, Quit. 0x00007f84fe78c110 in __poll_nocancel () at ../sysdeps/unix/syscall-template.S:81 81 in ../sysdeps/unix/syscall-template.S (gdb) bt full Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x7ffc65188bd8: #0 0x00007f84fe78c110 in __poll_nocancel () at ../sysdeps/unix/syscall-template.S:81 No locals. Cannot access memory at address 0x7ffc65188bd8 ======================================================================== ======================================================== Thanks, Michael J. Bommarito II, CEO Bommarito Consulting, LLC *Web:* http://www.bommaritollc.com *Mobile:* +1 (646) 450-3387 On Tue, Jul 14, 2015 at 8:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Bommarito <michael@bommaritollc.com> writes: > > If you can provide a patch that performs input validation in > > get_tle_by_resno and logs the condition, I can compile and test with it. > > Wouldn't prove anything one way or another. Somehow, a corrupt query tree > is being fed to the planner; what we need to understand is what conditions > cause that to happen. I doubt that getting more details at the point > where the code trips over the corruption will teach us that. > > regards, tom lane >
After exporting and re-importing (pg9.5 -> 9.5), this has gone away. We changed quite a few variables in the import (non-parallel, letting full index builds proceed, etc.), so not 100% sure what the issue was; as you suggested, doesn't that pg9.5 itself was the problem (though not sure on pg_dump/restore). Thanks, Michael J. Bommarito II, CEO Bommarito Consulting, LLC *Web:* http://www.bommaritollc.com *Mobile:* +1 (646) 450-3387 On Sat, Jul 18, 2015 at 12:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Bommarito <michael@bommaritollc.com> writes: > > Compiled from source with --debug using 9.5alpha1 with -O0 -ggdb > > -fno-omit-frame-pointer -mno-red-zone. Reset to default postgresql.conf. > > Was able to generate segfaults repeatedly loading pghero dashboard. > First, > > some errors from the pg logs. > > 2015-07-18 15:07:58 UTC [27112-1] postgres@database ERROR: attribute > > number 2 exceeds number of columns 0 > > 2015-07-18 15:08:23 UTC [27112-3] postgres@database ERROR: invalid > > varattno 66 > > 2015-07-18 15:17:19 UTC [3605-1] postgres@database ERROR: tupdesc > > reference 0x2bdd8a8 is not owned by resource owner Portal > > 2015-07-18 15:48:38 UTC [10281-1] postgres@database ERROR: tupdesc > > reference 0xf77248 is not owned by resource owner Portal > > I'm starting to wonder about flaky hardware and/or compiler bugs. 9.5 > seems completely unreliable for you, but we're not seeing any similar > reports from anyone else. Is 9.4 reliable on the same machine + same > build environment? > > > Program received signal SIGQUIT, Quit. > > Catching a SIGQUIT isn't very helpful; that just means you're one of > the server processes that got force-terminated after some other one > crashed. > > regards, tom lane >
Michael Bommarito wrote: > After exporting and re-importing (pg9.5 -> 9.5), this has gone away. We > changed quite a few variables in the import (non-parallel, letting full > index builds proceed, etc.), so not 100% sure what the issue was; as you > suggested, doesn't that pg9.5 itself was the problem (though not sure on > pg_dump/restore). Maybe you had an inconsistent build. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services