Обсуждение: Debugging a backend stuck consuming CPU
Hi PostgreSQL community, I am investigating a problem with a backend that appears to be stuck and spinning while performing a "DISCARD ALL" command. The system is running an older release 9.2.2. Are there any bugs that could be causing this behavior? How can I tell what the process is actually doing? It does not respond to either pg_cancel_backend() or to pg_terminate_backend() so it is acting like it is not receiving the signals. Any ideas on how to debug this would be appriciated. Regards, Ken
"ktm@rice.edu" <ktm@rice.edu> writes: > I am investigating a problem with a backend that appears to be stuck > and spinning while performing a "DISCARD ALL" command. The system is > running an older release 9.2.2. You do realize that the current release in that series is 9.2.17. > Are there any bugs that could be causing this behavior? Known bugs are summarized here: http://www.postgresql.org/docs/9.2/static/release.html > How can I tell what the process is actually doing? Getting a stack trace with gdb might be informative: https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend regards, tom lane
On 05/19/2016 06:46 AM, ktm@rice.edu wrote: > Hi PostgreSQL community, > > I am investigating a problem with a backend that appears to be stuck > and spinning while performing a "DISCARD ALL" command. The system is > running an older release 9.2.2. Are there any bugs that could be > causing this behavior? How can I tell what the process is actually > doing? It does not respond to either pg_cancel_backend() or to > pg_terminate_backend() so it is acting like it is not receiving the > signals. Any ideas on how to debug this would be appriciated. What does the Postgres log show around the time of the command and/or currently? What OS and does it's system log show anything? Can you query pg_stat_activity? http://www.postgresql.org/docs/9.2/static/monitoring-stats.html#PG-STAT-ACTIVITY-VIEW > > Regards, > Ken > > -- Adrian Klaver adrian.klaver@aklaver.com
On Thu, May 19, 2016 at 09:58:45AM -0400, Tom Lane wrote: > "ktm@rice.edu" <ktm@rice.edu> writes: > > I am investigating a problem with a backend that appears to be stuck > > and spinning while performing a "DISCARD ALL" command. The system is > > running an older release 9.2.2. > > You do realize that the current release in that series is 9.2.17. > > > Are there any bugs that could be causing this behavior? > > Known bugs are summarized here: > http://www.postgresql.org/docs/9.2/static/release.html > > > How can I tell what the process is actually doing? > > Getting a stack trace with gdb might be informative: > https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend > > regards, tom lane > Hi, We do have an upgrade scheduled. I will get a stack trace and see what information that provides. Regards, Ken
On Thu, May 19, 2016 at 8:46 AM, ktm@rice.edu <ktm@rice.edu> wrote: > Hi PostgreSQL community, > > I am investigating a problem with a backend that appears to be stuck > and spinning while performing a "DISCARD ALL" command. The system is > running an older release 9.2.2. Are there any bugs that could be > causing this behavior? How can I tell what the process is actually > doing? It does not respond to either pg_cancel_backend() or to > pg_terminate_backend() so it is acting like it is not receiving the > signals. Any ideas on how to debug this would be appriciated. 1. Grab a stack trace and a 'perf top' for forensic analysis in case it happens again 2. Install latest 9.2 binaries 3. Restart database There's not much value in analyzing such an out of date database; there's a very high probability your bug has already been fixed. merlin
On Thu, May 19, 2016 at 09:58:45AM -0400, Tom Lane wrote: > "ktm@rice.edu" <ktm@rice.edu> writes: > > I am investigating a problem with a backend that appears to be stuck > > and spinning while performing a "DISCARD ALL" command. The system is > > running an older release 9.2.2. > > You do realize that the current release in that series is 9.2.17. > > > Are there any bugs that could be causing this behavior? > > Known bugs are summarized here: > http://www.postgresql.org/docs/9.2/static/release.html > > > How can I tell what the process is actually doing? > > Getting a stack trace with gdb might be informative: > https://wiki.postgresql.org/wiki/Generating_a_stack_trace_of_a_PostgreSQL_backend > > regards, tom lane > Hi, The stack trace just appeared to be what I would expect while a 'DISCARD ALL' command was being run: Continuing. Program received signal SIGINT, Interrupt. 0x000000000073bc7c in MemoryContextSetParent () #0 0x000000000073bc7c in MemoryContextSetParent () #1 0x000000000073bde3 in MemoryContextDelete () #2 0x000000000054e3a9 in DropAllPreparedStatements () #3 0x00000000005365f3 in DiscardCommand () #4 0x00000000006582c7 in ?? () #5 0x00000000006592bd in ?? () #6 0x0000000000659a42 in PortalRun () #7 0x000000000065603d in ?? () #8 0x0000000000656ed0 in PostgresMain () #9 0x0000000000613b91 in ?? () #10 0x00000000006167fc in PostmasterMain () #11 0x00000000005b5290 in main () Continuing. Program received signal SIGINT, Interrupt. 0x000000000073bc7c in MemoryContextSetParent () #0 0x000000000073bc7c in MemoryContextSetParent () #1 0x000000000073bde3 in MemoryContextDelete () #2 0x000000000054e3a9 in DropAllPreparedStatements () #3 0x00000000005365f3 in DiscardCommand () #4 0x00000000006582c7 in ?? () #5 0x00000000006592bd in ?? () #6 0x0000000000659a42 in PortalRun () #7 0x000000000065603d in ?? () #8 0x0000000000656ed0 in PostgresMain () #9 0x0000000000613b91 in ?? () #10 0x00000000006167fc in PostmasterMain () #11 0x00000000005b5290 in main () Continuing. Program received signal SIGINT, Interrupt. 0x000000000073bc7c in MemoryContextSetParent () #0 0x000000000073bc7c in MemoryContextSetParent () #1 0x000000000073bde3 in MemoryContextDelete () #2 0x000000000070e7df in DropCachedPlan () #3 0x000000000054e3a9 in DropAllPreparedStatements () #4 0x00000000005365f3 in DiscardCommand () #5 0x00000000006582c7 in ?? () #6 0x00000000006592bd in ?? () #7 0x0000000000659a42 in PortalRun () #8 0x000000000065603d in ?? () #9 0x0000000000656ed0 in PostgresMain () #10 0x0000000000613b91 in ?? () #11 0x00000000006167fc in PostmasterMain () #12 0x00000000005b5290 in main () Continuing. Program received signal SIGINT, Interrupt. 0x000000000073bc7c in MemoryContextSetParent () #0 0x000000000073bc7c in MemoryContextSetParent () #1 0x000000000073bde3 in MemoryContextDelete () #2 0x000000000054e3a9 in DropAllPreparedStatements () #3 0x00000000005365f3 in DiscardCommand () #4 0x00000000006582c7 in ?? () #5 0x00000000006592bd in ?? () #6 0x0000000000659a42 in PortalRun () #7 0x000000000065603d in ?? () #8 0x0000000000656ed0 in PostgresMain () #9 0x0000000000613b91 in ?? () #10 0x00000000006167fc in PostmasterMain () #11 0x00000000005b5290 in main () Continuing. Program received signal SIGINT, Interrupt. 0x000000000070e7ff in DropCachedPlan () #0 0x000000000070e7ff in DropCachedPlan () #1 0x000000000054e3a9 in DropAllPreparedStatements () #2 0x00000000005365f3 in DiscardCommand () #3 0x00000000006582c7 in ?? () #4 0x00000000006592bd in ?? () #5 0x0000000000659a42 in PortalRun () #6 0x000000000065603d in ?? () #7 0x0000000000656ed0 in PostgresMain () #8 0x0000000000613b91 in ?? () #9 0x00000000006167fc in PostmasterMain () #10 0x00000000005b5290 in main () Detaching from program: /usr/pgsql-9.2/bin/postgres, process 38604 Undefined command: "exit". Try "help". Continuing. Program received signal SIGINT, Interrupt. 0x000000000070e7ff in DropCachedPlan () #0 0x000000000070e7ff in DropCachedPlan () #1 0x000000000054e3a9 in DropAllPreparedStatements () #2 0x00000000005365f3 in DiscardCommand () #3 0x00000000006582c7 in ?? () #4 0x00000000006592bd in ?? () #5 0x0000000000659a42 in PortalRun () #6 0x000000000065603d in ?? () #7 0x0000000000656ed0 in PostgresMain () #8 0x0000000000613b91 in ?? () #9 0x00000000006167fc in PostmasterMain () #10 0x00000000005b5290 in main () Continuing. Program received signal SIGINT, Interrupt. 0x000000000070e7ff in DropCachedPlan () #0 0x000000000070e7ff in DropCachedPlan () #1 0x000000000054e3a9 in DropAllPreparedStatements () #2 0x00000000005365f3 in DiscardCommand () #3 0x00000000006582c7 in ?? () #4 0x00000000006592bd in ?? () #5 0x0000000000659a42 in PortalRun () #6 0x000000000065603d in ?? () #7 0x0000000000656ed0 in PostgresMain () #8 0x0000000000613b91 in ?? () #9 0x00000000006167fc in PostmasterMain () #10 0x00000000005b5290 in main () Continuing. Program received signal SIGINT, Interrupt. 0x000000000073bc7c in MemoryContextSetParent () #0 0x000000000073bc7c in MemoryContextSetParent () #1 0x000000000073bde3 in MemoryContextDelete () #2 0x000000000070e7df in DropCachedPlan () #3 0x000000000054e3a9 in DropAllPreparedStatements () #4 0x00000000005365f3 in DiscardCommand () #5 0x00000000006582c7 in ?? () #6 0x00000000006592bd in ?? () #7 0x0000000000659a42 in PortalRun () #8 0x000000000065603d in ?? () #9 0x0000000000656ed0 in PostgresMain () #10 0x0000000000613b91 in ?? () #11 0x00000000006167fc in PostmasterMain () #12 0x00000000005b5290 in main () Continuing. Does a DISCARD command take alot of time, or is it like TRUNCATE? The backend does have a very large memory footprint (12GB). Regards, Ken
"ktm@rice.edu" <ktm@rice.edu> writes: > The stack trace just appeared to be what I would expect while a 'DISCARD ALL' > command was being run: > #0 0x000000000073bc7c in MemoryContextSetParent () > #1 0x000000000073bde3 in MemoryContextDelete () > #2 0x000000000054e3a9 in DropAllPreparedStatements () > #3 0x00000000005365f3 in DiscardCommand () Hmm, what it seems from these traces is that you've got a whole heck of a lot of prepared statements. > The backend does have a very large memory footprint (12GB). Um. The most likely explanation is that you are hitting O(N^2) behavior as a consequence of MemoryContextSetParent being O(N) in the number of sibling contexts of the context to be deleted. We fixed that for 9.6 (commit 25c539233044c235e97fd7c9dc600fb5f08fe065) but there's no easy solution in older branches, short of not using so many prepared statements. I'm a bit surprised that you could have gotten up to 12GB worth of prepared statements in an application that sends DISCARD ALL periodically. regards, tom lane
On Thu, May 19, 2016 at 05:52:26PM -0400, Tom Lane wrote: > "ktm@rice.edu" <ktm@rice.edu> writes: > > The stack trace just appeared to be what I would expect while a 'DISCARD ALL' > > command was being run: > > > #0 0x000000000073bc7c in MemoryContextSetParent () > > #1 0x000000000073bde3 in MemoryContextDelete () > > #2 0x000000000054e3a9 in DropAllPreparedStatements () > > #3 0x00000000005365f3 in DiscardCommand () > > Hmm, what it seems from these traces is that you've got a whole heck of > a lot of prepared statements. > > > The backend does have a very large memory footprint (12GB). > > Um. > > The most likely explanation is that you are hitting O(N^2) behavior as > a consequence of MemoryContextSetParent being O(N) in the number of > sibling contexts of the context to be deleted. We fixed that for 9.6 > (commit 25c539233044c235e97fd7c9dc600fb5f08fe065) but there's no easy > solution in older branches, short of not using so many prepared > statements. I'm a bit surprised that you could have gotten up to 12GB > worth of prepared statements in an application that sends DISCARD ALL > periodically. > > regards, tom lane > Hi, The DISCARD ALL is only sent by pgbouncer at the end of the processing. The actual process builds up a cache to be used later whose size is proportional to the number of items. The initial run is large, but the regular runs are much smaller and cleanup quickly. I was more concerned with incorrect behavior leading to DB corruption. Thank you for your suggestions and assistance. Regards, Ken