Обсуждение: diagnosing a db crash - server exit code 2

Поиск

Список

Период

Сортировка

diagnosing a db crash - server exit code 2

От

"Burgholzer, Robert (DEQ)"

Дата:

23 сентября 2011 г., 19:12:12

I am trying to get my head around why I keep getting crashes to my PG 8.3.7 database on CentOS - Linux version 2.6.18-164.el5. There are 3 slightly different (I think) circumstances leading to a crash of the database, all related in some way to long running PHP scripts with intensive activity on PG connections.
#1 - execution of an R call via pLR (sometimes this may crash it all by itself)
#2 - execution of a postGIS query (possible)
#3 - random occurences, all related to the same long running PHP scripts

I have read that perhaps hardware and/or system settings may cause this. I believe the system is running i9 processors, that may be set into some sort of virtual multi-threading mode. Thanks for any insight you all can give in tracking this down.

r.b.

Dump transcipt from postgresql log file:

LOG: server process (PID 5978) exited with exit code 2
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
<snip>message repeats 14 times </snip>
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted; last known up at 2011-09-23 14:15:04 EDT
LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 4AD/CFE63B98
LOG: record with zero length at 4AD/D0063FD8
LOG: redo done at 4AD/D0063FA8
LOG: last completed transaction was at log time 2011-09-23 14:19:45.972521-04
LOG: autovacuum launcher started
LOG: database system is ready to accept connections

Re: diagnosing a db crash - server exit code 2

От

Joe Conway

Дата:

23 сентября 2011 г., 19:23:27

On 09/23/2011 12:02 PM, Burgholzer, Robert (DEQ) wrote:
> I am trying to get my head around why I keep getting crashes to my PG
> 8.3.7 database on CentOS - Linux version 2.6.18-164.el5.  There are 3
> slightly different (I think) circumstances leading to a crash of the
> database, all related in some way to long running PHP scripts with
> intensive activity on PG connections.
> #1 - execution of an R call via pLR (sometimes this may crash it all by
> itself)
> #2 - execution of a postGIS query (possible)
> #3 - random occurences, all related to the same long running PHP scripts
>
> I have read that perhaps hardware and/or system settings may cause
> this.  I believe the system is running i9 processors, that may be set
> into some sort of virtual multi-threading mode.  Thanks for any insight
> you all can give in tracking this down.

Are you maybe getting bitten by the OOM killer?

http://www.postgresql.org/docs/8.3/interactive/kernel-resources.html#AEN22246

Joe

--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support

Re: diagnosing a db crash - server exit code 2

От

Scott Marlowe

Дата:

23 сентября 2011 г., 19:34:24

On Fri, Sep 23, 2011 at 1:23 PM, Joe Conway <mail@joeconway.com> wrote:
> On 09/23/2011 12:02 PM, Burgholzer, Robert (DEQ) wrote:
>> I am trying to get my head around why I keep getting crashes to my PG
>> 8.3.7 database on CentOS - Linux version 2.6.18-164.el5.  There are 3
>> slightly different (I think) circumstances leading to a crash of the
>> database, all related in some way to long running PHP scripts with
>> intensive activity on PG connections.
>> #1 - execution of an R call via pLR (sometimes this may crash it all by
>> itself)
>> #2 - execution of a postGIS query (possible)
>> #3 - random occurences, all related to the same long running PHP scripts
>>
>> I have read that perhaps hardware and/or system settings may cause
>> this.  I believe the system is running i9 processors, that may be set
>> into some sort of virtual multi-threading mode.  Thanks for any insight
>> you all can give in tracking this down.
>
> Are you maybe getting bitten by the OOM killer?
>
> http://www.postgresql.org/docs/8.3/interactive/kernel-resources.html#AEN22246

If OP needs a duct tape fix, just create a giant swap file and add it
to swap.  Unless there are runaway recursive things happening, then
he's gotta fix those.

Re: diagnosing a db crash - server exit code 2

От

"Burgholzer, Robert (DEQ)"

Дата:

23 сентября 2011 г., 19:38:27

Joe,
Thanks - I will try to check into this - however, we have done some tuning on the memory over the last 2 years and gotten it such that it is seldom if every having to dip into its swap too substantially -- according to "top", we remain under 1% swap usage during most times. Generally speaking, I run 3-5 of these large PHP processes simultaneously with no issue, however, when I issue even a "median" function call, after several calls (no consistent pattern that I can discern), the backend crashes.

If this were the OOM killer - any way I would diagnose it?

regards,
r.b.

-----Original Message-----
From: Joe Conway [mailto:mail@joeconway.com]
Sent: Fri 9/23/2011 3:23 PM
To: Burgholzer, Robert (DEQ)
Cc: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] diagnosing a db crash - server exit code 2

On 09/23/2011 12:02 PM, Burgholzer, Robert (DEQ) wrote:
> I am trying to get my head around why I keep getting crashes to my PG
> 8.3.7 database on CentOS - Linux version 2.6.18-164.el5. There are 3
> slightly different (I think) circumstances leading to a crash of the
> database, all related in some way to long running PHP scripts with
> intensive activity on PG connections.
> #1 - execution of an R call via pLR (sometimes this may crash it all by
> itself)
> #2 - execution of a postGIS query (possible)
> #3 - random occurences, all related to the same long running PHP scripts
>
> I have read that perhaps hardware and/or system settings may cause
> this. I believe the system is running i9 processors, that may be set
> into some sort of virtual multi-threading mode. Thanks for any insight
> you all can give in tracking this down.

Are you maybe getting bitten by the OOM killer?

http://www.postgresql.org/docs/8.3/interactive/kernel-resources.html#AEN22246

Joe

--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support

Re: diagnosing a db crash - server exit code 2

От

Scott Marlowe

Дата:

23 сентября 2011 г., 19:43:03

On Fri, Sep 23, 2011 at 1:38 PM, Burgholzer, Robert (DEQ)
<Robert.Burgholzer@deq.virginia.gov> wrote:
> Joe,
> Thanks - I will try to check into this - however, we have done some tuning
> on the memory over the last 2 years and gotten it such that it is seldom if
> every having to dip into its swap too substantially -- according to "top",
> we remain under 1% swap usage during most times.   Generally speaking, I run
> 3-5 of these large PHP processes simultaneously with no issue, however, when
> I issue even a "median" function call, after several calls (no consistent
> pattern that I can discern), the backend crashes.
>
> If this were the OOM killer - any way I would diagnose it?

If it's the OOM killer it'll be in your /var/log/messages log.

Re: diagnosing a db crash - server exit code 2

От

Joe Conway

Дата:

23 сентября 2011 г., 20:31:22

On 09/23/2011 12:38 PM, Burgholzer, Robert (DEQ) wrote:
> Thanks - I will try to check into this - however, we have done some
> tuning on the memory over the last 2 years and gotten it such that it is
> seldom if every having to dip into its swap too substantially --
> according to "top", we remain under 1% swap usage during most times.
> Generally speaking, I run 3-5 of these large PHP processes
> simultaneously with no issue, however, when I issue even a "median"
> function call, after several calls (no consistent pattern that I can
> discern), the backend crashes.

Is this always related to use of PL/R or is it sometimes happening
independently?

If you believe it is PL/R specific, please give us R version and PL/R
version. Also, any chance we can get a debugger on a core file? Can you
reproduce this on a development machine?

Thanks,

Joe

--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support

Re: diagnosing a db crash - server exit code 2

От

"Burgholzer, Robert (DEQ)"

Дата:

23 сентября 2011 г., 20:46:04

Scott - thanks, in pouring over the logs, I have not found anything certain, but have turned up a ton of mesages about my sysadmins se-linux security and php/pg (don't know if they're my app or script kiddies...). I will keep looking for things pertianing to these crashes. The message relative to php and pgsql:
Sep 20 11:37:32 deq1 setroubleshoot: SELinux is preventing php (httpd_sys_script_t) "setopt" to <Unknown> (httpd_sys_script_t). For complete SELinux messages. run sealert -l 199268c4-b84d-4a33-a073-29bd4461f875

Joe - it appears that it ALWAYS involves pLR - even a simple median call has caused it, though I must say it is something that is calculating the median of somewhere around 10-20,000 pieces of data if that makes any difference. I would be delighted to run any kind of debugging necessary and share the info. I have an identical system that can reproduce the errors (I am pretty certain that they HAVE previously). What I DON'T have is any knowledge of the stack-trace/debugger things, but I'm willing to learn, and I have a sysadmin who may be able to lend a hand.

Thanks a bunch gents - I have been nibbling around the edges of this problem for quite some time and I am ready to take a bite,
r.b.

-----Original Message-----
From: Scott Marlowe [mailto:scott.marlowe@gmail.com]
Sent: Fri 9/23/2011 3:42 PM
To: Burgholzer, Robert (DEQ)
Cc: Joe Conway; pgsql-admin@postgresql.org
Subject: Re: [ADMIN] diagnosing a db crash - server exit code 2

On Fri, Sep 23, 2011 at 1:38 PM, Burgholzer, Robert (DEQ)
<Robert.Burgholzer@deq.virginia.gov> wrote:
> Joe,
> Thanks - I will try to check into this - however, we have done some tuning
> on the memory over the last 2 years and gotten it such that it is seldom if
> every having to dip into its swap too substantially -- according to "top",
> we remain under 1% swap usage during most times. Generally speaking, I run
> 3-5 of these large PHP processes simultaneously with no issue, however, when
> I issue even a "median" function call, after several calls (no consistent
> pattern that I can discern), the backend crashes.
>
> If this were the OOM killer - any way I would diagnose it?

If it's the OOM killer it'll be in your /var/log/messages log.

Re: diagnosing a db crash - server exit code 2

От

Scott Marlowe

Дата:

23 сентября 2011 г., 20:51:47

On Fri, Sep 23, 2011 at 2:45 PM, Burgholzer, Robert (DEQ)
<Robert.Burgholzer@deq.virginia.gov> wrote:
> Scott - thanks, in pouring over the logs, I have not found anything certain,
> but have turned up a ton of mesages about my sysadmins se-linux security and
> php/pg (don't know if they're my app or script kiddies...).  I will keep
> looking for things pertianing to these crashes.  The message relative to php
> and pgsql:
> Sep 20 11:37:32 deq1 setroubleshoot: SELinux is preventing php
> (httpd_sys_script_t) "setopt" to <Unknown> (httpd_sys_script_t). For
> complete SELinux messages. run sealert -l
> 199268c4-b84d-4a33-a073-29bd4461f875

Take a peek here:
http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

If you can get a trace I'm sure Joe can figure out what's making it
croak.  Not that I speak for Joe, I just have faith in him. :)

Re: diagnosing a db crash - server exit code 2

От

"Burgholzer, Robert (DEQ)"

Дата:

23 сентября 2011 г., 20:55:54

Thanks Scott - I'm on the MFer. That is, I WILL be on it next Monday when I have time to sit down, gin up a test case and break this sucker. :)

Stay tuned, and thanks bunches,

r.b.

-----Original Message-----
From: Scott Marlowe [mailto:scott.marlowe@gmail.com]
Sent: Fri 9/23/2011 4:51 PM
To: Burgholzer, Robert (DEQ)
Cc: Joe Conway; pgsql-admin@postgresql.org
Subject: Re: [ADMIN] diagnosing a db crash - server exit code 2

On Fri, Sep 23, 2011 at 2:45 PM, Burgholzer, Robert (DEQ)
<Robert.Burgholzer@deq.virginia.gov> wrote:
> Scott - thanks, in pouring over the logs, I have not found anything certain,
> but have turned up a ton of mesages about my sysadmins se-linux security and
> php/pg (don't know if they're my app or script kiddies...). I will keep
> looking for things pertianing to these crashes. The message relative to php
> and pgsql:
> Sep 20 11:37:32 deq1 setroubleshoot: SELinux is preventing php
> (httpd_sys_script_t) "setopt" to <Unknown> (httpd_sys_script_t). For
> complete SELinux messages. run sealert -l
> 199268c4-b84d-4a33-a073-29bd4461f875

Take a peek here:
http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

If you can get a trace I'm sure Joe can figure out what's making it
croak. Not that I speak for Joe, I just have faith in him. :)

Re: diagnosing a db crash - server exit code 2

От

Joe Conway

Дата:

23 сентября 2011 г., 21:03:50

On 09/23/2011 01:45 PM, Burgholzer, Robert (DEQ) wrote:
> Joe - it appears that it ALWAYS involves pLR - even a simple median call
> has caused it, though I must say it is something that is calculating the
> median of somewhere around 10-20,000 pieces of data if that makes any
> difference.  I would be delighted to run any kind of debugging necessary
> and share the info.  I have an identical system that can reproduce the
> errors (I am pretty certain that they HAVE previously).  What I DON'T
> have is any knowledge of the stack-trace/debugger things, but I'm
> willing to learn, and I have a sysadmin who may be able to lend a hand.

There is some good information about using gdb with postgres here:

http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

If you need a hand, I would be happy to help you through the debugging
via phone or even log in remotely if you can allow it. Just contact me
off-list if you want to pursue that.

Note that I made a new PL/R release just a few weeks ago which fixed
several known crash-bugs. In particular these two pop out at me:

- Fix missing calls to UNPROTECT.
- Don't try to free an array element value when the
  array element is NULL

Joe

--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support

Re: diagnosing a db crash - server exit code 2

От

Tom Lane

Дата:

25 сентября 2011 г., 23:54:29

"Burgholzer, Robert (DEQ)" <Robert.Burgholzer@deq.virginia.gov> writes:
> I am trying to get my head around why I keep getting crashes to my PG
> 8.3.7 database on CentOS - Linux version 2.6.18-164.el5.
> LOG:  server process (PID 5978) exited with exit code 2

Just like it says, this implies that a backend process exited with
"exit(2)".  A quick grep through the 8.3 sources says that the only such
call in the backend code is in the SIGQUIT signal handler.  Now it's
possible that what you're seeing is an exit(2) somewhere in the R
library rather than in Postgres code.  But I think it's more likely that
some external source is SIGQUIT'ing the backend.  Joe suggested an OOM
kill, but I've never heard of the OOM code using anything but SIGKILL
(signal 9).  Have you got any other software that runs as root and
thinks it's licensed to kill processes in the name of
something-or-other?

Anyway, I concur with the plan of getting a stack trace to determine
where the exit call is, so that we can positively eliminate (or not)
the R library.

            regards, tom lane

Re: diagnosing a db crash - server exit code 2

От

"Burgholzer, Robert (DEQ)"

Дата:

26 сентября 2011 г., 12:36:31

Thanks to everyone, Tom, Joe, Scott, I will be in touch today as I move through this.

Joe - if I need to have you log in for assistance, I am more than happy to make that happen.

Regards,
r.b.

-----Original Message-----
From: Joe Conway [mailto:mail@joeconway.com]
Sent: Fri 9/23/2011 5:03 PM
To: Burgholzer, Robert (DEQ)
Cc: Scott Marlowe; pgsql-admin@postgresql.org
Subject: Re: [ADMIN] diagnosing a db crash - server exit code 2

On 09/23/2011 01:45 PM, Burgholzer, Robert (DEQ) wrote:
> Joe - it appears that it ALWAYS involves pLR - even a simple median call
> has caused it, though I must say it is something that is calculating the
> median of somewhere around 10-20,000 pieces of data if that makes any
> difference. I would be delighted to run any kind of debugging necessary
> and share the info. I have an identical system that can reproduce the
> errors (I am pretty certain that they HAVE previously). What I DON'T
> have is any knowledge of the stack-trace/debugger things, but I'm
> willing to learn, and I have a sysadmin who may be able to lend a hand.

There is some good information about using gdb with postgres here:

http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

If you need a hand, I would be happy to help you through the debugging
via phone or even log in remotely if you can allow it. Just contact me
off-list if you want to pursue that.

Note that I made a new PL/R release just a few weeks ago which fixed
several known crash-bugs. In particular these two pop out at me:

- Fix missing calls to UNPROTECT.
- Don't try to free an array element value when the
array element is NULL

Joe

--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support

Re: diagnosing a db crash - server exit code 2

От

Robert Burgholzer

Дата:

29 сентября 2011 г., 00:38:21

Just a quick checkin on this problem. Thus far, I have managed to install dbg and recompile postgresql with the appropriate debugging headers/variables.

I have been following wiki that Scott sent, and attempted to trace one of my pg processes while making it crash. I have "succeeded" in causing the crash on my dev server, which suggests at least that it is not due to some spurious piece of faulty hardware on my primary. I had failed to initiate the log file creation on the process that was tracing, and thus it seems no log file resulted. Also, the ssh session that was monitoring the process died in the midst due to a local network routing glitch.

If anyone has any suggestions as to how to run the trace via a nohup command or something, that would be cool, since then I could let it run in the background. I can reproduce the crash, but it is somewhat episodic, it seems that I can run the same query several times before things blow up.

So, in short, I am quite confident that I can get this finished shortly, but very short on time to devote to it for the next couple of days.

Thanks again for the help, and sorry that I am drawing this process out,

r.b.

On Mon, Sep 26, 2011 at 8:20 AM, Burgholzer, Robert (DEQ) <Robert.Burgholzer@deq.virginia.gov> wrote:

Thanks to everyone, Tom, Joe, Scott, I will be in touch today as I move through this.

Joe - if I need to have you log in for assistance, I am more than happy to make that happen.

Regards,
r.b.

-----Original Message-----
From: Joe Conway [mailto:mail@joeconway.com]
Sent: Fri 9/23/2011 5:03 PM
To: Burgholzer, Robert (DEQ)
Cc: Scott Marlowe; pgsql-admin@postgresql.org
Subject: Re: [ADMIN] diagnosing a db crash - server exit code 2

On 09/23/2011 01:45 PM, Burgholzer, Robert (DEQ) wrote:
> Joe - it appears that it ALWAYS involves pLR - even a simple median call
> has caused it, though I must say it is something that is calculating the
> median of somewhere around 10-20,000 pieces of data if that makes any
> difference. I would be delighted to run any kind of debugging necessary
> and share the info. I have an identical system that can reproduce the
> errors (I am pretty certain that they HAVE previously). What I DON'T
> have is any knowledge of the stack-trace/debugger things, but I'm
> willing to learn, and I have a sysadmin who may be able to lend a hand.

There is some good information about using gdb with postgres here:

http://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

If you need a hand, I would be happy to help you through the debugging
via phone or even log in remotely if you can allow it. Just contact me
off-list if you want to pursue that.

Note that I made a new PL/R release just a few weeks ago which fixed
several known crash-bugs. In particular these two pop out at me:

- Fix missing calls to UNPROTECT.
- Don't try to free an array element value when the
array element is NULL

Joe

--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support

--
--
Robert W. Burgholzer
http://www.findingfreestyle.com/
On Facebook - http://www.facebook.com/pages/Finding-Freestyle/151918511505970
Twitter - http://www.twitter.com/findfreestyle
What's a tweeted swim set? A Sweet? No, a #swaiku! Get them by following http://twitter.com/findfreestyle

Re: diagnosing a db crash - server exit code 2

От

bricklen

Дата:

29 сентября 2011 г., 16:08:58

On Wed, Sep 28, 2011 at 12:54 PM, Robert Burgholzer <rburghol@vt.edu> wrote:
> If anyone has any suggestions as to how to run the trace via a nohup command
> or something, that would be cool, since then I could let it run in the
> background.

If you have "screen" installed, maybe try it in a screen session.

Re: diagnosing a db crash - server exit code 2

От

Armin Resch

Дата:

29 сентября 2011 г., 17:00:45

In a bourne-type shell you could use:

$ nohup [your_cmd] >[your_log_file] 2>&1 &

Then you should be able to safely disconnect your terminal.

Cheers,

-ar

On Thu, Sep 29, 2011 at 11:08 AM, bricklen <bricklen@gmail.com> wrote:

On Wed, Sep 28, 2011 at 12:54 PM, Robert Burgholzer <rburghol@vt.edu> wrote:
> If anyone has any suggestions as to how to run the trace via a nohup command
> or something, that would be cool, since then I could let it run in the
> background.

If you have "screen" installed, maybe try it in a screen session.

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: diagnosing a db crash - server exit code 2

От

Joe Conway

Дата:

29 сентября 2011 г., 17:30:09

On 09/28/2011 12:54 PM, Robert Burgholzer wrote:
> Just a quick checkin on this problem.  Thus far, I have managed to
> install dbg and recompile postgresql with the appropriate debugging
> headers/variables.

I might have missed it, but are you already running the latest version
of PL/R (8.3.0.13)?

Joe


--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support

Re: diagnosing a db crash - server exit code 2

От

"Burgholzer, Robert (DEQ)"

Дата:

03 октября 2011 г., 16:54:54

OK,
So I have upgraded to the latest plR, and restarted my PG backend and
reloaded my PG functions that call R/PlR.  I am still able to make the
process crash by running my R functions.  When trying to obtain the
stack trace, however, I notice that it appears as if the process that is
CAUSING the crash, is not SUFFERING the crash, as per this message while
executing the offending SQL/pLR.  I have my PHP process running in the
background, making no R calls but involving heavy traffic on the PG
server, and I opened up a psql console and executed the troublesome
command (a large quantile call) - then I ran a trace on the psql console
session, and got the following at the PG command line while waiting for
the query to finish:

<psql>
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
</psql>

This is backed up by the debuglog.txt giving me no feedback:

<debuglog.txt>
Continuing.
Program exited normally.
</debuglog.txt>

Is it even possible that something like this is happening -- i.e.,
calling R from a different process kills another backend? How do I
figure out which process to watch?

Thanks,
r.b.

Robert W. Burgholzer
Surface Water Modeler
Office of Water Supply and Planning
Virginia Department of Environmental Quality
rwburgholzer@deq.virginia.gov
804-698-4405
Open Source Modeling Tools:
http://sourceforge.net/projects/npsource/

-----Original Message-----
From: Joe Conway [mailto:mail@joeconway.com]
Sent: Thursday, September 29, 2011 1:30 PM
To: Robert Burgholzer
Cc: Burgholzer, Robert (DEQ); Scott Marlowe; pgsql-admin@postgresql.org
Subject: Re: [ADMIN] diagnosing a db crash - server exit code 2

On 09/28/2011 12:54 PM, Robert Burgholzer wrote:
> Just a quick checkin on this problem.  Thus far, I have managed to
> install dbg and recompile postgresql with the appropriate debugging
> headers/variables.

I might have missed it, but are you already running the latest version
of PL/R (8.3.0.13)?

Joe

--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support

Re: diagnosing a db crash - server exit code 2

От

Robert Burgholzer

Дата:

03 октября 2011 г., 17:11:09

FWIW - I am currently trying this while tracing the process that I assume is the postmaster (/usr/bin/postgres -D /home/postgres/data), since this process number indicates that it was recently restarted - although the other PG processes, writer, wal writer, autovacuum, stats collector all have their older pids indicating that they still survive.

regards,

r.b.

On Mon, Oct 3, 2011 at 12:54 PM, Burgholzer, Robert (DEQ) <Robert.Burgholzer@deq.virginia.gov> wrote:

OK,
So I have upgraded to the latest plR, and restarted my PG backend and
reloaded my PG functions that call R/PlR. I am still able to make the
process crash by running my R functions. When trying to obtain the
stack trace, however, I notice that it appears as if the process that is
CAUSING the crash, is not SUFFERING the crash, as per this message while
executing the offending SQL/pLR. I have my PHP process running in the
background, making no R calls but involving heavy traffic on the PG
server, and I opened up a psql console and executed the troublesome
command (a large quantile call) - then I ran a trace on the psql console
session, and got the following at the PG command line while waiting for
the query to finish:

<psql>
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
</psql>

This is backed up by the debuglog.txt giving me no feedback:

<debuglog.txt>
Continuing.
Program exited normally.
</debuglog.txt>

Is it even possible that something like this is happening -- i.e.,
calling R from a different process kills another backend? How do I
figure out which process to watch?

Thanks,
r.b.

Robert W. Burgholzer
Surface Water Modeler
Office of Water Supply and Planning
Virginia Department of Environmental Quality
rwburgholzer@deq.virginia.gov
804-698-4405
Open Source Modeling Tools:
http://sourceforge.net/projects/npsource/

-----Original Message-----
From: Joe Conway [mailto:mail@joeconway.com]
Sent: Thursday, September 29, 2011 1:30 PM
To: Robert Burgholzer
Cc: Burgholzer, Robert (DEQ); Scott Marlowe; pgsql-admin@postgresql.org
Subject: Re: [ADMIN] diagnosing a db crash - server exit code 2

On 09/28/2011 12:54 PM, Robert Burgholzer wrote:
> Just a quick checkin on this problem. Thus far, I have managed to
> install dbg and recompile postgresql with the appropriate debugging
> headers/variables.

I might have missed it, but are you already running the latest version
of PL/R (8.3.0.13)?

Joe

--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support

Re: diagnosing a db crash - server exit code 2

От

Joe Conway

Дата:

03 октября 2011 г., 17:28:56

On 10/03/2011 10:10 AM, Robert Burgholzer wrote:
> FWIW - I am currently trying this while tracing the process that I
> assume is the postmaster (/usr/bin/postgres -D /home/postgres/data),
> since this process number indicates that it was recently restarted -
> although the other PG processes, writer, wal writer, autovacuum, stats
> collector all have their older pids indicating that they still survive.

Sounds like you are attaching to the wrong process. Try something like
the below...

Joe



Session #1: (connect to db and load PL/R)
-----------------
#psql contrib_regression
psql (9.2devel)
Type "help" for help.

contrib_regression=# load '$libdir/plr';
LOAD

Session #2: (use ps to find backend and attach)
-----------------
# ps -ef |grep postgres
postgres 17001     1  0 Sep24 ?        00:00:16
/usr/local/pgsql-head/bin/postgres -D /usr/local/pgsql-head/data -p 55437 -i
postgres 17006 17001  0 Sep24 ?        00:02:18 postgres: writer process

postgres 17007 17001  0 Sep24 ?        00:01:50 postgres: wal writer
process
postgres 17008 17001  0 Sep24 ?        00:00:37 postgres: autovacuum
launcher process
postgres 17009 17001  0 Sep24 ?        00:00:47 postgres: stats
collector process
postgres 26631 17001  0 10:22 ?        00:00:00 postgres: postgres
contrib_regression [local] idle

#gdb /usr/local/pgsql-head/bin/postgres 26631
(gdb) continue
Continuing.


Session #1: (run crashing function)
-----------------
run your PL/R function that causes the crash


--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support

Re: diagnosing a db crash - server exit code 2

От

Robert Burgholzer

Дата:

03 октября 2011 г., 17:34:30

Thanks Joe - yeah, I am now tracing the postmaster -- will post up shortly.

r.b.

On Mon, Oct 3, 2011 at 1:28 PM, Joe Conway <mail@joeconway.com> wrote:

On 10/03/2011 10:10 AM, Robert Burgholzer wrote:
> FWIW - I am currently trying this while tracing the process that I
> assume is the postmaster (/usr/bin/postgres -D /home/postgres/data),
> since this process number indicates that it was recently restarted -
> although the other PG processes, writer, wal writer, autovacuum, stats
> collector all have their older pids indicating that they still survive.

Sounds like you are attaching to the wrong process. Try something like
the below...

Joe

Session #1: (connect to db and load PL/R)
-----------------
#psql contrib_regression
psql (9.2devel)
Type "help" for help.

contrib_regression=# load '$libdir/plr';
LOAD

Session #2: (use ps to find backend and attach)
-----------------
# ps -ef |grep postgres
postgres 17001 1 0 Sep24 ? 00:00:16
/usr/local/pgsql-head/bin/postgres -D /usr/local/pgsql-head/data -p 55437 -i
postgres 17006 17001 0 Sep24 ? 00:02:18 postgres: writer process

postgres 17007 17001 0 Sep24 ? 00:01:50 postgres: wal writer
process
postgres 17008 17001 0 Sep24 ? 00:00:37 postgres: autovacuum
launcher process
postgres 17009 17001 0 Sep24 ? 00:00:47 postgres: stats
collector process
postgres 26631 17001 0 10:22 ? 00:00:00 postgres: postgres
contrib_regression [local] idle

#gdb /usr/local/pgsql-head/bin/postgres 26631
(gdb) continue
Continuing.

Session #1: (run crashing function)
-----------------
run your PL/R function that causes the crash

--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support

Re: diagnosing a db crash - server exit code 2

От

Tom Lane

Дата:

03 октября 2011 г., 23:11:03

Joe Conway <mail@joeconway.com> writes:
> On 10/03/2011 10:10 AM, Robert Burgholzer wrote:
>> FWIW - I am currently trying this while tracing the process that I
>> assume is the postmaster (/usr/bin/postgres -D /home/postgres/data),
>> since this process number indicates that it was recently restarted -
>> although the other PG processes, writer, wal writer, autovacuum, stats
>> collector all have their older pids indicating that they still survive.

> Sounds like you are attaching to the wrong process. Try something like
> the below...

No need to guess about it ... use "select pg_backend_pid();" and then
attach to that process.  (BTW, the "postmaster" is the parent process.
The one you want to debug is a backend.)

            regards, tom lane

Re: diagnosing a db crash - server exit code 2

От

Robert Burgholzer

Дата:

04 октября 2011 г., 19:18:18

Thank Tom, that's much better than guessing.

r.b.

On Mon, Oct 3, 2011 at 7:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Joe Conway <mail@joeconway.com> writes:
> On 10/03/2011 10:10 AM, Robert Burgholzer wrote:
>> FWIW - I am currently trying this while tracing the process that I
>> assume is the postmaster (/usr/bin/postgres -D /home/postgres/data),
>> since this process number indicates that it was recently restarted -
>> although the other PG processes, writer, wal writer, autovacuum, stats
>> collector all have their older pids indicating that they still survive.

> Sounds like you are attaching to the wrong process. Try something like
> the below...

No need to guess about it ... use "select pg_backend_pid();" and then
attach to that process. (BTW, the "postmaster" is the parent process.
The one you want to debug is a backend.)

regards, tom lane

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: diagnosing a db crash - server exit code 2

От

Robert Burgholzer

Дата:

04 октября 2011 г., 19:29:30

OK,
So, now I am starting to think that I did something erroneous when restarting the PG server, because after the most recent crash on my dev server, it is now seemingly uncrashable. Which makes me think it is FIXED by the upgrade to 0.13 - which is awesome. Though I was pretty certain that I had restarted the PG process - but it is possible that I missed it of course.

FWIW, I used the following command:
pg_ctl -D /var/lib/pgsql/data restart -l logfile

I have run the test query 1,000 times in a loop on my dev server without a crash now, and it managed to crash the live server in a test after about 20 invocations earlier today.

So, thanks to everyone, I will chalk this up as a great learning experience, I have managed to get my stack traces and so forth, so that is awesome. Your patience and willingness to help has been awesome.

r.b.

On Mon, Oct 3, 2011 at 1:28 PM, Joe Conway <mail@joeconway.com> wrote:

On 10/03/2011 10:10 AM, Robert Burgholzer wrote:
> FWIW - I am currently trying this while tracing the process that I
> assume is the postmaster (/usr/bin/postgres -D /home/postgres/data),
> since this process number indicates that it was recently restarted -
> although the other PG processes, writer, wal writer, autovacuum, stats
> collector all have their older pids indicating that they still survive.

Sounds like you are attaching to the wrong process. Try something like
the below...

Joe

Session #1: (connect to db and load PL/R)
-----------------
#psql contrib_regression
psql (9.2devel)
Type "help" for help.

contrib_regression=# load '$libdir/plr';
LOAD

Session #2: (use ps to find backend and attach)
-----------------
# ps -ef |grep postgres
postgres 17001 1 0 Sep24 ? 00:00:16
/usr/local/pgsql-head/bin/postgres -D /usr/local/pgsql-head/data -p 55437 -i
postgres 17006 17001 0 Sep24 ? 00:02:18 postgres: writer process

postgres 17007 17001 0 Sep24 ? 00:01:50 postgres: wal writer
process
postgres 17008 17001 0 Sep24 ? 00:00:37 postgres: autovacuum
launcher process
postgres 17009 17001 0 Sep24 ? 00:00:47 postgres: stats
collector process
postgres 26631 17001 0 10:22 ? 00:00:00 postgres: postgres
contrib_regression [local] idle

#gdb /usr/local/pgsql-head/bin/postgres 26631
(gdb) continue
Continuing.

Session #1: (run crashing function)
-----------------
run your PL/R function that causes the crash

--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support

Re: diagnosing a db crash - server exit code 2

От

Joe Conway

Дата:

04 октября 2011 г., 19:41:47

On 10/04/2011 12:29 PM, Robert Burgholzer wrote:
> So, now I am starting to think that I did something erroneous when
> restarting the PG server, because after the most recent crash on my dev
> server, it is now seemingly uncrashable.  Which makes me think it is
> FIXED by the upgrade to 0.13 - which is awesome.  Though I was pretty
> certain that I had restarted the PG process - but it is possible that I
> missed it of course.

Are you pre-loading plr?

> FWIW, I used the following command:
>    pg_ctl -D /var/lib/pgsql/data restart -l logfile
>
> I have run the test query 1,000 times in a loop on my dev server without
> a crash now, and it managed to crash the live server in a test after
> about 20 invocations earlier today.
>
> So, thanks to everyone, I will chalk this up as a great learning
> experience, I have managed to get my stack traces and so forth, so that
> is awesome.  Your patience and willingness to help has been awesome.

Sounds good -- let us know if the problem resurfaces.

Joe

--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: diagnosing a db crash - server exit code 2