Обсуждение: strange hung processes

Поиск
Список
Период
Сортировка

strange hung processes

От
Jeremy Ashcraft
Дата:
We're running 7.4.9 and have run into something strange.

A group of processes seem to have hung and cannot be killed.  netstat
showed that there were no active TCP connections at the time.  Sending
SIGTERM to the parent process caused PG to begin its shutdown, but it
never finished.  We then kill -9 the postmaster, which caused it to die,
but these child procs remain, even after a clean restart.  Our next step
is to reboot the box, but before we do, I'm just curious anyone else has
seen this,  why this happened, and if anyone know ways to prevent this
from happening again.....

thanks

--
jeremy ashcraft
operations/development
EDucation GATEways
jashcraft@edgate.com


Re: strange hung processes

От
Jeremy Ashcraft
Дата:
I forgot to include the process list:

postgres 13214  5142  0 Jan09 ?        00:00:00 postgres: postgre edgate
10.1.1.3 authentication
postgres 13215  5142  0 Jan09 ?        00:00:00 postgres: postgre edgate
10.1.1.1 authentication
postgres 13216  5142  0 Jan09 ?        00:00:00 postgres: postgre
sn_master 10.1.1.3 authentication
postgres 13217  5142  0 Jan09 ?        00:00:00 postgres: snuser
sn_master 10.1.1.3 authentication
postgres 13218  5142  0 Jan09 ?        00:00:00 postgres: postgre edgate
10.1.1.1 authentication
postgres 13219  5142  0 Jan09 ?        00:00:00 postgres: snuser
sn_master 10.1.1.3 authentication
postgres 13220  5142  0 Jan09 ?        00:00:00 postgres: snuser
sn_master 10.1.1.3 authentication
postgres 13221  5142  0 Jan09 ?        00:00:00 postgres: snuser
sn_master 10.1.1.3 authentication
postgres 13222  5142  0 Jan09 ?        00:00:00 postgres: postgre
sn_master 10.1.1.3 authentication
postgres 13223  5142  0 Jan09 ?        00:00:00 postgres: snuser
sn_master 10.1.1.3 authentication
postgres 13224  5142  0 Jan09 ?        00:00:00 postgres: postgre
sn_master 10.1.1.3 authentication
postgres 13225  5142  0 Jan09 ?        00:00:00 postgres: postgre
sn_master 10.1.1.3 authentication
postgres 13226  5142  0 Jan09 ?        00:00:00 postgres: snuser
sn_master 10.1.1.3 authentication
postgres 13227  5142  0 Jan09 ?        00:00:00 postgres: snuser
sn_master 10.1.1.3 authentication
postgres 13228  5142  0 Jan09 ?        00:00:00 postgres: postgre
sn_master 10.1.1.3 authentication
postgres 13229  5142  0 Jan09 ?        00:00:00 postgres: postgre
sn_master 10.1.1.3 authentication


Jeremy Ashcraft wrote:

> We're running 7.4.9 and have run into something strange.
> A group of processes seem to have hung and cannot be killed.  netstat
> showed that there were no active TCP connections at the time.  Sending
> SIGTERM to the parent process caused PG to begin its shutdown, but it
> never finished.  We then kill -9 the postmaster, which caused it to
> die, but these child procs remain, even after a clean restart.  Our
> next step is to reboot the box, but before we do, I'm just curious
> anyone else has seen this,  why this happened, and if anyone know ways
> to prevent this from happening again.....
>
> thanks
>


--
jeremy ashcraft
operations/development
EDucation GATEways
jashcraft@edgate.com


Re: strange hung processes

От
Ben Kim
Дата:
>We're running 7.4.9 and have run into something strange.
>A group of processes seem to have hung and cannot be killed.  netstat
>showed that there were no active TCP connections at the time.  Sending
>SIGTERM to the parent process caused PG to begin its shutdown, but it
>never finished.  We then kill -9 the postmaster, which caused it to die,
>but these child procs remain, even after a clean restart.  Our next step
>is to reboot the box, but before we do, I'm just curious anyone else has
>seen this,  why this happened, and if anyone know ways to prevent this
>from happening again.....

If it's not sensitive information, what does this show?

    lsof | grep 'pid of hung process'

Regards,

Ben Kim
Developer
http://benix.tamu.edu


Re: strange hung processes

От
"Larry Rosenman"
Дата:
Ben Kim wrote:
>> We're running 7.4.9 and have run into something strange.
>> A group of processes seem to have hung and cannot be killed.  netstat
>> showed that there were no active TCP connections at the time.
>> Sending SIGTERM to the parent process caused PG to begin its
>> shutdown, but it never finished.  We then kill -9 the postmaster,
>> which caused it to die, but these child procs remain, even after a
>> clean restart.  Our next step is to reboot the box, but before we
>> do, I'm just curious anyone else has seen this,  why this happened,
>> and if anyone know ways to prevent this from happening again.....
>
> If it's not sensitive information, what does this show?
>
>     lsof | grep 'pid of hung process'
>
from the save-a-process committee:

lsof -p 'pid of hung process'

will give the same info :)

LER
--
Larry Rosenman
Database Support Engineer

PERVASIVE SOFTWARE. INC.
12365B RIATA TRACE PKWY
3015
AUSTIN TX  78727-6531

Tel: 512.231.6173
Fax: 512.231.6597
Email: Larry.Rosenman@pervasive.com
Web: www.pervasive.com

Re: strange hung processes

От
Jeremy Ashcraft
Дата:
If it's not sensitive information, what does this show?

>>    lsof | grep 'pid of hung process'
>>
From my systems guy:

in /proc

codd 13262 # cat status     {proc "filesystem"}
Name:   postmaster
State:  D (disk sleep)

Because of the "D" state, it can't be killed as it is not interuptible
(waiting on IO ?).


codd opt # lsof -p 13263
COMMAND     PID     USER   FD   TYPE DEVICE    SIZE     NODE NAME
postmaste 13263 postgres  cwd    DIR   8,17     464     4030
/app1/user/postgres
postmaste 13263 postgres  rtd    DIR    8,3     488        2 /
postmaste 13263 postgres  txt    REG   8,17 2552859     5012
/app1/pg/7.4.9/bin/postgres
postmaste 13263 postgres  mem    REG    0,0                0 [heap]
(stat: No such file or directory)
postmaste 13263 postgres  DEL    REG    0,7                0 /SYSV0052e2c1
postmaste 13263 postgres  mem    REG    8,3   35428     5729
/lib/libnss_files-2.3.4.so
postmaste 13263 postgres  mem    REG    8,5   18852     6895
/usr/lib/libgpm.so.1.19.0
postmaste 13263 postgres  mem    REG    8,3  266284     5965
/lib/libncurses.so.5.4
postmaste 13263 postgres  mem    REG    8,3 1167808     5977
/lib/libc-2.3.4.so
postmaste 13263 postgres  mem    REG    8,3  151896     5755
/lib/libm-2.3.4.so
postmaste 13263 postgres  mem    REG    8,3   10620     5987
/lib/libdl-2.3.4.so
postmaste 13263 postgres  mem    REG    8,3   75848     6003
/lib/libnsl-2.3.4.so
postmaste 13263 postgres  mem    REG    8,3   60884     5998
/lib/libresolv-2.3.4.so
postmaste 13263 postgres  mem    REG    8,3   18424     6006
/lib/libcrypt-2.3.4.so
postmaste 13263 postgres  mem    REG    8,3  173996     6007
/lib/libreadline.so.4.3
postmaste 13263 postgres  mem    REG    8,3   63204     5976
/lib/libz.so.1.2.2
postmaste 13263 postgres  mem    REG    8,3   95392     5915
/lib/ld-2.3.4.so
postmaste 13263 postgres    0r   CHR    1,3             2767 /dev/null
postmaste 13263 postgres    1w   REG    8,7 7513385     3649
/var/log/pg/ruby.log.bak1
postmaste 13263 postgres    2w   REG    8,7 7513385     3649
/var/log/pg/ruby.log.bak1
postmaste 13263 postgres    5u  sock    0,4         11790654 can't
identify protocol


Re: strange hung processes

От
Tom Lane
Дата:
Jeremy Ashcraft <jashcraft@edgate.com> writes:
> in /proc

> codd 13262 # cat status     {proc "filesystem"}
> Name:   postmaster
> State:  D (disk sleep)

> Because of the "D" state, it can't be killed as it is not interuptible
> (waiting on IO ?).

If the process is stuck in D state then it's not Postgres' fault.
You're looking at a hardware problem, or if the database is mounted
via NFS then it might be an NFS-protocol-level problem.  In any case
you need to call out the kernel and hardware troops, not us database
weenies ...

            regards, tom lane