Обсуждение: Unkillable Backend Processes
I've encountered an oddity on a postgres cluster that results in an unresponsive postmaster and, frequently, unkillable backend processes. I'm having a difficult time isolating the queries that are related to this scenario because by the time the scenario occurs, max_connections have been reached, and no superuser connections are available. Because the query doesn't finish, I don't think it's getting logged (since logging is only done at the query level on a duration or error basis). In the current iteration, I can tell that it's an INSERT that's causing the problem, and the INSERT is coming from an Apache process on a machine on the same network. In recent occurrences, though, I'm almost positive I've seen a SELECT. But as troubled as I am by the cause, I'm similarly troubled by my inability to treat the symptoms effectively. When this occurs, I have tried shutting down the pgpools and postmaster (using pg_ctl). Unfortunately, pgpool frequently hangs during the shutdown attempt. When I kill these off individually using kill and then shut down the postmaster with pg_ctl immediate mode, I will occasionally find a backend process that cannot be killed, even with a KILL (-9) signal. Is this likely to be caused by something at a lower level than postgres? Here are the specs: PostgreSQL 8.1.3 pgpool 3.0.1 Debian GNU/Linux 3.1 Linux 2.6.10 #8 SMP system: ext3 RAID 1 WAL: jfs RAID 10 data: jfs RAID 10 There's also an NFS mount point. I'm still trying to do the forensics on the root cause (a related oddity: the system can run in production for days or weeks without any issues), but I'm just as interested in why I can't kill postgres backend processes that have no postmaster. If I can provide more information related to recovery, please let me know. -- Thomas F. O'Connell Database Architecture and Programming Sitening, LLC http://www.sitening.com/ 3004 B Poston Avenue Nashville, TN 37203-1314 615-260-0005 (cell) 615-469-5150 (office) 615-469-5151 (fax)
Thomas F. O'Connell wrote: > When I kill these off individually using kill and then shut down the > postmaster with pg_ctl immediate mode, I will occasionally find a > backend process that cannot be killed, even with a KILL (-9) signal. > > Is this likely to be caused by something at a lower level than postgres? Nothing Postgres does is able to block a SIGKILL (-9) signal. You can be certain that it is stuck in a system call, most likely reading something from disk. > Here are the specs: > > PostgreSQL 8.1.3 > pgpool 3.0.1 > Debian GNU/Linux 3.1 > Linux 2.6.10 #8 SMP > system: ext3 RAID 1 > WAL: jfs RAID 10 > data: jfs RAID 10 > > There's also an NFS mount point. JFS is not a very common sight around here I think. And NFS mounts are known as troublemakers of filesystem-level problems. > I'm still trying to do the forensics on the root cause (a related > oddity: the system can run in production for days or weeks without > any issues), but I'm just as interested in why I can't kill postgres > backend processes that have no postmaster. Backend processes are pretty much independent from postmaster. If you SIGKILL the postmaster, backends will happily continue with life AFAIK. (And anyway, if you can't kill them with SIGKILL, the postmaster won't be able to either.) -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On May 22, 2006, at 8:16 PM, Alvaro Herrera wrote: > Thomas F. O'Connell wrote: > >> When I kill these off individually using kill and then shut down the >> postmaster with pg_ctl immediate mode, I will occasionally find a >> backend process that cannot be killed, even with a KILL (-9) signal. >> >> Is this likely to be caused by something at a lower level than >> postgres? > > Nothing Postgres does is able to block a SIGKILL (-9) signal. You can > be certain that it is stuck in a system call, most likely reading > something from disk. That's what I thought. Thanks for the confirmation. -- Thomas F. O'Connell Database Architecture and Programming Sitening, LLC http://www.sitening.com/ 3004 B Poston Avenue Nashville, TN 37203-1314 615-260-0005 (cell) 615-469-5150 (office) 615-469-5151 (fax)
> > There's also an NFS mount point. > It's a long time since I used NFS but when I last did you had the choice between hard and soft mounts. A hard mount would behave like a physical drive (and would more-or-less never give up trying to commit a write) while a soft mount would return an error to the calling process should the mount become unavailable. Next time this happens you should be able to run ps -eo pid,wchan,user,args -u postgres Or similar (you need the wchan somehow or other - top can show it to, but I cannot remember the letter to toggle its display :-)), and obviously I assume that postgres is the user pg runs under. Your rogue process should show up a system call in the wchan column, and if you keep listing the process it won't change. This shows you are stuck (and also the args should show you at least whether it's INSERT or SELECT etc.). Even if your NFS mount doesn't have pg stuff on it, it could cause problems if you manage to stop a process in the kernel that has locked a resource that pg ends up needing (but I don't see this as likely because I would imagine a long queue of other processes backing up in the same way). Unless you really need the NFS mount I might be inclined to turn it off for a while and see what happens ... Cheers, Robin