Обсуждение: Unkillable Backend Processes

Поиск
Список
Период
Сортировка

Unkillable Backend Processes

От
"Thomas F. O'Connell"
Дата:
I've encountered an oddity on a postgres cluster that results in an
unresponsive postmaster and, frequently, unkillable backend
processes. I'm having a difficult time isolating the queries that are
related to this scenario because by the time the scenario occurs,
max_connections have been reached, and no superuser connections are
available. Because the query doesn't finish, I don't think it's
getting logged (since logging is only done at the query level on a
duration or error basis). In the current iteration, I can tell that
it's an INSERT that's causing the problem, and the INSERT is coming
from an Apache process on a machine on the same network. In recent
occurrences, though, I'm almost positive I've seen a SELECT.

But as troubled as I am by the cause, I'm similarly troubled by my
inability to treat the symptoms effectively. When this occurs, I have
tried shutting down the pgpools and postmaster (using pg_ctl).
Unfortunately, pgpool frequently hangs during the shutdown attempt.
When I kill these off individually using kill and then shut down the
postmaster with pg_ctl immediate mode, I will occasionally find a
backend process that cannot be killed, even with a KILL (-9) signal.

Is this likely to be caused by something at a lower level than postgres?

Here are the specs:

PostgreSQL 8.1.3
pgpool 3.0.1
Debian GNU/Linux 3.1
Linux 2.6.10 #8 SMP
system: ext3 RAID 1
WAL: jfs RAID 10
data: jfs RAID 10

There's also an NFS mount point.

I'm still trying to do the forensics on the root cause (a related
oddity: the system can run in production for days or weeks without
any issues), but I'm just as interested in why I can't kill postgres
backend processes that have no postmaster. If I can provide more
information related to recovery, please let me know.

--
Thomas F. O'Connell
Database Architecture and Programming
Sitening, LLC

http://www.sitening.com/
3004 B Poston Avenue
Nashville, TN 37203-1314
615-260-0005 (cell)
615-469-5150 (office)
615-469-5151 (fax)


Re: Unkillable Backend Processes

От
Alvaro Herrera
Дата:
Thomas F. O'Connell wrote:

> When I kill these off individually using kill and then shut down the
> postmaster with pg_ctl immediate mode, I will occasionally find a
> backend process that cannot be killed, even with a KILL (-9) signal.
>
> Is this likely to be caused by something at a lower level than postgres?

Nothing Postgres does is able to block a SIGKILL (-9) signal.  You can
be certain that it is stuck in a system call, most likely reading
something from disk.

> Here are the specs:
>
> PostgreSQL 8.1.3
> pgpool 3.0.1
> Debian GNU/Linux 3.1
> Linux 2.6.10 #8 SMP
> system: ext3 RAID 1
> WAL: jfs RAID 10
> data: jfs RAID 10
>
> There's also an NFS mount point.

JFS is not a very common sight around here I think.  And NFS mounts are
known as troublemakers of filesystem-level problems.

> I'm still trying to do the forensics on the root cause (a related
> oddity: the system can run in production for days or weeks without
> any issues), but I'm just as interested in why I can't kill postgres
> backend processes that have no postmaster.

Backend processes are pretty much independent from postmaster.  If you
SIGKILL the postmaster, backends will happily continue with life AFAIK.
(And anyway, if you can't kill them with SIGKILL, the postmaster won't
be able to either.)

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Unkillable Backend Processes

От
"Thomas F. O'Connell"
Дата:
On May 22, 2006, at 8:16 PM, Alvaro Herrera wrote:

> Thomas F. O'Connell wrote:
>
>> When I kill these off individually using kill and then shut down the
>> postmaster with pg_ctl immediate mode, I will occasionally find a
>> backend process that cannot be killed, even with a KILL (-9) signal.
>>
>> Is this likely to be caused by something at a lower level than
>> postgres?
>
> Nothing Postgres does is able to block a SIGKILL (-9) signal.  You can
> be certain that it is stuck in a system call, most likely reading
> something from disk.

That's what I thought. Thanks for the confirmation.

--
Thomas F. O'Connell
Database Architecture and Programming
Sitening, LLC

http://www.sitening.com/
3004 B Poston Avenue
Nashville, TN 37203-1314
615-260-0005 (cell)
615-469-5150 (office)
615-469-5151 (fax)

Re: Unkillable Backend Processes

От
Robin Iddon
Дата:
>
> There's also an NFS mount point.
>
It's a long time since I used NFS but when I last did you had the choice
between hard and soft mounts.  A hard mount would behave like a physical
drive (and would more-or-less never give up trying to commit a write)
while a soft mount would return an error to the calling process should
the mount become unavailable.

Next time this happens you should be able to run

    ps -eo pid,wchan,user,args -u postgres

Or similar (you need the wchan somehow or other - top can show it to,
but I cannot remember the letter to toggle its display :-)), and
obviously I assume that postgres is the user pg runs under.

Your rogue process should show up a system call in the wchan column,
and if you keep listing the process it won't change.  This shows you are
stuck (and also the args should show you at least whether it's INSERT or
SELECT etc.).

Even if your NFS mount doesn't have pg stuff on it, it could cause
problems if you manage to stop a process in the kernel that has locked a
resource that pg ends up needing (but I don't see this as likely because
I would imagine a long queue of other processes backing up in the same way).

Unless you really need the NFS mount I might be inclined to turn it off
for a while and see what happens ...

Cheers,
Robin