Обсуждение: auto removing stale pid for postmaster NT service

Поиск
Список
Период
Сортировка

auto removing stale pid for postmaster NT service

От
Tony_Chao@putnam.com
Дата:
Hi,

I've seen this question a couple of times in the archives, but I wasn't
able to find a solution. Please advise if you know of a workaround.

I have postmaster install as a service throught cygrunsrv on my win2k
machine. The postmaster service starts and stops fine most of the time.
But if the server crashes without a proper shutdown, the postmaster.pid
is left behind and the postmaster service fails to start at the next boot.

Is there a way to delete stale postmaster.pid on boot-up before
the postmaster service is attempted to be started?

Thanks
-Tony



Re: auto removing stale pid for postmaster NT service

От
Tom Lane
Дата:
Tony_Chao@putnam.com writes:
> I have postmaster install as a service throught cygrunsrv on my win2k
> machine. The postmaster service starts and stops fine most of the time.
> But if the server crashes without a proper shutdown, the postmaster.pid
> is left behind and the postmaster service fails to start at the next boot.

It should manage to start anyway --- why exactly does it refuse to
start?

            regards, tom lane

Re: auto removing stale pid for postmaster NT service

От
Simone Tellini
Дата:
On Mon, 16 Sep 2002 09:23:49 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

TL> > But if the server crashes without a proper shutdown, the postmaster.pid
TL> > is left behind and the postmaster service fails to start at the next boot.
TL>
TL> It should manage to start anyway --- why exactly does it refuse to
TL> start?

it happens on linux as well: if there's a stale file at boot, it refuses
to start saying that it's already running.

--

Simone Tellini
E-mail: tellini@areabusiness.it
http://www.areabusiness.it


Re: auto removing stale pid for postmaster NT service

От
Andrew Sullivan
Дата:
On Mon, Sep 16, 2002 at 04:56:26PM +0200, Simone Tellini wrote:
>
> it happens on linux as well: if there's a stale file at boot, it refuses
> to start saying that it's already running.

Not exactly.

If there is a stale pid file, it looks to see if a process with that
pid exists.  _Then_ it refuses to start.

This is because there is a process with the same pid as the
postmaster.  This will happen in cases where the machine crashes and
starts up again -- something else happens to get the (former)
postgres pid at startup, and so when postgres checks for a process
with that pid, one exists.  And kerplooey.

I seem to recall that someone (maybe Tom Lane?) suggested an
extension to the current pidfile check, so that it will also check to
see if the process really is PostgreSQL.  But I don't know if it was
implemented.

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110


Re: auto removing stale pid for postmaster NT service

От
Tom Lane
Дата:
Andrew Sullivan <andrew@libertyrms.info> writes:
> This is because there is a process with the same pid as the
> postmaster.  This will happen in cases where the machine crashes and
> starts up again -- something else happens to get the (former)
> postgres pid at startup, and so when postgres checks for a process
> with that pid, one exists.  And kerplooey.

FYI, sendmail has the same restart failure mode; I imagine a lot of
other Unix daemons do too.

> I seem to recall that someone (maybe Tom Lane?) suggested an
> extension to the current pidfile check, so that it will also check to
> see if the process really is PostgreSQL.  But I don't know if it was
> implemented.

It wasn't yet, mainly because it's not obvious how to tell reliably
whether some other process is a postmaster or not.

I think I had suggested distinguishing EPERM from other kill() errors,
which would tell us whether the other process is under the same userid
as us or not; if not, we could perhaps safely assume that it's not a
postmaster (or at least not one likely to be using our data directory).

Unfortunately, that doesn't really improve the odds very much.  The
typical scenario for this problem is that the PID we get assigned will
wobble around by one or two counts from one boot cycle to the next,
depending on just how fast other startup processes manage to finish.
(If we get the exact same PID as before, there's no problem; the code
is smart enough to notice that case.)  But the PID(s) adjacent to the
postmaster's will likely also belong to the postgres user --- consider
the shell that launched us, for example.  The shell, or whatever it
might launch right after the postmaster, would look enough like a
postmaster to fool this simplistic test.

So I'm at a loss how the postmaster can improve the reliability of this
check, without throwing the baby out with the bathwater by making a
check that might fail to recognize a conflicting postmaster.  The
consequences of that would be *dire*.

The best solution is probably to forcibly unlink the postmaster.pid
file in some startup script --- but it has to be a script that is *only*
run during boot, never anytime later.  The postgres start script is
not the place for this.

            regards, tom lane

Re: auto removing stale pid for postmaster NT service

От
Andrew Sullivan
Дата:
On Mon, Sep 16, 2002 at 05:27:38PM -0400, Tom Lane wrote:

> FYI, sendmail has the same restart failure mode; I imagine a lot of
> other Unix daemons do too.

Yes, as far as I know atd, klogd, and ypbind also fail this way, at
least on some flavours of Linux (where I've had it happen).  And ISTR
that some bit of NFS didn't recover correctly under Solaris 2.6, but
I can't recall for sure now.

> It wasn't yet, mainly because it's not obvious how to tell reliably
> whether some other process is a postmaster or not.

I had a feeling this might be the case.  I think the suggestion of a
boot-time "cleaning script" is a good idea -- something run by root
before switching runlevels is the obvious answer -- but that's the
sort of thing that probably should be hand-crafted by a competent
sysadmin for each case.  In some environments, there are good reasons
not to restart things in case of crash.  (If the hardware is flakey,
for instance, you might not want the service to be going up and
down.)

A

--
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110