Обсуждение: Re: postmaster.pid

Поиск
Список
Период
Сортировка

Re: postmaster.pid

От
"Magnus Hagander"
Дата:
> >> > But sure, we don't really care if it's a postmaster. Then
> >> > OpenProcess() is probably the best way, yes.
> >>
> >> Au contraire!!  One of the problems with the Unix
> implementation is
> >> that you *can't* tell for sure if the target process is a
> postmaster.
> >> See past discussions about how startup occasionally fails
> because we
> >> get fooled by the PID mentioned in postmaster.pid now belonging to
> >> pg_ctl or some other Postgres-owned process.
> >>
> >> This is a place where the Windows version can actually be
> better than
> >> the Unix one.  Please fix it and stop imagining that your
> charter is
> >> to duplicate a particular Unix syscall bug-for-bug.
> >
> > Ok, if you say so :-) I had the general impression we
> wanted that. But
> > then let's go with the
> > send-signal-0-down-the-pipe-and-ignore-it-in-the-backend. :-)
> >
>
> [away from my desk so can't check right now] What do we get
> back down the pipe? Unless it's something that identifies
> that we are talking to a postmaster will we be further
> advanced than the Unix case? (I agree that talking on the
> pipe would be more robust than the simple OpenProcess() test,
> regardless of this point).

We send the signal numberb ack down the pipe. If se send SIGHUP, we get
SIGHUP back. Etc.

And it would still be further advanced - we'd know it was a pg process
because it was listening on \\.\pipe\pgsgnal_<pid>.


//Magnus

Re: postmaster.pid

От
Joerg Hessdoerfer
Дата:
Hi,

On Wednesday 25 August 2004 16:21, Magnus Hagander wrote:
> > >> > But sure, we don't really care if it's a postmaster. Then
> > >> > OpenProcess() is probably the best way, yes.
> > >>
> > >> Au contraire!!  One of the problems with the Unix
> >
> > implementation is
> >
> > >> that you *can't* tell for sure if the target process is a
> >
> > postmaster.
> >
> > >> See past discussions about how startup occasionally fails
> >
> > because we
> >
> > >> get fooled by the PID mentioned in postmaster.pid now belonging to
> > >> pg_ctl or some other Postgres-owned process.
> > >>
> > >> This is a place where the Windows version can actually be
> >
> > better than
> >
> > >> the Unix one.  Please fix it and stop imagining that your
> >
> > charter is
> >
> > >> to duplicate a particular Unix syscall bug-for-bug.
> > >
> > > Ok, if you say so :-) I had the general impression we
> >
> > wanted that. But
> >
> > > then let's go with the
> > > send-signal-0-down-the-pipe-and-ignore-it-in-the-backend. :-)

Well, wouldn't it be better then to do an OS-dependant check for a running
postmaster, which could use kill() on IMHO broken systems where it's not easy
to determine the processname for a PID, and more elaborate checking on
others. On Windows, there's OpenProcess et al, on Linux, one could resort
to /proc. I didn't develop on too many others, but there should be
possibilities for those, too.

Greetings,
 Jörg
--
Leading SW developer  - S.E.A GmbH
Mail: joerg.hessdoerfer@sea-gmbh.com
WWW:  http://www.sea-gmbh.com

Re: postmaster.pid

От
Andrew Dunstan
Дата:

Joerg Hessdoerfer wrote:

>
>
>Well, wouldn't it be better then to do an OS-dependant check for a running
>postmaster, which could use kill() on IMHO broken systems where it's not easy
>to determine the processname for a PID, and more elaborate checking on
>others. On Windows, there's OpenProcess et al, on Linux, one could resort
>to /proc. I didn't develop on too many others, but there should be
>possibilities for those, too.
>
>
>
>

At this stage in the dev cycle I don't think so - might be worth
improving the robustness post 8.0. Assuming you have access to /proc can
be dangerous too - even if it's there (and in some jail/chroot type
environments it isn't) .

One thought I did have was that it might be worth ignoring the .pid file
is its mtime was older than the system boot time, assuming that both
could be determined portably and reliably.

cheers

andrew

Re: postmaster.pid

От
Joerg Hessdoerfer
Дата:
Hi,

On Wednesday 25 August 2004 19:53, Andrew Dunstan wrote:
> Joerg Hessdoerfer wrote:
> >Well, wouldn't it be better then to do an OS-dependant check for a running
> >postmaster, which could use kill() on IMHO broken systems where it's not
> > easy to determine the processname for a PID, and more elaborate checking
> > on others. On Windows, there's OpenProcess et al, on Linux, one could
> > resort to /proc. I didn't develop on too many others, but there should be
> > possibilities for those, too.
>
> At this stage in the dev cycle I don't think so - might be worth
> improving the robustness post 8.0. Assuming you have access to /proc can
> be dangerous too - even if it's there (and in some jail/chroot type
> environments it isn't) .
>
> One thought I did have was that it might be worth ignoring the .pid file
> is its mtime was older than the system boot time, assuming that both
> could be determined portably and reliably.
>
> cheers
>
> andrew

Ok, your objections are sound. But I just thought a little bit more about
this. What's your opinion of this:

On successful startup, postmaster opens a special TCP socket to listen on from
127.0.0.1 only, and notes the port no. in postmaster.pid, too.

On startup, postmaster reads postmaster.pid, if present, and tries to connect
to the mentioned port. If the connection fails, no postmaster is present, so
continue startup. If connection is accepted, the 'original' postmaster sends
a defined Message a la 'PostgreSQL postmaster version 8.1.0' down the socket
and closes the connection. Only if this is received in a reasonable time, we
are sure to have a postmaster running and should abort startup, else we can
safely continue.

This should be highly portable, and also catches the case where the postmaster
just crashed without the system rebooting (where the mtime check would fail,
too).

Greetings,
 Jörg
--
Leading SW developer  - S.E.A GmbH
Mail: joerg.hessdoerfer@sea-gmbh.com
WWW:  http://www.sea-gmbh.com

Re: postmaster.pid

От
Tom Lane
Дата:
Joerg Hessdoerfer <Joerg.Hessdoerfer@sea-gmbh.com> writes:
> On startup, postmaster reads postmaster.pid, if present, and tries to connect
> to the mentioned port. If the connection fails, no postmaster is present,

... or the kernel is filtering the port, or we couldn't resolve "localhost"
(cf various reports of stats collector not working), or the postmaster
is present but overloaded enough to be missing connection attempts, or ...

> Only if this is received in a reasonable time, we
> are sure to have a postmaster running and should abort startup, else we can
> safely continue.

The real point here is that the behavior has to be to default to
failure, not success.  The worst case if we fail incorrectly is that a
small amount of manual intervention is needed to start the postmaster,
ie, remove the lockfile and try again.  The worst (and very probable)
case if we succeed incorrectly is extensive, unrecoverable data
corruption.  We must *never* have multiple postmasters running against
the same data directory.  So taking an attitude of "prove that there is
a working postmaster out there" is quite backwards.  You have to think
in terms of "prove that there isn't".

(For the same reason, I am highly suspicious of the quick-fix proposals
we occasionally see to add an "rm $PGDATA/postmaster.pid" to pg_ctl or
the init script.  That is nothing but a large-caliber pistol loaded,
cocked, and aimed at your foot.)

I've occasionally thought about abandoning the PID test, in favor of
relying completely on the shmem-existence test.  If the shmem segment
named in the lockfile doesn't exist or has zero processes connected to
it, we could safely assume that the original postmaster is gone.
(If it has processes connected, we must abort anyway, to cover the case
where the postmaster crashed but backends remain alive.)  The risk here
is that we are then *completely* at the mercy of the OS having a correct
emulation of the SysV shmem semantics, in particular the ability to
detect whether a shmem segment has other processes connected to it.
I'm not sure whether this is true on all the supported platforms.
(This being the win32 list: what about Windows?)

            regards, tom lane

Re: postmaster.pid

От
Joerg Hessdoerfer
Дата:
On Thursday 26 August 2004 16:25, Tom Lane wrote:
[...]
> The real point here is that the behavior has to be to default to
> failure, not success.  The worst case if we fail incorrectly is that a
> small amount of manual intervention is needed to start the postmaster,
> ie, remove the lockfile and try again.  The worst (and very probable)
> case if we succeed incorrectly is extensive, unrecoverable data
> corruption.  We must *never* have multiple postmasters running against
> the same data directory.  So taking an attitude of "prove that there is
> a working postmaster out there" is quite backwards.  You have to think
> in terms of "prove that there isn't".
[...]

Of course you're right, as always ;-)  Data integrity has to be the absolute
priority, 'twas too early in the morning.

I just got sidetracked because of spurious 'why does our XXX not work?'
questions of people who let their production DB servers (this is for
industrial manufacturing processes, like laser welding) running in the
machine floor, and whose boxes get shot down every now and then, and
sometimes... pg doesn't start. Then the 'small amount of manual intervention'
sometimes is not so small - depending on OS and/or configuration, or even
remote access to the box. Mind you, these are mostly non-computer savvy
people, and those sometimes get upset when 'the system does not startup
correctly' - because that means they can't currently produce a car!

We're working around this by adding a shell script that removes
'postmaster.pid' as last action at system *shutdown*, so we can tell them to
'restart the machine', and everything usually just works fine. But, a
postmaster internal but safe mechanism would be great. Just daydreaming...

Greetings,
 Jörg
--
Leading SW developer  - S.E.A GmbH
Mail: joerg.hessdoerfer@sea-gmbh.com
WWW:  http://www.sea-gmbh.com