Обсуждение: Re: postmaster.pid
> >> > But sure, we don't really care if it's a postmaster. Then > >> > OpenProcess() is probably the best way, yes. > >> > >> Au contraire!! One of the problems with the Unix > implementation is > >> that you *can't* tell for sure if the target process is a > postmaster. > >> See past discussions about how startup occasionally fails > because we > >> get fooled by the PID mentioned in postmaster.pid now belonging to > >> pg_ctl or some other Postgres-owned process. > >> > >> This is a place where the Windows version can actually be > better than > >> the Unix one. Please fix it and stop imagining that your > charter is > >> to duplicate a particular Unix syscall bug-for-bug. > > > > Ok, if you say so :-) I had the general impression we > wanted that. But > > then let's go with the > > send-signal-0-down-the-pipe-and-ignore-it-in-the-backend. :-) > > > > [away from my desk so can't check right now] What do we get > back down the pipe? Unless it's something that identifies > that we are talking to a postmaster will we be further > advanced than the Unix case? (I agree that talking on the > pipe would be more robust than the simple OpenProcess() test, > regardless of this point). We send the signal numberb ack down the pipe. If se send SIGHUP, we get SIGHUP back. Etc. And it would still be further advanced - we'd know it was a pg process because it was listening on \\.\pipe\pgsgnal_<pid>. //Magnus
Hi, On Wednesday 25 August 2004 16:21, Magnus Hagander wrote: > > >> > But sure, we don't really care if it's a postmaster. Then > > >> > OpenProcess() is probably the best way, yes. > > >> > > >> Au contraire!! One of the problems with the Unix > > > > implementation is > > > > >> that you *can't* tell for sure if the target process is a > > > > postmaster. > > > > >> See past discussions about how startup occasionally fails > > > > because we > > > > >> get fooled by the PID mentioned in postmaster.pid now belonging to > > >> pg_ctl or some other Postgres-owned process. > > >> > > >> This is a place where the Windows version can actually be > > > > better than > > > > >> the Unix one. Please fix it and stop imagining that your > > > > charter is > > > > >> to duplicate a particular Unix syscall bug-for-bug. > > > > > > Ok, if you say so :-) I had the general impression we > > > > wanted that. But > > > > > then let's go with the > > > send-signal-0-down-the-pipe-and-ignore-it-in-the-backend. :-) Well, wouldn't it be better then to do an OS-dependant check for a running postmaster, which could use kill() on IMHO broken systems where it's not easy to determine the processname for a PID, and more elaborate checking on others. On Windows, there's OpenProcess et al, on Linux, one could resort to /proc. I didn't develop on too many others, but there should be possibilities for those, too. Greetings, Jörg -- Leading SW developer - S.E.A GmbH Mail: joerg.hessdoerfer@sea-gmbh.com WWW: http://www.sea-gmbh.com
Joerg Hessdoerfer wrote: > > >Well, wouldn't it be better then to do an OS-dependant check for a running >postmaster, which could use kill() on IMHO broken systems where it's not easy >to determine the processname for a PID, and more elaborate checking on >others. On Windows, there's OpenProcess et al, on Linux, one could resort >to /proc. I didn't develop on too many others, but there should be >possibilities for those, too. > > > > At this stage in the dev cycle I don't think so - might be worth improving the robustness post 8.0. Assuming you have access to /proc can be dangerous too - even if it's there (and in some jail/chroot type environments it isn't) . One thought I did have was that it might be worth ignoring the .pid file is its mtime was older than the system boot time, assuming that both could be determined portably and reliably. cheers andrew
Hi, On Wednesday 25 August 2004 19:53, Andrew Dunstan wrote: > Joerg Hessdoerfer wrote: > >Well, wouldn't it be better then to do an OS-dependant check for a running > >postmaster, which could use kill() on IMHO broken systems where it's not > > easy to determine the processname for a PID, and more elaborate checking > > on others. On Windows, there's OpenProcess et al, on Linux, one could > > resort to /proc. I didn't develop on too many others, but there should be > > possibilities for those, too. > > At this stage in the dev cycle I don't think so - might be worth > improving the robustness post 8.0. Assuming you have access to /proc can > be dangerous too - even if it's there (and in some jail/chroot type > environments it isn't) . > > One thought I did have was that it might be worth ignoring the .pid file > is its mtime was older than the system boot time, assuming that both > could be determined portably and reliably. > > cheers > > andrew Ok, your objections are sound. But I just thought a little bit more about this. What's your opinion of this: On successful startup, postmaster opens a special TCP socket to listen on from 127.0.0.1 only, and notes the port no. in postmaster.pid, too. On startup, postmaster reads postmaster.pid, if present, and tries to connect to the mentioned port. If the connection fails, no postmaster is present, so continue startup. If connection is accepted, the 'original' postmaster sends a defined Message a la 'PostgreSQL postmaster version 8.1.0' down the socket and closes the connection. Only if this is received in a reasonable time, we are sure to have a postmaster running and should abort startup, else we can safely continue. This should be highly portable, and also catches the case where the postmaster just crashed without the system rebooting (where the mtime check would fail, too). Greetings, Jörg -- Leading SW developer - S.E.A GmbH Mail: joerg.hessdoerfer@sea-gmbh.com WWW: http://www.sea-gmbh.com
Joerg Hessdoerfer <Joerg.Hessdoerfer@sea-gmbh.com> writes: > On startup, postmaster reads postmaster.pid, if present, and tries to connect > to the mentioned port. If the connection fails, no postmaster is present, ... or the kernel is filtering the port, or we couldn't resolve "localhost" (cf various reports of stats collector not working), or the postmaster is present but overloaded enough to be missing connection attempts, or ... > Only if this is received in a reasonable time, we > are sure to have a postmaster running and should abort startup, else we can > safely continue. The real point here is that the behavior has to be to default to failure, not success. The worst case if we fail incorrectly is that a small amount of manual intervention is needed to start the postmaster, ie, remove the lockfile and try again. The worst (and very probable) case if we succeed incorrectly is extensive, unrecoverable data corruption. We must *never* have multiple postmasters running against the same data directory. So taking an attitude of "prove that there is a working postmaster out there" is quite backwards. You have to think in terms of "prove that there isn't". (For the same reason, I am highly suspicious of the quick-fix proposals we occasionally see to add an "rm $PGDATA/postmaster.pid" to pg_ctl or the init script. That is nothing but a large-caliber pistol loaded, cocked, and aimed at your foot.) I've occasionally thought about abandoning the PID test, in favor of relying completely on the shmem-existence test. If the shmem segment named in the lockfile doesn't exist or has zero processes connected to it, we could safely assume that the original postmaster is gone. (If it has processes connected, we must abort anyway, to cover the case where the postmaster crashed but backends remain alive.) The risk here is that we are then *completely* at the mercy of the OS having a correct emulation of the SysV shmem semantics, in particular the ability to detect whether a shmem segment has other processes connected to it. I'm not sure whether this is true on all the supported platforms. (This being the win32 list: what about Windows?) regards, tom lane
On Thursday 26 August 2004 16:25, Tom Lane wrote: [...] > The real point here is that the behavior has to be to default to > failure, not success. The worst case if we fail incorrectly is that a > small amount of manual intervention is needed to start the postmaster, > ie, remove the lockfile and try again. The worst (and very probable) > case if we succeed incorrectly is extensive, unrecoverable data > corruption. We must *never* have multiple postmasters running against > the same data directory. So taking an attitude of "prove that there is > a working postmaster out there" is quite backwards. You have to think > in terms of "prove that there isn't". [...] Of course you're right, as always ;-) Data integrity has to be the absolute priority, 'twas too early in the morning. I just got sidetracked because of spurious 'why does our XXX not work?' questions of people who let their production DB servers (this is for industrial manufacturing processes, like laser welding) running in the machine floor, and whose boxes get shot down every now and then, and sometimes... pg doesn't start. Then the 'small amount of manual intervention' sometimes is not so small - depending on OS and/or configuration, or even remote access to the box. Mind you, these are mostly non-computer savvy people, and those sometimes get upset when 'the system does not startup correctly' - because that means they can't currently produce a car! We're working around this by adding a shell script that removes 'postmaster.pid' as last action at system *shutdown*, so we can tell them to 'restart the machine', and everything usually just works fine. But, a postmaster internal but safe mechanism would be great. Just daydreaming... Greetings, Jörg -- Leading SW developer - S.E.A GmbH Mail: joerg.hessdoerfer@sea-gmbh.com WWW: http://www.sea-gmbh.com