Обсуждение: Stats collection on Windows
Hi all, I think I've found the cause (or one of the causes) why stats collection is unreliable on Windows and I'm wondering about the best way to go about fixing it. The problem is that process IDs on Windows seem to be assigned without much rhyme or reason and it seems to happen relatively frequently that a new process will be assigned the same process ID as a process which recently died. If this happens before the backend has been expired out of pgstat.c's pgStatBeDead hash, the backend will be missed. I was thinking the postmaster could maintain a backend sequence number with similar semantics to a UNIXish process ID which could then be used as the key for pgStatBeDead instead of the actual process ID. Does that sound reasonable? Peter
"Peter Brant" <Peter.Brant@wicourts.gov> writes: > I think I've found the cause (or one of the causes) why stats > collection is unreliable on Windows and I'm wondering about the best way > to go about fixing it. > The problem is that process IDs on Windows seem to be assigned without > much rhyme or reason and it seems to happen relatively frequently that a > new process will be assigned the same process ID as a process which > recently died. If this happens before the backend has been expired out > of pgstat.c's pgStatBeDead hash, the backend will be missed. That's an interesting theory, but do you have any actual evidence for it? The evidence I've seen says that our big problem on Windows is the stats collector process just quitting due to unexplained piperead() failures. (I mean, I'd love to blame Microsoft for everything, but even the Redmond crowd should be able to figure out that recycling process IDs instantly would be a stupid idea...) regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> wrote > >> The problem is that process IDs on Windows seem to be assigned without >> much rhyme or reason and it seems to happen relatively frequently that a >> new process will be assigned the same process ID as a process which >> recently died. > > That's an interesting theory, but do you have any actual evidence for it? > I can confirm that on Windows 2000 the process ID is recycled instantly. Regards, Qingqing
"Tom Lane" <tgl@sss.pgh.pa.us> wrote > > Redmond crowd should be able to figure out that recycling process IDs > instantly would be a stupid idea...) > Can you explain more of this? IMHO, if we rely on feature like this, the difference is unstable-every-day vs. unstable-every-year. Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> wrote >> Redmond crowd should be able to figure out that recycling process IDs >> instantly would be a stupid idea...) > Can you explain more of this? IMHO, if we rely on feature like this, the > difference is unstable-every-day vs. unstable-every-year. The mere existence of the kill() primitive should bring to mind reasons why it's a bad idea. regards, tom lane
> >> Redmond crowd should be able to figure out that recycling > process IDs > >> instantly would be a stupid idea...) > > > Can you explain more of this? IMHO, if we rely on feature > like this, > > the difference is unstable-every-day vs. unstable-every-year. > > The mere existence of the kill() primitive should bring to > mind reasons why it's a bad idea. Except the kill() primitive *does not exist* on windows. That said, how did you go about to confirm that the pid is recycled instantly? I was under the impression that it assignes any unused pid in random order, which is also what a quick glance at my XP box looks like (don't have a 2000 box around, but I wasn't aware of such a change between those - but it's certainly not impossible). But if oyu had some better method of determining it, please let me know :-) If that's how, several other OSes do the same thing AFAIK - for security reasons. For example OpenBSD. So if we rely heavily on that, we may be in trouble elsewhere as well. //Magnus
On Tue, Apr 04, 2006 at 11:02:11PM -0400, Tom Lane wrote: > "Peter Brant" <Peter.Brant@wicourts.gov> writes: > > I think I've found the cause (or one of the causes) why stats > > collection is unreliable on Windows and I'm wondering about the best way > > to go about fixing it. > > The problem is that process IDs on Windows seem to be assigned without > > much rhyme or reason and it seems to happen relatively frequently that a > > new process will be assigned the same process ID as a process which > > recently died. If this happens before the backend has been expired out > > of pgstat.c's pgStatBeDead hash, the backend will be missed. > That's an interesting theory, but do you have any actual evidence for it? > The evidence I've seen says that our big problem on Windows is the stats > collector process just quitting due to unexplained piperead() failures. > (I mean, I'd love to blame Microsoft for everything, but even the > Redmond crowd should be able to figure out that recycling process IDs > instantly would be a stupid idea...) Why? :-) They use HANDLE. The process ID isn't nearly as useful as it is on UNIX. I haven't looked at that stuff in a long time, but process "ID" on Windows may be a compatibility method. Process "ID" isn't necessarily a good way of identifying tasks, precisely because they may be reused. Using a serial allocated at process start might make more sense. Relying on it not being reused, is not dissimilar to the old malloc() "tricks" of assuming that malloc() will not return something recently free()d. It's bad. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
On Tue, Apr 04, 2006 at 11:17:49PM -0400, Tom Lane wrote: > "Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > > "Tom Lane" <tgl@sss.pgh.pa.us> wrote > >> Redmond crowd should be able to figure out that recycling process IDs > >> instantly would be a stupid idea...) > > Can you explain more of this? IMHO, if we rely on feature like this, the > > difference is unstable-every-day vs. unstable-every-year. > The mere existence of the kill() primitive should bring to mind reasons > why it's a bad idea. CreateProcess - "The process is assigned a process identifier. The identifier is valid until the process terminates. It can be used to identify the process, or specified in the OpenProcess function to open a handle to the process. The initial thread in the process is also assigned a thread identifier. ..." TerminateProcess - "Terminates the specified process and all of its threads." TerminateProcess takes a HANDLE, not a process identifier. Yes, they provide the "kill" primitive, but only as a compatibility measure. A "good" Windows process, should maintain a HANDLE to the process, and kill the process using the HANDLE. This way, there is no race. The HANDLE is also how you wait for the process to terminate normally. I prefer the "Redmond" way, in that I find UNIX's use of integer identifiers to be encouraging of race conditions. UNIX requires hacks like minimizing PID reuse, because UNIX is the one that is broken. :-) Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
On Wed, Apr 05, 2006 at 03:20:47AM -0400, mark@mark.mielke.cc wrote: > TerminateProcess takes a HANDLE, not a process identifier. Yes, they > provide the "kill" primitive, but only as a compatibility measure. A > "good" Windows process, should maintain a HANDLE to the process, and > kill the process using the HANDLE. This way, there is no race. The > HANDLE is also how you wait for the process to terminate normally. Which presents the solution, we should use the HANDLE on windows rather than the process identifier. > I prefer the "Redmond" way, in that I find UNIX's use of integer > identifiers to be encouraging of race conditions. UNIX requires hacks > like minimizing PID reuse, because UNIX is the one that is broken. :-) Eh? A HANDLE is (or can be mapped to) an integer too. I don't see anything on that page about handle reuse. If you run a machine long enough I'm sure it can be reused also... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
On Wed, Apr 05, 2006 at 09:30:06AM +0200, Martijn van Oosterhout wrote: > On Wed, Apr 05, 2006 at 03:20:47AM -0400, mark@mark.mielke.cc wrote: > > TerminateProcess takes a HANDLE, not a process identifier. Yes, they > > provide the "kill" primitive, but only as a compatibility measure. A > > "good" Windows process, should maintain a HANDLE to the process, and > > kill the process using the HANDLE. This way, there is no race. The > > HANDLE is also how you wait for the process to terminate normally. > Which presents the solution, we should use the HANDLE on windows rather > than the process identifier. Yes. > > I prefer the "Redmond" way, in that I find UNIX's use of integer > > identifiers to be encouraging of race conditions. UNIX requires hacks > > like minimizing PID reuse, because UNIX is the one that is broken. :-) > Eh? A HANDLE is (or can be mapped to) an integer too. I don't see > anything on that page about handle reuse. If you run a machine long > enough I'm sure it can be reused also... Once upon a time, when I played with this stuff (I mostly use UNIX, not Windows), I concluded to myself that HANDLE was process-local, and that it was allocated. Meaning - it won't be re-used until you CloseHandle(). It's best then, to think of HANDLE as a opaque object. Regardless, of whether it is process-local or not, until you run CloseHandle(), it is yours to keep, and it won't be re-used. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
On Wed, Apr 05, 2006 at 03:38:28AM -0400, mark@mark.mielke.cc wrote: > Once upon a time, when I played with this stuff (I mostly use UNIX, not > Windows), I concluded to myself that HANDLE was process-local, and that > it was allocated. Meaning - it won't be re-used until you CloseHandle(). > It's best then, to think of HANDLE as a opaque object. Regardless, of > whether it is process-local or not, until you run CloseHandle(), it is > yours to keep, and it won't be re-used. HANDLE is process local? That is worse then, because then there's no guarentee that each process will see a different identifier. The stats collector identifies processes by their process id, which they get using getpid(). If instead they used a handle for their own process (GetCurrentProcess() always returns -1, but you can apparently clone it to get a real handle), you have no idea whether that handle is unique amongst backends, because it's process local. The stats collector doesn't have any open handles for the backend, it's just a way for backends to identify themselves. It appears that process handles are not up to the task either... Do we have a plan C? -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
> > Once upon a time, when I played with this stuff (I mostly use UNIX, > > not Windows), I concluded to myself that HANDLE was > process-local, and > > that it was allocated. Meaning - it won't be re-used until > you CloseHandle(). > > It's best then, to think of HANDLE as a opaque object. > Regardless, of > > whether it is process-local or not, until you run > CloseHandle(), it is > > yours to keep, and it won't be re-used. > > HANDLE is process local? That is worse then, because then > there's no guarentee that each process will see a different > identifier. HANDLE is process local. What you need to do is run DuplicateHandle() on it specifying it should "also be valid for process Y" (for which you need a HANDLE opened, in this case the stats collector). This will give you a new handle whichi s valid in the *target process*, but it is *not* valid in your own process. A quick look shows that pids appear to be reused quite fast on Windows 2000, but *not* on XP or 2003. Don't have any other versions around to test. AFAIK, the pid on unix is also only vaild while the process is running. It's not likely to be reused quickly, which appears to be the same for XP+ on win32. > Do we have a plan C? Well, first we need to know that this really *is* the problem. I'm unsure if we've seen "proof" that this actually causes a real problem. (It's certainly not the main issue with teh stats system on win32, which is the fact that it crashes now and then) How likely is it that the arrive-in-wrong-order really happens? And can we do something about that one? (I assume the problem is that the backend-die message is sent from postmaster, meaning that the die can appear before the alive sent from the backend itself. What if we changed it so we had one message sent from a backend doing a controlled shutdown (99.9% of the time) and a different one when the postmaster picks up a dead backend. And we apply the "dead counter" only to the messages from the postmaster, and expire the record right away when we receive one from a controlled shutdown (which would happen in-line with the backend-start message) //Magnus //Magnus
> > >> Redmond crowd should be able to figure out that > recycling process > > >> IDs instantly would be a stupid idea...) > > > Can you explain more of this? IMHO, if we rely on feature > like this, > > > the difference is unstable-every-day vs. unstable-every-year. > > The mere existence of the kill() primitive should bring to mind > > reasons why it's a bad idea. > > CreateProcess - "The process is assigned a process > identifier. The identifier is valid until the process > terminates. It can be used to identify the process, or > specified in the OpenProcess function to open a handle to the > process. The initial thread in the process is also assigned a > thread identifier. ..." > > TerminateProcess - "Terminates the specified process and all > of its threads." > > TerminateProcess takes a HANDLE, not a process identifier. Yes. You get the handle by doing OpenProcess() with PROCESS_TERMINATE access. The functions used to enumerate processes all return the process id, not a HANDLE.(See Process32First/Process32Next). You use the HANDLE only when you need to *modify* something (for example, killing it). > Yes, they provide the "kill" primitive, but only as a > compatibility measure. They do? Where is it? Certainly not in the documented SDK that I can see. > A "good" Windows process, should > maintain a HANDLE to the process, and kill the process using > the HANDLE. This way, there is no race. The HANDLE is also > how you wait for the process to terminate normally. A "good" Windows process would be using threads ;-) That's what the Windows API was written for, and that's why anything that deals with multiple processes working together is a lot harder than on Unix. //Magnus
On Wed, Apr 05, 2006 at 09:58:54AM +0200, Martijn van Oosterhout wrote: > On Wed, Apr 05, 2006 at 03:38:28AM -0400, mark@mark.mielke.cc wrote: > > Once upon a time, when I played with this stuff (I mostly use UNIX, not > > Windows), I concluded to myself that HANDLE was process-local, and that > > it was allocated. Meaning - it won't be re-used until you CloseHandle(). > > It's best then, to think of HANDLE as a opaque object. Regardless, of > > whether it is process-local or not, until you run CloseHandle(), it is > > yours to keep, and it won't be re-used. > HANDLE is process local? That is worse then, because then there's no > guarentee that each process will see a different identifier. It's no different from a file descriptor on UNIX. Neither UNIX nor Windows promise that a process identifier is valid beyond the life of the process. UNIX avoids it from happening, as it is necessary to avoid races with system calls such as kill(). Windows does not have this problem. > The stats collector identifies processes by their process id, which > they get using getpid(). If instead they used a handle for their own > process (GetCurrentProcess() always returns -1, but you can apparently > clone it to get a real handle), you have no idea whether that handle is > unique amongst backends, because it's process local. > The stats collector doesn't have any open handles for the backend, it's > just a way for backends to identify themselves. It appears that process > handles are not up to the task either... > Do we have a plan C? Sure. Serial. Allocate on process start. Or, back to another topic from months ago - UUID generation... :-) Process identifier should not be used beyond the life of the process. As soon as wait() removes the process on UNIX, the process identifier is no longer valid, and could be reused. That the operating system tries to prevent problems by avoiding recycling isn't necessarily a good reason to exploit this capability. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
On Wed, Apr 05, 2006 at 10:31:37AM +0200, Magnus Hagander wrote: > > TerminateProcess takes a HANDLE, not a process identifier. > Yes. You get the handle by doing OpenProcess() with PROCESS_TERMINATE > access. > The functions used to enumerate processes all return the process id, not > a HANDLE.(See Process32First/Process32Next). You use the HANDLE only > when you need to *modify* something (for example, killing it). Enumerating processes isn't a common operationg for programs other than 'top'. :-) But even so, one can use OpenProcess() to evaluate that the process being referenced is actually the one you think it is, before calling TerminateProcess(), eliminating the race. > > Yes, they provide the "kill" primitive, but only as a > > compatibility measure. > They do? Where is it? Certainly not in the documented SDK that I can > see. I believe it is called KillProcess(). > > A "good" Windows process, should > > maintain a HANDLE to the process, and kill the process using > > the HANDLE. This way, there is no race. The HANDLE is also > > how you wait for the process to terminate normally. > A "good" Windows process would be using threads ;-) > That's what the Windows API was written for, and that's why anything > that deals with multiple processes working together is a lot harder than > on Unix. Haha. Good point. :-) Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
On Wed, Apr 05, 2006 at 06:03:31AM -0400, mark@mark.mielke.cc wrote: > It's no different from a file descriptor on UNIX. > > Neither UNIX nor Windows promise that a process identifier is valid > beyond the life of the process. UNIX avoids it from happening, as it > is necessary to avoid races with system calls such as kill(). Windows > does not have this problem. But consider, why is the process id there? (Amongst other reasons) so that users can monitor pg_stat_activity and kill a backend that's out of control. The equivalent to this in windows would be to. 1. Get HANDLE from the process ID. 2. TerminateProcess with that HANDLE. Presumably users have a little GUI app that displays processes on the system with the process ID so they can kill it. If a process ID is quickly reused, they may end up killing the wrong one. The race condition in this case involves the user and you can't solve that programmatically. The non-reuse of pids is more for user-friendlyness than anything else. The Window use of HANDLE doesn't solve this problem at all. > Sure. Serial. Allocate on process start. > > Or, back to another topic from months ago - UUID generation... :-) Neither of which solve the "I'm a user and want to kill *that* backend" problem. Because even on windows you'll have to get the process id, convert it to a handle an kill it. Same race condition. > Process identifier should not be used beyond the life of the process. > As soon as wait() removes the process on UNIX, the process identifier > is no longer valid, and could be reused. That the operating system tries > to prevent problems by avoiding recycling isn't necessarily a good > reason to exploit this capability. Yeah, but it's very useful as a user. Consider this scenerio: <process 1234 goes AWOL> kill -INT 1234 <check if process still there> kill -TERM 1234 <process still there, damn it!> kill -9 1234 There gone. With no quick PID reuse I can be sure I won't kill the wrong one. This is presumably why recent versions of windows don't reuse pids quickly either... It for *users* not programs. -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
On Wed, Apr 05, 2006 at 12:20:49PM +0200, Martijn van Oosterhout wrote: > On Wed, Apr 05, 2006 at 06:03:31AM -0400, mark@mark.mielke.cc wrote: > > It's no different from a file descriptor on UNIX. > > Neither UNIX nor Windows promise that a process identifier is valid > > beyond the life of the process. UNIX avoids it from happening, as it > > is necessary to avoid races with system calls such as kill(). Windows > > does not have this problem. > But consider, why is the process id there? (Amongst other reasons) so > that users can monitor pg_stat_activity and kill a backend that's out > of control. The equivalent to this in windows would be to. > 1. Get HANDLE from the process ID. > 2. TerminateProcess with that HANDLE. You missed 1.5. Ensure that HANDLE matches the process you intend. :-) This is a little bit of a distraction though, as any system that requires the user to be able to kill broken backends, is only indicative of a broken backend. We're talking about how to deal with a broken process, after the process owner (PostgreSQL) has forgotten about it. What would be wrong with having a PostgreSQL generated serial associated with each backend, that can be used by the backend process owner, to map to a HANDLE, which uses TerminateProcess() underneath? > Presumably users have a little GUI app that displays processes on the > system with the process ID so they can kill it. If a process ID is > quickly reused, they may end up killing the wrong one. Sure. But the problem here, is that PostgreSQL is broken, so they find a need to go to their process listing GUI. And who is to say that the GUI doesn't do as I've suggested above? Ensure that TerminateProcess() is against the intended process? I have no idea if it does - but it could. > The race condition in this case involves the user and you can't solve > that programmatically. The non-reuse of pids is more for > user-friendlyness than anything else. The Window use of HANDLE doesn't > solve this problem at all. Sure you can. But it also shouldn't matter. > > Sure. Serial. Allocate on process start. > > Or, back to another topic from months ago - UUID generation... :-) > Neither of which solve the "I'm a user and want to kill *that* backend" > problem. Because even on windows you'll have to get the process id, > convert it to a handle an kill it. Same race condition. No, and not if PostgreSQL kills the backend for you cleanly, using the HANDLE, that it owns for the process. > > Process identifier should not be used beyond the life of the process. > > As soon as wait() removes the process on UNIX, the process identifier > > is no longer valid, and could be reused. That the operating system tries > > to prevent problems by avoiding recycling isn't necessarily a good > > reason to exploit this capability. > Yeah, but it's very useful as a user. Consider this scenerio: > <process 1234 goes AWOL> > kill -INT 1234 > <check if process still there> > kill -TERM 1234 > <process still there, damn it!> > kill -9 1234 > There gone. With no quick PID reuse I can be sure I won't kill the > wrong one. This is presumably why recent versions of windows don't reuse > pids quickly either... It for *users* not programs. Users shouldn't need to kill programs. If they do - on the off chance that they do, they should be more careful than you list above. That's a kneejerk reaction to a problem that shouldn't occur in the first place. It's a habit trained by UNIX users. I rarely ever need to kill a process on my Windows box, and when I have, the process listing in the GUI hasn't offered me a process ID. I click on the item I wish to kill, and I right click "End Process". A lot more user friendly, in my opinion. And if they happen to fix the race in the background using the method I suggest above? All the better... Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
> > It's no different from a file descriptor on UNIX. > > > > Neither UNIX nor Windows promise that a process identifier is valid > > beyond the life of the process. UNIX avoids it from > happening, as it > > is necessary to avoid races with system calls such as > kill(). Windows > > does not have this problem. > > But consider, why is the process id there? (Amongst other > reasons) so that users can monitor pg_stat_activity and kill > a backend that's out of control. The equivalent to this in > windows would be to. > > 1. Get HANDLE from the process ID. > 2. TerminateProcess with that HANDLE. > > Presumably users have a little GUI app that displays > processes on the system with the process ID so they can kill > it. If a process ID is quickly reused, they may end up > killing the wrong one. Actualy, not quite. That's the equivalent of kill -9. What you'd do is "pg_ctl kill TERM <pid>". Or whatever other signal you want to send. But you still need the pid, there is no way to get the required information from a plain handle when you're not in that process. Though I'd say I'm usually more interested in noticing what process it is, rather than killing it. Possibly canceling it, but very seldomly actually kill it. But it still needs the process id, yes. (BTW, the GUI would generaelly be Task Manager, so it's not like it's a special tool) <snip rest> //Magnus
> > > Yes, they provide the "kill" primitive, but only as a > compatibility > > > measure. > > They do? Where is it? Certainly not in the documented SDK > that I can > > see. > > I believe it is called KillProcess(). No such function documented on my MSDN, other than one in SQL Server DMO. Are you referring to TerminateProcess()? If so, it can just terminate the process, not send it arbitrary signals. Considering windows doesnt' reallyi have the concept of signals, that's not very surprising though :-) //Magnus
> This is a little bit of a distraction though, as any system > that requires the user to be able to kill broken backends, is > only indicative of a broken backend. We're talking about how > to deal with a broken process, after the process owner > (PostgreSQL) has forgotten about it. Don't confuse the postmaster with the statistics system. the postmaster knows about the backends per handle, but the stats system runs outside of the postmaster, and can possibly get confused. > What would be wrong with having a PostgreSQL generated serial > associated with each backend, that can be used by the backend > process owner, to map to a HANDLE, which uses > TerminateProcess() underneath? It might be a good idea to use the combination pid+internalid. Note taht we *don't* just want to TerminateProcess(), we need the pid to send actual signals with. And in this case, that's not even the problem - the problem is that we're not dropping the process from our list (I think). > > Presumably users have a little GUI app that displays > processes on the > > system with the process ID so they can kill it. If a process ID is > > quickly reused, they may end up killing the wrong one. > > Sure. But the problem here, is that PostgreSQL is broken, so > they find a need to go to their process listing GUI. And who > is to say that the GUI doesn't do as I've suggested above? > Ensure that TerminateProcess() is against the intended > process? I have no idea if it does - but it could. Because the GUI is most likely either Task Manager or Process Explorer. There is no need for a special GUI. > > > Process identifier should not be used beyond the life of > the process. > > > As soon as wait() removes the process on UNIX, the process > > > identifier is no longer valid, and could be reused. That the > > > operating system tries to prevent problems by avoiding recycling > > > isn't necessarily a good reason to exploit this capability. > > Yeah, but it's very useful as a user. Consider this scenerio: > > <process 1234 goes AWOL> > > kill -INT 1234 > > <check if process still there> > > kill -TERM 1234 > > <process still there, damn it!> > > kill -9 1234 > > There gone. With no quick PID reuse I can be sure I won't kill the > > wrong one. This is presumably why recent versions of windows don't > > reuse pids quickly either... It for *users* not programs. > > Users shouldn't need to kill programs. And normally they don't. But again, this isn't where the problem is. If your process is stuck and you need to kill it, it most likely will still be around once you get to killing it. If not, you were trying to kill it too soon. > I rarely ever need to kill a process on my Windows box, and > when I have, the process listing in the GUI hasn't offered me > a process ID. I click on the item I wish to kill, and I right > click "End Process". A lot more user friendly, in my opinion. > And if they happen to fix the race in the background using > the method I suggest above? All the better... AFAIK, Task Manager uses the pid to kill the task. Try tracing it. (And it will happily show you the pid as well - just enable it in the checkbox. It's the only reasonable way to differ between two processes off the same exe, for example two different notepad:s...) //Magnus
"Magnus Hagander" <mha@sollentuna.net> writes: >> HANDLE is process local? That is worse then, because then >> there's no guarentee that each process will see a different >> identifier. > HANDLE is process local. What you need to do is run DuplicateHandle() on > it specifying it should "also be valid for process Y" (for which you > need a HANDLE opened, in this case the stats collector). This will give > you a new handle whichi s valid in the *target process*, but it is *not* > valid in your own process. What happens if process Y goes away between the time you obtain a handle for it and the time you try to run this DuplicateHandle call? regards, tom lane
> >> HANDLE is process local? That is worse then, because then > there's no > >> guarentee that each process will see a different identifier. > > > HANDLE is process local. What you need to do is run > DuplicateHandle() > > on it specifying it should "also be valid for process Y" (for which > > you need a HANDLE opened, in this case the stats > collector). This will > > give you a new handle whichi s valid in the *target > process*, but it > > is *not* valid in your own process. > > What happens if process Y goes away between the time you > obtain a handle for it and the time you try to run this > DuplicateHandle call? I don't know offhand. I would assume one of two things: 1) DuplicateHandle() fails. 2) You get a handle back that is valid in the dead process (meaning it's not valid). I can put together some quick test-code for this if you need me to? //Magnus
2006/4/5, Tom Lane <tgl@sss.pgh.pa.us>:
The structures for a process should remain as long as there are any open HANDLEs, even if the process has ended. Threads behave the same way, which is why a creator needs to remember to always close the HANDLE returned from CreateThread.
What happens if process Y goes away between the time you obtain a
handle for it and the time you try to run this DuplicateHandle call?
The structures for a process should remain as long as there are any open HANDLEs, even if the process has ended. Threads behave the same way, which is why a creator needs to remember to always close the HANDLE returned from CreateThread.
"Magnus Hagander" <mha@sollentuna.net> writes: >> What happens if process Y goes away between the time you >> obtain a handle for it and the time you try to run this >> DuplicateHandle call? > I can put together some quick test-code for this if you need me to? Nah, it was just a rhetorical question meant to poke a hole in the claim that Windows can avoid race conditions by using HANDLEs. AFAICS, don't-reuse-PIDs-too-quick has exact analogs that Windows has to solve by ensuring it doesn't reuse HANDLEs too quick. regards, tom lane
2006/4/5, Tom Lane <tgl@sss.pgh.pa.us>:
There's a disconnect here. handles aren't process identifiers: they're reference counted "pointers" to the kernel structures for the process. If you are holding a handle (ie: from CreateProcess or OpenProcess) that handle cannot and will not be reclaimed until you call CloseHandle (or your process itself exits). You should never retain a handle after you've called CloseHandle on it.
Which brings an interesting thought: are process ID's reclaimed while open handles remain? I'm willing to bet the answer is no. In that case, the stats collector could retain the handle until it's done with the process ID.
AFAICS, don't-reuse-PIDs-too-quick has exact analogs that Windows has
to solve by ensuring it doesn't reuse HANDLEs too quick.
There's a disconnect here. handles aren't process identifiers: they're reference counted "pointers" to the kernel structures for the process. If you are holding a handle (ie: from CreateProcess or OpenProcess) that handle cannot and will not be reclaimed until you call CloseHandle (or your process itself exits). You should never retain a handle after you've called CloseHandle on it.
Which brings an interesting thought: are process ID's reclaimed while open handles remain? I'm willing to bet the answer is no. In that case, the stats collector could retain the handle until it's done with the process ID.
> >> What happens if process Y goes away between the time you obtain a > >> handle for it and the time you try to run this > DuplicateHandle call? > > > I can put together some quick test-code for this if you need me to? > > Nah, it was just a rhetorical question meant to poke a hole > in the claim that Windows can avoid race conditions by using HANDLEs. That's what I thought, which is why I asked first. > AFAICS, don't-reuse-PIDs-too-quick has exact analogs that > Windows has to solve by ensuring it doesn't reuse HANDLEs too quick. Windows doesn't "reuse HANDLEs" in that way. Handles are like file descriptors. The kernel structure represnting the process has a *process id*, not a handle. That is what uniquely identifies the process. (In the kernel it's often referred to as a "client id"). The process structure will be held as long as anybody has an open handle to the process. Thus, as long as this happens, the pid *cannot* be reused. So one way to assure the pid isn't recycled too soon is to keep the postmasters handle open "long enough" (the postmaster already has a handle there - it's just the pgstats process that doesn't have one). RIght now we close the postmaster handle first (win32_removechild at ~ 2086 in postmaster.c) and then fire off the message (CleanupBackend at ~2239). //Magnus
> There's a disconnect here. handles aren't process > identifiers: they're reference counted "pointers" to the > kernel structures for the process. If you are holding a > handle (ie: from CreateProcess or OpenProcess) that handle > cannot and will not be reclaimed until you call CloseHandle > (or your process itself exits). You should never retain a > handle after you've called CloseHandle on it. > > Which brings an interesting thought: are process ID's > reclaimed while open handles remain? I'm willing to bet the > answer is no. In that case, the stats collector could retain > the handle until it's done with the process ID. Since the process id lives in the kernel structure they point to, they can't very well be recycled... (classical handle-leak-kills-windows-system-by-running-out-of-kernel-space scenario). The problem is, the stats collector doesn't have the handle in the first place. I guess it could open one upon receiving the bestart message, but it seems like a kludge. Would be nicer if it could be done cleaner :-) //Magnus
"Peter Brant" <Peter.Brant@wicourts.gov> writes: > I added some strategic printfs to pgstat.c. Attached is the output when > a little program is run which, in a loop, makes 10 connections, sleeps 3 > seconds, closes them, sleeps another 3 seconds. My workstation (Windows > XP) was otherwise idle. > Search for "is known to be dead, ignoring" to find the re-used process > IDs. Things start out clean, but after a few cycles anywhere between 1 > and 5 backends are being missed. Looking at the pgstats code, I notice that once it makes an entry in the dead-backends hashtable, it keeps that entry (rejecting any messages with the same PID) for 10 seconds. That seems like approximately forever on modern machines, certainly much more than any plausible out-of-order condition in the UDP packet stream. It could easily be enough to get us in trouble on Unix machines, never mind Windows. A conservative suggestion would be to trim down the destroy interval. A more radical one is to question whether we need the destroy delay mechanism at all. What if we got rid of all that logic and simply let the collector delete stuff when it's told to? Out-of-order messages could cause entries to be re-created after they've been deleted, but I'm not sure that I see any harm in that. Bogus DB and table entries are already ignored in the pgstats views (because they won't join to anything in the system catalogs) and we also have a filter for bogus backend entries. There are also mechanisms that ensure these entries will go away eventually: pgstat_vacuum_tabstat for DB and table entries, and eventual re-use of a BackendId slot for backends. So I'm sort of thinking that the destroy delay has outlived its usefulness. regards, tom lane
> > Search for "is known to be dead, ignoring" to find the > re-used process > > IDs. Things start out clean, but after a few cycles > anywhere between > > 1 and 5 backends are being missed. > > Looking at the pgstats code, I notice that once it makes an > entry in the dead-backends hashtable, it keeps that entry > (rejecting any messages with the same PID) for 10 seconds. > That seems like approximately forever on modern machines, > certainly much more than any plausible out-of-order condition > in the UDP packet stream. It could easily be enough to get > us in trouble on Unix machines, never mind Windows. > > A conservative suggestion would be to trim down the destroy interval. > A more radical one is to question whether we need the destroy > delay mechanism at all. What if we got rid of all that logic > and simply let the collector delete stuff when it's told to? > Out-of-order messages could cause entries to be re-created > after they've been deleted, but I'm not sure that I see any > harm in that. Bogus DB and table entries are already ignored > in the pgstats views (because they won't join to anything in > the system catalogs) and we also have a filter for bogus > backend entries. There are also mechanisms that ensure these > entries will go away eventually: pgstat_vacuum_tabstat for DB > and table entries, and eventual re-use of a BackendId slot > for backends. > So I'm sort of thinking that the destroy delay has outlived > its usefulness. After reviewing my own patch I just sent in (stats write delay), I realised I had just upped that from 10 seconds to 5 minutes. Which is then even longer than forever :) Which prompted me to consider suggesting just this - do we really need the destroy functionality. From what I could tell it did look reasonable to get rid of it for these reasons. //Magnus
"Magnus Hagander" <mha@sollentuna.net> writes: > After reviewing my own patch I just sent in (stats write delay), I > realised I had just upped that from 10 seconds to 5 minutes. Which is > then even longer than forever :) > Which prompted me to consider suggesting just this - do we really need > the destroy functionality. Which was in fact exactly my train of thought ... regards, tom lane
On Wed, Apr 05, 2006 at 10:05:56AM -0400, Tom Lane wrote: > What happens if process Y goes away between the time you obtain a > handle for it and the time you try to run this DuplicateHandle call? Think of Windows HANDLE like UNIX fd, but Windows HANDLE works for everything - not just sockets, files, pipes, and character devices. Process Y doesn't go away until all references to it, via HANDLE, have been closed. It may not be running. It may have an exit status available. It doesn't go away, though, until you are done with it, and everybody who has a reference to it does CloseHandle(). Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
On Wed, Apr 05, 2006 at 10:30:36AM -0400, Tom Lane wrote: > "Magnus Hagander" <mha@sollentuna.net> writes: > >> What happens if process Y goes away between the time you > >> obtain a handle for it and the time you try to run this > >> DuplicateHandle call? > > I can put together some quick test-code for this if you need me to? > Nah, it was just a rhetorical question meant to poke a hole in the > claim that Windows can avoid race conditions by using HANDLEs. > AFAICS, don't-reuse-PIDs-too-quick has exact analogs that Windows has > to solve by ensuring it doesn't reuse HANDLEs too quick. No. It means you don't understand what a HANDLE is. But that's fine - because you understand DB stuff to compensate... :-) Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bindthem... http://mark.mielke.cc/
I'm going to rip out the destroy code and see how it goes. Patch will be forthcoming if things turn out well. Pete >>> Tom Lane <tgl@sss.pgh.pa.us> 04/05/06 4:49 pm >>> So I'm sort of thinking that the destroy delay has outlived its usefulness. regards, tom lane