Обсуждение: Stats collection on Windows

Поиск
Список
Период
Сортировка

Stats collection on Windows

От
"Peter Brant"
Дата:
Hi all,

I think I've found the cause (or one of the causes) why stats
collection is unreliable on Windows and I'm wondering about the best way
to go about fixing it.

The problem is that process IDs on Windows seem to be assigned without
much rhyme or reason and it seems to happen relatively frequently that a
new process will be assigned the same process ID as a process which
recently died.  If this happens before the backend has been expired out
of pgstat.c's pgStatBeDead hash, the backend will be missed.

I was thinking the postmaster could maintain a backend sequence number
with similar semantics to a UNIXish process ID which could then be used
as the key for pgStatBeDead instead of the actual process ID.  Does that
sound reasonable?

Peter


Re: Stats collection on Windows

От
Tom Lane
Дата:
"Peter Brant" <Peter.Brant@wicourts.gov> writes:
> I think I've found the cause (or one of the causes) why stats
> collection is unreliable on Windows and I'm wondering about the best way
> to go about fixing it.

> The problem is that process IDs on Windows seem to be assigned without
> much rhyme or reason and it seems to happen relatively frequently that a
> new process will be assigned the same process ID as a process which
> recently died.  If this happens before the backend has been expired out
> of pgstat.c's pgStatBeDead hash, the backend will be missed.

That's an interesting theory, but do you have any actual evidence for it?
The evidence I've seen says that our big problem on Windows is the stats
collector process just quitting due to unexplained piperead() failures.

(I mean, I'd love to blame Microsoft for everything, but even the
Redmond crowd should be able to figure out that recycling process IDs
instantly would be a stupid idea...)
        regards, tom lane


Re: Stats collection on Windows

От
"Qingqing Zhou"
Дата:
"Tom Lane" <tgl@sss.pgh.pa.us> wrote
>
>> The problem is that process IDs on Windows seem to be assigned without
>> much rhyme or reason and it seems to happen relatively frequently that a
>> new process will be assigned the same process ID as a process which
>> recently died.
>
> That's an interesting theory, but do you have any actual evidence for it?
>

I can confirm that on Windows 2000 the process ID is recycled instantly.

Regards,
Qingqing




Re: Stats collection on Windows

От
"Qingqing Zhou"
Дата:
"Tom Lane" <tgl@sss.pgh.pa.us> wrote
>
> Redmond crowd should be able to figure out that recycling process IDs
> instantly would be a stupid idea...)
>

Can you explain more of this? IMHO, if we rely on feature like this, the 
difference is unstable-every-day vs. unstable-every-year.

Regards,
Qingqing 




Re: Stats collection on Windows

От
Tom Lane
Дата:
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> wrote
>> Redmond crowd should be able to figure out that recycling process IDs
>> instantly would be a stupid idea...)

> Can you explain more of this? IMHO, if we rely on feature like this, the 
> difference is unstable-every-day vs. unstable-every-year.

The mere existence of the kill() primitive should bring to mind reasons
why it's a bad idea.
        regards, tom lane


Re: Stats collection on Windows

От
"Magnus Hagander"
Дата:
> >> Redmond crowd should be able to figure out that recycling
> process IDs
> >> instantly would be a stupid idea...)
>
> > Can you explain more of this? IMHO, if we rely on feature
> like this,
> > the difference is unstable-every-day vs. unstable-every-year.
>
> The mere existence of the kill() primitive should bring to
> mind reasons why it's a bad idea.

Except the kill() primitive *does not exist* on windows.

That said, how did you go about to confirm that the pid is recycled
instantly? I was under the impression that it assignes any unused pid in
random order, which is also what a quick glance at my XP box looks like
(don't have a 2000 box around, but I wasn't aware of such a change
between those - but it's certainly not impossible). But if oyu had some
better method of determining it, please let me know :-)


If that's how, several other OSes do the same thing AFAIK - for security
reasons. For example OpenBSD. So if we rely heavily on that, we may be
in trouble elsewhere as well.

//Magnus


Re: Stats collection on Windows

От
mark@mark.mielke.cc
Дата:
On Tue, Apr 04, 2006 at 11:02:11PM -0400, Tom Lane wrote:
> "Peter Brant" <Peter.Brant@wicourts.gov> writes:
> > I think I've found the cause (or one of the causes) why stats
> > collection is unreliable on Windows and I'm wondering about the best way
> > to go about fixing it.
> > The problem is that process IDs on Windows seem to be assigned without
> > much rhyme or reason and it seems to happen relatively frequently that a
> > new process will be assigned the same process ID as a process which
> > recently died.  If this happens before the backend has been expired out
> > of pgstat.c's pgStatBeDead hash, the backend will be missed.
> That's an interesting theory, but do you have any actual evidence for it?
> The evidence I've seen says that our big problem on Windows is the stats
> collector process just quitting due to unexplained piperead() failures.
> (I mean, I'd love to blame Microsoft for everything, but even the
> Redmond crowd should be able to figure out that recycling process IDs
> instantly would be a stupid idea...)

Why? :-)

They use HANDLE. The process ID isn't nearly as useful as it is on UNIX.
I haven't looked at that stuff in a long time, but process "ID" on Windows
may be a compatibility method.

Process "ID" isn't necessarily a good way of identifying tasks,
precisely because they may be reused. Using a serial allocated at
process start might make more sense. Relying on it not being reused,
is not dissimilar to the old malloc() "tricks" of assuming that
malloc() will not return something recently free()d. It's bad.

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/



Re: Stats collection on Windows

От
mark@mark.mielke.cc
Дата:
On Tue, Apr 04, 2006 at 11:17:49PM -0400, Tom Lane wrote:
> "Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
> > "Tom Lane" <tgl@sss.pgh.pa.us> wrote
> >> Redmond crowd should be able to figure out that recycling process IDs
> >> instantly would be a stupid idea...)
> > Can you explain more of this? IMHO, if we rely on feature like this, the 
> > difference is unstable-every-day vs. unstable-every-year.
> The mere existence of the kill() primitive should bring to mind reasons
> why it's a bad idea.

CreateProcess - "The process is assigned a process identifier. The
identifier is valid until the process terminates. It can be used to
identify the process, or specified in the OpenProcess function to open
a handle to the process. The initial thread in the process is also
assigned a thread identifier. ..."

TerminateProcess - "Terminates the specified process and all of its
threads."

TerminateProcess takes a HANDLE, not a process identifier. Yes, they
provide the "kill" primitive, but only as a compatibility measure.  A
"good" Windows process, should maintain a HANDLE to the process, and
kill the process using the HANDLE. This way, there is no race.  The
HANDLE is also how you wait for the process to terminate normally.

I prefer the "Redmond" way, in that I find UNIX's use of integer
identifiers to be encouraging of race conditions. UNIX requires hacks
like minimizing PID reuse, because UNIX is the one that is broken. :-)

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/



Re: Stats collection on Windows

От
Martijn van Oosterhout
Дата:
On Wed, Apr 05, 2006 at 03:20:47AM -0400, mark@mark.mielke.cc wrote:
> TerminateProcess takes a HANDLE, not a process identifier. Yes, they
> provide the "kill" primitive, but only as a compatibility measure.  A
> "good" Windows process, should maintain a HANDLE to the process, and
> kill the process using the HANDLE. This way, there is no race.  The
> HANDLE is also how you wait for the process to terminate normally.

Which presents the solution, we should use the HANDLE on windows rather
than the process identifier.

> I prefer the "Redmond" way, in that I find UNIX's use of integer
> identifiers to be encouraging of race conditions. UNIX requires hacks
> like minimizing PID reuse, because UNIX is the one that is broken. :-)

Eh? A HANDLE is (or can be mapped to) an integer too. I don't see
anything on that page about handle reuse. If you run a machine long
enough I'm sure it can be reused also...

Have a nice day,

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Stats collection on Windows

От
mark@mark.mielke.cc
Дата:
On Wed, Apr 05, 2006 at 09:30:06AM +0200, Martijn van Oosterhout wrote:
> On Wed, Apr 05, 2006 at 03:20:47AM -0400, mark@mark.mielke.cc wrote:
> > TerminateProcess takes a HANDLE, not a process identifier. Yes, they
> > provide the "kill" primitive, but only as a compatibility measure.  A
> > "good" Windows process, should maintain a HANDLE to the process, and
> > kill the process using the HANDLE. This way, there is no race.  The
> > HANDLE is also how you wait for the process to terminate normally.
> Which presents the solution, we should use the HANDLE on windows rather
> than the process identifier.

Yes.

> > I prefer the "Redmond" way, in that I find UNIX's use of integer
> > identifiers to be encouraging of race conditions. UNIX requires hacks
> > like minimizing PID reuse, because UNIX is the one that is broken. :-)
> Eh? A HANDLE is (or can be mapped to) an integer too. I don't see
> anything on that page about handle reuse. If you run a machine long
> enough I'm sure it can be reused also...

Once upon a time, when I played with this stuff (I mostly use UNIX, not
Windows), I concluded to myself that HANDLE was process-local, and that
it was allocated. Meaning - it won't be re-used until you CloseHandle().
It's best then, to think of HANDLE as a opaque object. Regardless, of
whether it is process-local or not, until you run CloseHandle(), it is
yours to keep, and it won't be re-used.

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/



Re: Stats collection on Windows

От
Martijn van Oosterhout
Дата:
On Wed, Apr 05, 2006 at 03:38:28AM -0400, mark@mark.mielke.cc wrote:
> Once upon a time, when I played with this stuff (I mostly use UNIX, not
> Windows), I concluded to myself that HANDLE was process-local, and that
> it was allocated. Meaning - it won't be re-used until you CloseHandle().
> It's best then, to think of HANDLE as a opaque object. Regardless, of
> whether it is process-local or not, until you run CloseHandle(), it is
> yours to keep, and it won't be re-used.

HANDLE is process local? That is worse then, because then there's no
guarentee that each process will see a different identifier.

The stats collector identifies processes by their process id, which
they get using getpid(). If instead they used a handle for their own
process (GetCurrentProcess() always returns -1, but you can apparently
clone it to get a real handle), you have no idea whether that handle is
unique amongst backends, because it's process local.

The stats collector doesn't have any open handles for the backend, it's
just a way for backends to identify themselves. It appears that process
handles are not up to the task either...

Do we have a plan C?
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Stats collection on Windows

От
"Magnus Hagander"
Дата:
> > Once upon a time, when I played with this stuff (I mostly use UNIX,
> > not Windows), I concluded to myself that HANDLE was
> process-local, and
> > that it was allocated. Meaning - it won't be re-used until
> you CloseHandle().
> > It's best then, to think of HANDLE as a opaque object.
> Regardless, of
> > whether it is process-local or not, until you run
> CloseHandle(), it is
> > yours to keep, and it won't be re-used.
>
> HANDLE is process local? That is worse then, because then
> there's no guarentee that each process will see a different
> identifier.

HANDLE is process local. What you need to do is run DuplicateHandle() on
it specifying it should "also be valid for process Y" (for which you
need a HANDLE opened, in this case the stats collector). This will give
you a new handle whichi s valid in the *target process*, but it is *not*
valid in your own process.

A quick look shows that pids appear to be reused quite fast on Windows
2000, but *not* on XP or 2003. Don't have any other versions around to
test.

AFAIK, the pid on unix is also only vaild while the process is running.
It's not likely to be reused quickly, which appears to be the same for
XP+ on win32.


> Do we have a plan C?

Well, first we need to know that this really *is* the problem. I'm
unsure if we've seen "proof" that this actually causes a real problem.
(It's certainly not the main issue with teh stats system on win32, which
is the fact that it crashes now and then)

How likely is it that the arrive-in-wrong-order really happens? And can
we do something about that one? (I assume the problem is that the
backend-die message is sent from postmaster, meaning that the die can
appear before the alive sent from the backend itself. What if we changed
it so we had one message sent from a backend doing a controlled shutdown
(99.9% of the time) and a different one when the postmaster picks up a
dead backend. And we apply the "dead counter" only to the messages from
the postmaster, and expire the record right away when we receive one
from a controlled shutdown (which would happen in-line with the
backend-start message)

//Magnus


//Magnus


Re: Stats collection on Windows

От
"Magnus Hagander"
Дата:
> > >> Redmond crowd should be able to figure out that
> recycling process
> > >> IDs instantly would be a stupid idea...)
> > > Can you explain more of this? IMHO, if we rely on feature
> like this,
> > > the difference is unstable-every-day vs. unstable-every-year.
> > The mere existence of the kill() primitive should bring to mind
> > reasons why it's a bad idea.
>
> CreateProcess - "The process is assigned a process
> identifier. The identifier is valid until the process
> terminates. It can be used to identify the process, or
> specified in the OpenProcess function to open a handle to the
> process. The initial thread in the process is also assigned a
> thread identifier. ..."
>
> TerminateProcess - "Terminates the specified process and all
> of its threads."
>
> TerminateProcess takes a HANDLE, not a process identifier.

Yes. You get the handle by doing OpenProcess() with PROCESS_TERMINATE
access.

The functions used to enumerate processes all return the process id, not
a HANDLE.(See Process32First/Process32Next). You use the HANDLE only
when you need to *modify* something (for example, killing it).


> Yes, they provide the "kill" primitive, but only as a
> compatibility measure.

They do? Where is it? Certainly not in the documented SDK that I can
see.


> A "good" Windows process, should
> maintain a HANDLE to the process, and kill the process using
> the HANDLE. This way, there is no race.  The HANDLE is also
> how you wait for the process to terminate normally.

A "good" Windows process would be using threads ;-)
That's what the Windows API was written for, and that's why anything
that deals with multiple processes working together is a lot harder than
on Unix.


//Magnus


Re: Stats collection on Windows

От
mark@mark.mielke.cc
Дата:
On Wed, Apr 05, 2006 at 09:58:54AM +0200, Martijn van Oosterhout wrote:
> On Wed, Apr 05, 2006 at 03:38:28AM -0400, mark@mark.mielke.cc wrote:
> > Once upon a time, when I played with this stuff (I mostly use UNIX, not
> > Windows), I concluded to myself that HANDLE was process-local, and that
> > it was allocated. Meaning - it won't be re-used until you CloseHandle().
> > It's best then, to think of HANDLE as a opaque object. Regardless, of
> > whether it is process-local or not, until you run CloseHandle(), it is
> > yours to keep, and it won't be re-used.
> HANDLE is process local? That is worse then, because then there's no
> guarentee that each process will see a different identifier.

It's no different from a file descriptor on UNIX.

Neither UNIX nor Windows promise that a process identifier is valid
beyond the life of the process. UNIX avoids it from happening, as it
is necessary to avoid races with system calls such as kill(). Windows
does not have this problem.

> The stats collector identifies processes by their process id, which
> they get using getpid(). If instead they used a handle for their own
> process (GetCurrentProcess() always returns -1, but you can apparently
> clone it to get a real handle), you have no idea whether that handle is
> unique amongst backends, because it's process local. 

> The stats collector doesn't have any open handles for the backend, it's
> just a way for backends to identify themselves. It appears that process
> handles are not up to the task either...

> Do we have a plan C?

Sure. Serial. Allocate on process start.

Or, back to another topic from months ago - UUID generation... :-)

Process identifier should not be used beyond the life of the process.
As soon as wait() removes the process on UNIX, the process identifier
is no longer valid, and could be reused. That the operating system tries
to prevent problems by avoiding recycling isn't necessarily a good
reason to exploit this capability.

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/



Re: Stats collection on Windows

От
mark@mark.mielke.cc
Дата:
On Wed, Apr 05, 2006 at 10:31:37AM +0200, Magnus Hagander wrote:
> > TerminateProcess takes a HANDLE, not a process identifier. 
> Yes. You get the handle by doing OpenProcess() with PROCESS_TERMINATE
> access.
> The functions used to enumerate processes all return the process id, not
> a HANDLE.(See Process32First/Process32Next). You use the HANDLE only
> when you need to *modify* something (for example, killing it).

Enumerating processes isn't a common operationg for programs other
than 'top'. :-) But even so, one can use OpenProcess() to evaluate
that the process being referenced is actually the one you think it
is, before calling TerminateProcess(), eliminating the race.

> > Yes, they provide the "kill" primitive, but only as a 
> > compatibility measure. 
> They do? Where is it? Certainly not in the documented SDK that I can
> see.

I believe it is called KillProcess().

> > A "good" Windows process, should 
> > maintain a HANDLE to the process, and kill the process using 
> > the HANDLE. This way, there is no race.  The HANDLE is also 
> > how you wait for the process to terminate normally.
> A "good" Windows process would be using threads ;-)

> That's what the Windows API was written for, and that's why anything
> that deals with multiple processes working together is a lot harder than
> on Unix.

Haha. Good point. :-)

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/



Re: Stats collection on Windows

От
Martijn van Oosterhout
Дата:
On Wed, Apr 05, 2006 at 06:03:31AM -0400, mark@mark.mielke.cc wrote:
> It's no different from a file descriptor on UNIX.
>
> Neither UNIX nor Windows promise that a process identifier is valid
> beyond the life of the process. UNIX avoids it from happening, as it
> is necessary to avoid races with system calls such as kill(). Windows
> does not have this problem.

But consider, why is the process id there? (Amongst other reasons) so
that users can monitor pg_stat_activity and kill a backend that's out
of control. The equivalent to this in windows would be to.

1. Get HANDLE from the process ID.
2. TerminateProcess with that HANDLE.

Presumably users have a little GUI app that displays processes on the
system with the process ID so they can kill it. If a process ID is
quickly reused, they may end up killing the wrong one.

The race condition in this case involves the user and you can't solve
that programmatically. The non-reuse of pids is more for
user-friendlyness than anything else. The Window use of HANDLE doesn't
solve this problem at all.

> Sure. Serial. Allocate on process start.
>
> Or, back to another topic from months ago - UUID generation... :-)

Neither of which solve the "I'm a user and want to kill *that* backend"
problem. Because even on windows you'll have to get the process id,
convert it to a handle an kill it. Same race condition.

> Process identifier should not be used beyond the life of the process.
> As soon as wait() removes the process on UNIX, the process identifier
> is no longer valid, and could be reused. That the operating system tries
> to prevent problems by avoiding recycling isn't necessarily a good
> reason to exploit this capability.

Yeah, but it's very useful as a user. Consider this scenerio:

<process 1234 goes AWOL>
kill -INT 1234
<check if process still there>
kill -TERM 1234
<process still there, damn it!>
kill -9 1234

There gone. With no quick PID reuse I can be sure I won't kill the
wrong one. This is presumably why recent versions of windows don't reuse
pids quickly either... It for *users* not programs.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Stats collection on Windows

От
mark@mark.mielke.cc
Дата:
On Wed, Apr 05, 2006 at 12:20:49PM +0200, Martijn van Oosterhout wrote:
> On Wed, Apr 05, 2006 at 06:03:31AM -0400, mark@mark.mielke.cc wrote:
> > It's no different from a file descriptor on UNIX.
> > Neither UNIX nor Windows promise that a process identifier is valid
> > beyond the life of the process. UNIX avoids it from happening, as it
> > is necessary to avoid races with system calls such as kill(). Windows
> > does not have this problem.
> But consider, why is the process id there? (Amongst other reasons) so
> that users can monitor pg_stat_activity and kill a backend that's out
> of control. The equivalent to this in windows would be to.
> 1. Get HANDLE from the process ID.
> 2. TerminateProcess with that HANDLE.

You missed 1.5. Ensure that HANDLE matches the process you intend. :-)

This is a little bit of a distraction though, as any system that requires
the user to be able to kill broken backends, is only indicative of a
broken backend. We're talking about how to deal with a broken process,
after the process owner (PostgreSQL) has forgotten about it.

What would be wrong with having a PostgreSQL generated serial associated
with each backend, that can be used by the backend process owner, to
map to a HANDLE, which uses TerminateProcess() underneath?

> Presumably users have a little GUI app that displays processes on the
> system with the process ID so they can kill it. If a process ID is
> quickly reused, they may end up killing the wrong one.

Sure. But the problem here, is that PostgreSQL is broken, so they find a
need to go to their process listing GUI. And who is to say that the GUI
doesn't do as I've suggested above? Ensure that TerminateProcess() is
against the intended process? I have no idea if it does - but it could.

> The race condition in this case involves the user and you can't solve
> that programmatically. The non-reuse of pids is more for
> user-friendlyness than anything else. The Window use of HANDLE doesn't
> solve this problem at all.

Sure you can. But it also shouldn't matter.

> > Sure. Serial. Allocate on process start.
> > Or, back to another topic from months ago - UUID generation... :-)
> Neither of which solve the "I'm a user and want to kill *that* backend"
> problem. Because even on windows you'll have to get the process id,
> convert it to a handle an kill it. Same race condition.

No, and not if PostgreSQL kills the backend for you cleanly, using the
HANDLE, that it owns for the process.

> > Process identifier should not be used beyond the life of the process.
> > As soon as wait() removes the process on UNIX, the process identifier
> > is no longer valid, and could be reused. That the operating system tries
> > to prevent problems by avoiding recycling isn't necessarily a good
> > reason to exploit this capability.
> Yeah, but it's very useful as a user. Consider this scenerio:
> <process 1234 goes AWOL>
> kill -INT 1234
> <check if process still there>
> kill -TERM 1234
> <process still there, damn it!>
> kill -9 1234
> There gone. With no quick PID reuse I can be sure I won't kill the
> wrong one. This is presumably why recent versions of windows don't reuse
> pids quickly either... It for *users* not programs.

Users shouldn't need to kill programs.

If they do - on the off chance that they do, they should be more
careful than you list above.

That's a kneejerk reaction to a problem that shouldn't occur in the
first place. It's a habit trained by UNIX users.

I rarely ever need to kill a process on my Windows box, and when I have,
the process listing in the GUI hasn't offered me a process ID. I click
on the item I wish to kill, and I right click "End Process". A lot more
user friendly, in my opinion. And if they happen to fix the race in the
background using the method I suggest above? All the better...

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/



Re: Stats collection on Windows

От
"Magnus Hagander"
Дата:
> > It's no different from a file descriptor on UNIX.
> >
> > Neither UNIX nor Windows promise that a process identifier is valid
> > beyond the life of the process. UNIX avoids it from
> happening, as it
> > is necessary to avoid races with system calls such as
> kill(). Windows
> > does not have this problem.
>
> But consider, why is the process id there? (Amongst other
> reasons) so that users can monitor pg_stat_activity and kill
> a backend that's out of control. The equivalent to this in
> windows would be to.
>
> 1. Get HANDLE from the process ID.
> 2. TerminateProcess with that HANDLE.
>
> Presumably users have a little GUI app that displays
> processes on the system with the process ID so they can kill
> it. If a process ID is quickly reused, they may end up
> killing the wrong one.

Actualy, not quite. That's the equivalent of kill -9. What you'd do is
"pg_ctl kill TERM <pid>". Or whatever other signal you want to send. But
you still need the pid, there is no way to get the required information
from a plain handle when you're not in that process.

Though I'd say I'm usually more interested in noticing what process it
is, rather than killing it. Possibly canceling it, but very seldomly
actually kill it. But it still needs the process id, yes.

(BTW, the GUI would generaelly be Task Manager, so it's not like it's a
special tool)


<snip rest>

//Magnus


Re: Stats collection on Windows

От
"Magnus Hagander"
Дата:
> > > Yes, they provide the "kill" primitive, but only as a
> compatibility
> > > measure.
> > They do? Where is it? Certainly not in the documented SDK
> that I can
> > see.
>
> I believe it is called KillProcess().

No such function documented on my MSDN, other than one in SQL Server
DMO.
Are you referring to TerminateProcess()? If so, it can just terminate
the process, not send it arbitrary signals. Considering windows doesnt'
reallyi have the concept of signals, that's not very surprising though
:-)


//Magnus


Re: Stats collection on Windows

От
"Magnus Hagander"
Дата:
> This is a little bit of a distraction though, as any system
> that requires the user to be able to kill broken backends, is
> only indicative of a broken backend. We're talking about how
> to deal with a broken process, after the process owner
> (PostgreSQL) has forgotten about it.

Don't confuse the postmaster with the statistics system. the postmaster
knows about the backends per handle, but the stats system runs outside
of the postmaster, and can possibly get confused.


> What would be wrong with having a PostgreSQL generated serial
> associated with each backend, that can be used by the backend
> process owner, to map to a HANDLE, which uses
> TerminateProcess() underneath?

It might be a good idea to use the combination pid+internalid. Note taht
we *don't* just want to TerminateProcess(), we need the pid to send
actual signals with. And in this case, that's not even the problem - the
problem is that we're not dropping the process from our list (I think).


> > Presumably users have a little GUI app that displays
> processes on the
> > system with the process ID so they can kill it. If a process ID is
> > quickly reused, they may end up killing the wrong one.
>
> Sure. But the problem here, is that PostgreSQL is broken, so
> they find a need to go to their process listing GUI. And who
> is to say that the GUI doesn't do as I've suggested above?
> Ensure that TerminateProcess() is against the intended
> process? I have no idea if it does - but it could.

Because the GUI is most likely either Task Manager or Process Explorer.
There is no need for a special GUI.


> > > Process identifier should not be used beyond the life of
> the process.
> > > As soon as wait() removes the process on UNIX, the process
> > > identifier is no longer valid, and could be reused. That the
> > > operating system tries to prevent problems by avoiding recycling
> > > isn't necessarily a good reason to exploit this capability.
> > Yeah, but it's very useful as a user. Consider this scenerio:
> > <process 1234 goes AWOL>
> > kill -INT 1234
> > <check if process still there>
> > kill -TERM 1234
> > <process still there, damn it!>
> > kill -9 1234
> > There gone. With no quick PID reuse I can be sure I won't kill the
> > wrong one. This is presumably why recent versions of windows don't
> > reuse pids quickly either... It for *users* not programs.
>
> Users shouldn't need to kill programs.

And normally they don't. But again, this isn't where the problem is. If
your process is stuck and you need to kill it, it most likely will still
be around once you get to killing it. If not, you were trying to kill it
too soon.


> I rarely ever need to kill a process on my Windows box, and
> when I have, the process listing in the GUI hasn't offered me
> a process ID. I click on the item I wish to kill, and I right
> click "End Process". A lot more user friendly, in my opinion.
> And if they happen to fix the race in the background using
> the method I suggest above? All the better...

AFAIK, Task Manager uses the pid to kill the task. Try tracing it. (And
it will happily show you the pid as well - just enable it in the
checkbox. It's the only reasonable way to differ between two processes
off the same exe, for example two different notepad:s...)

//Magnus


Re: Stats collection on Windows

От
Tom Lane
Дата:
"Magnus Hagander" <mha@sollentuna.net> writes:
>> HANDLE is process local? That is worse then, because then 
>> there's no guarentee that each process will see a different 
>> identifier.

> HANDLE is process local. What you need to do is run DuplicateHandle() on
> it specifying it should "also be valid for process Y" (for which you
> need a HANDLE opened, in this case the stats collector). This will give
> you a new handle whichi s valid in the *target process*, but it is *not*
> valid in your own process.

What happens if process Y goes away between the time you obtain a
handle for it and the time you try to run this DuplicateHandle call?
        regards, tom lane


Re: Stats collection on Windows

От
"Magnus Hagander"
Дата:
> >> HANDLE is process local? That is worse then, because then
> there's no
> >> guarentee that each process will see a different identifier.
>
> > HANDLE is process local. What you need to do is run
> DuplicateHandle()
> > on it specifying it should "also be valid for process Y" (for which
> > you need a HANDLE opened, in this case the stats
> collector). This will
> > give you a new handle whichi s valid in the *target
> process*, but it
> > is *not* valid in your own process.
>
> What happens if process Y goes away between the time you
> obtain a handle for it and the time you try to run this
> DuplicateHandle call?

I don't know offhand. I would assume one of two things:
1) DuplicateHandle() fails.
2) You get a handle back that is valid in the dead process (meaning it's
not valid).

I can put together some quick test-code for this if you need me to?

//Magnus


Re: Stats collection on Windows

От
"stephen joseph butler"
Дата:
2006/4/5, Tom Lane <tgl@sss.pgh.pa.us>:
What happens if process Y goes away between the time you obtain a
handle for it and the time you try to run this DuplicateHandle call?

The structures for a process should remain as long as there are any open HANDLEs, even if the process has ended. Threads behave the same way, which is why a creator needs to remember to always close the HANDLE returned from CreateThread.

Re: Stats collection on Windows

От
Tom Lane
Дата:
"Magnus Hagander" <mha@sollentuna.net> writes:
>> What happens if process Y goes away between the time you 
>> obtain a handle for it and the time you try to run this 
>> DuplicateHandle call?

> I can put together some quick test-code for this if you need me to?

Nah, it was just a rhetorical question meant to poke a hole in the
claim that Windows can avoid race conditions by using HANDLEs.

AFAICS, don't-reuse-PIDs-too-quick has exact analogs that Windows has
to solve by ensuring it doesn't reuse HANDLEs too quick.
        regards, tom lane


Re: Stats collection on Windows

От
"stephen joseph butler"
Дата:
2006/4/5, Tom Lane <tgl@sss.pgh.pa.us>:
AFAICS, don't-reuse-PIDs-too-quick has exact analogs that Windows has
to solve by ensuring it doesn't reuse HANDLEs too quick.


There's a disconnect here. handles aren't process identifiers: they're reference counted "pointers" to the kernel structures for the process. If you are holding a handle (ie: from CreateProcess or OpenProcess) that handle cannot and will not be reclaimed until you call CloseHandle (or your process itself exits). You should never retain a handle after you've called CloseHandle on it.

Which brings an interesting thought: are process ID's reclaimed while open handles remain? I'm willing to bet the answer is no. In that case, the stats collector could retain the handle until it's done with the process ID.

Re: Stats collection on Windows

От
"Magnus Hagander"
Дата:
> >> What happens if process Y goes away between the time you obtain a
> >> handle for it and the time you try to run this
> DuplicateHandle call?
>
> > I can put together some quick test-code for this if you need me to?
>
> Nah, it was just a rhetorical question meant to poke a hole
> in the claim that Windows can avoid race conditions by using HANDLEs.

That's what I thought, which is why I asked first.


> AFAICS, don't-reuse-PIDs-too-quick has exact analogs that
> Windows has to solve by ensuring it doesn't reuse HANDLEs too quick.

Windows doesn't "reuse HANDLEs" in that way. Handles are like file
descriptors.
The kernel structure represnting the process has a *process id*, not a
handle. That is what uniquely identifies the process. (In the kernel
it's often referred to as a "client id").

The process structure will be held as long as anybody has an open handle
to the process. Thus, as long as this happens, the pid *cannot* be
reused. So one way to assure the pid isn't recycled too soon is to keep
the postmasters handle open "long enough" (the postmaster already has a
handle there - it's just the pgstats process that doesn't have one).
RIght now we close the postmaster handle first (win32_removechild at ~
2086 in postmaster.c) and then fire off the message (CleanupBackend at
~2239).

//Magnus


Re: Stats collection on Windows

От
"Magnus Hagander"
Дата:
> There's a disconnect here. handles aren't process
> identifiers: they're reference counted "pointers" to the
> kernel structures for the process. If you are holding a
> handle (ie: from CreateProcess or OpenProcess) that handle
> cannot and will not be reclaimed until you call CloseHandle
> (or your process itself exits). You should never retain a
> handle after you've called CloseHandle on it.
>
> Which brings an interesting thought: are process ID's
> reclaimed while open handles remain? I'm willing to bet the
> answer is no. In that case, the stats collector could retain
> the handle until it's done with the process ID.

Since the process id lives in the kernel structure they point to, they
can't very well be recycled... (classical
handle-leak-kills-windows-system-by-running-out-of-kernel-space
scenario).

The problem is, the stats collector doesn't have the handle in the first
place. I guess it could open one upon receiving the bestart message, but
it seems like a kludge. Would be nicer if it could be done cleaner :-)

//Magnus


Re: Stats collection on Windows

От
Tom Lane
Дата:
"Peter Brant" <Peter.Brant@wicourts.gov> writes:
> I added some strategic printfs to pgstat.c.  Attached is the output when
> a little program is run which, in a loop, makes 10 connections, sleeps 3
> seconds, closes them, sleeps another 3 seconds.  My workstation (Windows
> XP) was otherwise idle.

> Search for "is known to be dead, ignoring" to find the re-used process
> IDs.  Things start out clean, but after a few cycles anywhere between 1
> and 5 backends are being missed.

Looking at the pgstats code, I notice that once it makes an entry in the
dead-backends hashtable, it keeps that entry (rejecting any messages
with the same PID) for 10 seconds.  That seems like approximately
forever on modern machines, certainly much more than any plausible
out-of-order condition in the UDP packet stream.  It could easily be
enough to get us in trouble on Unix machines, never mind Windows.

A conservative suggestion would be to trim down the destroy interval.
A more radical one is to question whether we need the destroy delay
mechanism at all.  What if we got rid of all that logic and simply let
the collector delete stuff when it's told to?  Out-of-order messages
could cause entries to be re-created after they've been deleted, but
I'm not sure that I see any harm in that.  Bogus DB and table entries
are already ignored in the pgstats views (because they won't join to
anything in the system catalogs) and we also have a filter for bogus
backend entries.  There are also mechanisms that ensure these entries
will go away eventually: pgstat_vacuum_tabstat for DB and table
entries, and eventual re-use of a BackendId slot for backends.
So I'm sort of thinking that the destroy delay has outlived its
usefulness.
        regards, tom lane


Re: Stats collection on Windows

От
"Magnus Hagander"
Дата:
> > Search for "is known to be dead, ignoring" to find the
> re-used process
> > IDs.  Things start out clean, but after a few cycles
> anywhere between
> > 1 and 5 backends are being missed.
>
> Looking at the pgstats code, I notice that once it makes an
> entry in the dead-backends hashtable, it keeps that entry
> (rejecting any messages with the same PID) for 10 seconds.
> That seems like approximately forever on modern machines,
> certainly much more than any plausible out-of-order condition
> in the UDP packet stream.  It could easily be enough to get
> us in trouble on Unix machines, never mind Windows.
>
> A conservative suggestion would be to trim down the destroy interval.
> A more radical one is to question whether we need the destroy
> delay mechanism at all.  What if we got rid of all that logic
> and simply let the collector delete stuff when it's told to?
> Out-of-order messages could cause entries to be re-created
> after they've been deleted, but I'm not sure that I see any
> harm in that.  Bogus DB and table entries are already ignored
> in the pgstats views (because they won't join to anything in
> the system catalogs) and we also have a filter for bogus
> backend entries.  There are also mechanisms that ensure these
> entries will go away eventually: pgstat_vacuum_tabstat for DB
> and table entries, and eventual re-use of a BackendId slot
> for backends.
> So I'm sort of thinking that the destroy delay has outlived
> its usefulness.

After reviewing my own patch I just sent in (stats write delay), I
realised I had just upped that from 10 seconds to 5 minutes. Which is
then even longer than forever :)
Which prompted me to consider suggesting just this - do we really need
the destroy functionality. From what I could tell it did look reasonable
to get rid of it for these reasons.

//Magnus


Re: Stats collection on Windows

От
Tom Lane
Дата:
"Magnus Hagander" <mha@sollentuna.net> writes:
> After reviewing my own patch I just sent in (stats write delay), I
> realised I had just upped that from 10 seconds to 5 minutes. Which is
> then even longer than forever :)
> Which prompted me to consider suggesting just this - do we really need
> the destroy functionality.

Which was in fact exactly my train of thought ...
        regards, tom lane


Re: Stats collection on Windows

От
mark@mark.mielke.cc
Дата:
On Wed, Apr 05, 2006 at 10:05:56AM -0400, Tom Lane wrote:
> What happens if process Y goes away between the time you obtain a
> handle for it and the time you try to run this DuplicateHandle call?

Think of Windows HANDLE like UNIX fd, but Windows HANDLE works for
everything - not just sockets, files, pipes, and character devices.

Process Y doesn't go away until all references to it, via HANDLE, have
been closed. It may not be running. It may have an exit status available.
It doesn't go away, though, until you are done with it, and everybody
who has a reference to it does CloseHandle().

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/



Re: Stats collection on Windows

От
mark@mark.mielke.cc
Дата:
On Wed, Apr 05, 2006 at 10:30:36AM -0400, Tom Lane wrote:
> "Magnus Hagander" <mha@sollentuna.net> writes:
> >> What happens if process Y goes away between the time you 
> >> obtain a handle for it and the time you try to run this 
> >> DuplicateHandle call?
> > I can put together some quick test-code for this if you need me to?
> Nah, it was just a rhetorical question meant to poke a hole in the
> claim that Windows can avoid race conditions by using HANDLEs.
> AFAICS, don't-reuse-PIDs-too-quick has exact analogs that Windows has
> to solve by ensuring it doesn't reuse HANDLEs too quick.

No. It means you don't understand what a HANDLE is.

But that's fine - because you understand DB stuff to compensate... :-)

Cheers,
mark

-- 
mark@mielke.cc / markm@ncf.ca / markm@nortel.com     __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   | 
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada
 One ring to rule them all, one ring to find them, one ring to bring them all                      and in the darkness
bindthem...
 
                          http://mark.mielke.cc/



Re: Stats collection on Windows

От
"Peter Brant"
Дата:
I'm going to rip out the destroy code and see how it goes.  Patch will
be forthcoming if things turn out well.

Pete

>>> Tom Lane <tgl@sss.pgh.pa.us> 04/05/06 4:49 pm >>>
So I'm sort of thinking that the destroy delay has outlived its
usefulness.
        regards, tom lane