Обсуждение: "pgstat wait timeout" just got a lot more common on Windows

Поиск
Список
Период
Сортировка

"pgstat wait timeout" just got a lot more common on Windows

От
Tom Lane
Дата:
Last night I changed the stats collector process to use
WaitLatchOrSocket instead of a periodic forced wakeup to see whether
the postmaster has died.  This morning I observe that several Windows
buildfarm members are showing regression test failures caused by
unexpected "pgstat wait timeout" warnings.  Everybody else is fine.

This suggests that there is something broken in the Windows
implementation of WaitLatchOrSocket.  I wonder whether it also
tells us something we did not know about the underlying cause of
those messages.  Not sure what though.  Ideas?  Can anyone who
knows Windows take another look at WaitLatchOrSocket?
        regards, tom lane


Re: "pgstat wait timeout" just got a lot more common on Windows

От
Tom Lane
Дата:
I wrote:
> Last night I changed the stats collector process to use
> WaitLatchOrSocket instead of a periodic forced wakeup to see whether
> the postmaster has died.  This morning I observe that several Windows
> buildfarm members are showing regression test failures caused by
> unexpected "pgstat wait timeout" warnings.  Everybody else is fine.

> This suggests that there is something broken in the Windows
> implementation of WaitLatchOrSocket.  I wonder whether it also
> tells us something we did not know about the underlying cause of
> those messages.  Not sure what though.  Ideas?  Can anyone who
> knows Windows take another look at WaitLatchOrSocket?

Anybody have any clues about that?  If not, I think I'll have to revert
the pgstat changes for beta1, which isn't really forward progress.

I spent some time staring at the Windows WaitLatchOrSocket code myself.
The only thing I could find that seemed wrong is that in the event
array, we list the latch's event before pgwin32_signal_event.  The
Microsoft documentation I looked at says that if more than one event
is ready, WaitforMultipleObjects reports the first such array member.
This means that if the latch is already set when control gets here,
signal handlers will not be serviced.  That doesn't match what would
happen on a Unix machine, so it seems like at least a violation of the
POLA.  Hence I think we oughta swap the order of those two array
elements.  (Same issue in PGSemaphoreLock, btw, and I'm suspicious of
pgwin32_select.)  I do not however see a way that that would explain the
pgstat failures, because the stats collector's latch really shouldn't
ever get set during normal regression test runs.
        regards, tom lane


Re: "pgstat wait timeout" just got a lot more common on Windows

От
Magnus Hagander
Дата:
<p><br /> On May 10, 2012 4:59 PM, "Tom Lane" <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>>
wrote:<br/> ><br /> > I wrote:<br /> > > Last night I changed the stats collector process to use<br /> >
>WaitLatchOrSocket instead of a periodic forced wakeup to see whether<br /> > > the postmaster has died.  This
morningI observe that several Windows<br /> > > buildfarm members are showing regression test failures caused
by<br/> > > unexpected "pgstat wait timeout" warnings.  Everybody else is fine.<br /> ><br /> > > This
suggeststhat there is something broken in the Windows<br /> > > implementation of WaitLatchOrSocket.  I wonder
whetherit also<br /> > > tells us something we did not know about the underlying cause of<br /> > > those
messages. Not sure what though.  Ideas?  Can anyone who<br /> > > knows Windows take another look at
WaitLatchOrSocket?<br/> ><br /> > Anybody have any clues about that?  If not, I think I'll have to revert<br />
>the pgstat changes for beta1, which isn't really forward progress.<p>Haven't had time to look at the code itself,
andwon't before wrap time. Sorry. <br /><p>> I spent some time staring at the Windows WaitLatchOrSocket code
myself.<br/> > The only thing I could find that seemed wrong is that in the event<br /> > array, we list the
latch'sevent before pgwin32_signal_event.  The<br /> > Microsoft documentation I looked at says that if more than
oneevent<br /> > is ready, WaitforMultipleObjects reports the first such array member.<br /> > This means that if
thelatch is already set when control gets here,<br /> > signal handlers will not be serviced.<p>Yeah, that does seem
wrong.<p>>  That doesn't match what would<br /> > happen on a Unix machine, so it seems like at least a violation
ofthe<br /> > POLA.  Hence I think we oughta swap the order of those two array<br /> > elements.  (Same issue in
PGSemaphoreLock,btw, and I'm suspicious of<br /> > pgwin32_select.)  I do not however<p>Maybe we need a loop that
checksfor all events? <p>> see a way that that would explain the<br /> > pgstat failures, because the stats
collector'slatch really shouldn't<br /> > ever get set during normal regression test runs.<br /><p>So could there be
somethingwrong in the other end, meaning the latch *does* get set? <p>/Magnus <br /> 

Re: "pgstat wait timeout" just got a lot more common on Windows

От
Tom Lane
Дата:
Magnus Hagander <magnus@hagander.net> writes:
> On May 10, 2012 4:59 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
>> I spent some time staring at the Windows WaitLatchOrSocket code myself.
>> The only thing I could find that seemed wrong is that in the event
>> array, we list the latch's event before pgwin32_signal_event.  The
>> Microsoft documentation I looked at says that if more than one event
>> is ready, WaitforMultipleObjects reports the first such array member.
>> This means that if the latch is already set when control gets here,
>> signal handlers will not be serviced.

> Yeah, that does seem wrong.

>> That doesn't match what would
>> happen on a Unix machine, so it seems like at least a violation of the
>> POLA.  Hence I think we oughta swap the order of those two array
>> elements.  (Same issue in PGSemaphoreLock, btw, and I'm suspicious of
>> pgwin32_select.)  I do not however

> Maybe we need a loop that checks for all events?

I don't think so.  It's already the case that WaitLatch doesn't
guarantee that all possible flags are set in its result.  In connection
with Peter G's observation that we could simplify the API by rechecking
PostmasterIsAlive for WL_POSTMASTER_DEATH, I was planning to clarify
the API spec as "result bits that are set are guaranteed to reflect
reality, but it's not guaranteed that we set every bit that could
possibly be set".  This should not break any caller since the same
result could occur given a slight change in timing anyway; the caller
has to be prepared to come back and check for more conditions after it
services whatever WaitLatch does report.  However, signal service is
not a condition the caller is supposed to deal with, so I think we
want a guarantee that that happens inside WaitLatch.

>> see a way that that would explain the
>> pgstat failures, because the stats collector's latch really shouldn't
>> ever get set during normal regression test runs.

> So could there be something wrong in the other end, meaning the latch
> *does* get set?

Even if it did, it'd get cleared at the top of the loop, so that the
next call ought to handle things.  Tis a puzzlement.  AFAICS the only
condition WaitforMultipleObjects is going to see in these tests is
read-ready on the socket; surely it wouldn't fail to notice that?
        regards, tom lane


Re: "pgstat wait timeout" just got a lot more common on Windows

От
Tom Lane
Дата:
I wrote:
> Hence I think we oughta swap the order of those two array
> elements.  (Same issue in PGSemaphoreLock, btw, and I'm suspicious of
> pgwin32_select.)

Oh ... while hacking win32 PGSemaphoreLock I saw that it has a *seriously*
nasty bug: it does not reset ImmediateInterruptOK before returning.
How is it that Windows machines aren't falling over constantly?
        regards, tom lane


Re: "pgstat wait timeout" just got a lot more common on Windows

От
Magnus Hagander
Дата:
On Thu, May 10, 2012 at 6:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Hence I think we oughta swap the order of those two array
>> elements.  (Same issue in PGSemaphoreLock, btw, and I'm suspicious of
>> pgwin32_select.)
>
> Oh ... while hacking win32 PGSemaphoreLock I saw that it has a *seriously*
> nasty bug: it does not reset ImmediateInterruptOK before returning.
> How is it that Windows machines aren't falling over constantly?

Hmm. the commit you made to fix it says it changes how
ImmediateInterruptOK is handled, but there was not a single line of
code that actually changed that? Or am I misreading this completely?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


Re: "pgstat wait timeout" just got a lot more common on Windows

От
Tom Lane
Дата:
Magnus Hagander <magnus@hagander.net> writes:
> On Thu, May 10, 2012 at 6:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Oh ... while hacking win32 PGSemaphoreLock I saw that it has a *seriously*
>> nasty bug: it does not reset ImmediateInterruptOK before returning.
>> How is it that Windows machines aren't falling over constantly?

> Hmm. the commit you made to fix it says it changes how
> ImmediateInterruptOK is handled, but there was not a single line of
> code that actually changed that? Or am I misreading this completely?

Exit is now out the bottom of the loop, not by a raw "return;".
        regards, tom lane


Re: "pgstat wait timeout" just got a lot more common on Windows

От
Magnus Hagander
Дата:
On Fri, May 11, 2012 at 3:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> On Thu, May 10, 2012 at 6:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Oh ... while hacking win32 PGSemaphoreLock I saw that it has a *seriously*
>>> nasty bug: it does not reset ImmediateInterruptOK before returning.
>>> How is it that Windows machines aren't falling over constantly?
>
>> Hmm. the commit you made to fix it says it changes how
>> ImmediateInterruptOK is handled, but there was not a single line of
>> code that actually changed that? Or am I misreading this completely?
>
> Exit is now out the bottom of the loop, not by a raw "return;".

oh, d'uh. Sorry, missed that one completely.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/