Re: "pgstat wait timeout" just got a lot more common on Windows

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: "pgstat wait timeout" just got a lot more common on Windows
Дата
Msg-id 426.1336661906@sss.pgh.pa.us
обсуждение исходный текст
Ответ на "pgstat wait timeout" just got a lot more common on Windows  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: "pgstat wait timeout" just got a lot more common on Windows  (Magnus Hagander <magnus@hagander.net>)
Re: "pgstat wait timeout" just got a lot more common on Windows  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
I wrote:
> Last night I changed the stats collector process to use
> WaitLatchOrSocket instead of a periodic forced wakeup to see whether
> the postmaster has died.  This morning I observe that several Windows
> buildfarm members are showing regression test failures caused by
> unexpected "pgstat wait timeout" warnings.  Everybody else is fine.

> This suggests that there is something broken in the Windows
> implementation of WaitLatchOrSocket.  I wonder whether it also
> tells us something we did not know about the underlying cause of
> those messages.  Not sure what though.  Ideas?  Can anyone who
> knows Windows take another look at WaitLatchOrSocket?

Anybody have any clues about that?  If not, I think I'll have to revert
the pgstat changes for beta1, which isn't really forward progress.

I spent some time staring at the Windows WaitLatchOrSocket code myself.
The only thing I could find that seemed wrong is that in the event
array, we list the latch's event before pgwin32_signal_event.  The
Microsoft documentation I looked at says that if more than one event
is ready, WaitforMultipleObjects reports the first such array member.
This means that if the latch is already set when control gets here,
signal handlers will not be serviced.  That doesn't match what would
happen on a Unix machine, so it seems like at least a violation of the
POLA.  Hence I think we oughta swap the order of those two array
elements.  (Same issue in PGSemaphoreLock, btw, and I'm suspicious of
pgwin32_select.)  I do not however see a way that that would explain the
pgstat failures, because the stats collector's latch really shouldn't
ever get set during normal regression test runs.
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Magnus Hagander
Дата:
Сообщение: Re: incorrect handling of the timeout in pg_receivexlog
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Draft release notes complete