Re: Possible explanation for Win32 stats regression test failures

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Possible explanation for Win32 stats regression test failures
Дата
Msg-id 20392.1153074880@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Possible explanation for Win32 stats regression test failures  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Possible explanation for Win32 stats regression test  (korry <korry@appx.com>)
Список pgsql-hackers
I wrote:
> But ... AFAICS the only signal that could plausibly be arriving at the
> stats collector is SIGALRM from its own use of setitimer() to schedule
> stats file writes.  So it seems that this failure occurs when the alarm
> fires between the select() and recv() calls; which is possible but it
> seems a mighty narrow window.  So I'm not 100% convinced that this is
> the correct explanation of the problem --- we've seen snake fail this
> way repeatedly, and here we have trout doing it three times within one
> regression run.  Can anyone think of a reason why the timing might fall
> just so with a higher probability than one would expect?  Perhaps
> pgwin32_select() has got a problem that makes it not dispatch signals
> as it seems to be trying to do?

Ah-hah, I see it.  pgwin32_select() uses WaitForMultipleObjectsEx() with
an event for the socket read-ready plus an event for signal arrival.
It returns EINTR if the return code from WaitForMultipleObjectsEx shows
the signal-arrival event as fired.  However, WaitForMultipleObjectsEx is
defined to return the number of the *first* event in the list that is
fired.  This means that if the socket comes read-ready at the same time
the SIGALRM arrives, pgwin32_select() will ignore the signal, and it'll
be processed by the subsequent pgwin32_recv().

Now I don't know anything about the Windows scheduler, but I suppose it
gives processes time quantums like everybody else does.  So "at the same
time" really means "within the same scheduler clock tick", which is not
so unlikely after all.  In short, before the just-committed patch, the
Windows stats collector would fail if a stats message arrived during the
same clock tick that its SIGALRM timeout expired.

I think this explains not only the intermittent stats regression
failures, but the reports we've heard from Merlin and others about the
stats collector being unstable under load on Windows.  The heavier the
load of stats messages, the more likely one is to arrive during the tick
when the timeout expires.
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Dave Page"
Дата:
Сообщение: Re: Windows buildfarm support, or lack of it
Следующее
От: Simon Riggs
Дата:
Сообщение: Re: [PATCHES] Restartable Recovery