Re: Latch implementation that wakes on postmaster death on both win32 and Unix

Поиск
Список
Период
Сортировка
От Florian Pflug
Тема Re: Latch implementation that wakes on postmaster death on both win32 and Unix
Дата
Msg-id AEA230E3-4A98-4C1B-8318-C5859A03C63E@phlo.org
обсуждение исходный текст
Ответ на Re: Latch implementation that wakes on postmaster death on both win32 and Unix  (Peter Geoghegan <peter@2ndquadrant.com>)
Ответы Re: Latch implementation that wakes on postmaster death on both win32 and Unix
Re: Latch implementation that wakes on postmaster death on both win32 and Unix
Список pgsql-hackers
On Jul4, 2011, at 23:11 , Peter Geoghegan wrote:
> On 4 July 2011 17:36, Florian Pflug <fgp@phlo.org> wrote:
>> On Jul4, 2011, at 17:53 , Heikki Linnakangas wrote:
>>>>       Under Linux, select() may report a socket file descriptor as "ready for
>>>>       reading",  while nevertheless a subsequent read blocks.  This could for
>>>>       example happen when data has arrived but  upon  examination  has  wrong
>>>>       checksum and is discarded.  There may be other circumstances in which a
>>>>       file descriptor is spuriously reported as ready.  Thus it may be  safer
>>>>       to use O_NONBLOCK on sockets that should not block.
>>>
>>> So in theory, on Linux you might WaitLatch might sometimes incorrectly return WL_POSTMASTER_DEATH. None of the
callerscheck for WL_POSTMASTER_DEATH return code, they call PostmasterIsAlive() before assuming the postmaster has
died,so that won't affect correctness at the moment. I doubt that scenario can even happen in our case, select() on a
pipethat is never written to. But maybe we should add add an assertion to WaitLatch to assert that if select() reports
thatthe postmaster pipe has been closed, PostmasterIsAlive() also returns false. 
>>
>> The correct solution would be to read() from the pipe after select()
>> returns, and only return WL_POSTMASTER_DEATCH if the read doesn't return
>> EAGAIN. To prevent that read() from blocking if the read event was indeed
>> spurious, O_NONBLOCK must be set on the pipe but that patch does that already.
>
> Let's have some perspective on this. We're talking about a highly
> doubtful chance that latches may wake when they shouldn't.

Yeah, as long as there's just a spurious wake up, sure. However,
having WaitLatch() indicate a postmaster death in that case seems
dangerous. Some caller, sooner or later, is bound to get it wrong,
i.e. forget to re-check PostmasterIsAlive.

> Maybe we should restore the return value of WaitLatch to its previous
> format (so it doesn't return a bitmask)? That way, we don't report
> that the Postmaster died, and therefore clients are required to call
> PostmasterIsAlive() to be sure.

That'd solve the issue too.

> In any case, I'm in favour of the assertion.

I don't really see the value in that assertion. It'd cause spurious
assertion failures in the case of spurious events reported by select().
If we do expect such event, we should close the hole instead of asserting.
If we don't, then what's the point of the assert.

BTW, do we currently retry the select() on EINTR (meaning a signal has
arrived)? If we don't, that'd be an additional source of spurious returns
from select.

>> Btw, with the death-watch / life-sign / whatever infrastructure in place,
>> shouldn't PostmasterIsAlive() be using that instead of getppid() / kill(0)?
>
> Hmm, maybe. That seems like a separate issue though, that can be
> addressed with another patch. It does have the considerable
> disadvantage of making Heikki's proposed assertion failure useless. Is
> the implementation of PostmasterIsAlive() really a problem at the
> moment?

I'm not sure that there is currently a guarantee that PostmasterIsAlive
will returns false immediately after select() indicates postmaster
death. If e.g. the postmaster's parent is still running (which happens
for example if you launch postgres via daemontools), the re-parenting of
backends to init might not happen until the postmaster zombie has been
vanquished by its parent's call of waitpid(). It's not entirely
inconceivable for getppid() to then return the (dead) postmaster's pid
until that waitpid() call has occurred.

But agreed, this is probably best handled by a separate patch.

best regards,
Florian Pflug



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Dunstan
Дата:
Сообщение: Re: %ENV warnings during builds
Следующее
От: Florian Pflug
Дата:
Сообщение: Re: Review of patch Bugfix for XPATH() if expression returns a scalar value