Обсуждение: corner case about replication and shutdown

Поиск
Список
Период
Сортировка

corner case about replication and shutdown

От
Fujii Masao
Дата:
Hi,

When I read the shutdown code to create the smart shutdown patch for sync rep,
I found the corner case where shutdown can get stuck infinitely. This happens
when postmaster reaches PM_WAIT_BACKENDS state before walsender marks
itself as WAL sender process for streaming WAL (i.e., before walsender calls
MarkPostmasterChildWalSender). In this case,CountChildren(NORMAL) in
PostmasterStateMachine() returns non-zero because normal backend (i.e.,
would-be walsender) is running, and postmaster in PM_WAIT_BACKENDS state
gets out of PostmasterStateMachine(). Then the backend receives
START_REPLICATION command, declares itself as walsender and
CountChildren(NORMAL) returns zero.

The problem is; that declaration doesn't trigger
PostmasterStateMachine() at all.
So, even though there is no normal backends, postmaster cannot call
PostmasterStateMachine() and move its state from PM_WAIT_BACKENDS.

I think this problem is harmless in practice since it doesn't happen
too often. But
that can happen...

The simple fix is to change ServerLoop() so that it periodically calls
PostmasterStateMachine() while shutdown is running. Though I was thinking to
change PostmasterStateMachine(), that looked complicated. Thought?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: corner case about replication and shutdown

От
Robert Haas
Дата:
[ sorry for not responding sooner ]

On Wed, Mar 23, 2011 at 9:46 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> When I read the shutdown code to create the smart shutdown patch for sync rep,
> I found the corner case where shutdown can get stuck infinitely. This happens
> when postmaster reaches PM_WAIT_BACKENDS state before walsender marks
> itself as WAL sender process for streaming WAL (i.e., before walsender calls
> MarkPostmasterChildWalSender). In this case,CountChildren(NORMAL) in
> PostmasterStateMachine() returns non-zero because normal backend (i.e.,
> would-be walsender) is running, and postmaster in PM_WAIT_BACKENDS state
> gets out of PostmasterStateMachine(). Then the backend receives
> START_REPLICATION command, declares itself as walsender and
> CountChildren(NORMAL) returns zero.
> The problem is; that declaration doesn't trigger
> PostmasterStateMachine() at all.
> So, even though there is no normal backends, postmaster cannot call
> PostmasterStateMachine() and move its state from PM_WAIT_BACKENDS.

Good catch.

> I think this problem is harmless in practice since it doesn't happen
> too often. But
> that can happen...
>
> The simple fix is to change ServerLoop() so that it periodically calls
> PostmasterStateMachine() while shutdown is running.

One idea I had was to have a backend that changes state from regular
backend to walsender kick the postmaster in some way - for example by
writing to a socket the other end of which the postmaster is holding
open.  Florian suggested that  might be useful anyway as a means of
detecting when the postmaster has gone belly-up, so maybe we could
kill two birds with one stone.  That seems like too much rejiggering
to do this late in the release cycle, though.  But I don't think the
idea of calling PostmasterStateMachine() periodically is very
appealing either - that's a significant change in how that code is
being used now, and even if it doesn't break anything else, it'll
allow for hangs of up to 60 seconds, which doesn't sound exciting
either.

The root of this problem in some sense is that we don't distinguish
between regular backends and backends that haven't yet decided whether
they are regular backends or walsenders.  But even if we created such
a distinction it won't fix the problem unless the postmaster somehow
gets notified of the state change.  And if we have that, then we're
back to not needing to distinguish.

Anyone have a good idea?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: corner case about replication and shutdown

От
Fujii Masao
Дата:
On Fri, Apr 1, 2011 at 4:48 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think this problem is harmless in practice since it doesn't happen
>> too often. But
>> that can happen...
>>
>> The simple fix is to change ServerLoop() so that it periodically calls
>> PostmasterStateMachine() while shutdown is running.
>
> One idea I had was to have a backend that changes state from regular
> backend to walsender kick the postmaster in some way - for example by
> writing to a socket the other end of which the postmaster is holding
> open.  Florian suggested that  might be useful anyway as a means of
> detecting when the postmaster has gone belly-up, so maybe we could
> kill two birds with one stone.  That seems like too much rejiggering
> to do this late in the release cycle, though.  But I don't think the
> idea of calling PostmasterStateMachine() periodically is very
> appealing either - that's a significant change in how that code is
> being used now, and even if it doesn't break anything else, it'll
> allow for hangs of up to 60 seconds, which doesn't sound exciting
> either.
>
> The root of this problem in some sense is that we don't distinguish
> between regular backends and backends that haven't yet decided whether
> they are regular backends or walsenders.  But even if we created such
> a distinction it won't fix the problem unless the postmaster somehow
> gets notified of the state change.  And if we have that, then we're
> back to not needing to distinguish.
>
> Anyone have a good idea?

Another simple fix is to make walsender send SIGUSR1 to postmaster
so that it calls PostmasterStateMachine() in sigusr1_handler(), when it
marks itself as walsender. The attached patch does this. Thought?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Вложения

Re: corner case about replication and shutdown

От
Robert Haas
Дата:
On Thu, Mar 31, 2011 at 11:12 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> Another simple fix is to make walsender send SIGUSR1 to postmaster
> so that it calls PostmasterStateMachine() in sigusr1_handler(), when it
> marks itself as walsender. The attached patch does this. Thought?

That looks OK to me.  Have you tested it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: corner case about replication and shutdown

От
Fujii Masao
Дата:
On Fri, Apr 1, 2011 at 11:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Mar 31, 2011 at 11:12 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>> Another simple fix is to make walsender send SIGUSR1 to postmaster
>> so that it calls PostmasterStateMachine() in sigusr1_handler(), when it
>> marks itself as walsender. The attached patch does this. Thought?
>
> That looks OK to me.  Have you tested it?

Yes. I added the sleep just before MarkPostmasterChildWalSender() in
walsender.c,
compiled, started replication, and then requested smart shutdown as soon as
walsender was forked (i.e., during the sleep). Without the patch, the server got
stuck infinitely. With the patch, smart shutdown worked as expected.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


Re: corner case about replication and shutdown

От
Robert Haas
Дата:
On Sat, Apr 2, 2011 at 4:51 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Fri, Apr 1, 2011 at 11:11 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Mar 31, 2011 at 11:12 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>> Another simple fix is to make walsender send SIGUSR1 to postmaster
>>> so that it calls PostmasterStateMachine() in sigusr1_handler(), when it
>>> marks itself as walsender. The attached patch does this. Thought?
>>
>> That looks OK to me.  Have you tested it?
>
> Yes. I added the sleep just before MarkPostmasterChildWalSender() in
> walsender.c,
> compiled, started replication, and then requested smart shutdown as soon as
> walsender was forked (i.e., during the sleep). Without the patch, the server got
> stuck infinitely. With the patch, smart shutdown worked as expected.

Cool.  Committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company