Re: [HACKERS] logical replication and PANIC during shutdowncheckpoint in publisher

Поиск
Список
Период
Сортировка
От Michael Paquier
Тема Re: [HACKERS] logical replication and PANIC during shutdowncheckpoint in publisher
Дата
Msg-id CAB7nPqS8ndjKwgekZG8dgfffOVJNcLw6bviPZ3+ShexK5E=ukg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] logical replication and PANIC during shutdowncheckpoint in publisher  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Ответы Re: [HACKERS] logical replication and PANIC during shutdowncheckpoint in publisher
Re: [HACKERS] logical replication and PANIC during shutdowncheckpoint in publisher
Список pgsql-hackers
On Fri, Apr 21, 2017 at 12:29 AM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 4/20/17 07:52, Petr Jelinek wrote:
>> On 20/04/17 05:57, Michael Paquier wrote:
>>> 2nd thoughts here... Ah now I see your point. True that there is no
>>> way to ensure that an unwanted command is not running when SIGUSR2 is
>>> received as the shutdown checkpoint may have already begun. Here is an
>>> idea: add a new state in WalSndState, say WALSNDSTATE_STOPPING, and
>>> the shutdown checkpoint does not run as long as all WAL senders still
>>> running do not reach such a state.
>>
>> +1 to this solution
>
> Michael, can you attempt to supply a patch?

Hmm. I have been actually looking at this solution and I am having
doubts regarding its robustness. In short this would need to be
roughly a two-step process:
- In PostmasterStateMachine(), SIGUSR2 is sent to the checkpoint to
make it call ShutdownXLOG(). Prior doing that, a first signal should
be sent to all the WAL senders with
SignalSomeChildren(BACKEND_TYPE_WALSND). SIGUSR2 or SIGINT could be
used.
- At reception of this signal, all WAL senders switch to a stopping
state, refusing commands that can generate WAL.
- Checkpointer looks at the state of all WAL senders, looping with a
sleep call of a couple of ms, refusing to launch the shutdown
checkpoint as long as all WAL senders have not switched to the
stopping state.
- In reaper(), once checkpointer is confirmed as stopped, signal again
the WAL senders, and tell them to perform the last loop.

After that, I got a second, more simple idea.
CheckpointerShmem->ckpt_flags holds the information about checkpoints
currently running, so we could have the WAL senders look at this data
and prevent any commands generating WAL. The checkpointer may be
already stopped at the moment the WAL senders finish their loop, so we
need also to check if the checkpointer is running or not on those code
paths. Such safeguards may actually be enough for the problem of this
thread. Thoughts?
-- 
Michael



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Noah Misch
Дата:
Сообщение: Re: [HACKERS] Quorum commit for multiple synchronous replication.
Следующее
От: Masahiko Sawada
Дата:
Сообщение: Re: [HACKERS] Quorum commit for multiple synchronous replication.