Re: "wal receiver" process hang in syslog() while exiting afterreceiving SIGTERM while the postgres has been promoted.

Поиск
Список
Период
Сортировка
От Fujii Masao
Тема Re: "wal receiver" process hang in syslog() while exiting afterreceiving SIGTERM while the postgres has been promoted.
Дата
Msg-id CAHGQGwERcjEORNh7NmdC8Theg+GCziUcFvi6nZACW9PK-JadhQ@mail.gmail.com
обсуждение исходный текст
Ответ на RE: "wal receiver" process hang in syslog() while exiting afterreceiving SIGTERM while the postgres has been promoted.  ("Chen, Yan-Jack (NSB - CN/Hangzhou)" <yan-jack.chen@nokia-sbell.com>)
Список pgsql-hackers
>>   We encounter one problem happened while we try to promote standby
>> postgres(version 9.6.9) server to active. From the trace(we triggered the
>> process abort). We can see the process was hang in syslog() while handling
>> SIGTERM. According to below article. Looks it is risky to write syslog in
>> signal handling. Any idea to avoid it?

ISTM that this issue can happen if ereport() can be called before
WalRcvImmediateInterruptOK flag is disabled, as follows.
In that case, if SIGTERM is sent while writing the log message,
the signal handler calls another ereport() because
WalRcvImmediateInterruptOK flag is still enabled.
Then walreceiver gets stuck...

      EnableWalRcvImmediateExit();
      wrconn = walrcv_connect(conninfo, false, "walreceiver", &err);
      if (!wrconn)
      ereport(ERROR,
      (errmsg("could not connect to the primary server: %s", err)));
      DisableWalRcvImmediateExit();

On Tue, Jun 26, 2018 at 5:12 PM, Chen, Yan-Jack (NSB - CN/Hangzhou)
<yan-jack.chen@nokia-sbell.com> wrote:
> Hi,
>   Well, if you agree with do not write log in signal handling function in any circumstance?  I see in many cases in
postgresqlsignal handling function just set one flag which triggers its main process to handling the progress.
 
>   How about simply remove the code lines?
>
> --- walreceiver_old.c
> +++ walreceiver.c
> @@ -816,10 +816,6 @@
>
>         SetLatch(&WalRcv->latch);
>
> -       /* Don't joggle the elbow of proc_exit */
> -       if (!proc_exit_inprogress && WalRcvImmediateInterruptOK)
> -               ProcessWalRcvInterrupts();
> -
>         errno = save_errno;
>  }

This change seems to cause another hung. Please imagine the case
where SIGTERM is sent while libpqrcv_connect() is waiting on the latch
(i.e., WaitLatchOrSocket()). In this case, SIGTERM causes libpqrcv_connect()
to wake up, call ResetLatch() and CHECK_FOR_INTERRUPTS(), and then
restart waiting on the latch. That is, walreceiver can get stuck on
libpqrcv_connect() in this case.

One idea to fix the above problem is to change CHECK_FOR_INTERRUPTS()
so that it calls ProcessWalRcvInterrupts() and then ereport(FATAL)
immediately if
WalRcvImmediateInterruptOK is true. Which can cause walreceiver to
ereport(FATAL) immediately after libpqrcv_connect() wakes up from
the wait on the latch.

Regards,

-- 
Fujii Masao


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Yugo Nagata
Дата:
Сообщение: Re: Small fixes about backup history file in doc and pg_standby
Следующее
От: Konstantin Knizhnik
Дата:
Сообщение: Monitoring time of fsyncing WALs