Re: Assertion failure in WaitForWALToBecomeAvailable state machine

Поиск
Список
Период
Сортировка
От Bharath Rupireddy
Тема Re: Assertion failure in WaitForWALToBecomeAvailable state machine
Дата
Msg-id CALj2ACUoBWbaFo_t0gew+x6n0V+mpvB_23HLvsVD9abgCShV5A@mail.gmail.com
обсуждение исходный текст
Ответ на Assertion failure in WaitForWALToBecomeAvailable state machine  (Dilip Kumar <dilipbalaut@gmail.com>)
Ответы Re: Assertion failure in WaitForWALToBecomeAvailable state machine  (Dilip Kumar <dilipbalaut@gmail.com>)
Список pgsql-hackers
On Fri, Feb 11, 2022 at 3:33 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Hi,
>
> The problem is that whenever we are going for streaming we always set
> XLogCtl->InstallXLogFileSegmentActive to true, but while switching
> from streaming to archive we do not always reset it so it hits
> assertion in some cases. Basically we reset it inside
> XLogShutdownWalRcv() but while switching from the streaming mode we
> only call it conditionally if WalRcvStreaming().  But it is very much
> possible that even before we call WalRcvStreaming() the walreceiver
> might have set alrcv->walRcvState to WALRCV_STOPPED.  So now
> WalRcvStreaming() will return false.  So I agree now we do not want to
> really shut down the walreceiver but who will reset the flag?
>
> I just ran some tests on primary and attached the walreceiver to gdb
> and waited for it to exit with timeout and then the recovery process
> hit the assertion.
>
> 2022-02-11 14:33:56.976 IST [60978] FATAL:  terminating walreceiver
> due to timeout
> cp: cannot stat
> ‘/home/dilipkumar/work/PG/install/bin/wal_archive/00000002.history’:
> No such file or directory
> 2022-02-11 14:33:57.002 IST [60973] LOG:  restored log file
> "000000010000000000000003" from archive
> TRAP: FailedAssertion("!XLogCtl->InstallXLogFileSegmentActive", File:
> "xlog.c", Line: 3823, PID: 60973)
>
> I have just applied a quick fix and that solved the issue, basically
> if the last failed source was streaming and the WalRcvStreaming() is
> false then just reset this flag.

IIUC, the issue can happen while the walreceiver failed to get WAL
from primary for whatever reasons and its status is not
WALRCV_STOPPING or WALRCV_STOPPED, and the startup process moved ahead
in WaitForWALToBecomeAvailable for reading from archive which ends up
in this assertion failure. ITSM, a rare scenario and it depends on
what walreceiver does between failure to get WAL from primary and
updating status to WALRCV_STOPPING or WALRCV_STOPPED.

If the above race condition is a serious problem, if one thinks at
least it is a problem at all, that needs to be fixed. I don't think
just making InstallXLogFileSegmentActive false is enough. By looking
at the comment [1], it doesn't make sense to move ahead for restoring
from the archive location without the WAL receiver fully stopped.
IMO, the real fix is to just remove WalRcvStreaming() and call
XLogShutdownWalRcv() unconditionally. Anyways, we have the
Assert(!WalRcvStreaming()); down below. I don't think it will create
any problem.

[1]
                    /*
                     * Before we leave XLOG_FROM_STREAM state, make sure that
                     * walreceiver is not active, so that it won't overwrite
                     * WAL that we restore from archive.
                     */
                    if (WalRcvStreaming())
                        XLogShutdownWalRcv();

Regards,
Bharath Rupireddy.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Julien Rouhaud
Дата:
Сообщение: Re: Database-level collation version tracking
Следующее
От: Etsuro Fujita
Дата:
Сообщение: Re: postgres_fdw: commit remote (sub)transactions in parallel during pre-commit