Re: [BUG] non archived WAL removed during production crash recovery

Поиск
Список
Период
Сортировка
От Fujii Masao
Тема Re: [BUG] non archived WAL removed during production crash recovery
Дата
Msg-id d0ef12c5-7e36-b9b3-c8ea-2e98d69d1317@oss.nttdata.com
обсуждение исходный текст
Ответ на [BUG] non archived WAL removed during production crash recovery  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Ответы Re: [BUG] non archived WAL removed during production crash recovery
Список pgsql-bugs

On 2020/04/01 0:22, Jehan-Guillaume de Rorthais wrote:
> Hello,
> 
> A colleague of mine reported an expected behavior.
> 
> On production cluster is in crash recovery, eg. after killing a backend, the
> WALs ready to be archived are removed before being archived.
> 
> See in attachment the reproduction script "non-arch-wal-on-recovery.bash".
> 
> This behavior has been introduced in 78ea8b5daab9237fd42d7a8a836c1c451765499f.
> Function XLogArchiveCheckDone() badly consider the in crashed recovery
> production cluster as a standby without archive_mode=always. So the check
> conclude the WAL can be removed safely.

Thanks for the report! Yeah, this seems a bug.

>    bool inRecovery = RecoveryInProgress();
>    
>    /*
>     * The file is always deletable if archive_mode is "off".  On standbys
>     * archiving is disabled if archive_mode is "on", and enabled with
>     * "always".  On a primary, archiving is enabled if archive_mode is "on"
>     * or "always".
>     */
>    if (!((XLogArchivingActive() && !inRecovery) ||
>          (XLogArchivingAlways() && inRecovery)))
>        return true;
> 
> Please find in attachment a patch that fix this issue using the following test
> instead:
> 
>    if (!((XLogArchivingActive() && !StandbyModeRequested) ||
>          (XLogArchivingAlways() && inRecovery)))
>        return true;
> 
> I'm not sure if we should rely on StandbyModeRequested for the second part of
> the test as well thought. What was the point to rely on RecoveryInProgress() to
> get the recovery status from shared mem?

Since StandbyModeRequested is the startup process-local variable,
it basically cannot be used in XLogArchiveCheckDone() that other process
may call. So another approach would be necessary... One straight idea is to
add new shmem flag indicating whether we are in standby mode or not.
Another is to make the startup process remove .ready file if necessary.

If it's not easy to fix the issue, we might need to just revert the commit
that introduced the issue at first.

Regards,

-- 
Fujii Masao
NTT DATA CORPORATION
Advanced Platform Technology Group
Research and Development Headquarters



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: BUG #16325: Assert failure on partitioning by int for a textvalue with a collation
Следующее
От: PG Bug reporting form
Дата:
Сообщение: BUG #16331: segfault in checkpointer with full disk