Re: [BUG] non archived WAL removed during production crash recovery

Поиск
Список
Период
Сортировка
От Fujii Masao
Тема Re: [BUG] non archived WAL removed during production crash recovery
Дата
Msg-id 4d48dbdb-e1db-a2d6-2fb2-c2a66efa7818@oss.nttdata.com
обсуждение исходный текст
Ответ на Re: [BUG] non archived WAL removed during production crash recovery  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Ответы Re: [BUG] non archived WAL removed during production crash recovery  (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
Список pgsql-bugs

On 2020/04/04 1:26, Jehan-Guillaume de Rorthais wrote:
> On Thu, 2 Apr 2020 23:55:46 +0900
> Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> 
>> On 2020/04/02 22:49, Jehan-Guillaume de Rorthais wrote:
>>> On Thu, 2 Apr 2020 19:38:59 +0900
>>> Fujii Masao <masao.fujii@oss.nttdata.com> wrote:
> [...]
>>>> That is, WAL files with .ready files are removed when either
>>>> archive_mode!=always in standby mode or archive_mode=off.
>>>
>>> sounds fine to me.
> 
> To some extends, it appears to me this sentence was relatively close to my
> previous patch, as far as you exclude IN_CRASH_RECOVERY from the shortcut:
> 
> XLogArchiveCheckDone(const char *xlog)
> {
>      [...]
>      if ( (inRecoveryState != IN_CRASH_RECOVERY) && (
>            (inRecoveryState == NOT_IN_RECOVERY && !XLogArchivingActive()) ||
>            (inRecoveryState == IN_ARCHIVE_RECOVERY && !XLogArchivingAlways())))
>          return true;
> 
> Which means that only .done cleanup would occurs during CRASH_RECOVERY
> and .ready files might be created if no .done exists. No matter the futur status
> of the cluster: primary or standby. Normal shortcut will apply during first
> checkpoint after the crash recovery step.
> 
> This should handle the case where a backup without backup_label (taken from a
> snapshot or after a shutdown with immediate) is restored to build a standby.
> 
> Please, find in attachment a third version of my patch
> "0001-v3-Fix-WAL-retention-during-production-crash-recovery.patch".

Thanks for updating the patch! Here are my review comments:

-    bool        SharedRecoveryInProgress;
+    RecoveryState    SharedRecoveryInProgress;

Since the patch changes the meaning of this variable, the name of
the variable should be changed? Otherwise, the current name seems
confusing.

+            SpinLockAcquire(&XLogCtl->info_lck);
+            XLogCtl->SharedRecoveryInProgress = IN_CRASH_RECOVERY;
+            SpinLockRelease(&XLogCtl->info_lck);

As I explained upthread, this code can be reached and IN_CRASH_RECOVERY
can be set even in standby or archive recovery. Is this right behavior that
you're expecting?

Even in crash recovery case, GetRecoveryState() returns IN_ARCHIVE_RECOVERY
until this code is reached. Also when WAL replay is not necessary
(e.g., restart of the server shutdowed cleanly before), GetRecoveryState()
returns IN_ARCHIVE_RECOVERY because this code is not reached. Aren't
these fragile? If XLogArchiveCheckDone() is only user of GetRecoveryState(),
they would be ok. But if another user will appear in the future, it seems
very easy to mistake. At least those behaviors should be commented in
GetRecoveryState().

-    if (!((XLogArchivingActive() && !inRecovery) ||
-          (XLogArchivingAlways() && inRecovery)))
+    if ( (inRecoveryState != IN_CRASH_RECOVERY) && (
+          (inRecoveryState == NOT_IN_RECOVERY && !XLogArchivingActive()) &&
+          (inRecoveryState == IN_ARCHIVE_RECOVERY && !XLogArchivingAlways())))
        return true;

The last condition seems to cause XLogArchiveCheckDone() to return
true in archive recovery mode with archive_mode=on, then cause
unarchived WAL files with .ready to be removed. Is my understanding right?
If yes, that behavior doesn't seem to match with our consensus, i.e.,
WAL files with .ready should not be removed in that case.

+/* Recovery state */
+typedef enum RecoveryState
+{
+    NOT_IN_RECOVERY = 0,
+    IN_CRASH_RECOVERY,
+    IN_ARCHIVE_RECOVERY
+} RecoveryState;

Isn't it better to add more comments here? For example, what does
"Recovery state" mean? Which state is used in standby mode? Why? ,etc.

Is it really ok not to have the value indicating standby mode?

These enum values names are confusing because the variables with
similar names already exist. For example, IN_CRASH_RECOVERY vs.
DB_IN_CRASH_RECOVERY. So IMO it seems better to rename them,
.e.g., by adding the prefix.

> 
> "0002-v1-Add-test-on-non-archived-WAL-during-crash-recovery.patch" is left
> untouched. But I'm considering adding some more tests relative to this
> discussion.
> 
>> BTW, now I'm thinking that the flag in shmem should be updated when
>> the startup process sets StandbyModeRequested to true at the beginning
>> of the recovery. That is,
>>
>> - Add something like SharedStandbyModeRequested into XLogCtl. This field
>>      should be initialized with false;
>> - Set XLogCtl->SharedStandbyModeRequested to true when the startup
>>      process detects the standby.signal file and sets the local variable
>>      StandbyModeRequested to true.
>> - Make XLogArchiveCheckDone() use XLogCtl->SharedStandbyModeRequested
>>      to know whether the server is in standby mode or not.
>>
>> Thought?
> 
> I try to avoid a new flag in memory with the proposal in attachment of this
> email. It seems to me various combinaisons of booleans with subtle differences
> around the same subject makes it a bit trappy and complicated to understand.

Ok, so firstly I try to review your patch!

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Jehan-Guillaume de Rorthais
Дата:
Сообщение: Re: [BUG] non archived WAL removed during production crash recovery
Следующее
От: PG Bug reporting form
Дата:
Сообщение: BUG #16342: CREATE TABLE LIKE INCLUDING GENERATED column order issue