Re: [BUG] non archived WAL removed during production crash recovery
От | Fujii Masao |
---|---|
Тема | Re: [BUG] non archived WAL removed during production crash recovery |
Дата | |
Msg-id | 4d48dbdb-e1db-a2d6-2fb2-c2a66efa7818@oss.nttdata.com обсуждение исходный текст |
Ответ на | Re: [BUG] non archived WAL removed during production crash recovery (Jehan-Guillaume de Rorthais <jgdr@dalibo.com>) |
Ответы |
Re: [BUG] non archived WAL removed during production crash recovery
(Jehan-Guillaume de Rorthais <jgdr@dalibo.com>)
|
Список | pgsql-bugs |
On 2020/04/04 1:26, Jehan-Guillaume de Rorthais wrote: > On Thu, 2 Apr 2020 23:55:46 +0900 > Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > >> On 2020/04/02 22:49, Jehan-Guillaume de Rorthais wrote: >>> On Thu, 2 Apr 2020 19:38:59 +0900 >>> Fujii Masao <masao.fujii@oss.nttdata.com> wrote: > [...] >>>> That is, WAL files with .ready files are removed when either >>>> archive_mode!=always in standby mode or archive_mode=off. >>> >>> sounds fine to me. > > To some extends, it appears to me this sentence was relatively close to my > previous patch, as far as you exclude IN_CRASH_RECOVERY from the shortcut: > > XLogArchiveCheckDone(const char *xlog) > { > [...] > if ( (inRecoveryState != IN_CRASH_RECOVERY) && ( > (inRecoveryState == NOT_IN_RECOVERY && !XLogArchivingActive()) || > (inRecoveryState == IN_ARCHIVE_RECOVERY && !XLogArchivingAlways()))) > return true; > > Which means that only .done cleanup would occurs during CRASH_RECOVERY > and .ready files might be created if no .done exists. No matter the futur status > of the cluster: primary or standby. Normal shortcut will apply during first > checkpoint after the crash recovery step. > > This should handle the case where a backup without backup_label (taken from a > snapshot or after a shutdown with immediate) is restored to build a standby. > > Please, find in attachment a third version of my patch > "0001-v3-Fix-WAL-retention-during-production-crash-recovery.patch". Thanks for updating the patch! Here are my review comments: - bool SharedRecoveryInProgress; + RecoveryState SharedRecoveryInProgress; Since the patch changes the meaning of this variable, the name of the variable should be changed? Otherwise, the current name seems confusing. + SpinLockAcquire(&XLogCtl->info_lck); + XLogCtl->SharedRecoveryInProgress = IN_CRASH_RECOVERY; + SpinLockRelease(&XLogCtl->info_lck); As I explained upthread, this code can be reached and IN_CRASH_RECOVERY can be set even in standby or archive recovery. Is this right behavior that you're expecting? Even in crash recovery case, GetRecoveryState() returns IN_ARCHIVE_RECOVERY until this code is reached. Also when WAL replay is not necessary (e.g., restart of the server shutdowed cleanly before), GetRecoveryState() returns IN_ARCHIVE_RECOVERY because this code is not reached. Aren't these fragile? If XLogArchiveCheckDone() is only user of GetRecoveryState(), they would be ok. But if another user will appear in the future, it seems very easy to mistake. At least those behaviors should be commented in GetRecoveryState(). - if (!((XLogArchivingActive() && !inRecovery) || - (XLogArchivingAlways() && inRecovery))) + if ( (inRecoveryState != IN_CRASH_RECOVERY) && ( + (inRecoveryState == NOT_IN_RECOVERY && !XLogArchivingActive()) && + (inRecoveryState == IN_ARCHIVE_RECOVERY && !XLogArchivingAlways()))) return true; The last condition seems to cause XLogArchiveCheckDone() to return true in archive recovery mode with archive_mode=on, then cause unarchived WAL files with .ready to be removed. Is my understanding right? If yes, that behavior doesn't seem to match with our consensus, i.e., WAL files with .ready should not be removed in that case. +/* Recovery state */ +typedef enum RecoveryState +{ + NOT_IN_RECOVERY = 0, + IN_CRASH_RECOVERY, + IN_ARCHIVE_RECOVERY +} RecoveryState; Isn't it better to add more comments here? For example, what does "Recovery state" mean? Which state is used in standby mode? Why? ,etc. Is it really ok not to have the value indicating standby mode? These enum values names are confusing because the variables with similar names already exist. For example, IN_CRASH_RECOVERY vs. DB_IN_CRASH_RECOVERY. So IMO it seems better to rename them, .e.g., by adding the prefix. > > "0002-v1-Add-test-on-non-archived-WAL-during-crash-recovery.patch" is left > untouched. But I'm considering adding some more tests relative to this > discussion. > >> BTW, now I'm thinking that the flag in shmem should be updated when >> the startup process sets StandbyModeRequested to true at the beginning >> of the recovery. That is, >> >> - Add something like SharedStandbyModeRequested into XLogCtl. This field >> should be initialized with false; >> - Set XLogCtl->SharedStandbyModeRequested to true when the startup >> process detects the standby.signal file and sets the local variable >> StandbyModeRequested to true. >> - Make XLogArchiveCheckDone() use XLogCtl->SharedStandbyModeRequested >> to know whether the server is in standby mode or not. >> >> Thought? > > I try to avoid a new flag in memory with the proposal in attachment of this > email. It seems to me various combinaisons of booleans with subtle differences > around the same subject makes it a bit trappy and complicated to understand. Ok, so firstly I try to review your patch! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
В списке pgsql-bugs по дате отправления:
Предыдущее
От: Jehan-Guillaume de RorthaisДата:
Сообщение: Re: [BUG] non archived WAL removed during production crash recovery
Следующее
От: PG Bug reporting formДата:
Сообщение: BUG #16342: CREATE TABLE LIKE INCLUDING GENERATED column order issue