Detecting some cases of missing backup_label
От | Andres Freund |
---|---|
Тема | Detecting some cases of missing backup_label |
Дата | |
Msg-id | 20231130205605.slaaw2ny5sjmukn3@awork3.anarazel.de обсуждение исходный текст |
Ответы |
Re: Detecting some cases of missing backup_label
(Stephen Frost <sfrost@snowman.net>)
|
Список | pgsql-hackers |
Hi, I recently mentioned to Robert (and also Heikki earlier), that I think I see a way to detect an omitted backup_label in a relevant subset of the cases (it'd apply to the pg_control as well, if we moved to that). Robert encouraged me to share the idea, even though it does not provide complete protection. The subset I think we can address is the following: a) An omitted backup_label would lead to corruption, i.e. without the backup_label we won't start recovery at the right position. Obviously it'd be better to also catch a wrong procedure when it'd not cause corruption - perhaps my idea can be extended to handle that, with a small bit of overhead. b) The backup has been taken from a primary. Unfortunately that probably can't be addressed - but the vast majority of backups are taken from a primary, so I think it's still a worthwhile protection. Here's my approach 1) We add a XLOG_BACKUP_START WAL record when starting a base backup on a primary, emitted just *after* the checkpoint completed 2) When replaying a base backup start record, we create a state file that includes the corresponding LSN in the filename 3) On the primary, the state file for XLOG_BACKUP_START is *not* created at that time. Instead the state file is created during pg_backup_stop(). 4) When replaying a XLOG_BACKUP_END record, we verif that the state file created by XLOG_BACKUP_START is present, and error out if not. Backups that started before the redo LSN from backup_label are ignored (necessitates remembering that LSN, but we've been discussing that anyway). Because the backup state file on the primary is only created during pg_backup_stop(), a copy of the data directory taken between pg_backup_start() and pg_backup_stop() does *not* contain the corresponding "backup state file". Because of this, an omitted backup_label is detected if recovery does not start early enough - recovery won't encounter the XLOG_BACKUP_START record and thus would not create the state file, leading to an error in 4). It is not a problem that the primary does not create the state file before the pg_backup_stop() - if the primary crashes before pg_backup_stop(), there is no XLOG_BACKUP_END and thus no error will be raised. It's a bit odd that the sequence differs between normal processing and recovery, but I think that's nothing a good comment couldn't explain. I haven't worked out the details, but I think we might be able extend this to catch errors even if there is no checkpoint during the base backup, by emitting the WAL record *before* the RequestCheckpoint(), and creating the corresponding state file during backup_label processing at the start of recovery. That'd probably make the logic for when we can remove the backup state files a bit more complicated, but I think we could deal with that. Comments? Swear words? Greetings, Andres Freund
В списке pgsql-hackers по дате отправления:
Следующее
От: "Tristan Partin"Дата:
Сообщение: Re: meson: Stop using deprecated way getting path of files