On Sat, Nov 18, 2023 at 01:49:15PM -0800, Andres Freund wrote:
> Note that the LSN in the "continuing" case is the one the backup started at,
> not where recovery will start.
>
> I've wondered whether it's worth also adding an explicit message just after
> ReachedEndOfBackup(), but it seems far less urgent due to the existing
> "consistent recovery state reached at %X/%X" message.
Upgrading the surrounding DEBUG1 to a LOG is another option, but I
agree that I've seen less that as being an actual problem in the field
compared to the famous I-removed-a-backup-label-and-I-m-still-up,
until this user sees signs of corruption after recovery was finished,
sometimes days after putting back an instance online.
> Playing around with this a bit, I'm wondering if we instead should remove that
> message, and emit something more informative earlier on. If there's a problem,
> you kinda want the information before we finally get to the loop in
> PerformWalLRecovery(). If e.g. there's no WAL you'll only get
> LOG: invalid checkpoint record
> PANIC: could not locate a valid checkpoint record
I was looking at this code a few weeks ago and have on my stack of
list to do an item about sending a patch to make this exact message
PANIC more talkative as there are a lot of instances with
log_min_messages > log.
> which is delightfully lacking in details.
With a user panicking as much as the server itself, that's even more
tasty.
+ if (ControlFile->backupStartPoint != InvalidXLogRecPtr)
+ ereport(LOG,
+ (errmsg("continuing to start from base backup with redo LSN %X/%X",
+ LSN_FORMAT_ARGS(ControlFile->backupStartPoint))));
"Continuing to start" sounds a bit weird to me, though, considering
that there are a few LOGs that say "starting" when there is a signal
file, but I don't have a better idea on top of my mind. So that
sounds OK here.
--
Michael