Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles
От | Nitin Motiani |
---|---|
Тема | Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles |
Дата | |
Msg-id | CAH5HC95Gmmb=f3LxHLw-x5wSae7A-9uNasUL0wU8V0HaKmfGfw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [PATCH] Accept connections post recovery without waiting for RemoveOldXlogFiles (Fujii Masao <masao.fujii@gmail.com>) |
Список | pgsql-hackers |
On Tue, Sep 9, 2025 at 1:23 PM Fujii Masao <masao.fujii@gmail.com> wrote: > > On Mon, Sep 8, 2025 at 6:33 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > > > Hi Hackers, > > > > I'd like to propose a patch to allow accepting connections post recovery without waiting for the removal of old xlogfiles. > > As another idea, could crash recovery avoid waiting for the end-of-recovery > checkpoint itself to finish, similar to archive recovery? In other words, > crash recovery would write the end-of-recovery WAL record and request > a checkpoint, but not block until it completes. Thought? > Thanks for the feedback Fujii. I'll look into this. Although based on Dilip's reply it is probably not feasible. > One concern, though: in your case, the first checkpoint after crash recovery > could take a very long time, since it needs to remove a large number of > WAL files. This could delay subsequent checkpoints beyond checkpoint_timeout. > If so, perhaps we'd need to limit how many WAL files a single checkpoint > can remove. > The limiting of WAL files is something we only want to do for this checkpoint or in general for all checkpoints? A couple of thoughts on these options : 1. If we only do it for the post end-of-recovery checkpoint, we will have to add special handling for that case and perhaps that reduces the simplicity of this approach. Also if we just do it for the first checkpoint after recovery, a future checkpoint might again spend a lot of time removing these files and delay subsequent checkpoints. 2. We can do it for all checkpoints but that can cause the bloat to last for a far longer period. One alternative might be to provide a guc to set the num/size of wal files or a timeout for this step which would require some tuning from the users. Also what do you think of the simple method of skipping removal of files at recovery time and let the future checkpoints take care of it? One reason I went with this solution over the others was that in the current state, the system is down for all the time of removal of files. But with this, the only thing which might be delayed is the checkpoint and that seems like an improvement. But it would be great to get your thoughts on this and the other alternatives. Thanks & Regards, Nitin Motiani Google
В списке pgsql-hackers по дате отправления: