Re: Race between KeepFileRestoredFromArchive() and restartpoint

Поиск
Список
Период
Сортировка
От Kyotaro Horiguchi
Тема Re: Race between KeepFileRestoredFromArchive() and restartpoint
Дата
Msg-id 20220803.112417.1896039828355025207.horikyota.ntt@gmail.com
обсуждение исходный текст
Ответ на Re: Race between KeepFileRestoredFromArchive() and restartpoint  (Don Seiler <don@seiler.us>)
Ответы Re: Race between KeepFileRestoredFromArchive() and restartpoint
Список pgsql-hackers
At Tue, 2 Aug 2022 16:03:42 -0500, Don Seiler <don@seiler.us> wrote in 
> On Tue, Aug 2, 2022 at 10:01 AM David Steele <david@pgmasters.net> wrote:
> 
> >
> > > That makes sense.  Each iteration of the restartpoint recycle loop has a
> > 1/N
> > > chance of failing.  Recovery adds >N files between restartpoints.
> > Hence, the
> > > WAL directory grows without bound.  Is that roughly the theory in mind?
> >
> > Yes, though you have formulated it better than I had in my mind.

I'm not sure I understand it correctly, but isn't the cause of the
issue in the other thread due to skipping many checkpoint records
within the checkpoint_timeout?  I remember that I proposed a GUC
variable to disable that checkpoint skipping.  As another measure for
that issue, we could force replaying checkpoint if max_wal_size is
already filled up or known to be filled in the next checkpoint cycle.
If this is correct, this patch is irrelevant to the issue.

> > Let's see if Don can confirm that he is seeing the "could not link file"
> > messages.
> 
> 
> During my latest incident, there was only one occurrence:
> 
> could not link file “pg_wal/xlogtemp.18799" to
> > “pg_wal/000000010000D45300000010”: File exists

(I noticed that the patch in the other thread is broken:()

Hmm.  It seems like a race condition betwen StartupXLOG() and
RemoveXlogFIle(). We need wider extent of ContolFileLock. Concretely
taking ControlFileLock before deciding the target xlog file name in
RemoveXlogFile() seems to prevent this happening. (If this is correct
this is a live issue on the master branch.)

> WAL restore/recovery seemed to continue on just fine then. And it would
> continue on until the pg_wal volume ran out of space unless I was manually
> rm'ing already-recovered WAL files from the side.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Dong Wook Lee
Дата:
Сообщение: Re: pgstattuple: add test for coverage
Следующее
От: Thomas Munro
Дата:
Сообщение: Re: Cleaning up historical portability baggage