Обсуждение: InstallXLogFileSegment() vs concurrent WAL flush
Hi, New WAL space is created by renaming a file into place. Either a newly created file with a temporary name or, ideally, a recyclable old file with a name derived from an old LSN. I think there is a data loss window between rename() and fsync(parent_directory). A concurrent backend might open(new_name), write(), fdatasync(), and then we might lose power before the rename hits the disk. The data itself would survive the crash, but recovery wouldn't be able to find and replay it. That might break the log-before-data rule or forget a transaction that has been reported as committed to a client. Actual breakage would presumably require really bad luck, and I haven't seen this happen or anything, it just occurred to me while reading code, and I can't see any existing defences. One simple way to address that would be to make XLogFileInitInternal() wait for InstallXLogFileSegment() to finish. It's a little pessimistic to do that unconditionally, though, as then you have to wait even for rename operations for segment files later than the one you're opening, so I thought about how to skip waiting in that case -- see 0002. I'm not sure if it's worth worrying about or not.
Вложения
On Fri, 2 Feb 2024 11:18:18 +0100 Thomas Munro <thomas.munro@gmail.com> wrote: > Hi, > > New WAL space is created by renaming a file into place. Either a > newly created file with a temporary name or, ideally, a recyclable old > file with a name derived from an old LSN. I think there is a data > loss window between rename() and fsync(parent_directory). A > concurrent backend might open(new_name), write(), fdatasync(), and > then we might lose power before the rename hits the disk. The data > itself would survive the crash, but recovery wouldn't be able to find > and replay it. That might break the log-before-data rule or forget a > transaction that has been reported as committed to a client. > > Actual breakage would presumably require really bad luck, and I > haven't seen this happen or anything, it just occurred to me while > reading code, and I can't see any existing defences. > > One simple way to address that would be to make XLogFileInitInternal() > wait for InstallXLogFileSegment() to finish. It's a little Or, can we make sure the rename is durable by calling fsync before returning the fd, as a patch attached here? Regards, Yugo Nagata > pessimistic to do that unconditionally, though, as then you have to > wait even for rename operations for segment files later than the one > you're opening, so I thought about how to skip waiting in that case -- > see 0002. I'm not sure if it's worth worrying about or not. -- Yugo NAGATA <nagata@sraoss.co.jp>
Вложения
On Fri, Feb 2, 2024 at 12:56 PM Yugo NAGATA <nagata@sraoss.co.jp> wrote: > On Fri, 2 Feb 2024 11:18:18 +0100 > Thomas Munro <thomas.munro@gmail.com> wrote: > > One simple way to address that would be to make XLogFileInitInternal() > > wait for InstallXLogFileSegment() to finish. It's a little > > Or, can we make sure the rename is durable by calling fsync before > returning the fd, as a patch attached here? Right, yeah, that works too. I'm not sure which way is better.
At Fri, 2 Feb 2024 14:42:46 +0100, Thomas Munro <thomas.munro@gmail.com> wrote in > On Fri, Feb 2, 2024 at 12:56 PM Yugo NAGATA <nagata@sraoss.co.jp> wrote: > > On Fri, 2 Feb 2024 11:18:18 +0100 > > Thomas Munro <thomas.munro@gmail.com> wrote: > > > One simple way to address that would be to make XLogFileInitInternal() > > > wait for InstallXLogFileSegment() to finish. It's a little > > > > Or, can we make sure the rename is durable by calling fsync before > > returning the fd, as a patch attached here? > > Right, yeah, that works too. I'm not sure which way is better. I'm not sure I like issuing spurious syncs unconditionally. Therefore, I prefer Thomas' approach in that regard. 0002 would be beneficial, considering the case of a very large max_wal_size, and the code seems to be the minimal required. I don't think it matters that the lock attempts occur uselessly until the first segment installation. That being said, we could avoid it by initializing last_known_installed_segno properly. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi all, On Tue, Jan 13, 2026 at 3:11 PM Soumyadeep Chakraborty <soumyadeep2007@gmail.com> wrote: > > On Mon, Feb 5, 2024 at 11:58 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > At Fri, 2 Feb 2024 14:42:46 +0100, Thomas Munro <thomas.munro@gmail.com> wrote in > > > On Fri, Feb 2, 2024 at 12:56 PM Yugo NAGATA <nagata@sraoss.co.jp> wrote: > > > > On Fri, 2 Feb 2024 11:18:18 +0100 > > > > Thomas Munro <thomas.munro@gmail.com> wrote: > > > > > One simple way to address that would be to make XLogFileInitInternal() > > > > > wait for InstallXLogFileSegment() to finish. It's a little > > > > > > > > Or, can we make sure the rename is durable by calling fsync before > > > > returning the fd, as a patch attached here? > > > > > > Right, yeah, that works too. I'm not sure which way is better. > > > > I'm not sure I like issuing spurious syncs unconditionally. Therefore, > > I prefer Thomas' approach in that regard. 0002 would be beneficial, > > considering the case of a very large max_wal_size, and the code seems > > to be the minimal required. I don't think it matters that the lock > > attempts occur uselessly until the first segment installation. That > > being said, we could avoid it by initializing > > last_known_installed_segno properly. > > I agree with avoiding unconditional fsyncs as well. 0002 seems safer in terms > of avoiding contention. I attached a patch initing the > last_known_installed_segno to the segno of the starting checkpoint. I added > some more commentary too. > > I added a comment that I'm not 100% sure about for durable_rename(). It was my > attempt at making other users of durable_rename() aware of a generalization of > the issue we are trying to tackle here. > > Are there other uses of durable_rename() that can suffer a similar issue? For > instance, relmapper seems safe because we rely on RelationMappingLock. However, > does the wal summarizer? > > Generally speaking, it's bad that durable_rename() has this window that makes > it less atomic than it seems. Ideally, we could have held some generic lock to > prevent that perhaps. Like have some file API within fd.c to grab a lock to > open a rename-able file, with the same LWLock being held during a > durable_rename()? Guess it would invite more contention though, and surrounding > code paths are already holding other LWLocks, which we may not be able to get > rid of. > I reviewed the patch and applied it on my current branch tree with assertions enabled and ran the standard regression tests. All tests passed successfully. I went through the ordering between InstallXLogFileSegment() and XLogFileInitInternal(). According to me, It seems where a WAL is written and fdatasync’d before the corresponding WAL segment name now becomes durable. Holding ControlFileLock until the durable rename is completed and waiting on the same lock before allowing WAL writes, appears sufficient to close the rename/fsync window described earlier. And also I reviewed the use of last_known_installed_segno to avoid unnecessary locking, initializing it from the checkpoint segment provides a conservative lower bound that behaves safely across startup and restart. Overall, this approach seems to close the recovery visibility gap and preserves the WAL before data guarantee under crash scenarios. Regards, Soumya