> I see where you're going, and you could possibly make it work, but
> there are a bunch of problems. One objection is that kernel FDs
> are a very finite resource on a lot of platforms --- you don't really
> want to tie up one FD for every dirty buffer, and you *certainly*
> don't want to get into a situation where you can't release kernel
> FDs until end of xact. You might be able to get around that by
> associating the fsync-needed bit with VFDs instead of FDs.
Reminds me to the usefulness of some kind of tablespace storage manager. It might not buy us a single saved
byte on disk, or maybe cost us some extra. But it would save file descriptors.
And if this storage manager would work with some amount of preallocated blocks, it would be totally happy
with a fdatasync() instead of a fsync(). Some per tablespace configurable options like initial number
ofblocks, next extent size and percentage increase would be fine.
Before someone asks, the difference between a fdatasync() and a fsync() is, that the first only forces modified
datablocks to be flushed to disk. A fsync() causes the inode to be flushed too, because at least it has a
newmodtime. In our case, where writes to files can cause block allocations, it is a requirement to flush the
inodeon modifications. But if dealing with a file where blocks are already allocated (no null faking or write
behind the EOF), it is not that important. Any difference you might see after a crash can be a slightly
differentlast modification time, and this really doesn't count.
The result of that difference is, that a write()+fsync() nearly allways causes head seeks on the disk
(exceptthe inode and dirty blocks are on the same cylinder). In contrast a series of write()+fdatasync() calls
forone and the same file, all blocks close together, wouldn't. And isn't that what our backends usually do?
Having immediate SCSI error reporting enabled on the disks, such a burst of write()+fdatasync() calls wouln't
havesuch a big performance impact any more. In that case, the fdatasync() call will return already at
thetime, the flushed blocks reached the on-disk cache. Not waiting until they are burned into the surface.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#========================================= wieck@debis.com (Jan Wieck) #