On Thu, 2 Apr 2009, Scott Carey wrote:
> The big one, is this quote from the linux kernel list:
> " Right now, if you want a reliable database on Linux, you _cannot_
> properly depend on fsync() or fdatasync(). Considering how much Linux
> is used for critical databases, using these functions, this amazes me.
> "
Things aren't as bad as that out of context quote makes them seem. There
are two main problem situations here:
1) You cannot trust Linux to flush data to a hard drive's write cache.
Solution: turn off the write cache. Given the general poor state of
targeted fsync on Linux (quoting from a downthread comment by David Lang:
"in data=ordered mode, the default for most distros, ext3 can end up
having to write all pending data when you do a fsync on one file"), those
fsyncs were likely to blow out the drive cache anyway.
2) There are no hard guarantees about write ordering at the disk level; if
you write blocks ABC and then fsync, you might actually get, say, only B
written before power goes out. I don't believe the PostgreSQL WAL design
will be corrupted by this particular situation, because until that fsync
comes back saying all 3 are done none of them are relied upon.
> Interestingly, postgres would be safer on linux if it used
> sync_file_range instead of fsync() but that has other drawbacks and
> limitations
I have thought about whether it would be possible to add a Linux-specific
improvement here into the code path that does something custom in this
area for Windows/Mac OS X when you use fsync_method=fsync_writethrough
We really should update the documentation in this area before 8.4 ships.
I'm looking into moving the "Tuning PostgreSQL WAL Synchronization" paper
I wrote onto the wiki and then fleshing it out with all this
filesystem-specific trivia.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD