Re: fsync reliability

Поиск
Список
Период
Сортировка
От Greg Smith
Тема Re: fsync reliability
Дата
Msg-id 4DB0FB55.5010102@2ndQuadrant.com
обсуждение исходный текст
Ответ на fsync reliability  (Simon Riggs <simon@2ndQuadrant.com>)
Ответы Re: fsync reliability  (Simon Riggs <simon@2ndQuadrant.com>)
Re: fsync reliability  (Daniel Farina <daniel@heroku.com>)
Список pgsql-hackers
On 04/21/2011 04:26 AM, Simon Riggs wrote:
> However, that begs the question of what happens with WAL. At present,
> we do nothing to ensure that "the entry in the directory containing
> the file has also reached disk".
>    


Well, we do, but it's not obvious why that is unless you've stared at 
this for far too many hours.  A clear description of the possible issue 
you and Dan are raising showed up on LKML a few years ago:  
http://lwn.net/Articles/270891/

Here's the most relevant part, which directly addresses the WAL case:

"[fsync] is unsafe for write-ahead logging, because it doesn't really 
guarantee any _ordering_ for the writes at the hard storage level.  So 
aside from losing committed data, it can also corrupt structural 
metadata.  With ext3 it's quite easy to verify that fsync/fdatasync 
don't always write a journal entry.  (Apart from looking at the kernel 
code :-)

Just write some data, fsync(), and observe the number of writes in 
/proc/diskstats.  If the current mtime second _hasn't_ changed, the 
inode isn't written.  If you write data, say, 10 times a second to the 
same place followed by fsync(), you'll see a little more than 10 write 
I/Os, and less than 20."

There's a terrible hack suggested where you run fchmod to force the 
journal out in the next fsync that makes me want to track the poster 
down and shoot him, but this part raises a reasonable question.

The main issue he's complaining about here is a moot one for 
PostgreSQL.  If the WAL rewrites have been reordered but have not 
completed, the minute WAL replay hits the spot with a missing block the 
CRC32 will be busted and replay is finished.  The fact that he's 
assuming a database would have such a naive WAL implementation that it 
would corrupt the database if blocks are written out of order in between 
fsync call returning is one of the reasons this whole idea never got 
more traction--hard to get excited about a proposal whose fundamentals 
rest on an assumption that doesn't turns out to be true on real databases.

There's still the "fsync'd a data block but not the directory entry yet" 
issue as fall-out from this too.  Why doesn't PostgreSQL run into this 
problem?  Because the exact code sequence used is this one:

open
write
fsync
close

And Linux shouldn't ever screw that up, or the similar rename path.  
Here's what the close man page says, from 
http://linux.die.net/man/2/close :

"A successful close does not guarantee that the data has been 
successfully saved to disk, as the kernel defers writes. It is not 
common for a filesystem to flush the buffers when the stream is closed. 
If you need to be sure that the data is physically stored use fsync(2). 
(It will depend on the disk hardware at this point.)"

What this is alluding to is that if you fsync before closing, the close 
will write all the metadata out too.  You're busted if your write cache 
lies, but we already know all about that issue.

There was a discussion of issues around this on LKML a few years ago, 
with Alan Cox getting the good pull quote at 
http://lkml.org/lkml/2009/3/27/268 : "fsync/close() as a pair allows the 
user to correctly indicate their requirements."  While fsync doesn't 
guarantee that metadata is written out, and neither does close, kernel 
developers seem to all agree that fsync-before-close means you want 
everything on disk.  Filesystems that don't honor that will break all 
sorts of software.

It is of course possible there are bugs in some part of this code path, 
where a clever enough test case might expose a window of strange 
file/metadata ordering.  I think it's too weak of a theorized problem to 
go specifically chasing after though.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavel Stehule
Дата:
Сообщение: Re: "stored procedures"
Следующее
От: Greg Smith
Дата:
Сообщение: Re: pgbench \for or similar loop