Tatsuo Ishii <t-ishii@sra.co.jp> writes:
>> possibly fix #2 by having transaction commit invoke the pg_fsync_pending
>> scan before it updates pg_log (and then fsyncing pg_log itself again
>> after).
> I do not understand #2. I call pg_fsync_pending twice in
> RecordTransactionCommit, one is after FlushBufferPool, and the other
> is after TansactionIdCommit and FlushBufferPool. Or am I missing
> something?
Oh, OK. That's what I meant. The snippet you posted didn't show where
you were calling the fsync routine from.
> I thought about that too. If the ordering was that important, a
> database managed by backends with -F on could be seriously
> corrupted. I've never heard of such disasters caused by -F.
This is why I think that fsync actually offers very little extra
protection ;-)
> BTW, Hiroshi has noticed me an excellent point #3:
>> This backend has to force the flush of a free buffer
>> page. Unfortunately the page was dirtied by the
>> above operation of Session-1 and calls pg_fsync()
>> for the table A. However fsync() is postponed until
>> commit of this backend.
>>
>> Session-1
>> commit;
>> There's no dirty buffer page for the table A.
>> So pg_fsync() isn't called for the table A.
Oooh, right. Backend A dirties the page, but leaves it sitting in
shared buffer. Backend B needs the buffer space, so it does the
fwrite of the page. Now if backend A wants to commit, it can fsync
everything it's written --- but does that guarantee the page that
was actually written by B will get flushed to disk? Not sure.
If the pending-fsync logic is based on either physical fds or vfds
then it definitely *won't* work; A might have found the desired page
sitting in buffer cache to begin with, and never have opened the
underlying file at all!
So it seems you would need to keep a list of all the relation files (and
segments) you've written to in the current xact, and open and fsync each
one just before writing/fsyncing pg_log. Even then, you're assuming
that fsync applied to a file via an fd belonging to one backend will
flush disk buffers written to the same file via *other* fds belonging
to *other* processes. I'm not sure that that is true on all Unixes...
heck, I'm not sure it's true on any. The fsync(2) man page here isn't
real specific.
regards, tom lane