> > A short test shows, that opening the file O_SYNC, and thus avoiding fsync()
> > would cut the effective time needed to sync write the xlog more than in half.
> > Of course we would need to buffer >= 1 xlog page before write (or commit)
> > to gain the full advantage.
>
> > prewrite 0 + write and fsync: 60.4 sec
> > sparse file + write with O_SYNC: 37.5 sec
> > no prewrite + write with O_SYNC: 36.8 sec
> > prewrite 0 + write with O_SYNC: 24.0 sec
>
> This seems odd. As near as I can tell, O_SYNC is simply a command to do
> fsync implicitly during each write call. It cannot save any I/O unless
> I'm missing something significant. Where is the performance difference
> coming from?
Yes, odd, but sure very reproducible here.
> The reason I'm inclined to question this is that what we want is not an
> fsync per write but an fsync per transaction, and we can't easily buffer
> all of a transaction's XLOG writes...
Yes, that is something to consider, but it would probably be sufficient to buffer
1-3 optimal IO blocks (32-256k here).
I assumed that with a few busy clients the fsyncs would come close to
one xlog page, but that is probably too few.
Andreas