Обсуждение: Re: [HACKERS] TODO item

Поиск
Список
Период
Сортировка

Re: [HACKERS] TODO item

От
wieck@debis.com (Jan Wieck)
Дата:
> I see where you're going, and you could possibly make it work, but
> there are a bunch of problems.  One objection is that kernel FDs
> are a very finite resource on a lot of platforms --- you don't really
> want to tie up one FD for every dirty buffer, and you *certainly*
> don't want to get into a situation where you can't release kernel
> FDs until end of xact.  You might be able to get around that by
> associating the fsync-needed bit with VFDs instead of FDs.
   Reminds  me  to  the  usefulness  of  some kind of tablespace   storage manager. It might not buy us a single saved
byte on   disk,  or  maybe  cost  us some extra. But it would save file   descriptors.
 
   And if this storage manager would work with  some  amount  of   preallocated  blocks,  it  would  be  totally  happy
with  a   fdatasync()  instead  of  a  fsync().   Some  per  tablespace   configurable  options  like  initial  number
ofblocks, next   extent size and percentage increase would be fine.
 
   Before someone asks, the difference between a fdatasync() and   a fsync() is, that the first only forces modified
datablocks   to be flushed to disk.  A fsync()  causes  the  inode  to  be   flushed  too,  because  at least it has a
newmodtime. In our   case, where writes to files can cause block  allocations,  it   is  a requirement to flush the
inodeon modifications. But if   dealing with a file where blocks are  already  allocated  (no   null  faking  or  write
behind  the  EOF),  it  is  not that   important. Any difference you might see after a crash can  be   a  slightly
differentlast modification time, and this really   doesn't count.
 
   The result of that  difference  is,  that  a  write()+fsync()   nearly  allways  causes  head  seeks  on the disk
(exceptthe   inode and dirty blocks are on the same cylinder). In contrast   a  series  of  write()+fdatasync() calls
forone and the same   file, all blocks close together,  wouldn't.  And  isn't  that   what our backends usually do?
 
   Having  immediate  SCSI error reporting enabled on the disks,   such a burst of write()+fdatasync() calls wouln't
havesuch a   big   performance   impact   any  more.  In  that  case,  the   fdatasync() call will return already at
thetime, the flushed   blocks  reached the on-disk cache. Not waiting until they are   burned into the surface.
 


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #



Re: [HACKERS] TODO item

От
Peter Eisentraut
Дата:
On Tue, 8 Feb 2000, Jan Wieck wrote:

>     And if this storage manager would work with  some  amount  of
>     preallocated  blocks,  it  would  be  totally  happy  with  a
>     fdatasync()  instead  of  a  fsync().   Some  per  tablespace
>     configurable  options  like  initial  number  of blocks, next
>     extent size and percentage increase would be fine.

On Linux, fdatasync() does exactly the same as fsync(). On FreeBSD (3.4),
fdatasync() isn't even documented and I can't find it in any of the
include files either. What I'm saying is that for the vast majority of our
users this would most likely buy exactly nothing. I just wanted to point
that out, not dismiss the idea.


-- 
Peter Eisentraut                  Sernanders vaeg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden