fsync alternatives (was: Re: [HACKERS] TODO item)

Поиск
Список
Период
Сортировка
От Alfred Perlstein
Тема fsync alternatives (was: Re: [HACKERS] TODO item)
Дата
Msg-id 20000207103646.A25520@fw.wintelcom.net
обсуждение исходный текст
Ответ на Re: [HACKERS] TODO item  (Bruce Momjian <pgman@candle.pha.pa.us>)
Ответы Re: fsync alternatives (was: Re: [HACKERS] TODO item)  (Bruce Momjian <pgman@candle.pha.pa.us>)
Список pgsql-hackers
* Bruce Momjian <pgman@candle.pha.pa.us> [000207 10:14] wrote:
> > Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > > Don't tell me we fsync on every buffer write, and not just at
> > > transaction commit?  That is terrible.
> > 
> > If you don't have -F set, yup.  Why did you think fsync mode was
> > so slow?
> > 
> > > What if we set a flag on the file descriptor stating we dirtied/wrote
> > > one of its buffers during the transaction, and cycle through the file
> > > descriptors on buffer commit and fsync all involved in the transaction. 
> > 
> > That's exactly what Tatsuo was describing, I believe.  I think Hiroshi
> > has pointed out a serious problem that would make it unreliable when
> > multiple backends are running: if some *other* backend fwrites the page
> > instead of your backend, and it doesn't fsync until *its* transaction is
> > done (possibly long after yours), then you lose the ordering guarantee
> > that is the point of the whole exercise...
> 
> OK, I understand now.  You are saying if my backend dirties a buffer,
> but another backend does the write, would my backend fsync() that buffer
> that the other backend wrote.
> 
> I can't imagine how fsync could flush _only_ the file discriptor buffers
> modified by the current process.  It would have to affect all buffers
> for the file descriptor.
> 
> BSDI says:
> 
>      Fsync() causes all modified data and attributes of fd to be moved to a
>      permanent storage device.  This normally results in all in-core modified
>      copies of buffers for the associated file to be written to a disk.
> 
> Looking at the BSDI kernel, there is a user-mode file descriptor table,
> which maps to a kernel file descriptor table.  This table can be shared,
> so a file descriptor opened multiple times, like in a fork() call.  The
> kernel table maps to an actual file inode/vnode that maps to a file. 
> The only thing that is kept in the file descriptor table is the current
> offset in the file (struct file in BSD).  There is no mapping of who
> wrote which blocks.
> 
> In fact, I would suggest that any kernel implementation that could track
> such things would be pretty broken.  I can imagine some cases the use of
> that mapping of blocks to file descriptors would cause compatibility
> problems.  Those buffers have to be shared by all processes.
> 
> So, I think we are safe if we can either keep that file descriptor open
> until commit, or re-open it and fsync it on commit.  That assume a
> re-open is hitting the same file.  My opinion is that we should just
> fsync it on close and not worry about a reopen.

I'm pretty sure that the standard is that a close on a file _should_
fsync it.

In re the fsync problems...

I came across this option when investigating implementing range fsync()
for FreeBSD, 'O_FSYNC'/'O_SYNC'.

Why not keep 2 file descritors open for each datafile, one opened
with O_FSYNC (exists but not documented in FreeBSD) and one normal?
This garantees sync writes for all write operations on that fd.

Most unicies offer an open flag for this type of access although the name
will vary (Linux/Solaris uses O_SYNC afaik).

When a sync write is needed then use that filedescriptor to do the writing,
and use the normal one for non-sync writes.

This would fix the problem where another backend causes an out-of-order
or unsafe fsync to occur.

Another option is using mmap() and msync() to achive the same effect, the
only problem with mmap() is that under most i386 systems you are limited
to a < 4gig (2gig with FreeBSD) mapping that would have to be 'windowed'
over the datafiles, however depending on the locality of accesses this
may be much more effecient that read/write semantics.
Not to mention that a lot of unicies have broken mmap() implementations
and problems with merged vm/buffercache.

Yes, I haven't looked at the backend code, just hoping to offer some 
useful suggestions.

-Alfred


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: [HACKERS] TODO item
Следующее
От: "Jeff MacDonald "
Дата:
Сообщение: Re: [HACKERS] Longer Column Names