Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Поиск
Список
Период
Сортировка
От Anthony Iliopoulos
Тема Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Дата
Msg-id 20180402230543.GO11627@technoir
обсуждение исходный текст
Ответ на Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Stephen Frost <sfrost@snowman.net>)
Ответы Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Andres Freund <andres@anarazel.de>)
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Craig Ringer <craig@2ndquadrant.com>)
Список pgsql-hackers
Hi Stephen,

On Mon, Apr 02, 2018 at 04:58:08PM -0400, Stephen Frost wrote:
>
> fsync() doesn't reflect the status of given pages, however, it reflects
> the status of the file descriptor, and as such the file, on which it's
> called.  This notion that fsync() is actually only responsible for the
> changes which were made to a file since the last fsync() call is pure
> foolishness.  If we were able to pass a list of pages or data ranges to
> fsync() for it to verify they're on disk then perhaps things would be
> different, but we can't, all we can do is ask to "please flush all the
> dirty pages associated with this file descriptor, which represents this
> file we opened, to disk, and let us know if you were successful."
>
> Give us a way to ask "are these specific pages written out to persistant
> storage?" and we would certainly be happy to use it, and to repeatedly
> try to flush out pages which weren't synced to disk due to some
> transient error, and to track those cases and make sure that we don't
> incorrectly assume that they've been transferred to persistent storage.

Indeed fsync() is simply a rather blunt instrument and a narrow legacy
interface but further changing its established semantics (no matter how
unreasonable they may be) is probably not the way to go.

Would using sync_file_range() be helpful? Potential errors would only
apply to pages that cover the requested file ranges. There are a few
caveats though:

(a) it still messes with the top-level error reporting so mixing it
with callers that use fsync() and do care about errors will produce
the same issue (clearing the error status).

(b) the error-reporting granularity is coarse (failure reporting applies
to the entire requested range so you still don't know which particular
pages/file sub-ranges failed writeback)

(c) the same "report and forget" semantics apply to repeated invocations
of the sync_file_range() call, so again action will need to be taken
upon first error encountered for the particular ranges.

> > The application will need to deal with that first error irrespective of
> > subsequent return codes from fsync(). Conceptually every fsync() invocation
> > demarcates an epoch for which it reports potential errors, so the caller
> > needs to take responsibility for that particular epoch.
> 
> We do deal with that error- by realizing that it failed and later
> *retrying* the fsync(), which is when we get back an "all good!
> everything with this file descriptor you've opened is sync'd!" and
> happily expect that to be truth, when, in reality, it's an unfortunate
> lie and there are still pages associated with that file descriptor which
> are, in reality, dirty and not sync'd to disk.

It really turns out that this is not how the fsync() semantics work
though, exactly because the nature of the errors: even if the kernel
retained the dirty bits on the failed pages, retrying persisting them
on the same disk location would simply fail. Instead the kernel opts
for marking those pages clean (since there is no other recovery
strategy), and reporting once to the caller who can potentially deal
with it in some manner. It is sadly a bad and undocumented convention.

> Consider two independent programs where the first one writes to a file
> and then calls the second one whose job it is to go out and fsync(),
> perhaps async from the first, those files.  Is the second program
> supposed to go write to each page that the first one wrote to, in order
> to ensure that all the dirty bits are set so that the fsync() will
> actually return if all the dirty pages are written?

I think what you have in mind are the semantics of sync() rather
than fsync(), but as long as an application needs to ensure data
are persisted to storage, it needs to retain those data in its heap
until fsync() is successful instead of discarding them and relying
on the kernel after write(). The pattern should be roughly like:
write() -> fsync() -> free(), rather than write() -> free() -> fsync().
For example, if a partition gets full upon fsync(), then the application
has a chance to persist the data in a different location, while
the kernel cannot possibly make this decision and recover.

> > Callers that are not affected by the potential outcome of fsync() and
> > do not react on errors, have no reason for calling it in the first place
> > (and thus masking failure from subsequent callers that may indeed care).
> 
> Reacting on an error from an fsync() call could, based on how it's
> documented and actually implemented in other OS's, mean "run another
> fsync() to see if the error has resolved itself."  Requiring that to
> mean "you have to go dirty all of the pages you previously dirtied to
> actually get a subsequent fsync() to do anything" is really just not
> reasonable- a given program may have no idea what was written to
> previously nor any particular reason to need to know, on the expectation
> that the fsync() call will flush any dirty pages, as it's documented to
> do.

I think we are conflating a few issues here: having the OS kernel being
responsible for error recovery (so that subsequent fsync() would fix
the problems) is one. This clearly is a design which most kernels have
not really adopted for reasons outlined above (although having the FS
layer recovering from hard errors transparently is open for discussion
from what it seems [1]). Now, there is the issue of granularity of
error reporting: userspace could benefit from a fine-grained indication
of failed pages (or file ranges). Another issue is that of reporting
semantics (report and clear), which is also a design choice made to
avoid having higher-resolution error tracking and the corresponding
memory overheads [1].

Best regards,
Anthony

[1] https://lwn.net/Articles/718734/


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: Disabling memory display in EXPLAIN ANALYZE
Следующее
От: Tatsuo Ishii
Дата:
Сообщение: Re: Creating streaming replication standby