Postgres, fsync, and OSs (specifically linux)

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Postgres, fsync, and OSs (specifically linux)
Дата
Msg-id 20180427222842.in2e4mibx45zdth5@alap3.anarazel.de
обсуждение исходный текст
Ответы Re: Postgres, fsync, and OSs (specifically linux)  (Bruce Momjian <bruce@momjian.us>)
Re: Postgres, fsync, and OSs (specifically linux)  (Craig Ringer <craig@2ndquadrant.com>)
Re: Postgres, fsync, and OSs (specifically linux)  (Simon Riggs <simon@2ndquadrant.com>)
Re: Postgres, fsync, and OSs (specifically linux)  (Catalin Iacob <iacobcatalin@gmail.com>)
Re: Postgres, fsync, and OSs (specifically linux)  (Andres Freund <andres@anarazel.de>)
Re: Postgres, fsync, and OSs (specifically linux)  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
Hi,

I thought I'd send this separately from [0] as the issue has become more
general than what was mentioned in that thread, and it went off into
various weeds.

I went to LSF/MM 2018 to discuss [0] and related issues. Overall I'd say
it was a very productive discussion.  I'll first try to recap the
current situation, updated with knowledge I gained. Secondly I'll try to
discuss the kernel changes that seem to have been agreed upon. Thirdly
I'll try to sum up what postgres needs to change.

== Current Situation ==

The fundamental problem is that postgres assumed that any IO error would
be reported at fsync time, and that the error would be reported until
resolved. That's not true in several operating systems, linux included.

There's various judgement calls leading to the current OS (specifically
linux, but the concerns are similar in other OSs) behaviour:

- By the time IO errors are treated as fatal, it's unlikely that plain
  retries attempting to write exactly the same data are going to
  succeed. There are retries on several layers. Some cases would be
  resolved by overwriting a larger amount (so device level remapping
  functionality can mask dead areas), but plain retries aren't going to
  get there if they didn't the first time round.
- Retaining all the data necessary for retries would make it quite
  possible to turn IO errors on some device into out of memory
  errors. This is true to a far lesser degree if only enough information
  were to be retained to (re-)report an error, rather than actually
  retry the write.
- Continuing to re-report an error after one fsync() failed would make
  it hard to recover from that fact. There'd need to be a way to "clear"
  a persistent error bit, and that'd obviously be outside of posix.
- Some other databases use direct-IO and thus these paths haven't been
  exercised under fire that much.
- Actually marking files as persistently failed would require filesystem
  changes, and filesystem metadata IO, far from guaranteed in failure
  scenarios.

Before linux v4.13 errors in kernel writeback would be reported at most
once, without a guarantee that that'd happen (IIUC memory pressure could
lead to the relevant information being evicted) - but it was pretty
likely.  After v4.13 (see https://lwn.net/Articles/724307/) errors are
reported exactly once to all open file descriptors for a file with an
error - but never for files that have been opened after the error
occurred.

It's worth to note that on linux it's not well defined what contents one
would read after a writeback error. IIUC xfs will mark the pagecache
contents that triggered an error as invalid, triggering a re-read from
the underlying storage (thus either failing or returning old but
persistent contents). Whereas some other filesystems (among them ext4 I
believe) retain the modified contents of the page cache, but marking it
as clean (thereby returning new contents until the page cache contents
are evicted).

Some filesystems (prominently NFS in many configurations) perform an
implicit fsync when closing the file. While postgres checks for an error
of close() and reports it, we don't treat it as fatal. It's worth to
note that by my reading this means that an fsync error at close() will
*not* be re-reported by the time an explicit fsync() is issued. It also
means that we'll not react properly to the possible ENOSPC errors that
may be reported at close() for NFS.  At least the latter isn't just the
case in linux.

Proposals for how postgres could deal with this included using syncfs(2)
- but that turns out not to work at all currently, because syncfs()
basically wouldn't return any file-level errors. It'd also imply
superflously flushing temporary files etc.

The second major type of proposal was using direct-IO. That'd generally
be a desirable feature, but a) would require some significant changes to
postgres to be performant, b) isn't really applicable for the large
percentage of installations that aren't tuned reasonably well, because
at the moment the OS page cache functions as a memory-pressure aware
extension of postgres' page cache.


Another topic brought up in this thread was the handling of ENOSPC
errors that aren't triggered on a filesystem level, but rather are
triggered by thin provisioning. On linux that currently apprently lead
to page cache contents being lost (and errors "eaten") in a lot of
places, including just when doing a write(). In a lot of cases it's
pretty much expected that the file system will just hang or react
unpredictably upon space exhaustion.  My reading is that the block-layer
thin provisioning code is still pretty fresh, and should only be used
with great care. The only way to halfway reliably use it appears to
change the configuration so space exhaustion blocks until admin
intervention (at least dm-thinp provides allows that).


There's some clear need to automate some more testing in this area so
that future behaviour changes don't surprise us.


== Proposed Linux Changes ==

- Matthew Wilcox proposed (and posted a patch) that'd partially revert
  behaviour to the pre v4.13 world, by *also* reporting errors to
  "newer" file-descriptors if the error hasn't previously been
  reported. That'd still not guarantee that the error is reported
  (memory pressure could evict information without open fd), but in most
  situations we'll again get the error in the checkpointer.

  This seems largely be agreed upon. It's unclear whether it'll go into
  the stable backports for still-maintained >= v4.13 kernels.

- syncfs() will be fixed so it reports errors properly - that'll likely
  require passing it an O_PATH filedescriptor to have space to store the
  errseq_t value that allows discerning already reported and new errors.

  No patch has appeared yet, but the behaviour seems largely agreed
  upon.

- Make per-filesystem error counts available in a uniform (i.e. same for
  every supporting fs) manner. Right now it's very hard to figure out
  whether errors occurred. There seemed general agreement that exporting
  knowledge about such errors is desirable. Quite possibly the syncfs()
  fix above will provide the necessary infrastructure. It's unclear as
  of yet how the value would be exposed. Per-fs /sys/ entries and an
  ioctl on O_PATH fds have been mentioned.

  These'd error counts would not vanish due to memory pressure, and they
  can be checked even without knowing which files in a specific
  filesystem have been touched (e.g. when just untar-ing something).

  There seemed to be fairly widespread agreement that this'd be a good
  idea. Much less clearer whether somebody would do the work.

- Provide config knobs that allow to define the FS error behaviour in a
  consistent way across supported filesystems. XFS currently has various
  knobs controlling what happens in case of metadata errors [1] (retry
  forever, timeout, return up). It was proposed that this interface be
  extended to also deal with data errors, and moved into generic support
  code.

  While the timeline is unclear, there seemed to be widespread support
  for the idea. I believe Dave Chinner indicated that he at least has
  plans to generalize the code.

- Stop inodes with unreported errors from being evicted. This will
  guarantee that a later fsync (without an open FD) will see the
  error. The memory pressure concerns here are lower than with keeping
  all the failed pages in memory, and it could be optimized further.

  I read some tentative agreement behind this idea, but I think it's the
  by far most controversial one.


== Potential Postgres Changes ==

Several operating systems / file systems behave differently (See
e.g. [2], thanks Thomas) than we expected. Even the discussed changes to
e.g. linux don't get to where we thought we are. There's obviously also
the question of how to deal with kernels / OSs that have not been
updated.

Changes that appear to be necessary, even for kernels with the issues
addressed:

- Clearly we need to treat fsync() EIO, ENOSPC errors as a PANIC and
  retry recovery. While ENODEV (underlying device went away) will be
  persistent, it probably makes sense to treat it the same or even just
  give up and shut down.  One question I see here is whether we just
  want to continue crash-recovery cycles, or whether we want to limit
  that.

- We need more aggressive error checking on close(), for ENOSPC and
  EIO. In both cases afaics we'll have to trigger a crash recovery
  cycle. It's entirely possible to end up in a loop on NFS etc, but I
  don't think there's a way around that.

  Robert, on IM, wondered whether there'd be a race between some backend
  doing a close(), triggering a PANIC, and a checkpoint succeeding.  I
  don't *think* so, because the error will only happen if there's
  outstanding dirty data, and the checkpoint would have flushed that out
  if it belonged to the current checkpointing cycle.

- The outstanding fsync request queue isn't persisted properly [3]. This
  means that even if the kernel behaved the way we'd expected, we'd not
  fail a second checkpoint :(. It's possible that we don't need to deal
  with this because we'll henceforth PANIC, but I'd argue we should fix
  that regardless. Seems like a time-bomb otherwise (e.g. after moving
  to DIO somebody might want to relax the PANIC...).

- It might be a good idea to whitelist expected return codes for write()
  and PANIC one ones that we did not expect. E.g. when hitting an EIO we
  should probably PANIC, to get back to a known good state. Even though
  it's likely that we'd again that error at fsync().

- Docs.

I think we also need to audit a few codepaths. I'd be surprised if we
PANICed appropriately on all fsyncs(), particularly around the SLRUs.  I
think we need to be particularly careful around the WAL handling, I
think it's fairly likely that there's cases where we'd write out WAL in
one backend and then fsync() in another backend with a file descriptor
that has only been opened *after* the write occurred, which means we
might miss the error entirely.


Then there's the question of how we want to deal with kernels that
haven't been updated with the aforementioned changes. We could say that
we expect decent OS support and declare that we just can't handle this -
given that at least various linux versions, netbsd, openbsd, MacOS just
silently drop errors and we'd need different approaches for dealing with
that, that doesn't seem like an insane approach.

What we could do:

- forward file descriptors from backends to checkpointer (using
  SCM_RIGHTS) when marking a segment dirty. That'd require some
  optimizations (see [4]) to avoid doing so repeatedly.  That'd
  guarantee correct behaviour in all linux kernels >= 4.13 (possibly
  backported by distributions?), and I think it'd also make it vastly
  more likely that errors are reported in earlier kernels.

  This should be doable without a noticeable performance impact, I
  believe.  I don't think it'd be that hard either, but it'd be a bit of
  a pain to backport it to all postgres versions, as well as a bit
  invasive for that.

  The infrastructure this'd likely end up building (hashtable of open
  relfilenodes), would likely be useful for further things (like caching
  file size).

- Add a pre-checkpoint hook that checks for filesystem errors *after*
  fsyncing all the files, but *before* logging the checkpoint completion
  record. Operating systems, filesystems, etc. all log the error format
  differently, but for larger installations it'd not be too hard to
  write code that checks their specific configuration.

  While I'm a bit concerned adding user-code before a checkpoint, if
  we'd do it as a shell command it seems pretty reasonable. And useful
  even without concern for the fsync issue itself. Checking for IO
  errors could e.g. also include checking for read errors - it'd not be
  unreasonable to not want to complete a checkpoint if there'd been any
  media errors.

- Use direct IO. Due to architectural performance issues in PG and the
  fact that it'd not be applicable for all installations I don't think
  this is a reasonable fix for the issue presented here. Although it's
  independently something we should work on.  It might be worthwhile to
  provide a configuration that allows to force DIO to be enabled for WAL
  even if replication is turned on.

- magic

Greetings,

Andres Freund

[0] https://archives.postgresql.org/message-id/CAMsr+YHh+5Oq4xziwwoEfhoTZgr07vdGG+hu=1adXx59aTeaoQ@mail.gmail.com

[1]
static const struct xfs_error_init xfs_error_meta_init[XFS_ERR_ERRNO_MAX] = {
    { .name = "default",
      .max_retries = XFS_ERR_RETRY_FOREVER,
      .retry_timeout = XFS_ERR_RETRY_FOREVER,
    },
    { .name = "EIO",
      .max_retries = XFS_ERR_RETRY_FOREVER,
      .retry_timeout = XFS_ERR_RETRY_FOREVER,
    },
    { .name = "ENOSPC",
      .max_retries = XFS_ERR_RETRY_FOREVER,
      .retry_timeout = XFS_ERR_RETRY_FOREVER,
    },
    { .name = "ENODEV",
      .max_retries = 0,    /* We can't recover from devices disappearing */
      .retry_timeout = 0,
    },
};

[2] https://wiki.postgresql.org/wiki/Fsync_Errors
[3] https://archives.postgresql.org/message-id/87y3i1ia4w.fsf%40news-spur.riddles.org.uk
[4] https://archives.postgresql.org/message-id/20180424180054.inih6bxfspgowjuc@alap3.anarazel.de


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Stas Kelvich
Дата:
Сообщение: FinishPreparedTransaction missing HOLD_INTERRUPTS section
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: Postgres, fsync, and OSs (specifically linux)