Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Дата
Msg-id CAMsr+YHQ3Evsdc5QMwZ6=rbg0Tdwgv+2DCfLnjV+4VJwNTocgQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Christophe Pettus <xof@thebuild.com>)
Ответы Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Justin Pryzby <pryzby@telsasoft.com>)
Список pgsql-hackers
On 9 April 2018 at 05:28, Christophe Pettus <xof@thebuild.com> wrote:

> On Apr 8, 2018, at 14:23, Greg Stark <stark@mit.edu> wrote:
>
> They consider dirty filesystem buffers when there's
> hardware failure preventing them from being written "a memory leak".

That's not an irrational position.  File system buffers are *not* dedicated memory for file system caching; they're being used for that because no one has a better use for them at that moment.  If an inability to flush them to disk meant that they suddenly became pinned memory, a large copy operation to a yanked USB drive could result in the system having no more allocatable memory.  I guess in theory that they could swap them, but swapping out a file system buffer in hopes that sometime in the future it could be properly written doesn't seem very architecturally sound to me.

Yep.

Another example is a write to an NFS or iSCSI volume that goes away forever. What if the app keeps write()ing in the hopes it'll come back, and by the time the kernel starts reporting EIO for write(), it's already saddled with a huge volume of dirty writeback buffers it can't get rid of because someone, one day, might want to know about them?

You could make the argument that it's OK to forget if the entire file system goes away. But actually, why is that ok? What if it's remounted again? That'd be really bad too, for someone expecting write reliability.

You can coarsen from dirty buffer tracking to marking the FD(s) bad, but what  if there's no FD to mark because the file isn't open at the moment?

You can mark the inode cache entry and pin it, I guess. But what if your app triggered I/O errors over vast numbers of small files? Again, the kernel's left holding the ball.

It doesn't know if/when an app will return to check. It doesn't know how long to remember the failure for. It doesn't know when all interested clients have been informed and it can treat the fault as cleared/repaired, either, so it'd have to *keep on reporting EIO for PostgreSQL's own writes and fsyncs() indefinitely*, even once we do recovery.

The only way it could avoid that would be to keep the dirty writeback pages around and flagged bad, then clear the flag when a new write() replaces the same file range. I can't imagine that being practical.

Blaming the kernel for this sure is the easy way out.

But IMO we cannot rationally expect the kernel to remember error state forever for us, then forget it when we expect, all without actually telling it anything about our activities or even that we still exist and are still interested in the files/writes. We've closed the files and gone away.

Whatever we do, it's likely going to have to involve not doing that anymore.

Even if we can somehow convince the kernel folks to add a new interface for us that reports I/O errors to some listener, like an inotify/fnotify/dnotify/whatever-it-is-today-notify extension reporting errors in buffered async writes, we won't be able to rely on having it for 5-10 years, and only on Linux.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Warnings and uninitialized variables in TAP tests
Следующее
От: Craig Ringer
Дата:
Сообщение: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS