Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Дата
Msg-id CAMsr+YFFeQo_nsfQff_tWxxvwg7gV88mVfFku4GQJYSRktnU3A@mail.gmail.com
обсуждение исходный текст
Ответ на Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Bruce Momjian <bruce@momjian.us>)
Ответы Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Bruce Momjian <bruce@momjian.us>)
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Andrew Gierth <andrew@tao11.riddles.org.uk>)
Список pgsql-hackers
On 4 April 2018 at 21:49, Bruce Momjian <bruce@momjian.us> wrote:
On Wed, Apr  4, 2018 at 07:32:04PM +1200, Thomas Munro wrote:
> On Wed, Apr 4, 2018 at 6:00 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> > On 4 April 2018 at 13:29, Thomas Munro <thomas.munro@enterprisedb.com>
> > wrote:
> >> /* Ensure that we skip any errors that predate opening of the file */
> >> f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
> >>
> >> [...]
> >
> > Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel
> > will deliberately hide writeback errors that predate our fsync() call from
> > us?
>
> Predates the opening of the file by the process that calls fsync().
> Yeah, it sure looks that way based on the above code fragment.  Does
> anyone know better?

Uh, just to clarify, what is new here is that it is ignoring any
_errors_ that happened before the open().  It is not ignoring write()'s
that happened but have not been written to storage before the open().

FYI, pg_test_fsync has always tested the ability to fsync() writes()
from from other processes:

        Test if fsync on non-write file descriptor is honored:
        (If the times are similar, fsync() can sync data written on a different
        descriptor.)
                write, fsync, close                5360.341 ops/sec     187 usecs/op
                write, close, fsync                4785.240 ops/sec     209 usecs/op

Those two numbers should be similar.  I added this as a check to make
sure the behavior we were relying on was working.  I never tested sync
errors though.

I think the fundamental issue is that we always assumed that writes to
the kernel that could not be written to storage would remain in the
kernel until they succeeded, and that fsync() would report their
existence.

I can understand why kernel developers don't want to keep failed sync
buffers in memory, and once they are gone we lose reporting of their
failure.  Also, if the kernel is going to not retry the syncs, how long
should it keep reporting the sync failure?

Ideally until the app tells it not to.

But there's no standard API for that.

The obvious answer seems to be "until the FD is closed". But we just discussed how Pg relies on being able to open and close files freely. That may not be as reasonable a thing to do as we thought it was when you consider error reporting. What's the kernel meant to do? How long should it remember "I had an error while doing writeback on this file"? Should it flag the file metadata and remember across reboots? Obviously not, but where does it stop? Tell the next program that does an fsync() and forget? How could it associate a dirty buffer on a file with no open FDs with any particular program at all? And what if the app did a write then closed the file and went away, never to bother to check the file again, like most apps do?

Some I/O errors are transient (network issue, etc). Some are recoverable with some sort of action, like disk space issues, but may take a long time before an admin steps in. Some are entirely unrecoverable (disk 1 in striped array is on fire) and there's no possible recovery. Currently we kind of hope the kernel will deal with figuring out which is which and retrying. Turns out it doesn't do that so much, and I don't think the reasons for that are wholly unreasonable. We may have been asking too much.

That does leave us in a pickle when it comes to the checkpointer and opening/closing FDs. I don't know what the "right" thing for the kernel to do from our perspective even is here, but the best I can come up with is actually pretty close to what it does now. Report the fsync() error to the first process that does an fsync() since the writeback error if one has occurred, then forget about it. Ideally I'd have liked it to mark all FDs pointing to the file with a flag to report EIO on next fsync too, but it turns out that won't even help us due to our opening and closing behaviour, so we're going to have to take responsibility for handling and communicating that ourselves, preventing checkpoint completion if any backend gets an fsync error. Probably by PANICing. Some extra work may be needed to ensure reliable ordering and stop checkpoints completing if their fsync() succeeds due to a recent failed fsync() on a normal backend that hasn't PANICed or where the postmaster hasn't noticed yet.

Our only option might be to tell administrators to closely watch for
kernel write failure messages, and then restore or failover.  :-(

Speaking of, there's not necessarily any lost page write error in the logs AFAICS. My tests often just show "Buffer I/O error on device dm-0, logical block 59393" or the like.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Antonis Iliopoulos
Дата:
Сообщение: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Следующее
От: Jesper Pedersen
Дата:
Сообщение: Re: [HACKERS] path toward faster partition pruning