Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Дата
Msg-id CAMsr+YHSj6g3bqzDH90-O8wHd9Eu=+1efjeA13RD+W4JMUOuBw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On 3 April 2018 at 10:54, Robert Haas <robertmhaas@gmail.com> wrote:
 
I think it's always unreasonable to
throw away the user's data.

Well, we do that. If a txn aborts, all writes in the txn are discarded.

I think that's perfectly reasonable. Though we also promise an all or nothing effect, we make exceptions even there.

The FS doesn't offer transactional semantics, but the fsync behaviour can be interpreted kind of similarly.

I don't *agree* with it, but I don't think it's as wholly unreasonable as all that. I think leaving it undocumented is absolutely gobsmacking, and it's dubious at best, but it's not totally insane.
 
If the writes are going to fail, then let
them keep on failing every time. 

Like we do, where we require an explicit rollback.

But POSIX may pose issues there, it doesn't really define any interface for that AFAIK. Unless you expect the app to close() and re-open() the file. Replacing one nonstandard issue with another may not be a win.
 
*That* wouldn't cause any data loss,
because we'd never be able to checkpoint, and eventually the user
would have to kill the server uncleanly, and that would trigger
recovery.

Yep. That's what I expected to happen on unrecoverable I/O errors. Because, y'know, unrecoverable.

I was stunned to learn it's not so. And I'm even more amazed to learn that ext4's errors=remount-ro apparently doesn't concern its self with mere user data, and may exhibit the same behaviour - I need to rerun my test case on it tomorrow.
 
Also, this really does make it impossible to write reliable programs.

In the presence of multiple apps interacting on the same file, yes. I think that's a little bit of a stretch though.

For a single app, you can recover by remembering and redoing all the writes you did.

Sucks if your app wants to have multiple processes working together on a file without some kind of journal or WAL, relying on fsync() alone, mind you. But at least we have WAL.

Hrm. I wonder how this interacts with wal_level=minimal.
 
Even leaving that aside, a PANIC means a prolonged outage on a
prolonged system - it could easily take tens of minutes or longer to
run recovery.  So saying "oh, just do that" is not really an answer.
Sure, we can do it, but it's like trying to lose weight by
intentionally eating a tapeworm.  Now, it's possible to shorten the
checkpoint_timeout so that recovery runs faster, but then performance
drops because data has to be fsync()'d more often instead of getting
buffered in the OS cache for the maximum possible time.

It's also spikier. Users have more issues with latency with short, frequent checkpoints.
 
  We could also
dodge this issue in another way: suppose that when we write a page
out, we don't consider it really written until fsync() succeeds.  Then
we wouldn't need to PANIC if an fsync() fails; we could just re-write
the page.  Unfortunately, this would also be terrible for performance,
for pretty much the same reasons: letting the OS cache absorb lots of
dirty blocks and do write-combining is *necessary* for good
performance.

Our double-caching is already plenty bad enough anyway, as well.

(Ideally I want to be able to swap buffers between shared_buffers and the OS buffer-cache. Almost like a 2nd level of buffer pinning. When we write out a block, we *transfer* ownership to the OS.  Yeah, I'm dreaming. But we'd sure need to be able to trust the OS not to just forget the block then!)
 
> The error reporting is thus consistent with the intended semantics (which
> are sadly not properly documented). Repeated calls to fsync() simply do not
> imply that the kernel will retry to writeback the previously-failed pages,
> so the application needs to be aware of that. Persisting the error at the
> fsync() level would essentially mean moving application policy into the
> kernel.

I might accept this argument if I accepted that it was OK to decide
that an fsync() failure means you can forget that the write() ever
happened in the first place, but it's hard to imagine an application
that wants that behavior.  If the application didn't care about
whether the bytes really got to disk or not, it would not have called
fsync() in the first place.  If it does care, reporting the error only
once is never an improvement.

Many RDBMSes do just that. It's hardly behaviour unique to the kernel. They report an ERROR on a statement in a txn then go on with life, merrily forgetting that anything was ever wrong.

I agree with PostgreSQL's stance that this is wrong. We require an explicit rollback (or ROLLBACK TO SAVEPOINT) to  restore the session to a usable state. This is good.

But we're the odd one out there. Almost everyone else does much like what fsync() does on Linux, report the error and forget it.

In any case, we're not going to get anyone to backpatch a fix for this into all kernels, so we're stuck working around it.

I'll do some testing with ENOSPC tomorrow, propose a patch, report back.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Magnus Hagander
Дата:
Сообщение: Re: [PATCH] Verify Checksums during Basebackups
Следующее
От: Tom Lane
Дата:
Сообщение: Re: [PATCH] Verify Checksums during Basebackups