Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Дата
Msg-id CAMsr+YFsrjzj8oisCcrTo3RB35D_kAmdd0VOOUQwqxtQw6LS_w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Andres Freund <andres@anarazel.de>)
Ответы Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Andres Freund <andres@anarazel.de>)
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
On 9 April 2018 at 07:16, Andres Freund <andres@anarazel.de> wrote:
 

I think the danger presented here is far smaller than some of the
statements in this thread might make one think.

Clearly it's not happening a huge amount or we'd have a lot of noise about Pg eating people's data, people shouting about how unreliable it is, etc. We don't. So it's not some earth shattering imminent threat to everyone's data. It's gone unnoticed, or the root cause unidentified, for a long time.

I suspect we've written off a fair few issues in the past as "it'd bad hardware" when actually, the hardware fault was the trigger for a Pg/kernel interaction bug. And blamed containers for things that weren't really the container's fault. But even so, if it were happening tons, we'd hear more noise.

I've already been very surprised there when I learned that PostgreSQL completely ignores wholly absent relfilenodes. Specifically, if you unlink() a relation's backing relfilenode while Pg is down and that file has writes pending in the WAL. We merrily re-create it with uninitalized pages and go on our way. As Andres pointed out in an offlist discussion, redo isn't a consistency check, and it's not obliged to fail in such cases. We can say "well, don't do that then" and define away file losses from FS corruption etc as not our problem, the lower levels we expect to take care of this have failed.

We have to look at what checkpoints are and are not supposed to promise, and whether this is a problem we just define away as "not our problem, the lower level failed, we're not obliged to detect this and fail gracefully."

We can choose to say that checkpoints are required to guarantee crash/power loss safety ONLY and do not attempt to protect against I/O errors of any sort. In fact, I think we should likely amend the documentation for release versions to say just that.

In all likelihood, once
you've got an IO error that kernel level retries don't fix, your
database is screwed.

Your database is going to be down or have interrupted service.  It's possible you may have some unreadable data. This could result in localised damage to one or more relations. That could affect FK relationships, indexes, all sorts. If you're really unlucky you might lose something critical like pg_clog/ contents.

But in general your DB should be repairable/recoverable even in those cases.

And in many failure modes there's no reason to expect any data loss at all, like:

* Local disk fills up (seems to be safe already due to space reservation at write() time)
* Thin-provisioned storage backing local volume iSCSI or paravirt block device fills up
* NFS volume fills up
* Multipath I/O error
* Interruption of connectivity to network block device
* Disk develops localized bad sector where we haven't previously written data

Except for the ENOSPC on NFS, all the rest of the cases can be handled by expecting the kernel to retry forever and not return until the block is written or we reach the heat death of the universe. And NFS, well...

Part of the trouble is that the kernel *won't* retry forever in all these cases, and doesn't seem to have a way to ask it to in all cases.

And if the user hasn't configured it for the right behaviour in terms of I/O error resilience, we don't find out about it.

So it's not the end of the world, but it'd sure be nice to fix.

Whether fsync reports that or not is really
somewhat besides the point. We don't panic that way when getting IO
errors during reads either, and they're more likely to be persistent
than errors during writes (because remapping on storage layer can fix
issues, but not during reads).

That's because reads don't make promises about what's committed and synced. I think that's quite different.
 
We should fix things so that reported errors are treated with crash
recovery, and for the rest I think there's very fair arguments to be
made that that's far outside postgres's remit.

Certainly for current versions.

I think we need to think about a more robust path in future. But it's certainly not "stop the world" territory.

The docs need an update to indicate that we explicitly disclaim responsibility for I/O errors on async writes, and that the kernel and I/O stack must be configured never to give up on buffered writes. If it does, that's not our problem anymore.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Следующее
От: Andres Freund
Дата:
Сообщение: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS