Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Дата
Msg-id CAMsr+YErjSKktTQgG0EF9FusDLnvsE32knFWrzFkKSxHyPy9gw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Craig Ringer <craig@2ndquadrant.com>)
Ответы Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Craig Ringer <craig@2ndquadrant.com>)
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Thomas Munro <thomas.munro@enterprisedb.com>)
Список pgsql-hackers
On 4 April 2018 at 15:51, Craig Ringer <craig@2ndquadrant.com> wrote:
On 4 April 2018 at 14:00, Craig Ringer <craig@2ndquadrant.com> wrote:
On 4 April 2018 at 13:29, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian <bruce@momjian.us> wrote:
>> Uh, are you sure it fixes our use-case?  From the email description it
>> sounded like it only reported fsync errors for every open file
>> descriptor at the time of the failure, but the checkpoint process might
>> open the file _after_ the failure and try to fsync a write that happened
>> _before_ the failure.
>
> I'm not sure of anything.  I can see that it's designed to report
> errors since the last fsync() of the *file* (presumably via any fd),
> which sounds like the desired behaviour:
>
> [..]

Scratch that.  Whenever you open a file descriptor you can't see any
preceding errors at all, because:

/* Ensure that we skip any errors that predate opening of the file */
f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

https://github.com/torvalds/linux/blob/master/fs/open.c#L752

Our whole design is based on being able to open, close and reopen
files at will from any process, and in particular to fsync() from a
different process that didn't inherit the fd but instead opened it
later.  But it looks like that might be able to eat errors that
occurred during asynchronous writeback (when there was nobody to
report them to), before you opened the file?

Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?

I'll see if I can expand my testcase for that. I'm presently dockerizing it to make it easier for others to use, but that turns out to be a major pain when using devmapper etc. Docker in privileged mode doesn't seem to play nice with device-mapper.




Update. Now supports multiple FSes.

I've tried xfs, jfs, ext3, ext4, even vfat. All behave the same on EIO. Didn't try zfs-on-linux or other platforms yet.

Still working on getting ENOSPC on fsync() rather than write(). Kernel code reading suggests this is possible, but all the above FSes reserve space eagerly on write( ) even if they do delayed allocation of the actual storage, so it doesn't seem to happen at least in my simple single-process test.

I'm not overly inclined to complain about a fsync() succeeding after a write() error. That seems reasonable enough, the kernel told the app at the time of the failure. What else is it going to do? I don't personally even object hugely to the current fsync() behaviour if it were, say, DOCUMENTED and conformant to the relevant standards, though not giving us any sane way to find out the affected file ranges makes it drastically harder to recover sensibly.

But what's come out since on this thread, that we cannot even rely on fsync() giving us an EIO *once* when it loses our data, because:

- all currently widely deployed kernels can fail to deliver info due to recently fixed limitation; and
- the kernel deliberately hides errors from us if they relate to writes that occurred before we opened the FD (?)

... that's really troubling. I thought we could at least fix this by PANICing on EIO, and was mostly worried about ENOSPC. But now it seems we can't even do that and expect reliability. So how the @#$ are we meant to do?

It's the error reporting issues around closing and reopening files with outstanding buffered I/O that's really going to hurt us here. I'll be expanding my test case to cover that shortly.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Следующее
От: Teodor Sigaev
Дата:
Сообщение: Re: json(b)_to_tsvector with numeric values