Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Дата
Msg-id 63e55a27-a6a4-e7eb-d74f-78a5d0840bd1@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Anthony Iliopoulos <ailiop@altatus.com>)
Ответы Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Anthony Iliopoulos <ailiop@altatus.com>)
Список pgsql-hackers
On 04/09/2018 02:31 PM, Anthony Iliopoulos wrote:
> On Mon, Apr 09, 2018 at 01:03:28PM +0100, Geoff Winkless wrote:
>> On 9 April 2018 at 11:50, Anthony Iliopoulos <ailiop@altatus.com> wrote:
>>
>>> What you seem to be asking for is the capability of dropping
>>> buffers over the (kernel) fence and idemnifying the application
>>> from any further responsibility, i.e. a hard assurance
>>> that either the kernel will persist the pages or it will
>>> keep them around till the application recovers them
>>> asynchronously, the filesystem is unmounted, or the system
>>> is rebooted.
>>>
>>
>> That seems like a perfectly reasonable position to take, frankly.
> 
> Indeed, as long as you are willing to ignore the consequences of
> this design decision: mainly, how you would recover memory when no
> application is interested in clearing the error. At which point
> other applications with different priorities will find this position
> rather unreasonable since there can be no way out of it for them.

Sure, but the question is whether the system can reasonably operate
after some of the writes failed and the data got lost. Because if it
can't, then recovering the memory is rather useless. It might be better
to stop the system in that case, forcing the system administrator to
resolve the issue somehow (fail-over to a replica, perform recovery from
the last checkpoint, ...).

We already have dirty_bytes and dirty_background_bytes, for example. I
don't see why there couldn't be another limit defining how much dirty
data to allow before blocking writes altogether. I'm sure it's not that
simple, but you get the general idea - do not allow using all available
memory because of writeback issues, but don't throw the data away in
case it's just a temporary issue.

> Good luck convincing any OS kernel upstream to go with this design.
> 

Well, there seem to be kernels that seem to do exactly that already. At
least that's how I understand what this thread says about FreeBSD and
Illumos, for example. So it's not an entirely insane design, apparently.

The question is whether the current design makes it any easier for
user-space developers to build reliable systems. We have tried using it,
and unfortunately the answers seems to be "no" and "Use direct I/O and
manage everything on your own!"

>> The whole _point_ of an Operating System should be that you can do exactly
>> that. As a developer I should be able to call write() and fsync() and know
>> that if both calls have succeeded then the result is on disk, no matter
>> what another application has done in the meantime. If that's a "difficult"
>> problem then that's the OS's problem, not mine. If the OS doesn't do that,
>> it's _not_doing_its_job_.
> 
> No OS kernel that I know of provides any promises for atomicity of a
> write()+fsync() sequence, unless one is using O_SYNC. It doesn't
> provide you with isolation either, as this is delegated to userspace,
> where processes that share a file should coordinate accordingly.
> 

We can (and do) take care of the atomicity and isolation. Implementation
of those parts is obviously very application-specific, and we have WAL
and locks for that purpose. I/O on the other hand seems to be a generic
service provided by the OS - at least that's how we saw it until now.

> It's not a difficult problem, but rather the kernels provide a common
> denominator of possible interfaces and designs that could accommodate
> a wider range of potential application scenarios for which the kernel
> cannot possibly anticipate requirements. There have been plenty of
> experimental works for providing a transactional (ACID) filesystem
> interface to applications. On the opposite end, there have been quite
> a few commercial databases that completely bypass the kernel storage
> stack. But I would assume it is reasonable to figure out something
> between those two extremes that can work in a "portable" fashion.
> 

Users ask us about this quite often, actually. The question is usually
about "RAW devices" and performance, but ultimately it boils down to
buffered vs. direct I/O. So far our answer was we rely on kernel to do
this reliably, because they know how to do that correctly and we simply
don't have the manpower to implement it (portable, reliable, handling
different types of storage, ...).

One has to wonder how many applications actually use this correctly,
considering PostgreSQL cares about data durability/consistency so much
and yet we've been misunderstanding how it works for 20+ years.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


В списке pgsql-hackers по дате отправления:

Предыдущее
От: ilmari@ilmari.org (Dagfinn Ilmari Mannsåker)
Дата:
Сообщение: Re: Transform for pl/perl
Следующее
От: "Jonathan S. Katz"
Дата:
Сообщение: Re: Boolean partitions syntax