Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

Поиск
Список
Период
Сортировка
От Mark Dilger
Тема Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Дата
Msg-id 4E7A12F7-16DC-4B5B-8925-11ED785523CF@gmail.com
обсуждение исходный текст
Ответ на Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
> On Apr 9, 2018, at 1:43 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
>
>
> On 04/09/2018 10:25 PM, Mark Dilger wrote:
>>
>>> On Apr 9, 2018, at 12:13 PM, Andres Freund <andres@anarazel.de> wrote:
>>>
>>> Hi,
>>>
>>> On 2018-04-09 15:02:11 -0400, Robert Haas wrote:
>>>> I think the simplest technological solution to this problem is to
>>>> rewrite the entire backend and all supporting processes to use
>>>> O_DIRECT everywhere.  To maintain adequate performance, we'll have to
>>>> write a complete I/O scheduling system inside PostgreSQL.  Also, since
>>>> we'll now have to make shared_buffers much larger -- since we'll no
>>>> longer be benefiting from the OS cache -- we'll need to replace the
>>>> use of malloc() with an allocator that pulls from shared_buffers.
>>>> Plus, as noted, we'll need to totally rearchitect several of our
>>>> critical frontend tools.  Let's freeze all other development for the
>>>> next year while we work on that, and put out a notice that Linux is no
>>>> longer a supported platform for any existing release.  Before we do
>>>> that, we might want to check whether fsync() actually writes the data
>>>> to disk in a usable way even with O_DIRECT.  If not, we should just
>>>> de-support Linux entirely as a hopelessly broken and unsupportable
>>>> platform.
>>>
>>> Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
>>> absurd, as is some of the proposed ways this is all supposed to
>>> work. But I think the case we're discussing is much closer to a near
>>> irresolvable corner case than anything else.
>>>
>>> We're talking about the storage layer returning an irresolvable
>>> error. You're hosed even if we report it properly.  Yes, it'd be nice if
>>> we could report it reliably.  But that doesn't change the fact that what
>>> we're doing is ensuring that data is safely fsynced unless storage
>>> fails, in which case it's not safely fsynced anyway.
>>
>> I was reading this thread up until now as meaning that the standby could
>> receive corrupt WAL data and become corrupted.  That seems a much bigger
>> problem than merely having the master become corrupted in some unrecoverable
>> way.  It is a long standing expectation that serious hardware problems on
>> the master can result in the master needing to be replaced.  But there has
>> not been an expectation that the one or more standby servers would be taken
>> down along with the master, leaving all copies of the database unusable.
>> If this bug corrupts the standby servers, too, then it is a whole different
>> class of problem than the one folks have come to expect.
>>
>> Your comment reads as if this is a problem isolated to whichever server has
>> the problem, and will not get propagated to other servers.  Am I reading
>> that right?
>>
>> Can anybody clarify this for non-core-hacker folks following along at home?
>>
>
> That's a good question. I don't see any guarantee it'd be isolated to
> the master node. Consider this example:
>
> (0) checkpoint happens on the primary
>
> (1) a page gets modified, a full-page gets written to WAL
>
> (2) the page is written out to page cache
>
> (3) writeback of that page fails (and gets discarded)
>
> (4) we attempt to modify the page again, but we read the stale version
>
> (5) we modify the stale version, writing the change to WAL
>
>
> The standby will get the full-page, and then a WAL from the stale page
> version. That doesn't seem like a story with a happy end, I guess. But I
> might be easily missing some protection built into the WAL ...

I can also imagine a master and standby that are similarly provisioned,
and thus hit an out of disk error at around the same time, resulting in
corruption on both, even if not the same corruption.  When choosing to
have one standby, or two standbys, or ten standbys, one needs to be able
to assume a certain amount of statistical independence between failures
on one server and failures on another.  If they are tightly correlated
dependent variables, then the conclusion that the probability of all
nodes failing simultaneously is vanishingly small becomes invalid.

mark

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: pgsql: Merge catalog/pg_foo_fn.h headers back into pg_foo.h headers.
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Shared PostgreSQL libraries and symbol versioning