Обсуждение: Fail hard if xlogreader.c fails on out-of-memory

Поиск
Список
Период
Сортировка

Fail hard if xlogreader.c fails on out-of-memory

От
Michael Paquier
Дата:
Hi all,
(Thomas in CC.)

Now that becfbdd6c1c9 has improved the situation to detect the
difference between out-of-memory and invalid WAL data in WAL, I guess
that it is time to tackle the problem of what we should do when
reading WAL records bit fail on out-of-memory.

To summarize, currently the WAL reader APIs fail the same way if we
detect some incorrect WAL record or if a memory allocation fails: an
error is generated and returned back to the caller to consume.  For
WAL replay, not being able to make the difference between an OOM and
the end-of-wal is a problem in some cases.  For example, in crash
recovery, failing an internal allocation will be detected as the
end-of-wal, causing recovery to stop prematurely.  In the worst cases,
this silently corrupts clusters because not all the records generated
in the local pg_wal/ have been replayed.  Oops.

When in standby mode, things are a bit better, because we'd just loop
and wait for the next record.  But, even in this case, if the startup
process does a crash recovery while standby is set up, we may finish
by attempting recovery from a different source than the local pg_wal/.
Not strictly critical, but less optimal in some cases as we could
switch to archive recovery earlier than necessary.

In a different thread, I have proposed to extend the WAL reader
facility so as an error code is returned to make the difference
between an OOM or the end of WAL with an incorrect record:
https://www.postgresql.org/message-id/ZRJ-p1dLUY0uoChc%40paquier.xyz

However this requires some ABI changes, so that's not backpatchable.

This leaves out what we can do for the existing back-branches, and
one option is to do the simplest thing I can think of: if an
allocation fails, just fail *hard*.  The allocations of the WAL reader
rely on palloc_extended(), so I'd like to suggest that we switch to
palloc() instead.  If we do so, an ERROR is promoted to a FATAL during
WAL replay, which makes sure that we will never stop recovery earlier
than we should, FATAL-ing before things go wrong.

Note that the WAL prefetching already relies on a palloc() on HEAD in
XLogReadRecordAlloc(), which would fail hard the same way on OOM.

So, attached is a proposal of patch to do something down to 12.

Thoughts and/or comments are welcome.
--
Michael

Вложения

Re: Fail hard if xlogreader.c fails on out-of-memory

От
Thomas Munro
Дата:
On Tue, Sep 26, 2023 at 8:38 PM Michael Paquier <michael@paquier.xyz> wrote:
> Thoughts and/or comments are welcome.

I don't have an opinion yet on your other thread about making this
stuff configurable for replicas, but for the simple crash recovery
case shown here, hard failure makes sense to me.

Here are some interesting points in the history of this topic:

1999 30659d43: xl_len is 16 bit, fixed size buffer in later commits
2001 7d4d5c00: WAL files recycled, xlp_pageaddr added
2004 0ffe11ab: xl_len is 32 bit, dynamic buffer, malloc() failure ends redo
2005 21fda22e: xl_tot_len and xl_len co-exist
2014 2c03216d: xl_tot_len fully replaces xl_len
2018 70b4f82a: xl_tot_len > 1GB ends redo
2023 8fcb32db: don't let xl_tot_len > 1GB be logged!
2023 bae868ca: check next xlp_pageaddr, xlp_rem_len before allocating

Recycled pages can't fool us into making a huge allocation any more.
If xl_tot_len implies more than one page but the next page's
xlp_pageaddr is too low, then either the xl_tot_len you read was
recycled garbage bits, or it was legitimate but the overwrite of the
following page didn't make it to disk; either way, we don't have a
record, so we have an end-of-wal condition.  The xlp_rem_len check
defends against the second page making it to disk while the first one
still contains recycled garbage where the xl_tot_len should be*.

What Michael wants to do now is remove the 2004-era assumption that
malloc failure implies bogus data.  It must be pretty unlikely in a 64
bit world with overcommitted virtual memory, but a legitimate
xl_tot_len can falsely end recovery and lose data, as reported from a
production case analysed by his colleagues.  In other words, we can
actually distinguish between lack of resources and recycled bogus
data, so why treat them the same?

For comparison, if you run out of disk space during recovery we don't
say "oh well, that's enough redoing for today, the computer is full,
let's forget about the rest of the WAL and start accepting new
transactions!".  The machine running recovery has certain resource
requirements relative to the machine that generated the WAL, and if
they're not met it just can't do it.  It's the same if it various
other allocations fail.  The new situation is that we are now
verifying that xl_tot_len was actually logged by PostgreSQL, so if we
can't allocate space for it, we can't replay the WAL.

*A more detailed analysis would talk about sectors (page header is
atomic), and consider whether we're only trying to defend ourselves
against recycled pages written by PostgreSQL (yes), arbitrary random
data (no, but it's probably still pretty good) or someone trying to
trick us (no, and we don't stand a chance).



Re: Fail hard if xlogreader.c fails on out-of-memory

От
Michael Paquier
Дата:
On Wed, Sep 27, 2023 at 11:06:37AM +1300, Thomas Munro wrote:
> I don't have an opinion yet on your other thread about making this
> stuff configurable for replicas, but for the simple crash recovery
> case shown here, hard failure makes sense to me.

Also, if we conclude that we're OK with just failing hard all the time
for crash recovery and archive recovery on OOM, the other patch is not
really required.  That would be disruptive for standbys in some cases,
still perhaps OK in the long-term.  I am wondering if people have lost
data because of this problem on production systems, actually..  It
would not be possible to know that it happened until you see a page on
disk that has a somewhat valid LSN, still an LSN older than the
position currently being inserted, and that could show up in various
forms.  Even that could get hidden quickly if WAL is written at a fast
pace after a crash recovery.  A standby promotion at an LSN older
would be unlikely as monitoring solutions discard standbys lagging
behind N bytes.

> *A more detailed analysis would talk about sectors (page header is
> atomic), and consider whether we're only trying to defend ourselves
> against recycled pages written by PostgreSQL (yes), arbitrary random
> data (no, but it's probably still pretty good) or someone trying to
> trick us (no, and we don't stand a chance).

WAL would not be the only part of the system that would get borked if
arbitrary bytes can be inserted into what's read from disk, random or
not.
--
Michael

Вложения

Re: Fail hard if xlogreader.c fails on out-of-memory

От
Noah Misch
Дата:
On Wed, Sep 27, 2023 at 11:06:37AM +1300, Thomas Munro wrote:
> On Tue, Sep 26, 2023 at 8:38 PM Michael Paquier <michael@paquier.xyz> wrote:
> > Thoughts and/or comments are welcome.
> 
> I don't have an opinion yet on your other thread about making this
> stuff configurable for replicas, but for the simple crash recovery
> case shown here, hard failure makes sense to me.

> Recycled pages can't fool us into making a huge allocation any more.
> If xl_tot_len implies more than one page but the next page's
> xlp_pageaddr is too low, then either the xl_tot_len you read was
> recycled garbage bits, or it was legitimate but the overwrite of the
> following page didn't make it to disk; either way, we don't have a
> record, so we have an end-of-wal condition.  The xlp_rem_len check
> defends against the second page making it to disk while the first one
> still contains recycled garbage where the xl_tot_len should be*.
> 
> What Michael wants to do now is remove the 2004-era assumption that
> malloc failure implies bogus data.  It must be pretty unlikely in a 64
> bit world with overcommitted virtual memory, but a legitimate
> xl_tot_len can falsely end recovery and lose data, as reported from a
> production case analysed by his colleagues.  In other words, we can
> actually distinguish between lack of resources and recycled bogus
> data, so why treat them the same?

Indeed.  Hard failure is fine, and ENOMEM=end-of-WAL definitely isn't.

> *A more detailed analysis would talk about sectors (page header is
> atomic)

I think the page header is atomic on POSIX-compliant filesystems but not
atomic on ext4.  That doesn't change the conclusion on $SUBJECT.



Re: Fail hard if xlogreader.c fails on out-of-memory

От
Michael Paquier
Дата:
On Tue, Sep 26, 2023 at 06:28:30PM -0700, Noah Misch wrote:
> On Wed, Sep 27, 2023 at 11:06:37AM +1300, Thomas Munro wrote:
>> What Michael wants to do now is remove the 2004-era assumption that
>> malloc failure implies bogus data.  It must be pretty unlikely in a 64
>> bit world with overcommitted virtual memory, but a legitimate
>> xl_tot_len can falsely end recovery and lose data, as reported from a
>> production case analysed by his colleagues.  In other words, we can
>> actually distinguish between lack of resources and recycled bogus
>> data, so why treat them the same?
>
> Indeed.  Hard failure is fine, and ENOMEM=end-of-WAL definitely isn't.

Are there any more comments and/or suggestions here?

If none, I propose to apply the patch to switch to palloc() instead of
palloc_extended(NO_OOM) in this code around the beginning of next
week, down to 12.
--
Michael

Вложения

Re: Fail hard if xlogreader.c fails on out-of-memory

От
Michael Paquier
Дата:
On Thu, Sep 28, 2023 at 09:36:37AM +0900, Michael Paquier wrote:
> If none, I propose to apply the patch to switch to palloc() instead of
> palloc_extended(NO_OOM) in this code around the beginning of next
> week, down to 12.

Done down to 12 as of 6b18b3fe2c2f, then.
--
Michael

Вложения