Обсуждение: crash recovery vs partially written WAL

Поиск
Список
Период
Сортировка

crash recovery vs partially written WAL

От
Andres Freund
Дата:
Hi,

A question from a colleague made me wonder if there are scenarios where
two subsequent crashes could lead to wrong WAL to be applied.

Imagine the following scenario
[ xlog page 1 ][ xlog page 2 ][ xlog page 3 ][ xlog page 4 ]
    ^flush                                        ^write ^insert

if the machine crashes in this moment, we could end up with a situation
where page 1, 3, 4 made it out out to disk, but page 2 wasn't.

That itself is not a problem, when we perform crash recovery, we'll
detect the end of WAL. We'll zero out the invalid parts of page 2, and
log a end-of-recovery checkpoint (which has to fit either onto page 2 or
3).

What I am concerned about is what happens if after crash recovery we
fill up page 3 with new valid records, ending exactly at the page
boundary (i.e. .

[ xlog page 1 ][ xlog page 2 ][ xlog page 3 ][ xlog page 4 ]
                                             ^(flush,write)
                                              ^insert

if we crash now, we'll peform recovery from the end-fo-recovery record
somewhere on page 2 or 3, and replay the rest of page 3.


That's where I see/wonder about a problem: What guarantees that we find
the contents of xlog page 4 to be invalid? The page header will have the
appropriate xl_pageaddr/tli/info. and because the last record on page 3
ended precisely at the page boundary, there'll not be a xlp_rem_len
allowing us to detect this either.

While we zero out WAL pages in-memory before using them, this won't help
in this instance because a) nothing was inserted into page 4 b) page 4
was never written out.


WAL segment recycling doesn't cause similar problems because xlp_pageaddr
protects us against related issues.


Replaying the old records from page 4 is obviously wrong, since they may
rely on modifications the "old" records on page 2/3 would have performed
(but which got lost).



I don't immediately see a good fix for this. The most obvious thing
would be to explicitly zero-out all WAL files beyond the end-of-recovery
point that have a "correct" xlp_pageaddr, but that may reading a lot of
WAL due to WAL file recycling.


I hope I am missing some crosscheck making this a non-issue?


Greetings,

Andres Freund



Re: crash recovery vs partially written WAL

От
Bruce Momjian
Дата:
On Wed, Dec 30, 2020 at 12:52:46PM -0800, Andres Freund wrote:
> Hi,
> 
> A question from a colleague made me wonder if there are scenarios where
> two subsequent crashes could lead to wrong WAL to be applied.
> 
> Imagine the following scenario
> [ xlog page 1 ][ xlog page 2 ][ xlog page 3 ][ xlog page 4 ]
>     ^flush                                        ^write ^insert
> 
> if the machine crashes in this moment, we could end up with a situation
> where page 1, 3, 4 made it out out to disk, but page 2 wasn't.

I don't see any flaw in your logic.  Seems we have to zero out all
future WAL files, not just to the end of the current one, or at least
clear xlp_pageaddr on each future page.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: crash recovery vs partially written WAL

От
Stephen Frost
Дата:
Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Wed, Dec 30, 2020 at 12:52:46PM -0800, Andres Freund wrote:
> > A question from a colleague made me wonder if there are scenarios where
> > two subsequent crashes could lead to wrong WAL to be applied.
> >
> > Imagine the following scenario
> > [ xlog page 1 ][ xlog page 2 ][ xlog page 3 ][ xlog page 4 ]
> >     ^flush                                        ^write ^insert
> >
> > if the machine crashes in this moment, we could end up with a situation
> > where page 1, 3, 4 made it out out to disk, but page 2 wasn't.
>
> I don't see any flaw in your logic.  Seems we have to zero out all
> future WAL files, not just to the end of the current one, or at least
> clear xlp_pageaddr on each future page.

I've wondered before if we should be doing a timeline switch at the end
of crash recovery...

Thanks,

Stephen

Вложения

Re: crash recovery vs partially written WAL

От
Bruce Momjian
Дата:
On Thu, Dec 31, 2020 at 02:27:44PM -0500, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (bruce@momjian.us) wrote:
> > On Wed, Dec 30, 2020 at 12:52:46PM -0800, Andres Freund wrote:
> > > A question from a colleague made me wonder if there are scenarios where
> > > two subsequent crashes could lead to wrong WAL to be applied.
> > > 
> > > Imagine the following scenario
> > > [ xlog page 1 ][ xlog page 2 ][ xlog page 3 ][ xlog page 4 ]
> > >     ^flush                                        ^write ^insert
> > > 
> > > if the machine crashes in this moment, we could end up with a situation
> > > where page 1, 3, 4 made it out out to disk, but page 2 wasn't.
> > 
> > I don't see any flaw in your logic.  Seems we have to zero out all
> > future WAL files, not just to the end of the current one, or at least
> > clear xlp_pageaddr on each future page.
> 
> I've wondered before if we should be doing a timeline switch at the end
> of crash recovery...

For a while we had trouble tracking timeline switches, but I think we
might be fine on that now.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee