Обсуждение: crash recovery vs partially written WAL
Hi, A question from a colleague made me wonder if there are scenarios where two subsequent crashes could lead to wrong WAL to be applied. Imagine the following scenario [ xlog page 1 ][ xlog page 2 ][ xlog page 3 ][ xlog page 4 ] ^flush ^write ^insert if the machine crashes in this moment, we could end up with a situation where page 1, 3, 4 made it out out to disk, but page 2 wasn't. That itself is not a problem, when we perform crash recovery, we'll detect the end of WAL. We'll zero out the invalid parts of page 2, and log a end-of-recovery checkpoint (which has to fit either onto page 2 or 3). What I am concerned about is what happens if after crash recovery we fill up page 3 with new valid records, ending exactly at the page boundary (i.e. . [ xlog page 1 ][ xlog page 2 ][ xlog page 3 ][ xlog page 4 ] ^(flush,write) ^insert if we crash now, we'll peform recovery from the end-fo-recovery record somewhere on page 2 or 3, and replay the rest of page 3. That's where I see/wonder about a problem: What guarantees that we find the contents of xlog page 4 to be invalid? The page header will have the appropriate xl_pageaddr/tli/info. and because the last record on page 3 ended precisely at the page boundary, there'll not be a xlp_rem_len allowing us to detect this either. While we zero out WAL pages in-memory before using them, this won't help in this instance because a) nothing was inserted into page 4 b) page 4 was never written out. WAL segment recycling doesn't cause similar problems because xlp_pageaddr protects us against related issues. Replaying the old records from page 4 is obviously wrong, since they may rely on modifications the "old" records on page 2/3 would have performed (but which got lost). I don't immediately see a good fix for this. The most obvious thing would be to explicitly zero-out all WAL files beyond the end-of-recovery point that have a "correct" xlp_pageaddr, but that may reading a lot of WAL due to WAL file recycling. I hope I am missing some crosscheck making this a non-issue? Greetings, Andres Freund
On Wed, Dec 30, 2020 at 12:52:46PM -0800, Andres Freund wrote: > Hi, > > A question from a colleague made me wonder if there are scenarios where > two subsequent crashes could lead to wrong WAL to be applied. > > Imagine the following scenario > [ xlog page 1 ][ xlog page 2 ][ xlog page 3 ][ xlog page 4 ] > ^flush ^write ^insert > > if the machine crashes in this moment, we could end up with a situation > where page 1, 3, 4 made it out out to disk, but page 2 wasn't. I don't see any flaw in your logic. Seems we have to zero out all future WAL files, not just to the end of the current one, or at least clear xlp_pageaddr on each future page. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Wed, Dec 30, 2020 at 12:52:46PM -0800, Andres Freund wrote: > > A question from a colleague made me wonder if there are scenarios where > > two subsequent crashes could lead to wrong WAL to be applied. > > > > Imagine the following scenario > > [ xlog page 1 ][ xlog page 2 ][ xlog page 3 ][ xlog page 4 ] > > ^flush ^write ^insert > > > > if the machine crashes in this moment, we could end up with a situation > > where page 1, 3, 4 made it out out to disk, but page 2 wasn't. > > I don't see any flaw in your logic. Seems we have to zero out all > future WAL files, not just to the end of the current one, or at least > clear xlp_pageaddr on each future page. I've wondered before if we should be doing a timeline switch at the end of crash recovery... Thanks, Stephen
Вложения
On Thu, Dec 31, 2020 at 02:27:44PM -0500, Stephen Frost wrote: > Greetings, > > * Bruce Momjian (bruce@momjian.us) wrote: > > On Wed, Dec 30, 2020 at 12:52:46PM -0800, Andres Freund wrote: > > > A question from a colleague made me wonder if there are scenarios where > > > two subsequent crashes could lead to wrong WAL to be applied. > > > > > > Imagine the following scenario > > > [ xlog page 1 ][ xlog page 2 ][ xlog page 3 ][ xlog page 4 ] > > > ^flush ^write ^insert > > > > > > if the machine crashes in this moment, we could end up with a situation > > > where page 1, 3, 4 made it out out to disk, but page 2 wasn't. > > > > I don't see any flaw in your logic. Seems we have to zero out all > > future WAL files, not just to the end of the current one, or at least > > clear xlp_pageaddr on each future page. > > I've wondered before if we should be doing a timeline switch at the end > of crash recovery... For a while we had trouble tracking timeline switches, but I think we might be fine on that now. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com The usefulness of a cup is in its emptiness, Bruce Lee