Обсуждение: Fix crash during recovery when redo segment is missing
Hi, In [1], Andres reported a bug where PostgreSQL crashes during recovery if the segment containing the redo pointer does not exist. I have attempted to address this issue and I am sharing a patch for the same. The problem was that PostgreSQL did not PANIC when the redo LSN and checkpoint LSN were in separate segments, and the file containing the redo LSN was missing, leading to a crash. Andres has provided a detailed analysis of the behavior across different settings and versions. Please refer to [1] for more information. This issue arises because PostgreSQL does not PANIC initially. The issue was resolved by ensuring that the REDO location exists once we successfully read the checkpoint record in InitWalRecovery(). This prevents control from reaching PerformWalRecovery() unless the WAL file containing the redo record exists. A new test script, 044_redo_segment_missing.pl, has been added to validate this. To populate the WAL file with a redo record different from the WAL file with the checkpoint record, I wait for the checkpoint start message and then issue a pg_switch_wal(), which should occur before the completion of the checkpoint. Then, I crash the server, and during the restart, it should log an appropriate error indicating that it could not find the redo location. Please let me know if there is a better way to reproduce this behavior. I have tested and verified this with the various scenarios Andres pointed out in [1]. Please note that this patch does not address error checking in StartupXLOG(), CreateCheckPoint(), etc., nor does it focus on cleaning up existing code. Attaching the patch. Please review and share your feedback. Thanks to Andres for spotting the bug and providing the detailed report [1]. [1]: https://www.postgresql.org/message-id/20231023232145.cmqe73stvivsmlhs%40awork3.anarazel.de Best Regards, Nitin Jadhav Azure Database for PostgreSQL Microsoft
Вложения
The patch wasn’t applying cleanly on master, so I’ve rebased it and also added it to the PG19‑4 CommitFest: https://commitfest.postgresql.org/patch/6279/ Please review and share your feedback. Best Regards, Nitin Jadhav Azure Database for PostgreSQL Microsoft Best Regards, Nitin Jadhav Azure Database for PostgreSQL Microsoft On Fri, Feb 21, 2025 at 4:29 PM Nitin Jadhav <nitinjadhavpostgres@gmail.com> wrote: > > Hi, > > In [1], Andres reported a bug where PostgreSQL crashes during recovery > if the segment containing the redo pointer does not exist. I have > attempted to address this issue and I am sharing a patch for the same. > > The problem was that PostgreSQL did not PANIC when the redo LSN and > checkpoint LSN were in separate segments, and the file containing the > redo LSN was missing, leading to a crash. Andres has provided a > detailed analysis of the behavior across different settings and > versions. Please refer to [1] for more information. This issue arises > because PostgreSQL does not PANIC initially. > > The issue was resolved by ensuring that the REDO location exists once > we successfully read the checkpoint record in InitWalRecovery(). This > prevents control from reaching PerformWalRecovery() unless the WAL > file containing the redo record exists. A new test script, > 044_redo_segment_missing.pl, has been added to validate this. To > populate the WAL file with a redo record different from the WAL file > with the checkpoint record, I wait for the checkpoint start message > and then issue a pg_switch_wal(), which should occur before the > completion of the checkpoint. Then, I crash the server, and during the > restart, it should log an appropriate error indicating that it could > not find the redo location. Please let me know if there is a better > way to reproduce this behavior. I have tested and verified this with > the various scenarios Andres pointed out in [1]. Please note that this > patch does not address error checking in StartupXLOG(), > CreateCheckPoint(), etc., nor does it focus on cleaning up existing > code. > > Attaching the patch. Please review and share your feedback. Thanks to > Andres for spotting the bug and providing the detailed report [1]. > > [1]: https://www.postgresql.org/message-id/20231023232145.cmqe73stvivsmlhs%40awork3.anarazel.de > > Best Regards, > Nitin Jadhav > Azure Database for PostgreSQL > Microsoft
Apologies, I missed attaching the patch earlier. Please find the v2 version attached. Best Regards, Nitin Jadhav Azure Database for PostgreSQL Microsoft On Thu, Dec 4, 2025 at 12:01 PM Nitin Jadhav <nitinjadhavpostgres@gmail.com> wrote: > > The patch wasn’t applying cleanly on master, so I’ve rebased it and > also added it to the PG19‑4 CommitFest: > https://commitfest.postgresql.org/patch/6279/ > Please review and share your feedback. > > Best Regards, > Nitin Jadhav > Azure Database for PostgreSQL > Microsoft > > Best Regards, > Nitin Jadhav > Azure Database for PostgreSQL > Microsoft > > > On Fri, Feb 21, 2025 at 4:29 PM Nitin Jadhav > <nitinjadhavpostgres@gmail.com> wrote: > > > > Hi, > > > > In [1], Andres reported a bug where PostgreSQL crashes during recovery > > if the segment containing the redo pointer does not exist. I have > > attempted to address this issue and I am sharing a patch for the same. > > > > The problem was that PostgreSQL did not PANIC when the redo LSN and > > checkpoint LSN were in separate segments, and the file containing the > > redo LSN was missing, leading to a crash. Andres has provided a > > detailed analysis of the behavior across different settings and > > versions. Please refer to [1] for more information. This issue arises > > because PostgreSQL does not PANIC initially. > > > > The issue was resolved by ensuring that the REDO location exists once > > we successfully read the checkpoint record in InitWalRecovery(). This > > prevents control from reaching PerformWalRecovery() unless the WAL > > file containing the redo record exists. A new test script, > > 044_redo_segment_missing.pl, has been added to validate this. To > > populate the WAL file with a redo record different from the WAL file > > with the checkpoint record, I wait for the checkpoint start message > > and then issue a pg_switch_wal(), which should occur before the > > completion of the checkpoint. Then, I crash the server, and during the > > restart, it should log an appropriate error indicating that it could > > not find the redo location. Please let me know if there is a better > > way to reproduce this behavior. I have tested and verified this with > > the various scenarios Andres pointed out in [1]. Please note that this > > patch does not address error checking in StartupXLOG(), > > CreateCheckPoint(), etc., nor does it focus on cleaning up existing > > code. > > > > Attaching the patch. Please review and share your feedback. Thanks to > > Andres for spotting the bug and providing the detailed report [1]. > > > > [1]: https://www.postgresql.org/message-id/20231023232145.cmqe73stvivsmlhs%40awork3.anarazel.de > > > > Best Regards, > > Nitin Jadhav > > Azure Database for PostgreSQL > > Microsoft
Вложения
On Thu, 4 Dec 2025 at 11:37, Nitin Jadhav <nitinjadhavpostgres@gmail.com> wrote: > > Apologies, I missed attaching the patch earlier. Please find the v2 > version attached. > > Best Regards, > Nitin Jadhav > Azure Database for PostgreSQL > Microsoft > > On Thu, Dec 4, 2025 at 12:01 PM Nitin Jadhav > <nitinjadhavpostgres@gmail.com> wrote: > > > > The patch wasn’t applying cleanly on master, so I’ve rebased it and > > also added it to the PG19‑4 CommitFest: > > https://commitfest.postgresql.org/patch/6279/ > > Please review and share your feedback. > > > > Best Regards, > > Nitin Jadhav > > Azure Database for PostgreSQL > > Microsoft > > > > Best Regards, > > Nitin Jadhav > > Azure Database for PostgreSQL > > Microsoft > > > > > > On Fri, Feb 21, 2025 at 4:29 PM Nitin Jadhav > > <nitinjadhavpostgres@gmail.com> wrote: > > > > > > Hi, > > > > > > In [1], Andres reported a bug where PostgreSQL crashes during recovery > > > if the segment containing the redo pointer does not exist. I have > > > attempted to address this issue and I am sharing a patch for the same. > > > > > > The problem was that PostgreSQL did not PANIC when the redo LSN and > > > checkpoint LSN were in separate segments, and the file containing the > > > redo LSN was missing, leading to a crash. Andres has provided a > > > detailed analysis of the behavior across different settings and > > > versions. Please refer to [1] for more information. This issue arises > > > because PostgreSQL does not PANIC initially. > > > > > > The issue was resolved by ensuring that the REDO location exists once > > > we successfully read the checkpoint record in InitWalRecovery(). This > > > prevents control from reaching PerformWalRecovery() unless the WAL > > > file containing the redo record exists. A new test script, > > > 044_redo_segment_missing.pl, has been added to validate this. To > > > populate the WAL file with a redo record different from the WAL file > > > with the checkpoint record, I wait for the checkpoint start message > > > and then issue a pg_switch_wal(), which should occur before the > > > completion of the checkpoint. Then, I crash the server, and during the > > > restart, it should log an appropriate error indicating that it could > > > not find the redo location. Please let me know if there is a better > > > way to reproduce this behavior. I have tested and verified this with > > > the various scenarios Andres pointed out in [1]. Please note that this > > > patch does not address error checking in StartupXLOG(), > > > CreateCheckPoint(), etc., nor does it focus on cleaning up existing > > > code. > > > > > > Attaching the patch. Please review and share your feedback. Thanks to > > > Andres for spotting the bug and providing the detailed report [1]. > > > > > > [1]: https://www.postgresql.org/message-id/20231023232145.cmqe73stvivsmlhs%40awork3.anarazel.de > > > > > > Best Regards, > > > Nitin Jadhav > > > Azure Database for PostgreSQL > > > Microsoft Hi! 1) Please do not top-post on these lists 2) I did not get the exact reason this thread is different from [0] 3) Your tests are using a busy loop with usleep which nowadays is considered as bad practice. There 3 of such places, and I think the first two of them can be replaced with injection point wait. > +# Wait for the checkpoint to complete > +my $checkpoint_complete = 0; > +foreach my $i (0 .. 10 * $PostgreSQL::Test::Utils::timeout_default) { > + if ($node->log_contains("checkpoint complete", $logstart)) { > + $checkpoint_complete = 1; > + last; > + } > + usleep(100_000); > +} 4) There are comments from Robert, which are not covered [1]. [0] https://www.postgresql.org/message-id/20231023232145.cmqe73stvivsmlhs%40awork3.anarazel.de [1] https://www.postgresql.org/message-id/CA%2BTgmob1x7HMcWAb%3D6ep2cBuWuwpT-p9E7EmQegWFu6E8nKHeg%40mail.gmail.com -- Best regards, Kirill Reshke
On Thu, Dec 04, 2025 at 12:00:23PM +0500, Kirill Reshke wrote: > Your tests are using a busy loop with usleep which nowadays is > considered as bad practice. There 3 of such places, and I think the > first two of them > can be replaced with injection point wait. (This fell off my radar, apologies about that.) The problem with this test is not related to its use of sleeps, which is perfectly fine to check the start or end timings of a checkpoint depending on the contents of the logs. It has two issues. One first issue is that the test is unnecessary long, taking more than 30s to finish because it relies on checkpoint_timeout to kick a checkpoint. This could use a background psql object to kick a checkpoint to accelerate the whole. So the test is sitting idle for a long time, doing nothing. Your intuition about injection points is right, though, but it points to a second problem: the test is not reliable because we could finish the checkpoint *before* the WAL segment is switched, and we expect the WAL segment switch to happen while the checkpoint is processing. If you want to make that deterministic, having a wait in the middle of checkpointing would make the test actually test what it should. In this case the test would randomly die on its "redo record wal file is the same" message. That's OK to prove the point of the initial patch, but it's not acceptable for a test that could be added to the tree. An update of src/test/recovery/meson.build to add the new test is also required. Anyway, I think that the patch is overdoing it in issuing a PANIC in this case, going backwards with the other thread from [0]: a FATAL would be perfectly OK, like in the backup_label path because your manipulations of WAL segment missing are indeed possible, as the test posted is proving. And there have been many arguments in the past about performing recovery without a backup_label, as well. And that would make the test something that we could use, because no backtrace on buildfarm hosts or disk bloat. If we do not re-check in InitWalRecovery() that the redo record is around, the startup process happily goes down to PerformWalRecovery() in an everything-is-fine mode, initializing a bunch of states, to then crash due to a pointer dereference while attempting to read the record from the redo LSN, which does not exist. This is not right. So, oops. I would see an argument good enough for a backpatch when it comes to crash recovery, because we actually don't know what has happened in this case. Or not based on the lack of complaints on the matter over the years. So I'd argue about updating the patch among these lines, instead: - Switch the PANIC to a FATAL, to inform about the pilot error. This addresses the pointer dereference with WAL replay going crazy for the redo LSN. - Add a test, with an injection point after checkpoint startup, based on a manual checkpoint done in a backgroud psql, without a checkpoint_timeout to make the test faster. - We could be careful first based on the lack of complaints, doing that extra check only on HEAD, for v19. > 4) There are comments from Robert, which are not covered [1]. > > [0] https://www.postgresql.org/message-id/20231023232145.cmqe73stvivsmlhs%40awork3.anarazel.de > [1] https://www.postgresql.org/message-id/CA%2BTgmob1x7HMcWAb%3D6ep2cBuWuwpT-p9E7EmQegWFu6E8nKHeg%40mail.gmail.com I get the argument about the spaghetti code in general, however the code in question here deals with the WAL replay initialization, for a backup_label vs a non-backup_label path. Perhaps I'm more used to this area than others, but it's not that much pasta to me. I don't see a point in moving the memcpy() and the wasShutdown parts as proposed in the patch, by the way. The PANIC would block things in its else block. Let's limit outselves to the extra ReadRecord() check for the redo record when we find a checkpoint record. I'd actually wonder if we should not lower the existing PANIC to a FATAL if ReadCheckpointRecord() fails, as well. The result would be the same for operators, without the need to deal with backtrace cleanups after a crash. And we still have the error message that tells us what's going on. Any thoughts or opinions from others? -- Michael
Вложения
Hi, Thanks for reviewing and sharing your feedback. > 1) Please do not top-post on these lists Apologies for that. I will make sure to follow this going forward. > 2) I did not get the exact reason this thread is different from [0] The purpose of this thread was to share the actual patch addressing the crash scenario discussed in [0]. That said, I agree I should have posted the patch in [0] itself. I will make sure to use existing threads wherever possible going forward. Best Regards, Nitin Jadhav Azure Database for PostgreSQL Microsoft
Thank you for the detailed feedback. > > Your tests are using a busy loop with usleep which nowadays is > > considered as bad practice. There 3 of such places, and I think the > > first two of them > > can be replaced with injection point wait. > > (This fell off my radar, apologies about that.) > > The problem with this test is not related to its use of sleeps, which > is perfectly fine to check the start or end timings of a checkpoint > depending on the contents of the logs. It has two issues. > > One first issue is that the test is unnecessary long, taking more than > 30s to finish because it relies on checkpoint_timeout to kick a > checkpoint. This could use a background psql object to kick a > checkpoint to accelerate the whole. So the test is sitting idle for a > long time, doing nothing. > > Your intuition about injection points is right, though, but it points > to a second problem: the test is not reliable because we could finish > the checkpoint *before* the WAL segment is switched, and we expect the > WAL segment switch to happen while the checkpoint is processing. If > you want to make that deterministic, having a wait in the middle of > checkpointing would make the test actually test what it should. In > this case the test would randomly die on its "redo record wal file is > the same" message. That's OK to prove the point of the initial patch, > but it's not acceptable for a test that could be added to the tree. > > An update of src/test/recovery/meson.build to add the new test is > also required. I will work on improving the test accordingly and include the changes in the next version. > Anyway, I think that the patch is overdoing it in issuing a PANIC in > this case, going backwards with the other thread from [0]: a FATAL > would be perfectly OK, like in the backup_label path because your > manipulations of WAL segment missing are indeed possible, as the test > posted is proving. And there have been many arguments in the past > about performing recovery without a backup_label, as well. And that > would make the test something that we could use, because no backtrace > on buildfarm hosts or disk bloat. The main reason I chose PANIC is that when the ReadCheckpointRecord() fails, we already use PANIC for the error message ‘could not locate a valid checkpoint record…’ in the no_backup_label case, whereas the similar flow with backup_label uses FATAL. I am not entirely sure why this difference exists. I looked into it but couldn’t find much. If we decide to change this patch to use FATAL for ‘could not find redo location…’, should we also consider changing the existing PANIC to FATAL for consistency? > > 4) There are comments from Robert, which are not covered [1]. > > > > [0] https://www.postgresql.org/message-id/20231023232145.cmqe73stvivsmlhs%40awork3.anarazel.de > > [1] https://www.postgresql.org/message-id/CA%2BTgmob1x7HMcWAb%3D6ep2cBuWuwpT-p9E7EmQegWFu6E8nKHeg%40mail.gmail.com > > I get the argument about the spaghetti code in general, however the > code in question here deals with the WAL replay initialization, for a > backup_label vs a non-backup_label path. Perhaps I'm more used to > this area than others, but it's not that much pasta to me. Yes. As noted in my initial email, this patch is focused solely on fixing the crash issue. It does not address error handling in StartupXLOG(), CreateCheckPoint(), or involve any broader code cleanup. > I don't see a point in moving the memcpy() and the wasShutdown parts > as proposed in the patch, by the way. The PANIC would block things > in its else block. Let's limit outselves to the extra ReadRecord() > check for the redo record when we find a checkpoint record. I agree and will take care in the next patch. Best Regards, Nitin Jadhav Azure Database for PostgreSQL Microsoft
On Mon, Dec 08, 2025 at 12:48:52PM +0530, Nitin Jadhav wrote: >> An update of src/test/recovery/meson.build to add the new test is >> also required. > > I will work on improving the test accordingly and include the changes > in the next version. Cool, thanks. > The main reason I chose PANIC is that when the ReadCheckpointRecord() > fails, we already use PANIC for the error message ‘could not locate a > valid checkpoint record…’ in the no_backup_label case, whereas the > similar flow with backup_label uses FATAL. I am not entirely sure why > this difference exists. I looked into it but couldn’t find much. If we > decide to change this patch to use FATAL for ‘could not find redo > location…’, should we also consider changing the existing PANIC to > FATAL for consistency? Using PANIC is an inherited historical artifact that has been introduced around 4d14fe0048cf with the introduction of WAL. There was nothing like archiving or even base backup back then. Switching the existing surrounding one to also use a FATAL is something that seems worth considering to me for the checkpoint record, at least based on the pattern that there could be a driver error even if there is no backup_label file (aka for example the case of FS-levelsnapshots with one partition used for the data folder, no?). This offers bonus points in the shape of more tests like the one you have sent upthread. It's not something that I would backpatch as it is a behavior change, but I'm open to seeing that as an improvement in usability for future releases: PANIC is for cases that should never happen for internal states, due to an internal logic error, or an OS going crazy. Here we have a should-no-happen case triggered by a user, and a FATAL still provides the same information about what's wrong. Let's make such changes separate patches, of course, depending on what we find on the way. > Yes. As noted in my initial email, this patch is focused solely on > fixing the crash issue. It does not address error handling in > StartupXLOG(), CreateCheckPoint(), or involve any broader code > cleanup. That sounds fine to me. -- Michael
Вложения
> Using PANIC is an inherited historical artifact that has been > introduced around 4d14fe0048cf with the introduction of WAL. There > was nothing like archiving or even base backup back then. Switching > the existing surrounding one to also use a FATAL is something that > seems worth considering to me for the checkpoint record, at least > based on the pattern that there could be a driver error even if there > is no backup_label file (aka for example the case of FS-levelsnapshots > with one partition used for the data folder, no?). Thanks for explaining the historical context. I agree that switching the existing PANIC to FATAL for the checkpoint record case makes sense. I will include this change in the next patch if there are no objections from others. > This offers bonus points in the shape of more tests like the one you > have sent upthread. It's not something that I would backpatch as it > is a behavior change, but I'm open to seeing that as an improvement in > usability for future releases: PANIC is for cases that should never > happen for internal states, due to an internal logic error, or an OS > going crazy. Here we have a should-no-happen case triggered by a > user, and a FATAL still provides the same information about what's > wrong. Let's make such changes separate patches, of course, depending > on what we find on the way. Thanks for the suggestion. I will keep that in mind and look to add more such tests in future. Best Regards, Nitin Jadhav Azure Database for PostgreSQL Microsoft
I have incorporated all the feedback discussed above and attached the v3 patch. Please take a look and let me know if you have any additional feedback. Best Regards, Nitin Jadhav Azure Database for PostgreSQL Microsoft