Обсуждение: Make mesage at end-of-recovery less scary.
Hello, this is a followup thread of [1]. # I didn't noticed that the thread didn't cover -hackers.. When recovery of any type ends, we see several kinds of error messages that says "WAL is broken". > LOG: invalid record length at 0/7CB6BC8: wanted 24, got 0 > LOG: redo is not required > LOG: database system is ready to accept connections This patch reduces the scariness of such messages as the follows. > LOG: rached end of WAL at 0/1551048 on timeline 1 in pg_wal during crash recovery > DETAIL: invalid record length at 0/1551048: wanted 24, got 0 > LOG: redo is not required > LOG: database system is ready to accept connections [1] https://www.postgresql.org/message-id/20200117.172655.1384889922565817808.horikyota.ntt%40gmail.com I'll register this to the coming CF. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From f3692cb484b7f1ebc351ba8a522039c0b91bcfdb Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Make this message less scary as "reached end of WAL". --- src/backend/access/transam/xlog.c | 45 ++++++++++++++++++++++++++----- 1 file changed, 38 insertions(+), 7 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index d19408b3be..452c376f62 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -4288,6 +4288,10 @@ ReadRecord(XLogReaderState *xlogreader, int emode, EndRecPtr = xlogreader->EndRecPtr; if (record == NULL) { + int actual_emode = + emode_for_corrupt_record(emode, + ReadRecPtr ? ReadRecPtr : EndRecPtr); + if (readFile >= 0) { close(readFile); @@ -4295,14 +4299,41 @@ ReadRecord(XLogReaderState *xlogreader, int emode, } /* - * We only end up here without a message when XLogPageRead() - * failed - in that case we already logged something. In - * StandbyMode that only happens if we have been triggered, so we - * shouldn't loop anymore in that case. + * randAccess here means we are reading successive records during + * recovery. If we get here during recovery, we can assume that we + * reached the end of WAL. Otherwise something's really wrong and + * we report just only the errormsg if any. If we don't receive + * errormsg here, we already logged something. We don't emit + * "reached end of WAL" in muted messages. + * + * Note: errormsg is alreay translated. */ - if (errormsg) - ereport(emode_for_corrupt_record(emode, EndRecPtr), - (errmsg_internal("%s", errormsg) /* already translated */ )); + if (!private->randAccess && actual_emode == emode) + { + if (StandbyMode) + ereport(actual_emode, + (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during streaming replication", + (uint32) (EndRecPtr >> 32), (uint32) EndRecPtr, + ThisTimeLineID, + xlogSourceNames[currentSource]), + (errormsg ? errdetail_internal("%s", errormsg) : 0))); + else if (InArchiveRecovery) + ereport(actual_emode, + (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during archive recovery", + (uint32) (EndRecPtr >> 32), (uint32) EndRecPtr, + ThisTimeLineID, + xlogSourceNames[currentSource]), + (errormsg ? errdetail_internal("%s", errormsg) : 0))); + else + ereport(actual_emode, + (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during crash recovery", + (uint32) (EndRecPtr >> 32), (uint32) EndRecPtr, + ThisTimeLineID, + xlogSourceNames[currentSource]), + (errormsg ? errdetail_internal("%s", errormsg) : 0))); + } + else if (errormsg) + ereport(actual_emode, (errmsg_internal("%s", errormsg))); } /* -- 2.18.2
On Fri, Feb 28, 2020 at 04:01:00PM +0900, Kyotaro Horiguchi wrote: > Hello, this is a followup thread of [1]. > > # I didn't noticed that the thread didn't cover -hackers.. > > When recovery of any type ends, we see several kinds of error messages > that says "WAL is broken". Have you considered an error context here? Your patch leads to a bit of duplication with the message a bit down of what you are changing where the end of local pg_wal is reached. > + * reached the end of WAL. Otherwise something's really wrong and > + * we report just only the errormsg if any. If we don't receive This sentence sounds strange to me. Or you meant "Something is wrong, so use errormsg as report if it is set"? > + * Note: errormsg is alreay translated. Typo here. > + if (StandbyMode) > + ereport(actual_emode, > + (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during streaming replication", StandbyMode happens also with only WAL archiving, depending on if primary_conninfo is set or not. > + (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during crash recovery", FWIW, you are introducing three times the same typo, in the same word, in three different messages. -- Michael
Вложения
Thank you for the comments. At Fri, 28 Feb 2020 16:33:18 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Fri, Feb 28, 2020 at 04:01:00PM +0900, Kyotaro Horiguchi wrote: > > Hello, this is a followup thread of [1]. > > > > # I didn't noticed that the thread didn't cover -hackers.. > > > > When recovery of any type ends, we see several kinds of error messages > > that says "WAL is broken". > > Have you considered an error context here? Your patch leads to a bit > of duplication with the message a bit down of what you are changing > where the end of local pg_wal is reached. It is a DEBUG message and it is for the time moving from crash recovery to archive recovery. I could remove that but decided to leave it for tracability. > > + * reached the end of WAL. Otherwise something's really wrong and > > + * we report just only the errormsg if any. If we don't receive > > This sentence sounds strange to me. Or you meant "Something is wrong, > so use errormsg as report if it is set"? The whole comment there follows. | recovery. If we get here during recovery, we can assume that we | reached the end of WAL. Otherwise something's really wrong and | we report just only the errormsg if any. If we don't receive | errormsg here, we already logged something. We don't emit | "reached end of WAL" in muted messages. "Othhersise" means "other than the case of recovery". "Just only the errmsg" means "show the message not as a part the message "reached end of WAL". > > + * Note: errormsg is alreay translated. > > Typo here. Thanks. Will fix along with "rached". > > + if (StandbyMode) > > + ereport(actual_emode, > > + (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during streaming replication", > > StandbyMode happens also with only WAL archiving, depending on if > primary_conninfo is set or not. Right. I'll fix it. Maybe to "during standby mode". > > + (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during crash recovery", > > FWIW, you are introducing three times the same typo, in the same > word, in three different messages. They're copy-pasto. I refrained from constructing an error message from multiple nonindipendent parts. Are you suggesting to do so? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hello. I changed the condition from randAccess to fetching_ckpt considering the discussion in another thread [1]. Then I moved the block that shows the new messages to more appropriate place. At Fri, 28 Feb 2020 17:28:06 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > > > > Have you considered an error context here? Your patch leads to a bit > > of duplication with the message a bit down of what you are changing > > where the end of local pg_wal is reached. > > It is a DEBUG message and it is for the time moving from crash > recovery to archive recovery. I could remove that but decided to leave > it for tracability. I modified the message so that it has the same look to the new messages, but I left it being DEBUG1, since it is just a intermediate state. We should finally see one of the new three messages. After the messages changed, another message from wal sender came to look redundant. | [20866] LOG: replication terminated by primary server | [20866] DETAIL: End of WAL reached on timeline 1 at 0/30001C8. | [20866] FATAL: could not send end-of-streaming message to primary: no COPY in progress | [20851] LOG: reached end of WAL at 0/30001C8 on timeline 1 in archive during standby mode | [20851] DETAIL: invalid record length at 0/30001C8: wanted 24, got 0 I changed the above to the below, which looks more adequate. | [24271] LOG: replication terminated by primary server on timeline 1 at 0/3000240. | [24271] FATAL: could not send end-of-streaming message to primary: no COPY in progress | [24267] LOG: reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode | [24267] DETAIL: invalid record length at 0/3000240: wanted 24, got 0 > > > + * reached the end of WAL. Otherwise something's really wrong and > > > + * we report just only the errormsg if any. If we don't receive > > > > This sentence sounds strange to me. Or you meant "Something is wrong, > > so use errormsg as report if it is set"? The message no longer exists. > > > + (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during crash recovery", > > > > FWIW, you are introducing three times the same typo, in the same > > word, in three different messages. > > They're copy-pasto. I refrained from constructing an error message > from multiple nonindipendent parts. Are you suggesting to do so? The tree times repetition of almost same phrases is very unreadable. I rewrote it in more simple shape. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 17ee82e5d44dd5a932ed69b8a1ea91a23d170952 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH v2] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Make this message less scary as "reached end of WAL". --- src/backend/access/transam/xlog.c | 72 ++++++++++++++++++++------- src/backend/replication/walreceiver.c | 3 +- 2 files changed, 55 insertions(+), 20 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index d19408b3be..849cf6fe6b 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -4282,12 +4282,15 @@ ReadRecord(XLogReaderState *xlogreader, int emode, for (;;) { char *errormsg; + XLogRecPtr ErrRecPtr = InvalidXLogRecPtr; record = XLogReadRecord(xlogreader, &errormsg); ReadRecPtr = xlogreader->ReadRecPtr; EndRecPtr = xlogreader->EndRecPtr; if (record == NULL) { + ErrRecPtr = ReadRecPtr ? ReadRecPtr : EndRecPtr; + if (readFile >= 0) { close(readFile); @@ -4295,14 +4298,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode, } /* - * We only end up here without a message when XLogPageRead() - * failed - in that case we already logged something. In - * StandbyMode that only happens if we have been triggered, so we - * shouldn't loop anymore in that case. + * If we are fetching checkpoint, we emit the error message right + * now. Otherwise the error is regarded as "end of WAL" and the + * message if any is shown as a part of the end-of-WAL message + * below. */ - if (errormsg) - ereport(emode_for_corrupt_record(emode, EndRecPtr), + if (fetching_ckpt && errormsg) + { + ereport(emode_for_corrupt_record(emode, ErrRecPtr), (errmsg_internal("%s", errormsg) /* already translated */ )); + } } /* @@ -4332,11 +4337,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode, /* Great, got a record */ return record; } - else - { - /* No valid record available from this source */ - lastSourceFailed = true; + /* No valid record available from this source */ + lastSourceFailed = true; + + if (!fetching_ckpt) + { /* * If archive recovery was requested, but we were still doing * crash recovery, switch to archive recovery and retry using the @@ -4349,11 +4355,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode, * we'd have no idea how far we'd have to replay to reach * consistency. So err on the safe side and give up. */ - if (!InArchiveRecovery && ArchiveRecoveryRequested && - !fetching_ckpt) + if (!InArchiveRecovery && ArchiveRecoveryRequested) { + /* + * We don't report this as LOG, since we don't stop recovery + * here + */ ereport(DEBUG1, - (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, enteringarchive recovery", + (uint32) (ErrRecPtr >> 32), + (uint32) ErrRecPtr, + ThisTimeLineID, + xlogSourceNames[currentSource]))); InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -4391,12 +4404,35 @@ ReadRecord(XLogReaderState *xlogreader, int emode, continue; } - /* In standby mode, loop back to retry. Otherwise, give up. */ - if (StandbyMode && !CheckForStandbyTrigger()) - continue; - else - return NULL; + /* + * We reached the end of WAL, show the messages just once at the + * same LSN. + */ + if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG) + { + char *fmt; + + if (StandbyMode) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode"); + else if (InArchiveRecovery) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery"); + else + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery"); + + ereport(LOG, + (errmsg (fmt, (uint32) (EndRecPtr >> 32), + (uint32) EndRecPtr, + ThisTimeLineID, + xlogSourceNames[currentSource]), + (errormsg ? errdetail_internal("%s", errormsg) : 0))); + } } + + /* In standby mode, loop back to retry. Otherwise, give up. */ + if (StandbyMode && !CheckForStandbyTrigger()) + continue; + else + return NULL; } } diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index 2ab15c3cbb..682dbb4e1f 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -478,8 +478,7 @@ WalReceiverMain(void) else if (len < 0) { ereport(LOG, - (errmsg("replication terminated by primary server"), - errdetail("End of WAL reached on timeline %u at %X/%X.", + (errmsg("replication terminated by primary server on timeline %u at %X/%X.", startpointTLI, (uint32) (LogstreamResult.Write >> 32), (uint32) LogstreamResult.Write))); endofwal = true; -- 2.18.2
On 2020-03-05 08:06, Kyotaro Horiguchi wrote: > | [20866] LOG: replication terminated by primary server > | [20866] DETAIL: End of WAL reached on timeline 1 at 0/30001C8. > | [20866] FATAL: could not send end-of-streaming message to primary: no COPY in progress > | [20851] LOG: reached end of WAL at 0/30001C8 on timeline 1 in archive during standby mode > | [20851] DETAIL: invalid record length at 0/30001C8: wanted 24, got 0 > > I changed the above to the below, which looks more adequate. > > | [24271] LOG: replication terminated by primary server on timeline 1 at 0/3000240. > | [24271] FATAL: could not send end-of-streaming message to primary: no COPY in progress > | [24267] LOG: reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode > | [24267] DETAIL: invalid record length at 0/3000240: wanted 24, got 0 Is this the before and after? That doesn't seem like a substantial improvement to me. You still get the "scary" message at the end. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Mar 23, 2020 at 2:37 AM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
On 2020-03-05 08:06, Kyotaro Horiguchi wrote:
> | [20866] LOG: replication terminated by primary server
> | [20866] DETAIL: End of WAL reached on timeline 1 at 0/30001C8.
> | [20866] FATAL: could not send end-of-streaming message to primary: no COPY in progress
> | [20851] LOG: reached end of WAL at 0/30001C8 on timeline 1 in archive during standby mode
> | [20851] DETAIL: invalid record length at 0/30001C8: wanted 24, got 0
>
> I changed the above to the below, which looks more adequate.
>
> | [24271] LOG: replication terminated by primary server on timeline 1 at 0/3000240.
> | [24271] FATAL: could not send end-of-streaming message to primary: no COPY in progress
> | [24267] LOG: reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode
> | [24267] DETAIL: invalid record length at 0/3000240: wanted 24, got 0
Is this the before and after? That doesn't seem like a substantial
improvement to me. You still get the "scary" message at the end.
Plus, I am hoping message will improve for pg_waldump as well?
Since it reads confusing and every-time have to explain new developer it's expected behavior which is annoying.
pg_waldump: fatal: error in WAL record at 0/1553F70: invalid record length at 0/1553FA8: wanted 24, got 0
Hi, On 2020-03-23 10:37:16 +0100, Peter Eisentraut wrote: > On 2020-03-05 08:06, Kyotaro Horiguchi wrote: > > | [20866] LOG: replication terminated by primary server > > | [20866] DETAIL: End of WAL reached on timeline 1 at 0/30001C8. > > | [20866] FATAL: could not send end-of-streaming message to primary: no COPY in progress IMO it's a bug that we see this FATAL. I seem to recall that we didn't use to get that? > > | [20851] LOG: reached end of WAL at 0/30001C8 on timeline 1 in archive during standby mode > > | [20851] DETAIL: invalid record length at 0/30001C8: wanted 24, got 0 > > > > I changed the above to the below, which looks more adequate. > > > > | [24271] LOG: replication terminated by primary server on timeline 1 at 0/3000240. > > | [24271] FATAL: could not send end-of-streaming message to primary: no COPY in progress > > | [24267] LOG: reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode > > | [24267] DETAIL: invalid record length at 0/3000240: wanted 24, got 0 > > Is this the before and after? That doesn't seem like a substantial > improvement to me. You still get the "scary" message at the end. It seems like a minor improvement - folding the DETAIL into the message makes sense to me here. But it indeed doesn't really address the main issue. I think we don't want to elide the information about how the end of the WAL was detected - there are some issues where I found that quite helpful. But we could reformulate it to be clearer that it's informative output, not a bug. E.g. something roughly like LOG: reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode DETAIL: End detected due to invalid record length at 0/3000240: expected 24, got 0 (I first elided the position in the DETAIL, but it could differ from the one in LOG) I don't find that very satisfying, but I can't come up with something that provides the current information, while being less scary than my suggestion? Greetings, Andres Freund
Hi, On 2020-03-23 10:43:09 -0700, Ashwin Agrawal wrote: > Plus, I am hoping message will improve for pg_waldump as well? > Since it reads confusing and every-time have to explain new developer it's > expected behavior which is annoying. > > pg_waldump: fatal: error in WAL record at 0/1553F70: invalid record length > at 0/1553FA8: wanted 24, got 0 What would you like to see here? There's inherently a lot less information about the context in waldump. We can't know whether it's to be expected that the WAL ends at that point, or whether there was corruption. Greetings, Andres Freund
At Mon, 23 Mar 2020 12:47:36 -0700, Andres Freund <andres@anarazel.de> wrote in > Hi, > > On 2020-03-23 10:37:16 +0100, Peter Eisentraut wrote: > > On 2020-03-05 08:06, Kyotaro Horiguchi wrote: > > > | [20866] LOG: replication terminated by primary server > > > | [20866] DETAIL: End of WAL reached on timeline 1 at 0/30001C8. > > > | [20866] FATAL: could not send end-of-streaming message to primary: no COPY in progress > > IMO it's a bug that we see this FATAL. I seem to recall that we didn't > use to get that? I thought that it is a convention that A auxiliary process uses ERROR (which is turned into FATAL in ereport) to exit, which I didn't like so much, but it was out of scope of this patch. As for the message bove, the FATAL is preceded by the "LOG: replication terminated by" message, that means walreceiver tries to send new data after disconnection just to fail, which is unreasonable. I think we should exit immediately after detecting disconnection. The FATAL is gone by the attached. > > > | [24267] LOG: reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode > > > | [24267] DETAIL: invalid record length at 0/3000240: wanted 24, got 0 > > > > Is this the before and after? That doesn't seem like a substantial > > improvement to me. You still get the "scary" message at the end. > > It seems like a minor improvement - folding the DETAIL into the message > makes sense to me here. But it indeed doesn't really address the main > issue. > > I think we don't want to elide the information about how the end of the > WAL was detected - there are some issues where I found that quite > helpful. But we could reformulate it to be clearer that it's informative > output, not a bug. E.g. something roughly like > > LOG: reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode > DETAIL: End detected due to invalid record length at 0/3000240: expected 24, got 0 > (I first elided the position in the DETAIL, but it could differ from the > one in LOG) > > I don't find that very satisfying, but I can't come up with something > that provides the current information, while being less scary than my > suggestion? The 0-length record is not an "invalid" state during recovery, so we can add the message for the state as "record length is 0 at %X/%X". I think if other states found there, it implies something wrong. LSN is redundantly shown but I'm not sure if it is better to remove it from either of the two lines. | LOG: reached end of WAL at 0/3000850 on timeline 1 in pg_wal during crash recovery | DETAIL: record length is 0 at 0/3000850 regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 47511afed5f8acf92abaf1cd6fcfecc1faea9c87 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH v3] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Make this message less scary as "reached end of WAL". --- src/backend/access/transam/xlog.c | 69 ++++++++++++++++++------- src/backend/access/transam/xlogreader.c | 9 ++++ src/backend/replication/walreceiver.c | 11 ++-- 3 files changed, 67 insertions(+), 22 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 793c076da6..6c2924dfb7 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -4283,12 +4283,15 @@ ReadRecord(XLogReaderState *xlogreader, int emode, for (;;) { char *errormsg; + XLogRecPtr ErrRecPtr = InvalidXLogRecPtr; record = XLogReadRecord(xlogreader, &errormsg); ReadRecPtr = xlogreader->ReadRecPtr; EndRecPtr = xlogreader->EndRecPtr; if (record == NULL) { + ErrRecPtr = ReadRecPtr ? ReadRecPtr : EndRecPtr; + if (readFile >= 0) { close(readFile); @@ -4296,13 +4299,13 @@ ReadRecord(XLogReaderState *xlogreader, int emode, } /* - * We only end up here without a message when XLogPageRead() - * failed - in that case we already logged something. In - * StandbyMode that only happens if we have been triggered, so we - * shouldn't loop anymore in that case. + * If we are fetching checkpoint, we emit the error message right + * now. Otherwise the error is regarded as "end of WAL" and the + * message if any is shown as a part of the end-of-WAL message + * below. */ - if (errormsg) - ereport(emode_for_corrupt_record(emode, EndRecPtr), + if (fetching_ckpt && errormsg) + ereport(emode_for_corrupt_record(emode, ErrRecPtr), (errmsg_internal("%s", errormsg) /* already translated */ )); } @@ -4333,11 +4336,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode, /* Great, got a record */ return record; } - else - { - /* No valid record available from this source */ - lastSourceFailed = true; + /* No valid record available from this source */ + lastSourceFailed = true; + + if (!fetching_ckpt) + { /* * If archive recovery was requested, but we were still doing * crash recovery, switch to archive recovery and retry using the @@ -4350,11 +4354,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode, * we'd have no idea how far we'd have to replay to reach * consistency. So err on the safe side and give up. */ - if (!InArchiveRecovery && ArchiveRecoveryRequested && - !fetching_ckpt) + if (!InArchiveRecovery && ArchiveRecoveryRequested) { + /* + * We don't report this as LOG, since we don't stop recovery + * here + */ ereport(DEBUG1, - (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, enteringarchive recovery", + (uint32) (ErrRecPtr >> 32), + (uint32) ErrRecPtr, + ThisTimeLineID, + xlogSourceNames[currentSource]))); InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -4392,12 +4403,34 @@ ReadRecord(XLogReaderState *xlogreader, int emode, continue; } - /* In standby mode, loop back to retry. Otherwise, give up. */ - if (StandbyMode && !CheckForStandbyTrigger()) - continue; - else - return NULL; + /* + * We reached the end of WAL, show the messages just once at the + * same LSN. + */ + if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG) + { + char *fmt; + + if (StandbyMode) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode"); + else if (InArchiveRecovery) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery"); + else + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery"); + + ereport(LOG, + (errmsg(fmt, (uint32) (EndRecPtr >> 32), + (uint32) EndRecPtr, ThisTimeLineID, + xlogSourceNames[currentSource]), + (errormsg ? errdetail_internal("%s", errormsg) : 0))); + } } + + /* In standby mode, loop back to retry. Otherwise, give up. */ + if (StandbyMode && !CheckForStandbyTrigger()) + continue; + else + return NULL; } } diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 32f02256ed..9ea1305364 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -682,6 +682,15 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess) { + + if (record->xl_tot_len == 0) + { + /* This is strictly not an invalid state, so phrase it as so. */ + report_invalid_record(state, + "record length is 0 at %X/%X", + (uint32) (RecPtr >> 32), (uint32) RecPtr); + return false; + } if (record->xl_tot_len < SizeOfXLogRecord) { report_invalid_record(state, diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index 25e0333c9e..da978d4047 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -479,12 +479,15 @@ WalReceiverMain(void) else if (len < 0) { ereport(LOG, - (errmsg("replication terminated by primary server"), - errdetail("End of WAL reached on timeline %u at %X/%X.", + (errmsg("replication terminated by primary server on timeline %u at %X/%X.", startpointTLI, (uint32) (LogstreamResult.Write >> 32), (uint32) LogstreamResult.Write))); - endofwal = true; - break; + + /* + * we have no longer anything to do on the broken + * connection other than exiting. + */ + proc_exit(1); } len = walrcv_receive(wrconn, &buf, &wait_fd); } -- 2.18.2
On 2020-03-24 02:52, Kyotaro Horiguchi wrote: >> I don't find that very satisfying, but I can't come up with something >> that provides the current information, while being less scary than my >> suggestion? > The 0-length record is not an "invalid" state during recovery, so we > can add the message for the state as "record length is 0 at %X/%X". I > think if other states found there, it implies something wrong. > > LSN is redundantly shown but I'm not sure if it is better to remove it > from either of the two lines. > > | LOG: reached end of WAL at 0/3000850 on timeline 1 in pg_wal during crash recovery > | DETAIL: record length is 0 at 0/3000850 I'm not up to date on all these details, but my high-level idea would be some kind of hint associated with the existing error messages, like: HINT: This is to be expected if this is the end of the WAL. Otherwise, it could indicate corruption. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Mar 25, 2020 at 8:53 AM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > HINT: This is to be expected if this is the end of the WAL. Otherwise, > it could indicate corruption. First, I agree that this general issue is a problem, because it's come up for me in quite a number of customer situations. Either people get scared when they shouldn't, because the message is innocuous, or they don't get scared about other things that actually are scary, because if some scary-looking messages are actually innocuous, it can lead people to believe that the same is true in other cases. Second, I don't really like the particular formulation you have above, because the user still doesn't know whether or not to be scared. Can we figure that out? I think if we're in crash recovery, I think that we should not be scared, because we have no alternative to assuming that we've reached the end of WAL, so all crash recoveries will end like this. If we're in archive recovery, we should definitely be scared if we haven't yet reached the minimum recovery point, because more WAL than that should certainly exist. After that, it depends on how we got the WAL. If it's being streamed, the question is whether we've reached the end of what got streamed. If it's being copied from the archive, we ought to have the whole segment, but maybe not more. Can we get the right context to the point where the error is being reported to know whether we hit the error at the end of the WAL that was streamed? If not, can we somehow rejigger things so that we only make it sound scary if we keep getting stuck at the same point when we woud've expected to make progress meanwhile? I'm just spitballing here, but it would be really good if there's a way to know definitely whether or not you should be scared. Corrupted WAL segments are definitely a thing that happens, but retries are a lot more common. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Mar 26, 2020 at 12:41 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Mar 25, 2020 at 8:53 AM Peter Eisentraut > <peter.eisentraut@2ndquadrant.com> wrote: > > HINT: This is to be expected if this is the end of the WAL. Otherwise, > > it could indicate corruption. > > First, I agree that this general issue is a problem, because it's come > up for me in quite a number of customer situations. Either people get > scared when they shouldn't, because the message is innocuous, or they > don't get scared about other things that actually are scary, because > if some scary-looking messages are actually innocuous, it can lead > people to believe that the same is true in other cases. > > Second, I don't really like the particular formulation you have above, > because the user still doesn't know whether or not to be scared. Can > we figure that out? I think if we're in crash recovery, I think that > we should not be scared, because we have no alternative to assuming > that we've reached the end of WAL, so all crash recoveries will end > like this. If we're in archive recovery, we should definitely be > scared if we haven't yet reached the minimum recovery point, because > more WAL than that should certainly exist. After that, it depends on > how we got the WAL. If it's being streamed, the question is whether > we've reached the end of what got streamed. If it's being copied from > the archive, we ought to have the whole segment, but maybe not more. > Can we get the right context to the point where the error is being > reported to know whether we hit the error at the end of the WAL that > was streamed? If not, can we somehow rejigger things so that we only > make it sound scary if we keep getting stuck at the same point when we > woud've expected to make progress meanwhile? > > I'm just spitballing here, but it would be really good if there's a > way to know definitely whether or not you should be scared. Corrupted > WAL segments are definitely a thing that happens, but retries are a > lot more common. First, I agree that getting enough context to say precisely is by far the ideal. That being said, as an end user who's found this surprising -- and momentarily scary every time I initially scan it even though I *know intellectually it's not* -- I would find Peter's suggestion a significant improvement over what we have now. I'm fairly certainly my co-workers on our database team would also. Knowing that something is at least not always scary is good. Though I'll grant that this does have the negative in reverse: if it actually is a scary situation...this mutes your concern level. On the other hand, monitoring would tell us if there's a real problem (namely replication lag), so I think the trade-off is clearly worth it. How about this minor tweak: HINT: This is expected if this is the end of currently available WAL. Otherwise, it could indicate corruption. Thanks, James
On 3/4/21, 10:50 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > As the result, the following messages are emitted with the attached. I'd like to voice my support for this effort, and I intend to help review the patch. It looks like the latest patch no longer applies, so I've marked the commitfest entry [0] as waiting-on-author. Nathan [0] https://commitfest.postgresql.org/35/2490/
At Fri, 22 Oct 2021 17:54:40 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in > On 3/4/21, 10:50 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote: > > As the result, the following messages are emitted with the attached. > > I'd like to voice my support for this effort, and I intend to help > review the patch. It looks like the latest patch no longer applies, > so I've marked the commitfest entry [0] as waiting-on-author. > > Nathan > > [0] https://commitfest.postgresql.org/35/2490/ Sorry for being late to reply. I rebased this to the current master. - rebased - use LSN_FORMAT_ARGS instead of bare shift and mask. - v4 immediately exited walreceiver on disconnection. Maybe I wanted not to see a FATAL message on standby after primary dies. However that would be another issue and that change was plain wrong.. v5 just removes the "end-of-WAL" part from the message, which duplicate to what startup emits. - add a new error message "missing contrecord at %X/%X". Maybe this should be regarded as a leftover of the contrecord patch. In the attached patch the "%X/%X" is the LSN of the current record. The log messages look like this (026_overwrite_contrecord). LOG: redo starts at 0/1486CB8 WARNING: missing contrecord at 0/1FFC2E0 LOG: consistent recovery state reached at 0/1FFC2E0 LOG: started streaming WAL from primary at 0/2000000 on timeline 1 LOG: successfully skipped missing contrecord at 0/1FFC2E0, overwritten at 2021-11-08 14:50:11.969952+09 CONTEXT: WAL redo at 0/2000028 for XLOG/OVERWRITE_CONTRECORD: lsn 0/1FFC2E0; time 2021-11-08 14:50:11.969952+09 While checking the behavior for the case of missing-contrecord, I noticed that emode_for_corrupt_record() doesn't work as expected since readSource is reset to XLOG_FROM_ANY after a read failure. We could remember the last failed source but pg_wal should have been visited if page read error happened so I changed the function so that it treats XLOG_FROM_ANY the same way with XLOG_FROM_PG_WAL. (Otherwise we see "LOG: reached end-of-WAL at .." message after "WARNING: missing contrecord at.." message.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 276f59c8b37a31cb831b7753d2b107eb1d83c1fb Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH v5] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead raise the error level for other kind of WAL failure to WARNING. --- src/backend/access/transam/xlog.c | 94 +++++++++++++++++++------ src/backend/access/transam/xlogreader.c | 14 ++++ src/backend/replication/walreceiver.c | 3 +- src/include/access/xlogreader.h | 1 + 4 files changed, 87 insertions(+), 25 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 5cda30836f..623fb01d0a 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -4477,6 +4477,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode, for (;;) { char *errormsg; + XLogRecPtr ErrRecPtr = InvalidXLogRecPtr; record = XLogReadRecord(xlogreader, &errormsg); ReadRecPtr = xlogreader->ReadRecPtr; @@ -4494,6 +4495,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; + ErrRecPtr = abortedRecPtr; + } + else + { + /* + * NULL ReadRecPtr means we could not read a record at + * beginning. In that case EndRecPtr is storing the LSN of the + * record we tried to read. + */ + ErrRecPtr = ReadRecPtr ? ReadRecPtr : EndRecPtr; } if (readFile >= 0) @@ -4503,13 +4514,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode, } /* - * We only end up here without a message when XLogPageRead() - * failed - in that case we already logged something. In - * StandbyMode that only happens if we have been triggered, so we - * shouldn't loop anymore in that case. + * If we get here for other than end-of-wal, emit the error message + * right now. Otherwise the message if any is shown as a part of + * the end-of-WAL message below. */ - if (errormsg) - ereport(emode_for_corrupt_record(emode, EndRecPtr), + if (!xlogreader->EndOfWAL && errormsg) + ereport(emode_for_corrupt_record(emode, ErrRecPtr), (errmsg_internal("%s", errormsg) /* already translated */ )); } @@ -4540,11 +4550,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode, /* Great, got a record */ return record; } - else + + /* No valid record available from this source */ + lastSourceFailed = true; + + if (!fetching_ckpt) { - /* No valid record available from this source */ - lastSourceFailed = true; - /* * If archive recovery was requested, but we were still doing * crash recovery, switch to archive recovery and retry using the @@ -4557,11 +4568,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode, * we'd have no idea how far we'd have to replay to reach * consistency. So err on the safe side and give up. */ - if (!InArchiveRecovery && ArchiveRecoveryRequested && - !fetching_ckpt) + if (!InArchiveRecovery && ArchiveRecoveryRequested) { + /* + * We don't report this as LOG, since we don't stop recovery + * here + */ ereport(DEBUG1, - (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, enteringarchive recovery", + LSN_FORMAT_ARGS(ErrRecPtr), + replayTLI, + xlogSourceNames[currentSource]))); InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -4609,12 +4626,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode, continue; } - /* In standby mode, loop back to retry. Otherwise, give up. */ - if (StandbyMode && !CheckForStandbyTrigger()) - continue; - else - return NULL; + /* + * If we haven't emit an error message, we have safely reached the + * end-of-WAL. + */ + if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG) + { + char *fmt; + + if (StandbyMode) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode"); + else if (InArchiveRecovery) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery"); + else + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery"); + + ereport(LOG, + (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI, + xlogSourceNames[currentSource]), + (errormsg ? errdetail_internal("%s", errormsg) : 0))); + } } + + /* In standby mode, loop back to retry. Otherwise, give up. */ + if (StandbyMode && !CheckForStandbyTrigger()) + continue; + else + return NULL; } } @@ -7544,7 +7582,7 @@ StartupXLOG(void) else { /* just have to read next record after CheckPoint */ - record = ReadRecord(xlogreader, LOG, false, ThisTimeLineID); + record = ReadRecord(xlogreader, WARNING, false, ThisTimeLineID); } if (record != NULL) @@ -7781,7 +7819,7 @@ StartupXLOG(void) } /* Else, try to fetch the next WAL record */ - record = ReadRecord(xlogreader, LOG, false, ThisTimeLineID); + record = ReadRecord(xlogreader, WARNING, false, ThisTimeLineID); } while (record != NULL); /* @@ -7841,13 +7879,20 @@ StartupXLOG(void) InRedo = false; } - else + else if (xlogreader->EndOfWAL) { /* there are no WAL records following the checkpoint */ ereport(LOG, (errmsg("redo is not required"))); } + else + { + /* broken record found */ + ereport(WARNING, + (errmsg("redo is skipped"), + errhint("This suggests WAL file corruption. You might need to check the database."))); + } /* * This check is intentionally after the above log messages that @@ -13135,7 +13180,9 @@ XLogShutdownWalRcv(void) * reading from pg_wal, because we don't expect any invalid records in archive * or in records streamed from the primary. Files in the archive should be complete, * and we should never hit the end of WAL because we stop and wait for more WAL - * to arrive before replaying it. + * to arrive before replaying it. When we failed to read a new page, + * readSource is reset to XLOG_FROM_ANY. This indicates all sources including + * pg_wal was failed. Thus treat that the same way with XLOG_FROM_PG_WAL. * * NOTE: This function remembers the RecPtr value it was last called with, * to suppress repeated messages about the same record. Only call this when @@ -13147,7 +13194,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr) { static XLogRecPtr lastComplaint = 0; - if (readSource == XLOG_FROM_PG_WAL && emode == LOG) + if ((readSource == XLOG_FROM_PG_WAL || readSource == XLOG_FROM_ANY) + && emode <= WARNING) { if (RecPtr == lastComplaint) emode = DEBUG1; diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index f39f8044a9..df2198e862 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, pfree(state); return NULL; } + state->EndOfWAL = false; state->errormsg_buf[0] = '\0'; /* @@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) /* reset error state */ *errormsg = NULL; state->errormsg_buf[0] = '\0'; + state->EndOfWAL = false; ResetDecoder(state); state->abortedRecPtr = InvalidXLogRecPtr; @@ -588,6 +590,9 @@ err: */ state->abortedRecPtr = RecPtr; state->missingContrecPtr = targetPagePtr; + report_invalid_record(state, + "missing contrecord at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); } /* @@ -730,6 +735,15 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess) { + if (record->xl_tot_len == 0) + { + /* This is strictly not an invalid state, so phrase it as so. */ + report_invalid_record(state, + "record length is 0 at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); + state->EndOfWAL = true; + return false; + } if (record->xl_tot_len < SizeOfXLogRecord) { report_invalid_record(state, diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index 7a7eb3784e..ba3c4bd550 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -471,8 +471,7 @@ WalReceiverMain(void) else if (len < 0) { ereport(LOG, - (errmsg("replication terminated by primary server"), - errdetail("End of WAL reached on timeline %u at %X/%X.", + (errmsg("replication terminated by primary server on timeline %u at %X/%X.", startpointTLI, LSN_FORMAT_ARGS(LogstreamResult.Write)))); endofwal = true; diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h index de6fd791fe..1241b85838 100644 --- a/src/include/access/xlogreader.h +++ b/src/include/access/xlogreader.h @@ -174,6 +174,7 @@ struct XLogReaderState */ XLogRecPtr ReadRecPtr; /* start of last record read */ XLogRecPtr EndRecPtr; /* end+1 of last record read */ + bool EndOfWAL; /* the last attempt was EOW? */ /* * Set at the end of recovery: the start point of a partial record at the -- 2.27.0
Thank you for the comments! At Tue, 9 Nov 2021 09:53:15 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Mon, Nov 08, 2021 at 02:59:46PM +0900, Kyotaro Horiguchi wrote: > > While checking the behavior for the case of missing-contrecord, I > > noticed that emode_for_corrupt_record() doesn't work as expected since > > readSource is reset to XLOG_FROM_ANY after a read failure. We could > > remember the last failed source but pg_wal should have been visited if > > page read error happened so I changed the function so that it treats > > XLOG_FROM_ANY the same way with XLOG_FROM_PG_WAL. > > FWIW, I am not much a fan of assuming that it is fine to use > XLOG_FROM_ANY as a condition here. The comments on top of > emode_for_corrupt_record() make it rather clear what the expectations > are, and this is the default readSource. The readSource is expected by the function to be the failed source but it goes back to XLOG_FROM_ANY on page read failure. So the function *is* standing on the wrong assumption. I noticed that currentSource holds the last accessed source (but forgot about that). So it is exactly what we need here. No longer need to introduce the unclear assumption by using it. > > (Otherwise we see "LOG: reached end-of-WAL at .." message after > > "WARNING: missing contrecord at.." message.) > > + /* broken record found */ > + ereport(WARNING, > + (errmsg("redo is skipped"), > + errhint("This suggests WAL file corruption. You might need to check the database."))); > This looks rather scary to me, FWIW, and this could easily be reached Yes, the message is intentionally scary, since we don't come here in the case of clean WAL:) > if one forgets about EndOfWAL while hacking on xlogreader.c. > Unlikely so, still. I don't understand. Isn't it the case of almost every feature? The patch compells hackers to maintain the condition for recovery being considered cleanly ended. If the last record doesn't meet the condition, the WAL file should be considered having a problem. However, I don't see the condition expanded to have another term in future. Even if someone including myself broke that condition, we will at worst unwantedly see a scary message. And I believe almost all hackers can easily find it a bug from the DETAILED message shown along aside. I'm not sure such bugs could be found in development phase, though.. > + report_invalid_record(state, > + "missing contrecord at %X/%X", > + LSN_FORMAT_ARGS(RecPtr)); > Isn't there a risk here to break applications checking after error > messages stored in the WAL reader after seeing a contrecord? I'm not sure you are mentioning the case where no message is stored previously, or the case where already a message is stored. The former is fine as the record is actually broken. But I was missing the latter case. In this version I avoided to overwite the error message. > + if (record->xl_tot_len == 0) > + { > + /* This is strictly not an invalid state, so phrase it as so. */ > + report_invalid_record(state, > + "record length is 0 at %X/%X", > + LSN_FORMAT_ARGS(RecPtr)); > + state->EndOfWAL = true; > + return false; > + } > This assumes that a value of 0 for xl_tot_len is a synonym of the end > of WAL, but cannot we have also a corrupted record in this case in the > shape of xl_tot_len being 0? We validate the full record after > reading the header, so it seems to me that we should not assume that > things are just ending as proposed in this patch. Yeah, it's the most serious concern to me. So I didn't hide the detailed message in the "end-of-wal reached message". > LOG: reached end of WAL at 0/512F198 on timeline 1 in pg_wal during crash recovery > DETAIL: record length is 0 at 0/512F210 I believe everyone regards zero record length as fine unless something wrong is seen afterwards. However, we can extend the check to the whole record header. I think it is by far nearer to the perfect for almost all cases. The attached emits the following message for the good (true end-of-WAL) case. > LOG: reached end of WAL at 0/512F4A0 on timeline 1 in pg_wal during crash recovery > DETAIL: empty record header found at 0/512F518 If garbage bytes are found in the header area, the following log will be left. I think we can have a better message here. > WARNING: garbage record header at 0/2095458 > LOG: redo done at 0/2095430 system usage: CPU: user: 0.03 s, system: 0.01 s, elapsed: 0.04 s This is the updated version. - emode_for_currupt_record() now uses currentSource instead of readSource. - If zero record length is faced, make sure the whole header is zeroed before deciding it as the end-of-WAL. - Do not overwrite existig message when missing contrecord is detected. The message added here is seen in the TAP test log 026_overwrite_contrecord_standby.log regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 1d5f6e707f8d67172eea79689c8a5f4d86889d3e Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH v6] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead raise the error level for other kind of WAL failure to WARNING. --- src/backend/access/transam/xlog.c | 89 +++++++++++++++++++------ src/backend/access/transam/xlogreader.c | 42 ++++++++++++ src/backend/replication/walreceiver.c | 3 +- src/include/access/xlogreader.h | 1 + 4 files changed, 111 insertions(+), 24 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 5cda30836f..e90c69810b 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -4477,6 +4477,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode, for (;;) { char *errormsg; + XLogRecPtr ErrRecPtr = InvalidXLogRecPtr; record = XLogReadRecord(xlogreader, &errormsg); ReadRecPtr = xlogreader->ReadRecPtr; @@ -4494,6 +4495,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; + ErrRecPtr = abortedRecPtr; + } + else + { + /* + * NULL ReadRecPtr means we could not read a record at + * beginning. In that case EndRecPtr is storing the LSN of the + * record we tried to read. + */ + ErrRecPtr = ReadRecPtr ? ReadRecPtr : EndRecPtr; } if (readFile >= 0) @@ -4503,13 +4514,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode, } /* - * We only end up here without a message when XLogPageRead() - * failed - in that case we already logged something. In - * StandbyMode that only happens if we have been triggered, so we - * shouldn't loop anymore in that case. + * If we get here for other than end-of-wal, emit the error message + * right now. Otherwise the message if any is shown as a part of + * the end-of-WAL message below. */ - if (errormsg) - ereport(emode_for_corrupt_record(emode, EndRecPtr), + if (!xlogreader->EndOfWAL && errormsg) + ereport(emode_for_corrupt_record(emode, ErrRecPtr), (errmsg_internal("%s", errormsg) /* already translated */ )); } @@ -4540,11 +4550,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode, /* Great, got a record */ return record; } - else + + /* No valid record available from this source */ + lastSourceFailed = true; + + if (!fetching_ckpt) { - /* No valid record available from this source */ - lastSourceFailed = true; - /* * If archive recovery was requested, but we were still doing * crash recovery, switch to archive recovery and retry using the @@ -4557,11 +4568,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode, * we'd have no idea how far we'd have to replay to reach * consistency. So err on the safe side and give up. */ - if (!InArchiveRecovery && ArchiveRecoveryRequested && - !fetching_ckpt) + if (!InArchiveRecovery && ArchiveRecoveryRequested) { + /* + * We don't report this as LOG, since we don't stop recovery + * here + */ ereport(DEBUG1, - (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, enteringarchive recovery", + LSN_FORMAT_ARGS(ErrRecPtr), + replayTLI, + xlogSourceNames[currentSource]))); InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -4609,12 +4626,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode, continue; } - /* In standby mode, loop back to retry. Otherwise, give up. */ - if (StandbyMode && !CheckForStandbyTrigger()) - continue; - else - return NULL; + /* + * If we haven't emit an error message, we have safely reached the + * end-of-WAL. + */ + if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG) + { + char *fmt; + + if (StandbyMode) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode"); + else if (InArchiveRecovery) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery"); + else + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery"); + + ereport(LOG, + (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI, + xlogSourceNames[currentSource]), + (errormsg ? errdetail_internal("%s", errormsg) : 0))); + } } + + /* In standby mode, loop back to retry. Otherwise, give up. */ + if (StandbyMode && !CheckForStandbyTrigger()) + continue; + else + return NULL; } } @@ -7544,7 +7582,7 @@ StartupXLOG(void) else { /* just have to read next record after CheckPoint */ - record = ReadRecord(xlogreader, LOG, false, ThisTimeLineID); + record = ReadRecord(xlogreader, WARNING, false, ThisTimeLineID); } if (record != NULL) @@ -7781,7 +7819,7 @@ StartupXLOG(void) } /* Else, try to fetch the next WAL record */ - record = ReadRecord(xlogreader, LOG, false, ThisTimeLineID); + record = ReadRecord(xlogreader, WARNING, false, ThisTimeLineID); } while (record != NULL); /* @@ -7841,13 +7879,20 @@ StartupXLOG(void) InRedo = false; } - else + else if (xlogreader->EndOfWAL) { /* there are no WAL records following the checkpoint */ ereport(LOG, (errmsg("redo is not required"))); } + else + { + /* broken record found */ + ereport(WARNING, + (errmsg("redo is skipped"), + errhint("This suggests WAL file corruption. You might need to check the database."))); + } /* * This check is intentionally after the above log messages that @@ -13147,7 +13192,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr) { static XLogRecPtr lastComplaint = 0; - if (readSource == XLOG_FROM_PG_WAL && emode == LOG) + if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING) { if (RecPtr == lastComplaint) emode = DEBUG1; diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index f39f8044a9..273b927cd9 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, pfree(state); return NULL; } + state->EndOfWAL = false; state->errormsg_buf[0] = '\0'; /* @@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) /* reset error state */ *errormsg = NULL; state->errormsg_buf[0] = '\0'; + state->EndOfWAL = false; ResetDecoder(state); state->abortedRecPtr = InvalidXLogRecPtr; @@ -588,6 +590,16 @@ err: */ state->abortedRecPtr = RecPtr; state->missingContrecPtr = targetPagePtr; + + /* + * If the messages is not set yet, that means we failed to load the + * page for the record. Otherwise do not hide the existing message at + * it should be more detailed. + */ + if (state->errormsg_buf[0] == '\0') + report_invalid_record(state, + "missing contrecord at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); } /* @@ -730,6 +742,36 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess) { + if (record->xl_tot_len == 0) + { + /* + * We are almost sure reaching the end of WAL, make sure that the whole + * header is zeroed. + */ + char *p = (char *)record; + char *pe = (char *)record + SizeOfXLogRecord; + + while (*p == 0 && p < pe) + p++; + + if (p == pe) + { + /* it is completely zeroed, call it a day */ + report_invalid_record(state, "empty record header found at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); + + /* notify end-of-wal to callers */ + state->EndOfWAL = true; + } + else + { + /* Otherwise we found a garbage header.. */ + report_invalid_record(state, "garbage record header at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); + } + + return false; + } if (record->xl_tot_len < SizeOfXLogRecord) { report_invalid_record(state, diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index 7a7eb3784e..ba3c4bd550 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -471,8 +471,7 @@ WalReceiverMain(void) else if (len < 0) { ereport(LOG, - (errmsg("replication terminated by primary server"), - errdetail("End of WAL reached on timeline %u at %X/%X.", + (errmsg("replication terminated by primary server on timeline %u at %X/%X.", startpointTLI, LSN_FORMAT_ARGS(LogstreamResult.Write)))); endofwal = true; diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h index de6fd791fe..1241b85838 100644 --- a/src/include/access/xlogreader.h +++ b/src/include/access/xlogreader.h @@ -174,6 +174,7 @@ struct XLogReaderState */ XLogRecPtr ReadRecPtr; /* start of last record read */ XLogRecPtr EndRecPtr; /* end+1 of last record read */ + bool EndOfWAL; /* the last attempt was EOW? */ /* * Set at the end of recovery: the start point of a partial record at the -- 2.27.0
At Tue, 09 Nov 2021 16:27:51 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > This is the updated version. > > - emode_for_currupt_record() now uses currentSource instead of > readSource. > > - If zero record length is faced, make sure the whole header is zeroed > before deciding it as the end-of-WAL. > > - Do not overwrite existig message when missing contrecord is > detected. The message added here is seen in the TAP test log > 026_overwrite_contrecord_standby.log d2ddfa681db27a138acb63c8defa8cc6fa588922 removed global variables ReadRecPtr and EndRecPtr. This is rebased version that reads the LSNs directly from xlogreader instead of the removed global variables. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From cc521692a9f98fabde07e248b63f1222f8406de1 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH v7] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead, raise the error level for other kind of WAL failure to WARNING. --- src/backend/access/transam/xlog.c | 89 +++++++++++++++++++------ src/backend/access/transam/xlogreader.c | 42 ++++++++++++ src/backend/replication/walreceiver.c | 3 +- src/include/access/xlogreader.h | 1 + 4 files changed, 112 insertions(+), 23 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index d894af310a..fa435faec4 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -4469,6 +4469,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode, for (;;) { char *errormsg; + XLogRecPtr ErrRecPtr = InvalidXLogRecPtr; record = XLogReadRecord(xlogreader, &errormsg); if (record == NULL) @@ -4484,6 +4485,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; + ErrRecPtr = abortedRecPtr; + } + else + { + /* + * NULL ReadRecPtr means we could not read a record at + * beginning. In that case EndRecPtr is storing the LSN of the + * record we tried to read. + */ + ErrRecPtr = + xlogreader->ReadRecPtr ? + xlogreader->ReadRecPtr : xlogreader->EndRecPtr; } if (readFile >= 0) @@ -4493,12 +4506,11 @@ ReadRecord(XLogReaderState *xlogreader, int emode, } /* - * We only end up here without a message when XLogPageRead() - * failed - in that case we already logged something. In - * StandbyMode that only happens if we have been triggered, so we - * shouldn't loop anymore in that case. + * If we get here for other than end-of-wal, emit the error message + * right now. Otherwise the message if any is shown as a part of + * the end-of-WAL message below. */ - if (errormsg) + if (!xlogreader->EndOfWAL && errormsg) ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr), (errmsg_internal("%s", errormsg) /* already translated */ )); } @@ -4530,11 +4542,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode, /* Great, got a record */ return record; } - else + + /* No valid record available from this source */ + lastSourceFailed = true; + + if (!fetching_ckpt) { - /* No valid record available from this source */ - lastSourceFailed = true; - /* * If archive recovery was requested, but we were still doing * crash recovery, switch to archive recovery and retry using the @@ -4547,11 +4560,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode, * we'd have no idea how far we'd have to replay to reach * consistency. So err on the safe side and give up. */ - if (!InArchiveRecovery && ArchiveRecoveryRequested && - !fetching_ckpt) + if (!InArchiveRecovery && ArchiveRecoveryRequested) { + /* + * We don't report this as LOG, since we don't stop recovery + * here + */ ereport(DEBUG1, - (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, enteringarchive recovery", + LSN_FORMAT_ARGS(ErrRecPtr), + replayTLI, + xlogSourceNames[currentSource]))); InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -4599,12 +4618,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode, continue; } - /* In standby mode, loop back to retry. Otherwise, give up. */ - if (StandbyMode && !CheckForStandbyTrigger()) - continue; - else - return NULL; + /* + * If we haven't emit an error message, we have safely reached the + * end-of-WAL. + */ + if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG) + { + char *fmt; + + if (StandbyMode) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode"); + else if (InArchiveRecovery) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery"); + else + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery"); + + ereport(LOG, + (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI, + xlogSourceNames[currentSource]), + (errormsg ? errdetail_internal("%s", errormsg) : 0))); + } } + + /* In standby mode, loop back to retry. Otherwise, give up. */ + if (StandbyMode && !CheckForStandbyTrigger()) + continue; + else + return NULL; } } @@ -7536,7 +7576,7 @@ StartupXLOG(void) else { /* just have to read next record after CheckPoint */ - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } if (record != NULL) @@ -7774,7 +7814,7 @@ StartupXLOG(void) } /* Else, try to fetch the next WAL record */ - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } while (record != NULL); /* @@ -7834,13 +7874,20 @@ StartupXLOG(void) InRedo = false; } - else + else if (xlogreader->EndOfWAL) { /* there are no WAL records following the checkpoint */ ereport(LOG, (errmsg("redo is not required"))); } + else + { + /* broken record found */ + ereport(WARNING, + (errmsg("redo is skipped"), + errhint("This suggests WAL file corruption. You might need to check the database."))); + } /* * This check is intentionally after the above log messages that @@ -13130,7 +13177,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr) { static XLogRecPtr lastComplaint = 0; - if (readSource == XLOG_FROM_PG_WAL && emode == LOG) + if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING) { if (RecPtr == lastComplaint) emode = DEBUG1; diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 3a7de02565..e16b6fe041 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, pfree(state); return NULL; } + state->EndOfWAL = false; state->errormsg_buf[0] = '\0'; /* @@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) /* reset error state */ *errormsg = NULL; state->errormsg_buf[0] = '\0'; + state->EndOfWAL = false; ResetDecoder(state); state->abortedRecPtr = InvalidXLogRecPtr; @@ -588,6 +590,16 @@ err: */ state->abortedRecPtr = RecPtr; state->missingContrecPtr = targetPagePtr; + + /* + * If the messages is not set yet, that means we failed to load the + * page for the record. Otherwise do not hide the existing message at + * it should be more detailed. + */ + if (state->errormsg_buf[0] == '\0') + report_invalid_record(state, + "missing contrecord at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); } /* @@ -730,6 +742,36 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess) { + if (record->xl_tot_len == 0) + { + /* + * We are almost sure reaching the end of WAL, make sure that the whole + * header is zeroed. + */ + char *p = (char *)record; + char *pe = (char *)record + SizeOfXLogRecord; + + while (*p == 0 && p < pe) + p++; + + if (p == pe) + { + /* it is completely zeroed, call it a day */ + report_invalid_record(state, "empty record header found at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); + + /* notify end-of-wal to callers */ + state->EndOfWAL = true; + } + else + { + /* Otherwise we found a garbage header.. */ + report_invalid_record(state, "garbage record header at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); + } + + return false; + } if (record->xl_tot_len < SizeOfXLogRecord) { report_invalid_record(state, diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index 7a7eb3784e..ba3c4bd550 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -471,8 +471,7 @@ WalReceiverMain(void) else if (len < 0) { ereport(LOG, - (errmsg("replication terminated by primary server"), - errdetail("End of WAL reached on timeline %u at %X/%X.", + (errmsg("replication terminated by primary server on timeline %u at %X/%X.", startpointTLI, LSN_FORMAT_ARGS(LogstreamResult.Write)))); endofwal = true; diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h index de6fd791fe..1241b85838 100644 --- a/src/include/access/xlogreader.h +++ b/src/include/access/xlogreader.h @@ -174,6 +174,7 @@ struct XLogReaderState */ XLogRecPtr ReadRecPtr; /* start of last record read */ XLogRecPtr EndRecPtr; /* end+1 of last record read */ + bool EndOfWAL; /* the last attempt was EOW? */ /* * Set at the end of recovery: the start point of a partial record at the -- 2.27.0
d2ddfa681db27a138acb63c8defa8cc6fa588922 removed global variables
ReadRecPtr and EndRecPtr. This is rebased version that reads the LSNs
directly from xlogreader instead of the removed global variables.
Hi, hackers!
I've checked the latest version of a patch. It applies cleanly, check-world passes and CI is also in the green state.
Proposed messages seem good to me, but probably it would be better to have a test on conditions where "reached end of WAL..." emitted.
Then, I believe it can be set as 'ready for committter'.
At Mon, 24 Jan 2022 14:23:33 +0400, Pavel Borisov <pashkin.elfe@gmail.com> wrote in > > > > d2ddfa681db27a138acb63c8defa8cc6fa588922 removed global variables > > ReadRecPtr and EndRecPtr. This is rebased version that reads the LSNs > > directly from xlogreader instead of the removed global variables. > > > > Hi, hackers! > > I've checked the latest version of a patch. It applies cleanly, check-world > passes and CI is also in the green state. > Proposed messages seem good to me, but probably it would be better to have > a test on conditions where "reached end of WAL..." emitted. > Then, I believe it can be set as 'ready for committter'. Thanks for checking that, and the comment! I thought that we usually don't test log messages, but finally I found that I needed that. It is because I found another mode of end-of-wal and a bug that emits a spurious message on passing... This v8 is changed in... - Added tests to 011_crash_recovery.pl - Fixed a bug that server emits "end-of-wal" messages even if it have emitted an error message for the same LSN. - Changed XLogReaderValidatePageHeader() so that it recognizes an empty page as end-of-WAL. - Made pg_waldump conscious of end-of-wal. While doing the last item, I noticed that pg_waldump shows the wrong LSN as the error position. Concretely it emits the LSN of the last sound WAL record as the error position. I will post a bug-fix patch for the issue after confirmation. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 0f1024bdfba9d1926465351fa1b7125698a21e8d Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH v8] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead, raise the error level for other kind of WAL failure to WARNING. --- src/backend/access/transam/xlog.c | 91 +++++++++++++----- src/backend/access/transam/xlogreader.c | 64 +++++++++++++ src/backend/replication/walreceiver.c | 3 +- src/bin/pg_waldump/pg_waldump.c | 13 ++- src/include/access/xlogreader.h | 1 + src/test/recovery/t/011_crash_recovery.pl | 110 +++++++++++++++++++++- 6 files changed, 254 insertions(+), 28 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 58922f7ede..c08b9554b3 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode, for (;;) { char *errormsg; + XLogRecPtr ErrRecPtr = InvalidXLogRecPtr; record = XLogReadRecord(xlogreader, &errormsg); if (record == NULL) @@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; + ErrRecPtr = abortedRecPtr; + } + else + { + /* + * NULL ReadRecPtr means we could not read a record at the + * beginning. In that case EndRecPtr is storing the LSN of the + * record we tried to read. + */ + ErrRecPtr = + xlogreader->ReadRecPtr ? + xlogreader->ReadRecPtr : xlogreader->EndRecPtr; } if (readFile >= 0) @@ -4504,13 +4517,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode, } /* - * We only end up here without a message when XLogPageRead() - * failed - in that case we already logged something. In - * StandbyMode that only happens if we have been triggered, so we - * shouldn't loop anymore in that case. + * If we get here for other than end-of-wal, emit the error message + * right now. Otherwise the message if any is shown as a part of + * the end-of-WAL message below. */ - if (errormsg) - ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr), + if (!xlogreader->EndOfWAL && errormsg) + ereport(emode_for_corrupt_record(emode, ErrRecPtr), (errmsg_internal("%s", errormsg) /* already translated */ )); } @@ -4541,11 +4553,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode, /* Great, got a record */ return record; } - else + + /* No valid record available from this source */ + lastSourceFailed = true; + + if (!fetching_ckpt) { - /* No valid record available from this source */ - lastSourceFailed = true; - /* * If archive recovery was requested, but we were still doing * crash recovery, switch to archive recovery and retry using the @@ -4558,11 +4571,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode, * we'd have no idea how far we'd have to replay to reach * consistency. So err on the safe side and give up. */ - if (!InArchiveRecovery && ArchiveRecoveryRequested && - !fetching_ckpt) + if (!InArchiveRecovery && ArchiveRecoveryRequested) { + /* + * We don't report this as LOG, since we don't stop recovery + * here + */ ereport(DEBUG1, - (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, enteringarchive recovery", + LSN_FORMAT_ARGS(ErrRecPtr), + replayTLI, + xlogSourceNames[currentSource]))); InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -4610,12 +4629,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode, continue; } - /* In standby mode, loop back to retry. Otherwise, give up. */ - if (StandbyMode && !CheckForStandbyTrigger()) - continue; - else - return NULL; + /* + * If we haven't emit an error message, we have safely reached the + * end-of-WAL. + */ + if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG) + { + char *fmt; + + if (StandbyMode) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode"); + else if (InArchiveRecovery) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery"); + else + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery"); + + ereport(LOG, + (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI, + xlogSourceNames[currentSource]), + (errormsg ? errdetail_internal("%s", errormsg) : 0))); + } } + + /* In standby mode, loop back to retry. Otherwise, give up. */ + if (StandbyMode && !CheckForStandbyTrigger()) + continue; + else + return NULL; } } @@ -7544,7 +7584,7 @@ StartupXLOG(void) else { /* just have to read next record after CheckPoint */ - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } if (record != NULL) @@ -7782,7 +7822,7 @@ StartupXLOG(void) } /* Else, try to fetch the next WAL record */ - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } while (record != NULL); /* @@ -7842,13 +7882,20 @@ StartupXLOG(void) InRedo = false; } - else + else if (xlogreader->EndOfWAL) { /* there are no WAL records following the checkpoint */ ereport(LOG, (errmsg("redo is not required"))); } + else + { + /* broken record found */ + ereport(WARNING, + (errmsg("redo is skipped"), + errhint("This suggests WAL file corruption. You might need to check the database."))); + } /* * This check is intentionally after the above log messages that @@ -13097,7 +13144,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr) { static XLogRecPtr lastComplaint = 0; - if (readSource == XLOG_FROM_PG_WAL && emode == LOG) + if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING) { if (RecPtr == lastComplaint) emode = DEBUG1; diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 35029cf97d..55f54cd98d 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, pfree(state); return NULL; } + state->EndOfWAL = false; state->errormsg_buf[0] = '\0'; /* @@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) /* reset error state */ *errormsg = NULL; state->errormsg_buf[0] = '\0'; + state->EndOfWAL = false; ResetDecoder(state); state->abortedRecPtr = InvalidXLogRecPtr; @@ -588,6 +590,15 @@ err: */ state->abortedRecPtr = RecPtr; state->missingContrecPtr = targetPagePtr; + + /* + * If the message is not set yet, that means we failed to load the + * page for the record. Otherwise do not hide the existing message. + */ + if (state->errormsg_buf[0] == '\0') + report_invalid_record(state, + "missing contrecord at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); } /* @@ -730,6 +741,36 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess) { + if (record->xl_tot_len == 0) + { + /* + * We are almost sure reaching the end of WAL, make sure that the whole + * header is zeroed. + */ + char *p = (char *)record; + char *pe = (char *)record + SizeOfXLogRecord; + + while (*p == 0 && p < pe) + p++; + + if (p == pe) + { + /* it is completely zeroed, call it a day */ + report_invalid_record(state, "empty record header found at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); + + /* notify end-of-wal to callers */ + state->EndOfWAL = true; + } + else + { + /* Otherwise the header is corrupted. */ + report_invalid_record(state, "garbage record header at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); + } + + return false; + } if (record->xl_tot_len < SizeOfXLogRecord) { report_invalid_record(state, @@ -836,6 +877,29 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr); + StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero"); + + if (hdr->xlp_magic == 0) + { + /* Regard an empty page as End-Of-WAL */ + int i; + + for (i = 0 ; i < XLOG_BLCKSZ && phdr[i] == 0 ; i++); + if (i == XLOG_BLCKSZ) + { + char fname[MAXFNAMELEN]; + + XLogFileName(fname, state->seg.ws_tli, segno, + state->segcxt.ws_segsize); + + report_invalid_record(state, + "empty page in log segment %s, offset %u", + fname, + offset); + state->EndOfWAL = true; + return false; + } + } if (hdr->xlp_magic != XLOG_PAGE_MAGIC) { char fname[MAXFNAMELEN]; diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index b39fce8c23..3034f8281e 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -471,8 +471,7 @@ WalReceiverMain(void) else if (len < 0) { ereport(LOG, - (errmsg("replication terminated by primary server"), - errdetail("End of WAL reached on timeline %u at %X/%X.", + (errmsg("replication terminated by primary server on timeline %u at %X/%X.", startpointTLI, LSN_FORMAT_ARGS(LogstreamResult.Write)))); endofwal = true; diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c index a6251e1a96..3745e76488 100644 --- a/src/bin/pg_waldump/pg_waldump.c +++ b/src/bin/pg_waldump/pg_waldump.c @@ -1176,9 +1176,16 @@ main(int argc, char **argv) exit(0); if (errormsg) - fatal_error("error in WAL record at %X/%X: %s", - LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr), - errormsg); + { + if (xlogreader_state->EndOfWAL) + pg_log_info("end of WAL at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + else + fatal_error("error in WAL record at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + } XLogReaderFree(xlogreader_state); diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h index 477f0efe26..3eeba220a1 100644 --- a/src/include/access/xlogreader.h +++ b/src/include/access/xlogreader.h @@ -174,6 +174,7 @@ struct XLogReaderState */ XLogRecPtr ReadRecPtr; /* start of last record read */ XLogRecPtr EndRecPtr; /* end+1 of last record read */ + bool EndOfWAL; /* the last attempt was EOW? */ /* * Set at the end of recovery: the start point of a partial record at the diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl index 3892aba3e5..b793280a5c 100644 --- a/src/test/recovery/t/011_crash_recovery.pl +++ b/src/test/recovery/t/011_crash_recovery.pl @@ -10,9 +10,11 @@ use PostgreSQL::Test::Cluster; use PostgreSQL::Test::Utils; use Test::More; use Config; +use IPC::Run; -plan tests => 3; +plan tests => 11; +my $reached_eow_pat = "reached end of WAL at "; my $node = PostgreSQL::Test::Cluster->new('primary'); $node->init(allows_streaming => 1); $node->start; @@ -50,7 +52,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), # Crash and restart the postmaster $node->stop('immediate'); +my $logstart = get_log_size($node); $node->start; +my $max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, $reached_eow_pat, $logstart)); + sleep 0.5; +} +ok ($max_attempts >= 0, "end-of-wal is logged"); # Make sure we really got a new xid cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'), @@ -62,3 +72,101 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), $stdin .= "\\q\n"; $tx->finish; # wait for psql to quit gracefully + +my $segsize = $node->safe_psql('postgres', + qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]); + +# make sure no records afterwards go to the next segment +$node->safe_psql('postgres', qq[ + SELECT pg_switch_wal(); + CHECKPOINT; + CREATE TABLE t(); +]); +$node->stop('immediate'); + +# identify REDO WAL file +my $cmd = "pg_controldata -D " . $node->data_dir(); +my $chkptfile; +$cmd = ['pg_controldata', '-D', $node->data_dir()]; +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m, + "checkpoint file is identified"); +my $chkptfile = $1; + +# identify the last record +my $walfile = $node->data_dir() . "/pg_wal/$chkptfile"; +$cmd = ['pg_waldump', $walfile]; +$stdout = ''; +$stderr = ''; +my $lastrec; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +foreach my $l (split(/\r?\n/, $stdout)) +{ + $lastrec = $l; +} +ok(defined $lastrec, "last WAL record is extracted"); +ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/, + "pg_waldump emits the correct ending message"); + +# read the last record LSN excluding leading zeroes +ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/, + "LSN of the last record identified"); +my $lastlsn = $1; + +# corrupt the last record +my $offset = hex($lastlsn) % $segsize; +open(my $segf, '+<', $walfile) or die "failed to open $walfile\n"; +seek($segf, $offset, 0); # halfway break the last record +print $segf "\0\0\0\0"; +close($segf); + +# pg_waldump complains about the corrupted record +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/, + "pg_waldump emits the correct error message"); + +# also server complains +$logstart = get_log_size($node); +$node->start; +$max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, "WARNING: garbage record header at 0/$lastlsn", + $logstart)); + sleep 0.5; +} +ok($max_attempts >= 0, "header error is logged at $lastlsn"); + +# and the end-of-wal messages shouldn't be seen +# the same message has been confirmed in the past +ok(!find_in_log($node, $reached_eow_pat, $logstart), + "false log message is not emitted"); + +$node->stop('immediate'); + +#### helper routines +# return the size of logfile of $node in bytes +sub get_log_size +{ + my ($node) = @_; + + return (stat $node->logfile)[7]; +} + +# find $pat in logfile of $node after $off-th byte +sub find_in_log +{ + my ($node, $pat, $off) = @_; + + $off = 0 unless defined $off; + my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile); + return 0 if (length($log) <= $off); + + $log = substr($log, $off); + + return $log =~ m/$pat/; +} -- 2.27.0
At Tue, 25 Jan 2022 17:34:56 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > This v8 is changed in... > > - Added tests to 011_crash_recovery.pl > > - Fixed a bug that server emits "end-of-wal" messages even if it have > emitted an error message for the same LSN. > > - Changed XLogReaderValidatePageHeader() so that it recognizes an > empty page as end-of-WAL. > > - Made pg_waldump conscious of end-of-wal. > > While doing the last item, I noticed that pg_waldump shows the wrong > LSN as the error position. Concretely it emits the LSN of the last > sound WAL record as the error position. I will post a bug-fix patch > for the issue after confirmation. I noticed that I added a useless error message "garbage record header", but it is a kind of invalid record length. So I removed the message. That change makes the logic for EOW in ValidXLogRecordHeader and XLogReaderValidatePageHeader share the same flow. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 57cb251f7cacbb96066ead4543b9f12f5b3c7062 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH v9] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead, raise the error level for other kind of WAL failure to WARNING. --- src/backend/access/transam/xlog.c | 91 +++++++++++++----- src/backend/access/transam/xlogreader.c | 61 ++++++++++++ src/backend/replication/walreceiver.c | 3 +- src/bin/pg_waldump/pg_waldump.c | 13 ++- src/include/access/xlogreader.h | 1 + src/test/recovery/t/011_crash_recovery.pl | 110 +++++++++++++++++++++- 6 files changed, 251 insertions(+), 28 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index dfe2a0bcce..5727e0939f 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode, for (;;) { char *errormsg; + XLogRecPtr ErrRecPtr = InvalidXLogRecPtr; record = XLogReadRecord(xlogreader, &errormsg); if (record == NULL) @@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; + ErrRecPtr = abortedRecPtr; + } + else + { + /* + * NULL ReadRecPtr means we could not read a record at the + * beginning. In that case EndRecPtr is storing the LSN of the + * record we tried to read. + */ + ErrRecPtr = + xlogreader->ReadRecPtr ? + xlogreader->ReadRecPtr : xlogreader->EndRecPtr; } if (readFile >= 0) @@ -4504,13 +4517,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode, } /* - * We only end up here without a message when XLogPageRead() - * failed - in that case we already logged something. In - * StandbyMode that only happens if we have been triggered, so we - * shouldn't loop anymore in that case. + * If we get here for other than end-of-wal, emit the error message + * right now. Otherwise the message if any is shown as a part of + * the end-of-WAL message below. */ - if (errormsg) - ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr), + if (!xlogreader->EndOfWAL && errormsg) + ereport(emode_for_corrupt_record(emode, ErrRecPtr), (errmsg_internal("%s", errormsg) /* already translated */ )); } @@ -4541,11 +4553,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode, /* Great, got a record */ return record; } - else + + /* No valid record available from this source */ + lastSourceFailed = true; + + if (!fetching_ckpt) { - /* No valid record available from this source */ - lastSourceFailed = true; - /* * If archive recovery was requested, but we were still doing * crash recovery, switch to archive recovery and retry using the @@ -4558,11 +4571,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode, * we'd have no idea how far we'd have to replay to reach * consistency. So err on the safe side and give up. */ - if (!InArchiveRecovery && ArchiveRecoveryRequested && - !fetching_ckpt) + if (!InArchiveRecovery && ArchiveRecoveryRequested) { + /* + * We don't report this as LOG, since we don't stop recovery + * here + */ ereport(DEBUG1, - (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, enteringarchive recovery", + LSN_FORMAT_ARGS(ErrRecPtr), + replayTLI, + xlogSourceNames[currentSource]))); InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -4610,12 +4629,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode, continue; } - /* In standby mode, loop back to retry. Otherwise, give up. */ - if (StandbyMode && !CheckForStandbyTrigger()) - continue; - else - return NULL; + /* + * If we haven't emit an error message, we have safely reached the + * end-of-WAL. + */ + if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG) + { + char *fmt; + + if (StandbyMode) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode"); + else if (InArchiveRecovery) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery"); + else + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery"); + + ereport(LOG, + (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI, + xlogSourceNames[currentSource]), + (errormsg ? errdetail_internal("%s", errormsg) : 0))); + } } + + /* In standby mode, loop back to retry. Otherwise, give up. */ + if (StandbyMode && !CheckForStandbyTrigger()) + continue; + else + return NULL; } } @@ -7544,7 +7584,7 @@ StartupXLOG(void) else { /* just have to read next record after CheckPoint */ - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } if (record != NULL) @@ -7782,7 +7822,7 @@ StartupXLOG(void) } /* Else, try to fetch the next WAL record */ - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } while (record != NULL); /* @@ -7842,13 +7882,20 @@ StartupXLOG(void) InRedo = false; } - else + else if (xlogreader->EndOfWAL) { /* there are no WAL records following the checkpoint */ ereport(LOG, (errmsg("redo is not required"))); } + else + { + /* broken record found */ + ereport(WARNING, + (errmsg("redo is skipped"), + errhint("This suggests WAL file corruption. You might need to check the database."))); + } /* * This check is intentionally after the above log messages that @@ -13097,7 +13144,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr) { static XLogRecPtr lastComplaint = 0; - if (readSource == XLOG_FROM_PG_WAL && emode == LOG) + if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING) { if (RecPtr == lastComplaint) emode = DEBUG1; diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 35029cf97d..418fb66ef2 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, pfree(state); return NULL; } + state->EndOfWAL = false; state->errormsg_buf[0] = '\0'; /* @@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) /* reset error state */ *errormsg = NULL; state->errormsg_buf[0] = '\0'; + state->EndOfWAL = false; ResetDecoder(state); state->abortedRecPtr = InvalidXLogRecPtr; @@ -588,6 +590,15 @@ err: */ state->abortedRecPtr = RecPtr; state->missingContrecPtr = targetPagePtr; + + /* + * If the message is not set yet, that means we failed to load the + * page for the record. Otherwise do not hide the existing message. + */ + if (state->errormsg_buf[0] == '\0') + report_invalid_record(state, + "missing contrecord at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); } /* @@ -730,6 +741,31 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess) { + if (record->xl_tot_len == 0) + { + /* + * We are almost sure reaching the end of WAL, make sure that the whole + * header is filled with zeroes. + */ + char *p = (char *)record; + char *pe = (char *)record + SizeOfXLogRecord; + + while (*p == 0 && p < pe) + p++; + + if (p == pe) + { + /* it is completely zeroed, call it a day */ + report_invalid_record(state, "empty record header found at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); + + /* notify end-of-wal to callers */ + state->EndOfWAL = true; + return false; + } + + /* The same condition will be caught as invalid record length */ + } if (record->xl_tot_len < SizeOfXLogRecord) { report_invalid_record(state, @@ -836,6 +872,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr); + StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero"); + + if (hdr->xlp_magic == 0) + { + /* Regard an empty page as End-Of-WAL */ + int i; + + for (i = 0 ; i < XLOG_BLCKSZ && phdr[i] == 0 ; i++); + if (i == XLOG_BLCKSZ) + { + char fname[MAXFNAMELEN]; + + XLogFileName(fname, state->seg.ws_tli, segno, + state->segcxt.ws_segsize); + + report_invalid_record(state, + "empty page in log segment %s, offset %u", + fname, + offset); + state->EndOfWAL = true; + return false; + } + + /* The same condition will be caught as invalid magic number */ + } if (hdr->xlp_magic != XLOG_PAGE_MAGIC) { char fname[MAXFNAMELEN]; diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index b39fce8c23..3034f8281e 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -471,8 +471,7 @@ WalReceiverMain(void) else if (len < 0) { ereport(LOG, - (errmsg("replication terminated by primary server"), - errdetail("End of WAL reached on timeline %u at %X/%X.", + (errmsg("replication terminated by primary server on timeline %u at %X/%X.", startpointTLI, LSN_FORMAT_ARGS(LogstreamResult.Write)))); endofwal = true; diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c index a6251e1a96..3745e76488 100644 --- a/src/bin/pg_waldump/pg_waldump.c +++ b/src/bin/pg_waldump/pg_waldump.c @@ -1176,9 +1176,16 @@ main(int argc, char **argv) exit(0); if (errormsg) - fatal_error("error in WAL record at %X/%X: %s", - LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr), - errormsg); + { + if (xlogreader_state->EndOfWAL) + pg_log_info("end of WAL at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + else + fatal_error("error in WAL record at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + } XLogReaderFree(xlogreader_state); diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h index 477f0efe26..3eeba220a1 100644 --- a/src/include/access/xlogreader.h +++ b/src/include/access/xlogreader.h @@ -174,6 +174,7 @@ struct XLogReaderState */ XLogRecPtr ReadRecPtr; /* start of last record read */ XLogRecPtr EndRecPtr; /* end+1 of last record read */ + bool EndOfWAL; /* the last attempt was EOW? */ /* * Set at the end of recovery: the start point of a partial record at the diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl index 3892aba3e5..67d264df26 100644 --- a/src/test/recovery/t/011_crash_recovery.pl +++ b/src/test/recovery/t/011_crash_recovery.pl @@ -10,9 +10,11 @@ use PostgreSQL::Test::Cluster; use PostgreSQL::Test::Utils; use Test::More; use Config; +use IPC::Run; -plan tests => 3; +plan tests => 11; +my $reached_eow_pat = "reached end of WAL at "; my $node = PostgreSQL::Test::Cluster->new('primary'); $node->init(allows_streaming => 1); $node->start; @@ -50,7 +52,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), # Crash and restart the postmaster $node->stop('immediate'); +my $logstart = get_log_size($node); $node->start; +my $max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, $reached_eow_pat, $logstart)); + sleep 0.5; +} +ok ($max_attempts >= 0, "end-of-wal is logged"); # Make sure we really got a new xid cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'), @@ -62,3 +72,101 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), $stdin .= "\\q\n"; $tx->finish; # wait for psql to quit gracefully + +my $segsize = $node->safe_psql('postgres', + qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]); + +# make sure no records afterwards go to the next segment +$node->safe_psql('postgres', qq[ + SELECT pg_switch_wal(); + CHECKPOINT; + CREATE TABLE t(); +]); +$node->stop('immediate'); + +# identify REDO WAL file +my $cmd = "pg_controldata -D " . $node->data_dir(); +my $chkptfile; +$cmd = ['pg_controldata', '-D', $node->data_dir()]; +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m, + "checkpoint file is identified"); +my $chkptfile = $1; + +# identify the last record +my $walfile = $node->data_dir() . "/pg_wal/$chkptfile"; +$cmd = ['pg_waldump', $walfile]; +$stdout = ''; +$stderr = ''; +my $lastrec; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +foreach my $l (split(/\r?\n/, $stdout)) +{ + $lastrec = $l; +} +ok(defined $lastrec, "last WAL record is extracted"); +ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/, + "pg_waldump emits the correct ending message"); + +# read the last record LSN excluding leading zeroes +ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/, + "LSN of the last record identified"); +my $lastlsn = $1; + +# corrupt the last record +my $offset = hex($lastlsn) % $segsize; +open(my $segf, '+<', $walfile) or die "failed to open $walfile\n"; +seek($segf, $offset, 0); # halfway break the last record +print $segf "\0\0\0\0"; +close($segf); + +# pg_waldump complains about the corrupted record +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/, + "pg_waldump emits the correct error message"); + +# also server complains +$logstart = get_log_size($node); +$node->start; +$max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, "WARNING: invalid record length at 0/$lastlsn: wanted [0-9]+, got 0", + $logstart)); + sleep 0.5; +} +ok($max_attempts >= 0, "header error is logged at $lastlsn"); + +# and the end-of-wal messages shouldn't be seen +# the same message has been confirmed in the past +ok(!find_in_log($node, $reached_eow_pat, $logstart), + "false log message is not emitted"); + +$node->stop('immediate'); + +#### helper routines +# return the size of logfile of $node in bytes +sub get_log_size +{ + my ($node) = @_; + + return (stat $node->logfile)[7]; +} + +# find $pat in logfile of $node after $off-th byte +sub find_in_log +{ + my ($node, $pat, $off) = @_; + + $off = 0 unless defined $off; + my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile); + return 0 if (length($log) <= $off); + + $log = substr($log, $off); + + return $log =~ m/$pat/; +} -- 2.27.0
> This v8 is changed in...
>
> - Added tests to 011_crash_recovery.pl
>
> - Fixed a bug that server emits "end-of-wal" messages even if it have
> emitted an error message for the same LSN.
>
> - Changed XLogReaderValidatePageHeader() so that it recognizes an
> empty page as end-of-WAL.
>
> - Made pg_waldump conscious of end-of-wal.
>
> While doing the last item, I noticed that pg_waldump shows the wrong
> LSN as the error position. Concretely it emits the LSN of the last
> sound WAL record as the error position. I will post a bug-fix patch
> for the issue after confirmation.
I noticed that I added a useless error message "garbage record
header", but it is a kind of invalid record length. So I removed the
message. That change makes the logic for EOW in ValidXLogRecordHeader
and XLogReaderValidatePageHeader share the same flow.
Hi, Kyotaro!
I don't quite understand a meaning of a comment:
/* it is completely zeroed, call it a day */
Please also run pgindent on your code.
Otherwise the new patch seems ok.
--
Hi, Pavel. At Mon, 31 Jan 2022 15:17:09 +0400, Pavel Borisov <pashkin.elfe@gmail.com> wrote in > I don't quite understand a meaning of a comment: > /* it is completely zeroed, call it a day */ While rethinking about this comment, It came to my mind that XLogReaderValidatePageHeader is doing whole-page check. There is no clear reason for not doing at least the same check here. ValidXLogRecordHeader is changed to check all bytes in the rest of the page, instead of just the record header. While working on that, I noticed another end-of-WAL case, unexpected pageaddr. I think we can assume it safe when the pageaddr is smaller than expected (or we have no choice than assuming so). XLogReaderValidatePageHeader is changed that way. But I'm not sure others regard it as a form of safe end-of-WAL. > Please also run pgindent on your code. Hmm. I'm not sure we need to do that at this stage. pgindent makes changes on the whole file involving unrelated part from this patch. Anyway I did that then removed irrelevant edits. pgindent makes a seemingly not-great suggestion. + char *pe = + (char *) record + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); I'm not sure this is intended but I split the line into two lines to define and assign. > Otherwise the new patch seems ok. Thanks! This version 10 is changed in the following points. - Rewrited the comment in ValidXLogRecordHeader. - ValidXLogRecordHeader regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 9eacdd050a8041b358df11ca3e18c1071b693d20 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH v10] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead, raise the error level for other kind of WAL failure to WARNING. --- src/backend/access/transam/xlog.c | 91 +++++++++++++----- src/backend/access/transam/xlogreader.c | 77 +++++++++++++++ src/backend/replication/walreceiver.c | 7 +- src/bin/pg_waldump/pg_waldump.c | 13 ++- src/include/access/xlogreader.h | 1 + src/test/recovery/t/011_crash_recovery.pl | 110 +++++++++++++++++++++- 6 files changed, 269 insertions(+), 30 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index dfe2a0bcce..378c13ccf7 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode, for (;;) { char *errormsg; + XLogRecPtr ErrRecPtr = InvalidXLogRecPtr; record = XLogReadRecord(xlogreader, &errormsg); if (record == NULL) @@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; + ErrRecPtr = abortedRecPtr; + } + else + { + /* + * NULL ReadRecPtr means we could not read a record at the + * beginning. In that case EndRecPtr is storing the LSN of the + * record we tried to read. + */ + ErrRecPtr = + xlogreader->ReadRecPtr ? + xlogreader->ReadRecPtr : xlogreader->EndRecPtr; } if (readFile >= 0) @@ -4504,13 +4517,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode, } /* - * We only end up here without a message when XLogPageRead() - * failed - in that case we already logged something. In - * StandbyMode that only happens if we have been triggered, so we - * shouldn't loop anymore in that case. + * If we get here for other than end-of-wal, emit the error + * message right now. Otherwise the message if any is shown as a + * part of the end-of-WAL message below. */ - if (errormsg) - ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr), + if (!xlogreader->EndOfWAL && errormsg) + ereport(emode_for_corrupt_record(emode, ErrRecPtr), (errmsg_internal("%s", errormsg) /* already translated */ )); } @@ -4541,11 +4553,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode, /* Great, got a record */ return record; } - else + + /* No valid record available from this source */ + lastSourceFailed = true; + + if (!fetching_ckpt) { - /* No valid record available from this source */ - lastSourceFailed = true; - /* * If archive recovery was requested, but we were still doing * crash recovery, switch to archive recovery and retry using the @@ -4558,11 +4571,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode, * we'd have no idea how far we'd have to replay to reach * consistency. So err on the safe side and give up. */ - if (!InArchiveRecovery && ArchiveRecoveryRequested && - !fetching_ckpt) + if (!InArchiveRecovery && ArchiveRecoveryRequested) { + /* + * We don't report this as LOG, since we don't stop recovery + * here + */ ereport(DEBUG1, - (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, enteringarchive recovery", + LSN_FORMAT_ARGS(ErrRecPtr), + replayTLI, + xlogSourceNames[currentSource]))); InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -4610,12 +4629,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode, continue; } - /* In standby mode, loop back to retry. Otherwise, give up. */ - if (StandbyMode && !CheckForStandbyTrigger()) - continue; - else - return NULL; + /* + * If we haven't emit an error message, we have safely reached the + * end-of-WAL. + */ + if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG) + { + char *fmt; + + if (StandbyMode) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode"); + else if (InArchiveRecovery) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery"); + else + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery"); + + ereport(LOG, + (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI, + xlogSourceNames[currentSource]), + (errormsg ? errdetail_internal("%s", errormsg) : 0))); + } } + + /* In standby mode, loop back to retry. Otherwise, give up. */ + if (StandbyMode && !CheckForStandbyTrigger()) + continue; + else + return NULL; } } @@ -7544,7 +7584,7 @@ StartupXLOG(void) else { /* just have to read next record after CheckPoint */ - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } if (record != NULL) @@ -7782,7 +7822,7 @@ StartupXLOG(void) } /* Else, try to fetch the next WAL record */ - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } while (record != NULL); /* @@ -7842,13 +7882,20 @@ StartupXLOG(void) InRedo = false; } - else + else if (xlogreader->EndOfWAL) { /* there are no WAL records following the checkpoint */ ereport(LOG, (errmsg("redo is not required"))); } + else + { + /* broken record found */ + ereport(WARNING, + (errmsg("redo is skipped"), + errhint("This suggests WAL file corruption. You might need to check the database."))); + } /* * This check is intentionally after the above log messages that @@ -13097,7 +13144,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr) { static XLogRecPtr lastComplaint = 0; - if (readSource == XLOG_FROM_PG_WAL && emode == LOG) + if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING) { if (RecPtr == lastComplaint) emode = DEBUG1; diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 35029cf97d..9bcc4a2d37 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, pfree(state); return NULL; } + state->EndOfWAL = false; state->errormsg_buf[0] = '\0'; /* @@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) /* reset error state */ *errormsg = NULL; state->errormsg_buf[0] = '\0'; + state->EndOfWAL = false; ResetDecoder(state); state->abortedRecPtr = InvalidXLogRecPtr; @@ -588,6 +590,15 @@ err: */ state->abortedRecPtr = RecPtr; state->missingContrecPtr = targetPagePtr; + + /* + * If the message is not set yet, that means we failed to load the + * page for the record. Otherwise do not hide the existing message. + */ + if (state->errormsg_buf[0] == '\0') + report_invalid_record(state, + "missing contrecord at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); } /* @@ -730,6 +741,39 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess) { + if (record->xl_tot_len == 0) + { + /* + * We are almost sure reaching the end of WAL, make sure that the + * whole page after the record is filled with zeroes. + */ + char *p = (char *) record; + char *pe; + + /* set pe to the beginning of the next page */ + pe = (char *) record + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); + + while (*p == 0 && p < pe) + p++; + + if (p == pe) + { + /* + * The page after the record is completely zeroed. That suggests + * we don't have a record after this point. We don't bother + * checking the pages after since they are not zeroed in the case + * of recycled segments. + */ + report_invalid_record(state, "empty record found at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); + + /* notify end-of-wal to callers */ + state->EndOfWAL = true; + return false; + } + + /* The same condition will be caught as invalid record length */ + } if (record->xl_tot_len < SizeOfXLogRecord) { report_invalid_record(state, @@ -836,6 +880,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr); + StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero"); + + if (hdr->xlp_magic == 0) + { + /* Regard an empty page as End-Of-WAL */ + int i; + + for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++); + if (i == XLOG_BLCKSZ) + { + char fname[MAXFNAMELEN]; + + XLogFileName(fname, state->seg.ws_tli, segno, + state->segcxt.ws_segsize); + + report_invalid_record(state, + "empty page in log segment %s, offset %u", + fname, + offset); + state->EndOfWAL = true; + return false; + } + + /* The same condition will be caught as invalid magic number */ + } if (hdr->xlp_magic != XLOG_PAGE_MAGIC) { char fname[MAXFNAMELEN]; @@ -921,6 +990,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, LSN_FORMAT_ARGS(hdr->xlp_pageaddr), fname, offset); + + /* + * If the page address is less than expected we assume it is an unused + * page in a recycled segment. + */ + if (hdr->xlp_pageaddr < recaddr) + state->EndOfWAL = true; + return false; } diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index b39fce8c23..8e1fa32489 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -471,10 +471,9 @@ WalReceiverMain(void) else if (len < 0) { ereport(LOG, - (errmsg("replication terminated by primary server"), - errdetail("End of WAL reached on timeline %u at %X/%X.", - startpointTLI, - LSN_FORMAT_ARGS(LogstreamResult.Write)))); + (errmsg("replication terminated by primary server on timeline %u at %X/%X.", + startpointTLI, + LSN_FORMAT_ARGS(LogstreamResult.Write)))); endofwal = true; break; } diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c index a6251e1a96..3745e76488 100644 --- a/src/bin/pg_waldump/pg_waldump.c +++ b/src/bin/pg_waldump/pg_waldump.c @@ -1176,9 +1176,16 @@ main(int argc, char **argv) exit(0); if (errormsg) - fatal_error("error in WAL record at %X/%X: %s", - LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr), - errormsg); + { + if (xlogreader_state->EndOfWAL) + pg_log_info("end of WAL at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + else + fatal_error("error in WAL record at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + } XLogReaderFree(xlogreader_state); diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h index 477f0efe26..3eeba220a1 100644 --- a/src/include/access/xlogreader.h +++ b/src/include/access/xlogreader.h @@ -174,6 +174,7 @@ struct XLogReaderState */ XLogRecPtr ReadRecPtr; /* start of last record read */ XLogRecPtr EndRecPtr; /* end+1 of last record read */ + bool EndOfWAL; /* the last attempt was EOW? */ /* * Set at the end of recovery: the start point of a partial record at the diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl index 3892aba3e5..67d264df26 100644 --- a/src/test/recovery/t/011_crash_recovery.pl +++ b/src/test/recovery/t/011_crash_recovery.pl @@ -10,9 +10,11 @@ use PostgreSQL::Test::Cluster; use PostgreSQL::Test::Utils; use Test::More; use Config; +use IPC::Run; -plan tests => 3; +plan tests => 11; +my $reached_eow_pat = "reached end of WAL at "; my $node = PostgreSQL::Test::Cluster->new('primary'); $node->init(allows_streaming => 1); $node->start; @@ -50,7 +52,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), # Crash and restart the postmaster $node->stop('immediate'); +my $logstart = get_log_size($node); $node->start; +my $max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, $reached_eow_pat, $logstart)); + sleep 0.5; +} +ok ($max_attempts >= 0, "end-of-wal is logged"); # Make sure we really got a new xid cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'), @@ -62,3 +72,101 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), $stdin .= "\\q\n"; $tx->finish; # wait for psql to quit gracefully + +my $segsize = $node->safe_psql('postgres', + qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]); + +# make sure no records afterwards go to the next segment +$node->safe_psql('postgres', qq[ + SELECT pg_switch_wal(); + CHECKPOINT; + CREATE TABLE t(); +]); +$node->stop('immediate'); + +# identify REDO WAL file +my $cmd = "pg_controldata -D " . $node->data_dir(); +my $chkptfile; +$cmd = ['pg_controldata', '-D', $node->data_dir()]; +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m, + "checkpoint file is identified"); +my $chkptfile = $1; + +# identify the last record +my $walfile = $node->data_dir() . "/pg_wal/$chkptfile"; +$cmd = ['pg_waldump', $walfile]; +$stdout = ''; +$stderr = ''; +my $lastrec; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +foreach my $l (split(/\r?\n/, $stdout)) +{ + $lastrec = $l; +} +ok(defined $lastrec, "last WAL record is extracted"); +ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/, + "pg_waldump emits the correct ending message"); + +# read the last record LSN excluding leading zeroes +ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/, + "LSN of the last record identified"); +my $lastlsn = $1; + +# corrupt the last record +my $offset = hex($lastlsn) % $segsize; +open(my $segf, '+<', $walfile) or die "failed to open $walfile\n"; +seek($segf, $offset, 0); # halfway break the last record +print $segf "\0\0\0\0"; +close($segf); + +# pg_waldump complains about the corrupted record +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/, + "pg_waldump emits the correct error message"); + +# also server complains +$logstart = get_log_size($node); +$node->start; +$max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, "WARNING: invalid record length at 0/$lastlsn: wanted [0-9]+, got 0", + $logstart)); + sleep 0.5; +} +ok($max_attempts >= 0, "header error is logged at $lastlsn"); + +# and the end-of-wal messages shouldn't be seen +# the same message has been confirmed in the past +ok(!find_in_log($node, $reached_eow_pat, $logstart), + "false log message is not emitted"); + +$node->stop('immediate'); + +#### helper routines +# return the size of logfile of $node in bytes +sub get_log_size +{ + my ($node) = @_; + + return (stat $node->logfile)[7]; +} + +# find $pat in logfile of $node after $off-th byte +sub find_in_log +{ + my ($node, $pat, $off) = @_; + + $off = 0 unless defined $off; + my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile); + return 0 if (length($log) <= $off); + + $log = substr($log, $off); + + return $log =~ m/$pat/; +} -- 2.27.0
This version 10 is changed in the following points.
- Rewrited the comment in ValidXLogRecordHeader.
- ValidXLogRecordHeader
Thanks!
Maybe it can be written little bit shorter:
pe = (char *) record + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
as
pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
?
The problem that pgindent sometimes reflow formatting of unrelated blocks is indeed existing. But I think it's right to manually leave pgindent-ed code only on what is related to the patch. The leftover is pgindent-ed in a scheduled manner sometimes, so don't need to bother.
I'd like to set v10 as RfC.
At Tue, 1 Feb 2022 12:38:01 +0400, Pavel Borisov <pashkin.elfe@gmail.com> wrote in > Maybe it can be written little bit shorter: > pe = (char *) record + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); > as > pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); > ? That difference would be a matter of taste, but I found it looks cleaner that definition and assignment is separated for both p and pe. Now it is like the following. > char *p; > char *pe; > > /* scan from the beginning of the record to the end of block */ > p = (char *) record; > pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); > The problem that pgindent sometimes reflow formatting of unrelated blocks > is indeed existing. But I think it's right to manually leave pgindent-ed > code only on what is related to the patch. The leftover is pgindent-ed in a > scheduled manner sometimes, so don't need to bother. Yeah, I meant that it is a bit annoying to unpginden-ting unrelated edits:p > I'd like to set v10 as RfC. Thanks! The suggested change is done in the attached v11. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 491416866920f8f9648dee9c0571022f71553879 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH v11] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead, raise the error level for other kind of WAL failure to WARNING. --- src/backend/access/transam/xlog.c | 91 +++++++++++++----- src/backend/access/transam/xlogreader.c | 78 +++++++++++++++ src/backend/replication/walreceiver.c | 7 +- src/bin/pg_waldump/pg_waldump.c | 13 ++- src/include/access/xlogreader.h | 1 + src/test/recovery/t/011_crash_recovery.pl | 110 +++++++++++++++++++++- 6 files changed, 270 insertions(+), 30 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index dfe2a0bcce..378c13ccf7 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode, for (;;) { char *errormsg; + XLogRecPtr ErrRecPtr = InvalidXLogRecPtr; record = XLogReadRecord(xlogreader, &errormsg); if (record == NULL) @@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; + ErrRecPtr = abortedRecPtr; + } + else + { + /* + * NULL ReadRecPtr means we could not read a record at the + * beginning. In that case EndRecPtr is storing the LSN of the + * record we tried to read. + */ + ErrRecPtr = + xlogreader->ReadRecPtr ? + xlogreader->ReadRecPtr : xlogreader->EndRecPtr; } if (readFile >= 0) @@ -4504,13 +4517,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode, } /* - * We only end up here without a message when XLogPageRead() - * failed - in that case we already logged something. In - * StandbyMode that only happens if we have been triggered, so we - * shouldn't loop anymore in that case. + * If we get here for other than end-of-wal, emit the error + * message right now. Otherwise the message if any is shown as a + * part of the end-of-WAL message below. */ - if (errormsg) - ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr), + if (!xlogreader->EndOfWAL && errormsg) + ereport(emode_for_corrupt_record(emode, ErrRecPtr), (errmsg_internal("%s", errormsg) /* already translated */ )); } @@ -4541,11 +4553,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode, /* Great, got a record */ return record; } - else + + /* No valid record available from this source */ + lastSourceFailed = true; + + if (!fetching_ckpt) { - /* No valid record available from this source */ - lastSourceFailed = true; - /* * If archive recovery was requested, but we were still doing * crash recovery, switch to archive recovery and retry using the @@ -4558,11 +4571,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode, * we'd have no idea how far we'd have to replay to reach * consistency. So err on the safe side and give up. */ - if (!InArchiveRecovery && ArchiveRecoveryRequested && - !fetching_ckpt) + if (!InArchiveRecovery && ArchiveRecoveryRequested) { + /* + * We don't report this as LOG, since we don't stop recovery + * here + */ ereport(DEBUG1, - (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, enteringarchive recovery", + LSN_FORMAT_ARGS(ErrRecPtr), + replayTLI, + xlogSourceNames[currentSource]))); InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -4610,12 +4629,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode, continue; } - /* In standby mode, loop back to retry. Otherwise, give up. */ - if (StandbyMode && !CheckForStandbyTrigger()) - continue; - else - return NULL; + /* + * If we haven't emit an error message, we have safely reached the + * end-of-WAL. + */ + if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG) + { + char *fmt; + + if (StandbyMode) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode"); + else if (InArchiveRecovery) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery"); + else + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery"); + + ereport(LOG, + (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI, + xlogSourceNames[currentSource]), + (errormsg ? errdetail_internal("%s", errormsg) : 0))); + } } + + /* In standby mode, loop back to retry. Otherwise, give up. */ + if (StandbyMode && !CheckForStandbyTrigger()) + continue; + else + return NULL; } } @@ -7544,7 +7584,7 @@ StartupXLOG(void) else { /* just have to read next record after CheckPoint */ - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } if (record != NULL) @@ -7782,7 +7822,7 @@ StartupXLOG(void) } /* Else, try to fetch the next WAL record */ - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } while (record != NULL); /* @@ -7842,13 +7882,20 @@ StartupXLOG(void) InRedo = false; } - else + else if (xlogreader->EndOfWAL) { /* there are no WAL records following the checkpoint */ ereport(LOG, (errmsg("redo is not required"))); } + else + { + /* broken record found */ + ereport(WARNING, + (errmsg("redo is skipped"), + errhint("This suggests WAL file corruption. You might need to check the database."))); + } /* * This check is intentionally after the above log messages that @@ -13097,7 +13144,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr) { static XLogRecPtr lastComplaint = 0; - if (readSource == XLOG_FROM_PG_WAL && emode == LOG) + if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING) { if (RecPtr == lastComplaint) emode = DEBUG1; diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 35029cf97d..03a8b42f15 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, pfree(state); return NULL; } + state->EndOfWAL = false; state->errormsg_buf[0] = '\0'; /* @@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) /* reset error state */ *errormsg = NULL; state->errormsg_buf[0] = '\0'; + state->EndOfWAL = false; ResetDecoder(state); state->abortedRecPtr = InvalidXLogRecPtr; @@ -588,6 +590,15 @@ err: */ state->abortedRecPtr = RecPtr; state->missingContrecPtr = targetPagePtr; + + /* + * If the message is not set yet, that means we failed to load the + * page for the record. Otherwise do not hide the existing message. + */ + if (state->errormsg_buf[0] == '\0') + report_invalid_record(state, + "missing contrecord at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); } /* @@ -730,6 +741,40 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess) { + if (record->xl_tot_len == 0) + { + /* + * We are almost sure reaching the end of WAL, make sure that the + * whole page after the record is filled with zeroes. + */ + char *p; + char *pe; + + /* scan from the beginning of the record to the end of block */ + p = (char *) record; + pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); + + while (*p == 0 && p < pe) + p++; + + if (p == pe) + { + /* + * The page after the record is completely zeroed. That suggests + * we don't have a record after this point. We don't bother + * checking the pages after since they are not zeroed in the case + * of recycled segments. + */ + report_invalid_record(state, "empty record found at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); + + /* notify end-of-wal to callers */ + state->EndOfWAL = true; + return false; + } + + /* The same condition will be caught as invalid record length */ + } if (record->xl_tot_len < SizeOfXLogRecord) { report_invalid_record(state, @@ -836,6 +881,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr); + StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero"); + + if (hdr->xlp_magic == 0) + { + /* Regard an empty page as End-Of-WAL */ + int i; + + for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++); + if (i == XLOG_BLCKSZ) + { + char fname[MAXFNAMELEN]; + + XLogFileName(fname, state->seg.ws_tli, segno, + state->segcxt.ws_segsize); + + report_invalid_record(state, + "empty page in log segment %s, offset %u", + fname, + offset); + state->EndOfWAL = true; + return false; + } + + /* The same condition will be caught as invalid magic number */ + } if (hdr->xlp_magic != XLOG_PAGE_MAGIC) { char fname[MAXFNAMELEN]; @@ -921,6 +991,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, LSN_FORMAT_ARGS(hdr->xlp_pageaddr), fname, offset); + + /* + * If the page address is less than expected we assume it is an unused + * page in a recycled segment. + */ + if (hdr->xlp_pageaddr < recaddr) + state->EndOfWAL = true; + return false; } diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index b39fce8c23..8e1fa32489 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -471,10 +471,9 @@ WalReceiverMain(void) else if (len < 0) { ereport(LOG, - (errmsg("replication terminated by primary server"), - errdetail("End of WAL reached on timeline %u at %X/%X.", - startpointTLI, - LSN_FORMAT_ARGS(LogstreamResult.Write)))); + (errmsg("replication terminated by primary server on timeline %u at %X/%X.", + startpointTLI, + LSN_FORMAT_ARGS(LogstreamResult.Write)))); endofwal = true; break; } diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c index a6251e1a96..3745e76488 100644 --- a/src/bin/pg_waldump/pg_waldump.c +++ b/src/bin/pg_waldump/pg_waldump.c @@ -1176,9 +1176,16 @@ main(int argc, char **argv) exit(0); if (errormsg) - fatal_error("error in WAL record at %X/%X: %s", - LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr), - errormsg); + { + if (xlogreader_state->EndOfWAL) + pg_log_info("end of WAL at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + else + fatal_error("error in WAL record at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + } XLogReaderFree(xlogreader_state); diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h index 477f0efe26..3eeba220a1 100644 --- a/src/include/access/xlogreader.h +++ b/src/include/access/xlogreader.h @@ -174,6 +174,7 @@ struct XLogReaderState */ XLogRecPtr ReadRecPtr; /* start of last record read */ XLogRecPtr EndRecPtr; /* end+1 of last record read */ + bool EndOfWAL; /* the last attempt was EOW? */ /* * Set at the end of recovery: the start point of a partial record at the diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl index 3892aba3e5..67d264df26 100644 --- a/src/test/recovery/t/011_crash_recovery.pl +++ b/src/test/recovery/t/011_crash_recovery.pl @@ -10,9 +10,11 @@ use PostgreSQL::Test::Cluster; use PostgreSQL::Test::Utils; use Test::More; use Config; +use IPC::Run; -plan tests => 3; +plan tests => 11; +my $reached_eow_pat = "reached end of WAL at "; my $node = PostgreSQL::Test::Cluster->new('primary'); $node->init(allows_streaming => 1); $node->start; @@ -50,7 +52,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), # Crash and restart the postmaster $node->stop('immediate'); +my $logstart = get_log_size($node); $node->start; +my $max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, $reached_eow_pat, $logstart)); + sleep 0.5; +} +ok ($max_attempts >= 0, "end-of-wal is logged"); # Make sure we really got a new xid cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'), @@ -62,3 +72,101 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), $stdin .= "\\q\n"; $tx->finish; # wait for psql to quit gracefully + +my $segsize = $node->safe_psql('postgres', + qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]); + +# make sure no records afterwards go to the next segment +$node->safe_psql('postgres', qq[ + SELECT pg_switch_wal(); + CHECKPOINT; + CREATE TABLE t(); +]); +$node->stop('immediate'); + +# identify REDO WAL file +my $cmd = "pg_controldata -D " . $node->data_dir(); +my $chkptfile; +$cmd = ['pg_controldata', '-D', $node->data_dir()]; +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m, + "checkpoint file is identified"); +my $chkptfile = $1; + +# identify the last record +my $walfile = $node->data_dir() . "/pg_wal/$chkptfile"; +$cmd = ['pg_waldump', $walfile]; +$stdout = ''; +$stderr = ''; +my $lastrec; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +foreach my $l (split(/\r?\n/, $stdout)) +{ + $lastrec = $l; +} +ok(defined $lastrec, "last WAL record is extracted"); +ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/, + "pg_waldump emits the correct ending message"); + +# read the last record LSN excluding leading zeroes +ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/, + "LSN of the last record identified"); +my $lastlsn = $1; + +# corrupt the last record +my $offset = hex($lastlsn) % $segsize; +open(my $segf, '+<', $walfile) or die "failed to open $walfile\n"; +seek($segf, $offset, 0); # halfway break the last record +print $segf "\0\0\0\0"; +close($segf); + +# pg_waldump complains about the corrupted record +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/, + "pg_waldump emits the correct error message"); + +# also server complains +$logstart = get_log_size($node); +$node->start; +$max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, "WARNING: invalid record length at 0/$lastlsn: wanted [0-9]+, got 0", + $logstart)); + sleep 0.5; +} +ok($max_attempts >= 0, "header error is logged at $lastlsn"); + +# and the end-of-wal messages shouldn't be seen +# the same message has been confirmed in the past +ok(!find_in_log($node, $reached_eow_pat, $logstart), + "false log message is not emitted"); + +$node->stop('immediate'); + +#### helper routines +# return the size of logfile of $node in bytes +sub get_log_size +{ + my ($node) = @_; + + return (stat $node->logfile)[7]; +} + +# find $pat in logfile of $node after $off-th byte +sub find_in_log +{ + my ($node, $pat, $off) = @_; + + $off = 0 unless defined $off; + my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile); + return 0 if (length($log) <= $off); + + $log = substr($log, $off); + + return $log =~ m/$pat/; +} -- 2.27.0
Thanks! The suggested change is done in the attached v11.
Thanks! v11 is a small refactoring of v10 that doesn't change behavior, so it is RfC as well.
Hi, Here are some of my review comments on the v11 patch: - (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, entering archive recovery", + LSN_FORMAT_ARGS(ErrRecPtr), + replayTLI, + xlogSourceNames[currentSource]))); Why crash recovery? Won't this message get printed even during PITR? I just did a PITR and could see these messages in the logfile. 2022-02-08 18:00:44.367 IST [86185] LOG: starting point-in-time recovery to WAL location (LSN) "0/5227790" 2022-02-08 18:00:44.368 IST [86185] LOG: database system was not properly shut down; automatic recovery in progress 2022-02-08 18:00:44.369 IST [86185] LOG: redo starts at 0/14DC8D8 2022-02-08 18:00:44.978 IST [86185] DEBUG1: reached end of WAL at 0/3FFFFD0 on timeline 1 in pg_wal during crash recovery, entering archive recovery == + /* + * If we haven't emit an error message, we have safely reached the + * end-of-WAL. + */ + if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG) + { + char *fmt; + + if (StandbyMode) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode"); + else if (InArchiveRecovery) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery"); + else + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery"); + + ereport(LOG, + (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI, + xlogSourceNames[currentSource]), + (errormsg ? errdetail_internal("%s", errormsg) : 0))); + } Doesn't it make sense to add an assert statement inside this if-block that will check for xlogreader->EndOfWAL? == - * We only end up here without a message when XLogPageRead() - * failed - in that case we already logged something. In - * StandbyMode that only happens if we have been triggered, so we - * shouldn't loop anymore in that case. + * If we get here for other than end-of-wal, emit the error + * message right now. Otherwise the message if any is shown as a + * part of the end-of-WAL message below. */ For consistency, I think we can replace "end-of-wal" with "end-of-WAL". Please note that everywhere else in the comments you have used "end-of-WAL". So why not the same here? == ereport(LOG, - (errmsg("replication terminated by primary server"), - errdetail("End of WAL reached on timeline %u at %X/%X.", - startpointTLI, - LSN_FORMAT_ARGS(LogstreamResult.Write)))); + (errmsg("replication terminated by primary server on timeline %u at %X/%X.", + startpointTLI, + LSN_FORMAT_ARGS(LogstreamResult.Write)))); Is this change really required? I don't see any issue with the existing error message. == Lastly, are we also planning to backport this patch? -- With Regards, Ashutosh Sharma. On Wed, Feb 2, 2022 at 11:05 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Tue, 1 Feb 2022 12:38:01 +0400, Pavel Borisov <pashkin.elfe@gmail.com> wrote in > > Maybe it can be written little bit shorter: > > pe = (char *) record + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); > > as > > pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); > > ? > > That difference would be a matter of taste, but I found it looks > cleaner that definition and assignment is separated for both p and pe. > Now it is like the following. > > > char *p; > > char *pe; > > > > /* scan from the beginning of the record to the end of block */ > > p = (char *) record; > > pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); > > > > The problem that pgindent sometimes reflow formatting of unrelated blocks > > is indeed existing. But I think it's right to manually leave pgindent-ed > > code only on what is related to the patch. The leftover is pgindent-ed in a > > scheduled manner sometimes, so don't need to bother. > > Yeah, I meant that it is a bit annoying to unpginden-ting unrelated > edits:p > > > I'd like to set v10 as RfC. > > Thanks! The suggested change is done in the attached v11. > > regards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center
Hi, Ashutosh. At Tue, 8 Feb 2022 18:35:34 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in > Here are some of my review comments on the v11 patch: Thank you for taking a look on this. > - (errmsg_internal("reached end of WAL in > pg_wal, entering archive recovery"))); > + (errmsg_internal("reached end of WAL at %X/%X > on timeline %u in %s during crash recovery, entering archive > recovery", > + LSN_FORMAT_ARGS(ErrRecPtr), > + replayTLI, > + xlogSourceNames[currentSource]))); > > Why crash recovery? Won't this message get printed even during PITR? It is in the if-block with the following condition. > * If archive recovery was requested, but we were still doing > * crash recovery, switch to archive recovery and retry using the > * offline archive. We have now replayed all the valid WAL in > * pg_wal, so we are presumably now consistent. ... > if (!InArchiveRecovery && ArchiveRecoveryRequested) This means archive-recovery is requested but not started yet. That is, we've just finished crash recovery. The existing comment cited together is mentioning that. At the end of PITR (or archive recovery), the other code works. > /* > * If we haven't emit an error message, we have safely reached the > * end-of-WAL. > */ > if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG) > { > char *fmt; > > if (StandbyMode) > fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode"); > else if (InArchiveRecovery) > fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery"); > else > fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery"); The last among the above messages is choosed when archive-recovery is not requested at all. > I just did a PITR and could see these messages in the logfile. Yeah, the log lines are describing that the server starting with crash recovery to run PITR. > 2022-02-08 18:00:44.367 IST [86185] LOG: starting point-in-time > recovery to WAL location (LSN) "0/5227790" > 2022-02-08 18:00:44.368 IST [86185] LOG: database system was not > properly shut down; automatic recovery in progress Well. I guess that the "automatic recovery" is ambiguous. Does it make sense if the second line were like the follows instead? + 2022-02-08 18:00:44.368 IST [86185] LOG: database system was not properly shut down; crash recovery in progress > 2022-02-08 18:00:44.369 IST [86185] LOG: redo starts at 0/14DC8D8 > 2022-02-08 18:00:44.978 IST [86185] DEBUG1: reached end of WAL at > 0/3FFFFD0 on timeline 1 in pg_wal during crash recovery, entering > archive recovery (I don't include this change in this patch since it would be another issue.) > == > > + /* > + * If we haven't emit an error message, we have safely reached the > + * end-of-WAL. > + */ > + if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG) > + { > + char *fmt; > + > + if (StandbyMode) > + fmt = gettext_noop("reached end of WAL at %X/%X on > timeline %u in %s during standby mode"); > + else if (InArchiveRecovery) > + fmt = gettext_noop("reached end of WAL at %X/%X on > timeline %u in %s during archive recovery"); > + else > + fmt = gettext_noop("reached end of WAL at %X/%X on > timeline %u in %s during crash recovery"); > + > + ereport(LOG, > + (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI, > + xlogSourceNames[currentSource]), > + (errormsg ? errdetail_internal("%s", errormsg) : 0))); > + } > > Doesn't it make sense to add an assert statement inside this if-block > that will check for xlogreader->EndOfWAL? Good point. On second thought, the condition there is flat wrong. The message is "reached end of WAL" so the condition should be EndOfWAL. On the other hand we didn't make sure that the error message for the stop is emitted anywhere. Thus I don't particularly want to be strict on that point. I made the following change for this. - if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG) + if (xlogreader->EndOfWAL) > == > > - * We only end up here without a message when XLogPageRead() > - * failed - in that case we already logged something. In > - * StandbyMode that only happens if we have been triggered, so we > - * shouldn't loop anymore in that case. > + * If we get here for other than end-of-wal, emit the error > + * message right now. Otherwise the message if any is shown as a > + * part of the end-of-WAL message below. > */ > > For consistency, I think we can replace "end-of-wal" with > "end-of-WAL". Please note that everywhere else in the comments you > have used "end-of-WAL". So why not the same here? Right. Fixed. > == > > ereport(LOG, > - (errmsg("replication terminated by > primary server"), > - errdetail("End of WAL reached on > timeline %u at %X/%X.", > - startpointTLI, > - > LSN_FORMAT_ARGS(LogstreamResult.Write)))); > + (errmsg("replication terminated by > primary server on timeline %u at %X/%X.", > + startpointTLI, > + > LSN_FORMAT_ARGS(LogstreamResult.Write)))); > > Is this change really required? I don't see any issue with the > existing error message. Without the change, we see two similar end-of-WAL messages from both walreceiver and startup. (Please don't care about the slight difference of LSNs..) [walreceiver] LOG: replication terminated by primary server [walreceiver] DETAIL: End of WAL reached on timeline 1 at 0/B0000D8. [startup] LOG: reached end of WAL at 0/B000060 on timeline 1 in archive during standby mode [startup] DETAIL: empty record found at 0/B0000D8 But what the walreceiver detected at the time is not End-of-WAL but an error on the streaming connection. Since this patch makes startup process to detect End-of-WAL, we don't need the duplicate and in-a-sense false end-of-WAL message from walreceiver. # By the way, I deliberately choosed to report the LSN of last # successfully record in the "reached end of WAL" message. On second # thought about this choice, I came to think that it is better to report # the failure LSN. I changed it to report the failure LSN. In this # case we face an ambiguity according to how we failed to read the # record but for now we have no choice than blindly choosing one of # them. I choosed EndRecPtr since I think decode error happens quite # rarely than read errors. [walreceiver] LOG: replication terminated by primary server at 0/B014228 on timeline 1. [startup] LOG: reached end of WAL at 0/B014228 on timeline 1 in archive during standby mode [startup] DETAIL: empty record found at 0/B014228 This is the reason for the change. > Lastly, are we also planning to backport this patch? This is apparent a behavioral change, not a bug fix, which I think we regard as not appropriate for back-patching. As the result, I made the following chages in the version 11. 1. Changed the condition for the "end-of-WAL" message from emode_for_corrupt_record to the EndOfWAL flag. 2. Corrected the wording of end-of-wal to end-of-WAL. 3. In the "reached end of WAL" message, report the LSN of the beginning of failed record instead of the beginning of the last-succeeded record. 4. In the changed message in walreceiver.c, I swapped LSN and timeline so that they are in the same order with other similar messages. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From e07c1501cd0020f2a817dd9544c4aa5063e29685 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH v11] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead, raise the error level for other kind of WAL failure to WARNING. --- src/backend/access/transam/xlog.c | 93 +++++++++++++----- src/backend/access/transam/xlogreader.c | 78 +++++++++++++++ src/backend/replication/walreceiver.c | 7 +- src/bin/pg_waldump/pg_waldump.c | 13 ++- src/include/access/xlogreader.h | 1 + src/test/recovery/t/011_crash_recovery.pl | 110 +++++++++++++++++++++- 6 files changed, 271 insertions(+), 31 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 958220c495..618f33d342 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode, for (;;) { char *errormsg; + XLogRecPtr ErrRecPtr = InvalidXLogRecPtr; record = XLogReadRecord(xlogreader, &errormsg); if (record == NULL) @@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; + ErrRecPtr = abortedRecPtr; + } + else + { + /* + * EndRecPtr is the LSN we tried to read but failed. In the + * case of decoding error, it is at the end of the failed + * record but we don't have a means for now to know EndRecPtr + * is pointing to which of the beginning or ending of the + * failed record. + */ + ErrRecPtr = xlogreader->EndRecPtr; } if (readFile >= 0) @@ -4504,13 +4517,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode, } /* - * We only end up here without a message when XLogPageRead() - * failed - in that case we already logged something. In - * StandbyMode that only happens if we have been triggered, so we - * shouldn't loop anymore in that case. + * If we get here for other than end-of-WAL, emit the error + * message right now. Otherwise the message if any is shown as a + * part of the end-of-WAL message below. */ - if (errormsg) - ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr), + if (!xlogreader->EndOfWAL && errormsg) + ereport(emode_for_corrupt_record(emode, ErrRecPtr), (errmsg_internal("%s", errormsg) /* already translated */ )); } @@ -4541,11 +4553,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode, /* Great, got a record */ return record; } - else + + /* No valid record available from this source */ + lastSourceFailed = true; + + if (!fetching_ckpt) { - /* No valid record available from this source */ - lastSourceFailed = true; - /* * If archive recovery was requested, but we were still doing * crash recovery, switch to archive recovery and retry using the @@ -4558,11 +4571,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode, * we'd have no idea how far we'd have to replay to reach * consistency. So err on the safe side and give up. */ - if (!InArchiveRecovery && ArchiveRecoveryRequested && - !fetching_ckpt) + if (!InArchiveRecovery && ArchiveRecoveryRequested) { + /* + * We don't report this as LOG, since we don't stop recovery + * here + */ ereport(DEBUG1, - (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, enteringarchive recovery", + LSN_FORMAT_ARGS(ErrRecPtr), + replayTLI, + xlogSourceNames[currentSource]))); InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -4610,12 +4629,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode, continue; } - /* In standby mode, loop back to retry. Otherwise, give up. */ - if (StandbyMode && !CheckForStandbyTrigger()) - continue; - else - return NULL; + /* + * If we haven't emit an error message, we have safely reached the + * end-of-WAL. + */ + if (xlogreader->EndOfWAL) + { + char *fmt; + + if (StandbyMode) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode"); + else if (InArchiveRecovery) + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery"); + else + fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery"); + + ereport(LOG, + (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI, + xlogSourceNames[currentSource]), + (errormsg ? errdetail_internal("%s", errormsg) : 0))); + } } + + /* In standby mode, loop back to retry. Otherwise, give up. */ + if (StandbyMode && !CheckForStandbyTrigger()) + continue; + else + return NULL; } } @@ -7294,7 +7334,7 @@ StartupXLOG(void) { ereport(LOG, (errmsg("database system was not properly shut down; " - "automatic recovery in progress"))); + "crash recovery in progress"))); if (recoveryTargetTLI > ControlFile->checkPointCopy.ThisTimeLineID) ereport(LOG, (errmsg("crash recovery starts in timeline %u " @@ -7544,7 +7584,7 @@ StartupXLOG(void) else { /* just have to read next record after CheckPoint */ - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } if (record != NULL) @@ -7782,7 +7822,7 @@ StartupXLOG(void) } /* Else, try to fetch the next WAL record */ - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } while (record != NULL); /* @@ -7842,13 +7882,20 @@ StartupXLOG(void) InRedo = false; } - else + else if (xlogreader->EndOfWAL) { /* there are no WAL records following the checkpoint */ ereport(LOG, (errmsg("redo is not required"))); } + else + { + /* broken record found */ + ereport(WARNING, + (errmsg("redo is skipped"), + errhint("This suggests WAL file corruption. You might need to check the database."))); + } /* * This check is intentionally after the above log messages that @@ -13097,7 +13144,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr) { static XLogRecPtr lastComplaint = 0; - if (readSource == XLOG_FROM_PG_WAL && emode == LOG) + if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING) { if (RecPtr == lastComplaint) emode = DEBUG1; diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 35029cf97d..03a8b42f15 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, pfree(state); return NULL; } + state->EndOfWAL = false; state->errormsg_buf[0] = '\0'; /* @@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) /* reset error state */ *errormsg = NULL; state->errormsg_buf[0] = '\0'; + state->EndOfWAL = false; ResetDecoder(state); state->abortedRecPtr = InvalidXLogRecPtr; @@ -588,6 +590,15 @@ err: */ state->abortedRecPtr = RecPtr; state->missingContrecPtr = targetPagePtr; + + /* + * If the message is not set yet, that means we failed to load the + * page for the record. Otherwise do not hide the existing message. + */ + if (state->errormsg_buf[0] == '\0') + report_invalid_record(state, + "missing contrecord at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); } /* @@ -730,6 +741,40 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess) { + if (record->xl_tot_len == 0) + { + /* + * We are almost sure reaching the end of WAL, make sure that the + * whole page after the record is filled with zeroes. + */ + char *p; + char *pe; + + /* scan from the beginning of the record to the end of block */ + p = (char *) record; + pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); + + while (*p == 0 && p < pe) + p++; + + if (p == pe) + { + /* + * The page after the record is completely zeroed. That suggests + * we don't have a record after this point. We don't bother + * checking the pages after since they are not zeroed in the case + * of recycled segments. + */ + report_invalid_record(state, "empty record found at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); + + /* notify end-of-wal to callers */ + state->EndOfWAL = true; + return false; + } + + /* The same condition will be caught as invalid record length */ + } if (record->xl_tot_len < SizeOfXLogRecord) { report_invalid_record(state, @@ -836,6 +881,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr); + StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero"); + + if (hdr->xlp_magic == 0) + { + /* Regard an empty page as End-Of-WAL */ + int i; + + for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++); + if (i == XLOG_BLCKSZ) + { + char fname[MAXFNAMELEN]; + + XLogFileName(fname, state->seg.ws_tli, segno, + state->segcxt.ws_segsize); + + report_invalid_record(state, + "empty page in log segment %s, offset %u", + fname, + offset); + state->EndOfWAL = true; + return false; + } + + /* The same condition will be caught as invalid magic number */ + } if (hdr->xlp_magic != XLOG_PAGE_MAGIC) { char fname[MAXFNAMELEN]; @@ -921,6 +991,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, LSN_FORMAT_ARGS(hdr->xlp_pageaddr), fname, offset); + + /* + * If the page address is less than expected we assume it is an unused + * page in a recycled segment. + */ + if (hdr->xlp_pageaddr < recaddr) + state->EndOfWAL = true; + return false; } diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index b39fce8c23..1a7a692bc0 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -471,10 +471,9 @@ WalReceiverMain(void) else if (len < 0) { ereport(LOG, - (errmsg("replication terminated by primary server"), - errdetail("End of WAL reached on timeline %u at %X/%X.", - startpointTLI, - LSN_FORMAT_ARGS(LogstreamResult.Write)))); + (errmsg("replication terminated by primary server at %X/%X on timeline %u.", + LSN_FORMAT_ARGS(LogstreamResult.Write), + startpointTLI))); endofwal = true; break; } diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c index a6251e1a96..3745e76488 100644 --- a/src/bin/pg_waldump/pg_waldump.c +++ b/src/bin/pg_waldump/pg_waldump.c @@ -1176,9 +1176,16 @@ main(int argc, char **argv) exit(0); if (errormsg) - fatal_error("error in WAL record at %X/%X: %s", - LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr), - errormsg); + { + if (xlogreader_state->EndOfWAL) + pg_log_info("end of WAL at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + else + fatal_error("error in WAL record at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + } XLogReaderFree(xlogreader_state); diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h index 477f0efe26..3eeba220a1 100644 --- a/src/include/access/xlogreader.h +++ b/src/include/access/xlogreader.h @@ -174,6 +174,7 @@ struct XLogReaderState */ XLogRecPtr ReadRecPtr; /* start of last record read */ XLogRecPtr EndRecPtr; /* end+1 of last record read */ + bool EndOfWAL; /* the last attempt was EOW? */ /* * Set at the end of recovery: the start point of a partial record at the diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl index 3892aba3e5..67d264df26 100644 --- a/src/test/recovery/t/011_crash_recovery.pl +++ b/src/test/recovery/t/011_crash_recovery.pl @@ -10,9 +10,11 @@ use PostgreSQL::Test::Cluster; use PostgreSQL::Test::Utils; use Test::More; use Config; +use IPC::Run; -plan tests => 3; +plan tests => 11; +my $reached_eow_pat = "reached end of WAL at "; my $node = PostgreSQL::Test::Cluster->new('primary'); $node->init(allows_streaming => 1); $node->start; @@ -50,7 +52,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), # Crash and restart the postmaster $node->stop('immediate'); +my $logstart = get_log_size($node); $node->start; +my $max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, $reached_eow_pat, $logstart)); + sleep 0.5; +} +ok ($max_attempts >= 0, "end-of-wal is logged"); # Make sure we really got a new xid cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'), @@ -62,3 +72,101 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), $stdin .= "\\q\n"; $tx->finish; # wait for psql to quit gracefully + +my $segsize = $node->safe_psql('postgres', + qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]); + +# make sure no records afterwards go to the next segment +$node->safe_psql('postgres', qq[ + SELECT pg_switch_wal(); + CHECKPOINT; + CREATE TABLE t(); +]); +$node->stop('immediate'); + +# identify REDO WAL file +my $cmd = "pg_controldata -D " . $node->data_dir(); +my $chkptfile; +$cmd = ['pg_controldata', '-D', $node->data_dir()]; +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m, + "checkpoint file is identified"); +my $chkptfile = $1; + +# identify the last record +my $walfile = $node->data_dir() . "/pg_wal/$chkptfile"; +$cmd = ['pg_waldump', $walfile]; +$stdout = ''; +$stderr = ''; +my $lastrec; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +foreach my $l (split(/\r?\n/, $stdout)) +{ + $lastrec = $l; +} +ok(defined $lastrec, "last WAL record is extracted"); +ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/, + "pg_waldump emits the correct ending message"); + +# read the last record LSN excluding leading zeroes +ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/, + "LSN of the last record identified"); +my $lastlsn = $1; + +# corrupt the last record +my $offset = hex($lastlsn) % $segsize; +open(my $segf, '+<', $walfile) or die "failed to open $walfile\n"; +seek($segf, $offset, 0); # halfway break the last record +print $segf "\0\0\0\0"; +close($segf); + +# pg_waldump complains about the corrupted record +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/, + "pg_waldump emits the correct error message"); + +# also server complains +$logstart = get_log_size($node); +$node->start; +$max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, "WARNING: invalid record length at 0/$lastlsn: wanted [0-9]+, got 0", + $logstart)); + sleep 0.5; +} +ok($max_attempts >= 0, "header error is logged at $lastlsn"); + +# and the end-of-wal messages shouldn't be seen +# the same message has been confirmed in the past +ok(!find_in_log($node, $reached_eow_pat, $logstart), + "false log message is not emitted"); + +$node->stop('immediate'); + +#### helper routines +# return the size of logfile of $node in bytes +sub get_log_size +{ + my ($node) = @_; + + return (stat $node->logfile)[7]; +} + +# find $pat in logfile of $node after $off-th byte +sub find_in_log +{ + my ($node, $pat, $off) = @_; + + $off = 0 unless defined $off; + my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile); + return 0 if (length($log) <= $off); + + $log = substr($log, $off); + + return $log =~ m/$pat/; +} -- 2.27.0
On Wed, Feb 9, 2022 at 1:14 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Hi, Ashutosh. > > At Tue, 8 Feb 2022 18:35:34 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in > > Here are some of my review comments on the v11 patch: > > Thank you for taking a look on this. > > > - (errmsg_internal("reached end of WAL in > > pg_wal, entering archive recovery"))); > > + (errmsg_internal("reached end of WAL at %X/%X > > on timeline %u in %s during crash recovery, entering archive > > recovery", > > + LSN_FORMAT_ARGS(ErrRecPtr), > > + replayTLI, > > + xlogSourceNames[currentSource]))); > > > > Why crash recovery? Won't this message get printed even during PITR? > > It is in the if-block with the following condition. > > > * If archive recovery was requested, but we were still doing > > * crash recovery, switch to archive recovery and retry using the > > * offline archive. We have now replayed all the valid WAL in > > * pg_wal, so we are presumably now consistent. > ... > > if (!InArchiveRecovery && ArchiveRecoveryRequested) > > This means archive-recovery is requested but not started yet. That is, > we've just finished crash recovery. The existing comment cited > together is mentioning that. > > At the end of PITR (or archive recovery), the other code works. > This is quite understandable, the point here is that the message that we are emitting says, we have just finished reading the wal files in the pg_wal directory during crash recovery and are now entering archive recovery when we are actually doing point-in-time recovery which seems a bit misleading. > > /* > > * If we haven't emit an error message, we have safely reached the > > * end-of-WAL. > > */ > > if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG) > > { > > char *fmt; > > > > if (StandbyMode) > > fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode"); > > else if (InArchiveRecovery) > > fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery"); > > else > > fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery"); > > The last among the above messages is choosed when archive-recovery is > not requested at all. > > > I just did a PITR and could see these messages in the logfile. > > Yeah, the log lines are describing that the server starting with crash > recovery to run PITR. > > > 2022-02-08 18:00:44.367 IST [86185] LOG: starting point-in-time > > recovery to WAL location (LSN) "0/5227790" > > 2022-02-08 18:00:44.368 IST [86185] LOG: database system was not > > properly shut down; automatic recovery in progress > > Well. I guess that the "automatic recovery" is ambiguous. Does it > make sense if the second line were like the follows instead? > > + 2022-02-08 18:00:44.368 IST [86185] LOG: database system was not properly shut down; crash recovery in progress > Well, according to me the current message looks fine. > > Lastly, are we also planning to backport this patch? > > This is apparent a behavioral change, not a bug fix, which I think we > regard as not appropriate for back-patching. > > > As the result, I made the following chages in the version 11. > > 1. Changed the condition for the "end-of-WAL" message from > emode_for_corrupt_record to the EndOfWAL flag. > > 2. Corrected the wording of end-of-wal to end-of-WAL. > > 3. In the "reached end of WAL" message, report the LSN of the > beginning of failed record instead of the beginning of the > last-succeeded record. > > 4. In the changed message in walreceiver.c, I swapped LSN and timeline > so that they are in the same order with other similar messages. > Thanks for sharing this information. == Here is one more comment: One more comment: +# identify REDO WAL file +my $cmd = "pg_controldata -D " . $node->data_dir(); +my $chkptfile; +$cmd = ['pg_controldata', '-D', $node->data_dir()]; +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m, + "checkpoint file is identified"); +my $chkptfile = $1; $chkptfile is declared twice in the same scope. We can probably remove the first one. -- With Regards, Ashutosh Sharma.
At Wed, 9 Feb 2022 17:31:02 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in > On Wed, Feb 9, 2022 at 1:14 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > This means archive-recovery is requested but not started yet. That is, > > we've just finished crash recovery. The existing comment cited > > together is mentioning that. > > > > At the end of PITR (or archive recovery), the other code works. > > > > This is quite understandable, the point here is that the message that > we are emitting says, we have just finished reading the wal files in > the pg_wal directory during crash recovery and are now entering > archive recovery when we are actually doing point-in-time recovery > which seems a bit misleading. Here is the messages. > 2022-02-08 18:00:44.367 IST [86185] LOG: starting point-in-time > recovery to WAL location (LSN) "0/5227790" > 2022-02-08 18:00:44.368 IST [86185] LOG: database system was not > properly shut down; automatic recovery in progress > 2022-02-08 18:00:44.369 IST [86185] LOG: redo starts at 0/14DC8D8 > 2022-02-08 18:00:44.978 IST [86185] DEBUG1: reached end of WAL at > 0/3FFFFD0 on timeline 1 in pg_wal during crash recovery, entering > archive recovery In the first place the last DEBUG1 is not on my part, but one of the messages added by this patch says the same thing. Is your point that archive recovery is different thing from PITR? In regard to the difference, I think PITR is a form of archive recovery. That being said, after some thoughts on this, I changed my mind that we don't need to say what operation was being performed at the end-of-WAL. So in the attached the end-of-WAL message is not accompanied by the kind of recovery. > LOG: reached end of WAL at 0/3000000 on timeline 1 I removed the archive-source part along with the operation mode. Because it make the message untranslatable. It is now very simple but seems enough. While working on this, I noticed that we need to set EndOfWAL when WaitForWALToBecomeAvailable returned with failure. That means the file does not exist at all so it is a kind of end-of-WAL. In that sense the following existing comment in ReadRecord is a bit wrong. > * We only end up here without a message when XLogPageRead() > * failed - in that case we already logged something. In > * StandbyMode that only happens if we have been triggered, so we > * shouldn't loop anymore in that case. Actually there's a case we get there without a message and without logged something when a segment file is not found unless we're in standby mode. > > Well. I guess that the "automatic recovery" is ambiguous. Does it > > make sense if the second line were like the follows instead? > > > > + 2022-02-08 18:00:44.368 IST [86185] LOG: database system was not properly shut down; crash recovery in progress > > > > Well, according to me the current message looks fine. Good to hear. (In the previos version I modified the message by accident..) > $chkptfile is declared twice in the same scope. We can probably remove > the first one. Ugh.. Fixed. (I wonder why Perl doesn't complain on this..) In this version 12 I made the following changes. - Rewrote (halfly reverted) a comment in ReadRecord - Simplified the "reached end of WAL" message by removing recovery mode and WAL source in ReadRecord. - XLogPageRead sets EndOfWAL flag in the ENOENT case. - Removed redundant declaration of the same variable in TAP script. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From e553164dbca709389d92b05cf8ae7a8b427e83a6 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH v12] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead, raise the error level for other kind of WAL failure to WARNING. --- src/backend/access/transam/xlog.c | 92 +++++++++++++----- src/backend/access/transam/xlogreader.c | 78 ++++++++++++++++ src/backend/replication/walreceiver.c | 7 +- src/bin/pg_waldump/pg_waldump.c | 13 ++- src/include/access/xlogreader.h | 1 + src/test/recovery/t/011_crash_recovery.pl | 108 +++++++++++++++++++++- 6 files changed, 268 insertions(+), 31 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 958220c495..bf1d40e7cb 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode, for (;;) { char *errormsg; + XLogRecPtr ErrRecPtr = InvalidXLogRecPtr; record = XLogReadRecord(xlogreader, &errormsg); if (record == NULL) @@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; + ErrRecPtr = abortedRecPtr; + } + else + { + /* + * EndRecPtr is the LSN we tried to read but failed. In the + * case of decoding error, it is at the end of the failed + * record but we don't have a means for now to know EndRecPtr + * is pointing to which of the beginning or ending of the + * failed record. + */ + ErrRecPtr = xlogreader->EndRecPtr; } if (readFile >= 0) @@ -4504,13 +4517,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode, } /* - * We only end up here without a message when XLogPageRead() - * failed - in that case we already logged something. In - * StandbyMode that only happens if we have been triggered, so we - * shouldn't loop anymore in that case. + * We only end up here without a message when XLogPageRead() failed + * in that case we already logged something, or just met end-of-WAL + * conditions. In StandbyMode that only happens if we have been + * triggered, so we shouldn't loop anymore in that case. When + * EndOfWAL is true, we don't emit that error if any immediately + * and instead will show it as a part of a decent end-of-wal + * message later. */ - if (errormsg) - ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr), + if (!xlogreader->EndOfWAL && errormsg) + ereport(emode_for_corrupt_record(emode, ErrRecPtr), (errmsg_internal("%s", errormsg) /* already translated */ )); } @@ -4541,11 +4557,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode, /* Great, got a record */ return record; } - else + + Assert(ErrRecPtr != InvalidXLogRecPtr); + + /* No valid record available from this source */ + lastSourceFailed = true; + + if (!fetching_ckpt) { - /* No valid record available from this source */ - lastSourceFailed = true; - /* * If archive recovery was requested, but we were still doing * crash recovery, switch to archive recovery and retry using the @@ -4558,11 +4577,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode, * we'd have no idea how far we'd have to replay to reach * consistency. So err on the safe side and give up. */ - if (!InArchiveRecovery && ArchiveRecoveryRequested && - !fetching_ckpt) + if (!InArchiveRecovery && ArchiveRecoveryRequested) { + /* + * We don't report this as LOG, since we don't stop recovery + * here + */ ereport(DEBUG1, - (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery, enteringarchive recovery", + LSN_FORMAT_ARGS(ErrRecPtr), + replayTLI, + xlogSourceNames[currentSource]))); InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -4610,12 +4635,24 @@ ReadRecord(XLogReaderState *xlogreader, int emode, continue; } - /* In standby mode, loop back to retry. Otherwise, give up. */ - if (StandbyMode && !CheckForStandbyTrigger()) - continue; - else - return NULL; + /* + * recovery ended. + * + * Emit a decent message if we met end-of-WAL. Otherwise we should + * have already emitted an error message. + */ + if (xlogreader->EndOfWAL) + ereport(LOG, + (errmsg("reached end of WAL at %X/%X on timeline %u", + LSN_FORMAT_ARGS(ErrRecPtr), replayTLI), + (errormsg ? errdetail_internal("%s", errormsg) : 0))); } + + /* In standby mode, loop back to retry. Otherwise, give up. */ + if (StandbyMode && !CheckForStandbyTrigger()) + continue; + else + return NULL; } } @@ -7544,7 +7581,7 @@ StartupXLOG(void) else { /* just have to read next record after CheckPoint */ - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } if (record != NULL) @@ -7782,7 +7819,7 @@ StartupXLOG(void) } /* Else, try to fetch the next WAL record */ - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } while (record != NULL); /* @@ -7842,13 +7879,20 @@ StartupXLOG(void) InRedo = false; } - else + else if (xlogreader->EndOfWAL) { /* there are no WAL records following the checkpoint */ ereport(LOG, (errmsg("redo is not required"))); } + else + { + /* broken record found */ + ereport(WARNING, + (errmsg("redo is skipped"), + errhint("This suggests WAL file corruption. You might need to check the database."))); + } /* * This check is intentionally after the above log messages that @@ -12434,12 +12478,14 @@ retry: private->replayTLI, xlogreader->EndRecPtr)) { + Assert(!StandbyMode); + if (readFile >= 0) close(readFile); readFile = -1; readLen = 0; readSource = XLOG_FROM_ANY; - + xlogreader->EndOfWAL = true; return -1; } } @@ -13097,7 +13143,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr) { static XLogRecPtr lastComplaint = 0; - if (readSource == XLOG_FROM_PG_WAL && emode == LOG) + if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING) { if (RecPtr == lastComplaint) emode = DEBUG1; diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 35029cf97d..22982c4de7 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, pfree(state); return NULL; } + state->EndOfWAL = false; state->errormsg_buf[0] = '\0'; /* @@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) /* reset error state */ *errormsg = NULL; state->errormsg_buf[0] = '\0'; + state->EndOfWAL = false; ResetDecoder(state); state->abortedRecPtr = InvalidXLogRecPtr; @@ -588,6 +590,15 @@ err: */ state->abortedRecPtr = RecPtr; state->missingContrecPtr = targetPagePtr; + + /* + * If the message is not set yet, that means we failed to load the + * page for the record. Otherwise do not hide the existing message. + */ + if (state->errormsg_buf[0] == '\0') + report_invalid_record(state, + "missing contrecord at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); } /* @@ -730,6 +741,40 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess) { + if (record->xl_tot_len == 0) + { + /* + * We are almost sure reaching the end of WAL, make sure that the + * whole page after the record is filled with zeroes. + */ + char *p; + char *pe; + + /* scan from the beginning of the record to the end of block */ + p = (char *) record; + pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); + + while (*p == 0 && p < pe) + p++; + + if (p == pe) + { + /* + * The page after the record is completely zeroed. That suggests + * we don't have a record after this point. We don't bother + * checking the pages after since they are not zeroed in the case + * of recycled segments. + */ + report_invalid_record(state, "empty record at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); + + /* notify end-of-wal to callers */ + state->EndOfWAL = true; + return false; + } + + /* The same condition will be caught as invalid record length */ + } if (record->xl_tot_len < SizeOfXLogRecord) { report_invalid_record(state, @@ -836,6 +881,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr); + StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero"); + + if (hdr->xlp_magic == 0) + { + /* Regard an empty page as End-Of-WAL */ + int i; + + for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++); + if (i == XLOG_BLCKSZ) + { + char fname[MAXFNAMELEN]; + + XLogFileName(fname, state->seg.ws_tli, segno, + state->segcxt.ws_segsize); + + report_invalid_record(state, + "empty page in log segment %s, offset %u", + fname, + offset); + state->EndOfWAL = true; + return false; + } + + /* The same condition will be caught as invalid magic number */ + } if (hdr->xlp_magic != XLOG_PAGE_MAGIC) { char fname[MAXFNAMELEN]; @@ -921,6 +991,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, LSN_FORMAT_ARGS(hdr->xlp_pageaddr), fname, offset); + + /* + * If the page address is less than expected we assume it is an unused + * page in a recycled segment. + */ + if (hdr->xlp_pageaddr < recaddr) + state->EndOfWAL = true; + return false; } diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index b39fce8c23..1a7a692bc0 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -471,10 +471,9 @@ WalReceiverMain(void) else if (len < 0) { ereport(LOG, - (errmsg("replication terminated by primary server"), - errdetail("End of WAL reached on timeline %u at %X/%X.", - startpointTLI, - LSN_FORMAT_ARGS(LogstreamResult.Write)))); + (errmsg("replication terminated by primary server at %X/%X on timeline %u.", + LSN_FORMAT_ARGS(LogstreamResult.Write), + startpointTLI))); endofwal = true; break; } diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c index a6251e1a96..3745e76488 100644 --- a/src/bin/pg_waldump/pg_waldump.c +++ b/src/bin/pg_waldump/pg_waldump.c @@ -1176,9 +1176,16 @@ main(int argc, char **argv) exit(0); if (errormsg) - fatal_error("error in WAL record at %X/%X: %s", - LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr), - errormsg); + { + if (xlogreader_state->EndOfWAL) + pg_log_info("end of WAL at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + else + fatal_error("error in WAL record at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + } XLogReaderFree(xlogreader_state); diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h index 477f0efe26..7b314ef10e 100644 --- a/src/include/access/xlogreader.h +++ b/src/include/access/xlogreader.h @@ -174,6 +174,7 @@ struct XLogReaderState */ XLogRecPtr ReadRecPtr; /* start of last record read */ XLogRecPtr EndRecPtr; /* end+1 of last record read */ + bool EndOfWAL; /* was the last attempt EOW? */ /* * Set at the end of recovery: the start point of a partial record at the diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl index 3892aba3e5..1d7476c309 100644 --- a/src/test/recovery/t/011_crash_recovery.pl +++ b/src/test/recovery/t/011_crash_recovery.pl @@ -10,9 +10,11 @@ use PostgreSQL::Test::Cluster; use PostgreSQL::Test::Utils; use Test::More; use Config; +use IPC::Run; -plan tests => 3; +plan tests => 11; +my $reached_eow_pat = "reached end of WAL at "; my $node = PostgreSQL::Test::Cluster->new('primary'); $node->init(allows_streaming => 1); $node->start; @@ -50,7 +52,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), # Crash and restart the postmaster $node->stop('immediate'); +my $logstart = get_log_size($node); $node->start; +my $max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, $reached_eow_pat, $logstart)); + sleep 0.5; +} +ok ($max_attempts >= 0, "end-of-wal is logged"); # Make sure we really got a new xid cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'), @@ -62,3 +72,99 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), $stdin .= "\\q\n"; $tx->finish; # wait for psql to quit gracefully + +my $segsize = $node->safe_psql('postgres', + qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]); + +# make sure no records afterwards go to the next segment +$node->safe_psql('postgres', qq[ + SELECT pg_switch_wal(); + CHECKPOINT; + CREATE TABLE t(); +]); +$node->stop('immediate'); + +# identify REDO WAL file +my $cmd = "pg_controldata -D " . $node->data_dir(); +$cmd = ['pg_controldata', '-D', $node->data_dir()]; +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m, + "checkpoint file is identified"); +my $chkptfile = $1; + +# identify the last record +my $walfile = $node->data_dir() . "/pg_wal/$chkptfile"; +$cmd = ['pg_waldump', $walfile]; +$stdout = ''; +$stderr = ''; +my $lastrec; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +foreach my $l (split(/\r?\n/, $stdout)) +{ + $lastrec = $l; +} +ok(defined $lastrec, "last WAL record is extracted"); +ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/, + "pg_waldump emits the correct ending message"); + +# read the last record LSN excluding leading zeroes +ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/, + "LSN of the last record identified"); +my $lastlsn = $1; + +# corrupt the last record +my $offset = hex($lastlsn) % $segsize; +open(my $segf, '+<', $walfile) or die "failed to open $walfile\n"; +seek($segf, $offset, 0); # halfway break the last record +print $segf "\0\0\0\0"; +close($segf); + +# pg_waldump complains about the corrupted record +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/, + "pg_waldump emits the correct error message"); + +# also server complains +$logstart = get_log_size($node); +$node->start; +$max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, "WARNING: invalid record length at 0/$lastlsn: wanted [0-9]+, got 0", + $logstart)); + sleep 0.5; +} +ok($max_attempts >= 0, "header error is logged at $lastlsn"); + +# no end-of-wal message should be seen this time +ok(!find_in_log($node, $reached_eow_pat, $logstart), + "false log message is not emitted"); + +$node->stop('immediate'); + +#### helper routines +# return the size of logfile of $node in bytes +sub get_log_size +{ + my ($node) = @_; + + return (stat $node->logfile)[7]; +} + +# find $pat in logfile of $node after $off-th byte +sub find_in_log +{ + my ($node, $pat, $off) = @_; + + $off = 0 unless defined $off; + my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile); + return 0 if (length($log) <= $off); + + $log = substr($log, $off); + + return $log =~ m/$pat/; +} -- 2.27.0
Hi, On Thu, Feb 10, 2022 at 11:47 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Wed, 9 Feb 2022 17:31:02 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in > > On Wed, Feb 9, 2022 at 1:14 PM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > This means archive-recovery is requested but not started yet. That is, > > > we've just finished crash recovery. The existing comment cited > > > together is mentioning that. > > > > > > At the end of PITR (or archive recovery), the other code works. > > > > > > > This is quite understandable, the point here is that the message that > > we are emitting says, we have just finished reading the wal files in > > the pg_wal directory during crash recovery and are now entering > > archive recovery when we are actually doing point-in-time recovery > > which seems a bit misleading. > > Here is the messages. > > > 2022-02-08 18:00:44.367 IST [86185] LOG: starting point-in-time > > recovery to WAL location (LSN) "0/5227790" > > 2022-02-08 18:00:44.368 IST [86185] LOG: database system was not > > properly shut down; automatic recovery in progress > > 2022-02-08 18:00:44.369 IST [86185] LOG: redo starts at 0/14DC8D8 > > 2022-02-08 18:00:44.978 IST [86185] DEBUG1: reached end of WAL at > > 0/3FFFFD0 on timeline 1 in pg_wal during crash recovery, entering > > archive recovery > > In the first place the last DEBUG1 is not on my part, but one of the > messages added by this patch says the same thing. Is your point that > archive recovery is different thing from PITR? In regard to the > difference, I think PITR is a form of archive recovery. > No, I haven't tried to compare archive recovery to PITR or vice versa, instead I was trying to compare crash recovery with PITR. The message you're emitting says just before entering into the archive recovery is - "reached end-of-WAL on ... in pg_wal *during crash recovery*, entering archive recovery". This message is static and can be emitted not only during crash recovery, but also during PITR. I think we can remove the "during crash recovery" part from this message, so "reached the end of WAL at %X/%X on timeline %u in %s, entering archive recovery". Also I don't think we need format specifier %s here, it can be hard-coded with pg_wal as in this case we can only enter archive recovery after reading wal from pg_wal, so current WAL source has to be pg_wal, isn't it? > That being said, after some thoughts on this, I changed my mind that > we don't need to say what operation was being performed at the > end-of-WAL. So in the attached the end-of-WAL message is not > accompanied by the kind of recovery. > > > LOG: reached end of WAL at 0/3000000 on timeline 1 > > I removed the archive-source part along with the operation mode. > Because it make the message untranslatable. It is now very simple but > seems enough. > > While working on this, I noticed that we need to set EndOfWAL when > WaitForWALToBecomeAvailable returned with failure. That means the > file does not exist at all so it is a kind of end-of-WAL. In that > sense the following existing comment in ReadRecord is a bit wrong. > > > * We only end up here without a message when XLogPageRead() > > * failed - in that case we already logged something. In > > * StandbyMode that only happens if we have been triggered, so we > > * shouldn't loop anymore in that case. > > Actually there's a case we get there without a message and without > logged something when a segment file is not found unless we're in > standby mode. > > > > Well. I guess that the "automatic recovery" is ambiguous. Does it > > > make sense if the second line were like the follows instead? > > > > > > + 2022-02-08 18:00:44.368 IST [86185] LOG: database system was not properly shut down; crash recovery in progress > > > > > > > Well, according to me the current message looks fine. > > Good to hear. (In the previos version I modified the message by accident..) > > > $chkptfile is declared twice in the same scope. We can probably remove > > the first one. > > Ugh.. Fixed. (I wonder why Perl doesn't complain on this..) > > > In this version 12 I made the following changes. > > - Rewrote (halfly reverted) a comment in ReadRecord > > - Simplified the "reached end of WAL" message by removing recovery > mode and WAL source in ReadRecord. > > - XLogPageRead sets EndOfWAL flag in the ENOENT case. > > - Removed redundant declaration of the same variable in TAP script. > Thanks for the changes. Please note that I am not able to apply the latest patch on HEAD. Could you please rebase it on HEAD and share the new version. Thank you. -- With Regards, Ashutosh Sharma.
At Mon, 14 Feb 2022 20:14:11 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in > No, I haven't tried to compare archive recovery to PITR or vice versa, > instead I was trying to compare crash recovery with PITR. The message > you're emitting says just before entering into the archive recovery is > - "reached end-of-WAL on ... in pg_wal *during crash recovery*, > entering archive recovery". This message is static and can be emitted > not only during crash recovery, but also during PITR. I think we can No. It is emitted *only* after crash recovery before starting archive recovery. Another message this patch adds can be emitted after PITR or archive recovery. > not only during crash recovery, but also during PITR. I think we can > remove the "during crash recovery" part from this message, so "reached > the end of WAL at %X/%X on timeline %u in %s, entering archive What makes you think it can be emitted after other than crash recovery? (Please look at the code comment just above.) > recovery". Also I don't think we need format specifier %s here, it can > be hard-coded with pg_wal as in this case we can only enter archive > recovery after reading wal from pg_wal, so current WAL source has to > be pg_wal, isn't it? You're right that it can't be other than pg_wal. It was changed just in accordance woth another message this patch adds and it would be a matter of taste. I replaced to "pg_wal" in this version. > Thanks for the changes. Please note that I am not able to apply the > latest patch on HEAD. Could you please rebase it on HEAD and share the > new version. Thank you. A change on TAP script hit this. The v13 attached is: - Rebased. - Replaced "%s" in the debug transition message from crash recovery to archive recovery. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 311e862e87dbdeb6348c6fc17063308342359c02 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH v13] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead, raise the error level for other kind of WAL failure to WARNING. --- src/backend/access/transam/xlog.c | 91 ++++++++++++++----- src/backend/access/transam/xlogreader.c | 78 ++++++++++++++++ src/backend/replication/walreceiver.c | 7 +- src/bin/pg_waldump/pg_waldump.c | 13 ++- src/include/access/xlogreader.h | 1 + src/test/recovery/t/011_crash_recovery.pl | 106 ++++++++++++++++++++++ 6 files changed, 266 insertions(+), 30 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 958220c495..bb7026ac77 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode, for (;;) { char *errormsg; + XLogRecPtr ErrRecPtr = InvalidXLogRecPtr; record = XLogReadRecord(xlogreader, &errormsg); if (record == NULL) @@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; + ErrRecPtr = abortedRecPtr; + } + else + { + /* + * EndRecPtr is the LSN we tried to read but failed. In the + * case of decoding error, it is at the end of the failed + * record but we don't have a means for now to know EndRecPtr + * is pointing to which of the beginning or ending of the + * failed record. + */ + ErrRecPtr = xlogreader->EndRecPtr; } if (readFile >= 0) @@ -4504,13 +4517,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode, } /* - * We only end up here without a message when XLogPageRead() - * failed - in that case we already logged something. In - * StandbyMode that only happens if we have been triggered, so we - * shouldn't loop anymore in that case. + * We only end up here without a message when XLogPageRead() failed + * in that case we already logged something, or just met end-of-WAL + * conditions. In StandbyMode that only happens if we have been + * triggered, so we shouldn't loop anymore in that case. When + * EndOfWAL is true, we don't emit that error if any immediately + * and instead will show it as a part of a decent end-of-wal + * message later. */ - if (errormsg) - ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr), + if (!xlogreader->EndOfWAL && errormsg) + ereport(emode_for_corrupt_record(emode, ErrRecPtr), (errmsg_internal("%s", errormsg) /* already translated */ )); } @@ -4541,11 +4557,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode, /* Great, got a record */ return record; } - else + + Assert(ErrRecPtr != InvalidXLogRecPtr); + + /* No valid record available from this source */ + lastSourceFailed = true; + + if (!fetching_ckpt) { - /* No valid record available from this source */ - lastSourceFailed = true; - /* * If archive recovery was requested, but we were still doing * crash recovery, switch to archive recovery and retry using the @@ -4558,11 +4577,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode, * we'd have no idea how far we'd have to replay to reach * consistency. So err on the safe side and give up. */ - if (!InArchiveRecovery && ArchiveRecoveryRequested && - !fetching_ckpt) + if (!InArchiveRecovery && ArchiveRecoveryRequested) { + /* + * We don't report this as LOG, since we don't stop recovery + * here + */ ereport(DEBUG1, - (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + (errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, enteringarchive recovery", + LSN_FORMAT_ARGS(ErrRecPtr), + replayTLI))); InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -4610,12 +4634,24 @@ ReadRecord(XLogReaderState *xlogreader, int emode, continue; } - /* In standby mode, loop back to retry. Otherwise, give up. */ - if (StandbyMode && !CheckForStandbyTrigger()) - continue; - else - return NULL; + /* + * recovery ended. + * + * Emit a decent message if we met end-of-WAL. Otherwise we should + * have already emitted an error message. + */ + if (xlogreader->EndOfWAL) + ereport(LOG, + (errmsg("reached end of WAL at %X/%X on timeline %u", + LSN_FORMAT_ARGS(ErrRecPtr), replayTLI), + (errormsg ? errdetail_internal("%s", errormsg) : 0))); } + + /* In standby mode, loop back to retry. Otherwise, give up. */ + if (StandbyMode && !CheckForStandbyTrigger()) + continue; + else + return NULL; } } @@ -7544,7 +7580,7 @@ StartupXLOG(void) else { /* just have to read next record after CheckPoint */ - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } if (record != NULL) @@ -7782,7 +7818,7 @@ StartupXLOG(void) } /* Else, try to fetch the next WAL record */ - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } while (record != NULL); /* @@ -7842,13 +7878,20 @@ StartupXLOG(void) InRedo = false; } - else + else if (xlogreader->EndOfWAL) { /* there are no WAL records following the checkpoint */ ereport(LOG, (errmsg("redo is not required"))); } + else + { + /* broken record found */ + ereport(WARNING, + (errmsg("redo is skipped"), + errhint("This suggests WAL file corruption. You might need to check the database."))); + } /* * This check is intentionally after the above log messages that @@ -12434,12 +12477,14 @@ retry: private->replayTLI, xlogreader->EndRecPtr)) { + Assert(!StandbyMode); + if (readFile >= 0) close(readFile); readFile = -1; readLen = 0; readSource = XLOG_FROM_ANY; - + xlogreader->EndOfWAL = true; return -1; } } @@ -13097,7 +13142,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr) { static XLogRecPtr lastComplaint = 0; - if (readSource == XLOG_FROM_PG_WAL && emode == LOG) + if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING) { if (RecPtr == lastComplaint) emode = DEBUG1; diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 35029cf97d..22982c4de7 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, pfree(state); return NULL; } + state->EndOfWAL = false; state->errormsg_buf[0] = '\0'; /* @@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) /* reset error state */ *errormsg = NULL; state->errormsg_buf[0] = '\0'; + state->EndOfWAL = false; ResetDecoder(state); state->abortedRecPtr = InvalidXLogRecPtr; @@ -588,6 +590,15 @@ err: */ state->abortedRecPtr = RecPtr; state->missingContrecPtr = targetPagePtr; + + /* + * If the message is not set yet, that means we failed to load the + * page for the record. Otherwise do not hide the existing message. + */ + if (state->errormsg_buf[0] == '\0') + report_invalid_record(state, + "missing contrecord at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); } /* @@ -730,6 +741,40 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess) { + if (record->xl_tot_len == 0) + { + /* + * We are almost sure reaching the end of WAL, make sure that the + * whole page after the record is filled with zeroes. + */ + char *p; + char *pe; + + /* scan from the beginning of the record to the end of block */ + p = (char *) record; + pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); + + while (*p == 0 && p < pe) + p++; + + if (p == pe) + { + /* + * The page after the record is completely zeroed. That suggests + * we don't have a record after this point. We don't bother + * checking the pages after since they are not zeroed in the case + * of recycled segments. + */ + report_invalid_record(state, "empty record at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); + + /* notify end-of-wal to callers */ + state->EndOfWAL = true; + return false; + } + + /* The same condition will be caught as invalid record length */ + } if (record->xl_tot_len < SizeOfXLogRecord) { report_invalid_record(state, @@ -836,6 +881,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr); + StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero"); + + if (hdr->xlp_magic == 0) + { + /* Regard an empty page as End-Of-WAL */ + int i; + + for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++); + if (i == XLOG_BLCKSZ) + { + char fname[MAXFNAMELEN]; + + XLogFileName(fname, state->seg.ws_tli, segno, + state->segcxt.ws_segsize); + + report_invalid_record(state, + "empty page in log segment %s, offset %u", + fname, + offset); + state->EndOfWAL = true; + return false; + } + + /* The same condition will be caught as invalid magic number */ + } if (hdr->xlp_magic != XLOG_PAGE_MAGIC) { char fname[MAXFNAMELEN]; @@ -921,6 +991,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, LSN_FORMAT_ARGS(hdr->xlp_pageaddr), fname, offset); + + /* + * If the page address is less than expected we assume it is an unused + * page in a recycled segment. + */ + if (hdr->xlp_pageaddr < recaddr) + state->EndOfWAL = true; + return false; } diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index b39fce8c23..1a7a692bc0 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -471,10 +471,9 @@ WalReceiverMain(void) else if (len < 0) { ereport(LOG, - (errmsg("replication terminated by primary server"), - errdetail("End of WAL reached on timeline %u at %X/%X.", - startpointTLI, - LSN_FORMAT_ARGS(LogstreamResult.Write)))); + (errmsg("replication terminated by primary server at %X/%X on timeline %u.", + LSN_FORMAT_ARGS(LogstreamResult.Write), + startpointTLI))); endofwal = true; break; } diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c index a6251e1a96..3745e76488 100644 --- a/src/bin/pg_waldump/pg_waldump.c +++ b/src/bin/pg_waldump/pg_waldump.c @@ -1176,9 +1176,16 @@ main(int argc, char **argv) exit(0); if (errormsg) - fatal_error("error in WAL record at %X/%X: %s", - LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr), - errormsg); + { + if (xlogreader_state->EndOfWAL) + pg_log_info("end of WAL at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + else + fatal_error("error in WAL record at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + } XLogReaderFree(xlogreader_state); diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h index 477f0efe26..7b314ef10e 100644 --- a/src/include/access/xlogreader.h +++ b/src/include/access/xlogreader.h @@ -174,6 +174,7 @@ struct XLogReaderState */ XLogRecPtr ReadRecPtr; /* start of last record read */ XLogRecPtr EndRecPtr; /* end+1 of last record read */ + bool EndOfWAL; /* was the last attempt EOW? */ /* * Set at the end of recovery: the start point of a partial record at the diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl index 14154d1ce0..01033334d6 100644 --- a/src/test/recovery/t/011_crash_recovery.pl +++ b/src/test/recovery/t/011_crash_recovery.pl @@ -10,7 +10,9 @@ use PostgreSQL::Test::Cluster; use PostgreSQL::Test::Utils; use Test::More; use Config; +use IPC::Run; +my $reached_eow_pat = "reached end of WAL at "; my $node = PostgreSQL::Test::Cluster->new('primary'); $node->init(allows_streaming => 1); $node->start; @@ -48,7 +50,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), # Crash and restart the postmaster $node->stop('immediate'); +my $logstart = get_log_size($node); $node->start; +my $max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, $reached_eow_pat, $logstart)); + sleep 0.5; +} +ok ($max_attempts >= 0, "end-of-wal is logged"); # Make sure we really got a new xid cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'), @@ -61,4 +71,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), $stdin .= "\\q\n"; $tx->finish; # wait for psql to quit gracefully +my $segsize = $node->safe_psql('postgres', + qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]); + +# make sure no records afterwards go to the next segment +$node->safe_psql('postgres', qq[ + SELECT pg_switch_wal(); + CHECKPOINT; + CREATE TABLE t(); +]); +$node->stop('immediate'); + +# identify REDO WAL file +my $cmd = "pg_controldata -D " . $node->data_dir(); +$cmd = ['pg_controldata', '-D', $node->data_dir()]; +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m, + "checkpoint file is identified"); +my $chkptfile = $1; + +# identify the last record +my $walfile = $node->data_dir() . "/pg_wal/$chkptfile"; +$cmd = ['pg_waldump', $walfile]; +$stdout = ''; +$stderr = ''; +my $lastrec; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +foreach my $l (split(/\r?\n/, $stdout)) +{ + $lastrec = $l; +} +ok(defined $lastrec, "last WAL record is extracted"); +ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/, + "pg_waldump emits the correct ending message"); + +# read the last record LSN excluding leading zeroes +ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/, + "LSN of the last record identified"); +my $lastlsn = $1; + +# corrupt the last record +my $offset = hex($lastlsn) % $segsize; +open(my $segf, '+<', $walfile) or die "failed to open $walfile\n"; +seek($segf, $offset, 0); # halfway break the last record +print $segf "\0\0\0\0"; +close($segf); + +# pg_waldump complains about the corrupted record +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/, + "pg_waldump emits the correct error message"); + +# also server complains +$logstart = get_log_size($node); +$node->start; +$max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, "WARNING: invalid record length at 0/$lastlsn: wanted [0-9]+, got 0", + $logstart)); + sleep 0.5; +} +ok($max_attempts >= 0, "header error is logged at $lastlsn"); + +# no end-of-wal message should be seen this time +ok(!find_in_log($node, $reached_eow_pat, $logstart), + "false log message is not emitted"); + +$node->stop('immediate'); + done_testing(); + +#### helper routines +# return the size of logfile of $node in bytes +sub get_log_size +{ + my ($node) = @_; + + return (stat $node->logfile)[7]; +} + +# find $pat in logfile of $node after $off-th byte +sub find_in_log +{ + my ($node, $pat, $off) = @_; + + $off = 0 unless defined $off; + my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile); + return 0 if (length($log) <= $off); + + $log = substr($log, $off); + + return $log =~ m/$pat/; +} -- 2.27.0
On Tue, Feb 15, 2022 at 7:52 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Mon, 14 Feb 2022 20:14:11 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in > > No, I haven't tried to compare archive recovery to PITR or vice versa, > > instead I was trying to compare crash recovery with PITR. The message > > you're emitting says just before entering into the archive recovery is > > - "reached end-of-WAL on ... in pg_wal *during crash recovery*, > > entering archive recovery". This message is static and can be emitted > > not only during crash recovery, but also during PITR. I think we can > > No. It is emitted *only* after crash recovery before starting archive > recovery. Another message this patch adds can be emitted after PITR > or archive recovery. > > > not only during crash recovery, but also during PITR. I think we can > > remove the "during crash recovery" part from this message, so "reached > > the end of WAL at %X/%X on timeline %u in %s, entering archive > > What makes you think it can be emitted after other than crash recovery? > (Please look at the code comment just above.) > Yep that's right. We won't be coming here in case of pitr. > > recovery". Also I don't think we need format specifier %s here, it can > > be hard-coded with pg_wal as in this case we can only enter archive > > recovery after reading wal from pg_wal, so current WAL source has to > > be pg_wal, isn't it? > > You're right that it can't be other than pg_wal. It was changed just > in accordance woth another message this patch adds and it would be a > matter of taste. I replaced to "pg_wal" in this version. > OK. I have verified the changes. > > Thanks for the changes. Please note that I am not able to apply the > > latest patch on HEAD. Could you please rebase it on HEAD and share the > > new version. Thank you. > > A change on TAP script hit this. The v13 attached is: > OK. The v13 patch looks good. I have marked it as ready to commit. Thank you for working on all my review comments. -- With Regards, Ashutosh Sharma.
At Tue, 15 Feb 2022 20:17:20 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in > OK. The v13 patch looks good. I have marked it as ready to commit. > Thank you for working on all my review comments. Thaks! But the recent xlog.c refactoring crashes into this patch. And I found a silly bug while rebasing. xlog.c:12463 / xlogrecovery.c:3168 if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen, .. { + Assert(!StandbyMode); ... + xlogreader->EndOfWAL = true; Yeah, I forgot about promotion there.. So what I should have done is setting EndOfWAL according to StandbyMode. + Assert(!StandbyMode || CheckForStandbyTrigger()); ... + /* promotion exit is not end-of-WAL */ + xlogreader->EndOfWAL = !StandbyMode; The rebased v14 is attached. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 5613ee80a4d2a9786f5ce8421dcbb560b63a13c1 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH v14] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead, raise the error level for other kind of WAL failure to WARNING. --- src/backend/access/transam/xlogreader.c | 78 ++++++++++++++++ src/backend/access/transam/xlogrecovery.c | 92 ++++++++++++++----- src/backend/replication/walreceiver.c | 7 +- src/bin/pg_waldump/pg_waldump.c | 13 ++- src/include/access/xlogreader.h | 1 + src/test/recovery/t/011_crash_recovery.pl | 106 ++++++++++++++++++++++ 6 files changed, 268 insertions(+), 29 deletions(-) diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 35029cf97d..22982c4de7 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, pfree(state); return NULL; } + state->EndOfWAL = false; state->errormsg_buf[0] = '\0'; /* @@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) /* reset error state */ *errormsg = NULL; state->errormsg_buf[0] = '\0'; + state->EndOfWAL = false; ResetDecoder(state); state->abortedRecPtr = InvalidXLogRecPtr; @@ -588,6 +590,15 @@ err: */ state->abortedRecPtr = RecPtr; state->missingContrecPtr = targetPagePtr; + + /* + * If the message is not set yet, that means we failed to load the + * page for the record. Otherwise do not hide the existing message. + */ + if (state->errormsg_buf[0] == '\0') + report_invalid_record(state, + "missing contrecord at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); } /* @@ -730,6 +741,40 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess) { + if (record->xl_tot_len == 0) + { + /* + * We are almost sure reaching the end of WAL, make sure that the + * whole page after the record is filled with zeroes. + */ + char *p; + char *pe; + + /* scan from the beginning of the record to the end of block */ + p = (char *) record; + pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); + + while (*p == 0 && p < pe) + p++; + + if (p == pe) + { + /* + * The page after the record is completely zeroed. That suggests + * we don't have a record after this point. We don't bother + * checking the pages after since they are not zeroed in the case + * of recycled segments. + */ + report_invalid_record(state, "empty record at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); + + /* notify end-of-wal to callers */ + state->EndOfWAL = true; + return false; + } + + /* The same condition will be caught as invalid record length */ + } if (record->xl_tot_len < SizeOfXLogRecord) { report_invalid_record(state, @@ -836,6 +881,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr); + StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero"); + + if (hdr->xlp_magic == 0) + { + /* Regard an empty page as End-Of-WAL */ + int i; + + for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++); + if (i == XLOG_BLCKSZ) + { + char fname[MAXFNAMELEN]; + + XLogFileName(fname, state->seg.ws_tli, segno, + state->segcxt.ws_segsize); + + report_invalid_record(state, + "empty page in log segment %s, offset %u", + fname, + offset); + state->EndOfWAL = true; + return false; + } + + /* The same condition will be caught as invalid magic number */ + } if (hdr->xlp_magic != XLOG_PAGE_MAGIC) { char fname[MAXFNAMELEN]; @@ -921,6 +991,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, LSN_FORMAT_ARGS(hdr->xlp_pageaddr), fname, offset); + + /* + * If the page address is less than expected we assume it is an unused + * page in a recycled segment. + */ + if (hdr->xlp_pageaddr < recaddr) + state->EndOfWAL = true; + return false; } diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index f9f212680b..750056acaf 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -1592,7 +1592,7 @@ PerformWalRecovery(void) /* just have to read next record after CheckPoint */ Assert(xlogreader->ReadRecPtr == CheckPointLoc); replayTLI = CheckPointTLI; - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } if (record != NULL) @@ -1706,7 +1706,7 @@ PerformWalRecovery(void) } /* Else, try to fetch the next WAL record */ - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } while (record != NULL); /* @@ -1765,13 +1765,20 @@ PerformWalRecovery(void) InRedo = false; } - else + else if (xlogreader->EndOfWAL) { /* there are no WAL records following the checkpoint */ ereport(LOG, (errmsg("redo is not required"))); } + else + { + /* broken record found */ + ereport(WARNING, + (errmsg("redo is skipped"), + errhint("This suggests WAL file corruption. You might need to check the database."))); + } /* * This check is intentionally after the above log messages that indicate @@ -2939,6 +2946,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode, for (;;) { char *errormsg; + XLogRecPtr ErrRecPtr = InvalidXLogRecPtr; record = XLogReadRecord(xlogreader, &errormsg); if (record == NULL) @@ -2954,6 +2962,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; + ErrRecPtr = abortedRecPtr; + } + else + { + /* + * EndRecPtr is the LSN we tried to read but failed. In the + * case of decoding error, it is at the end of the failed + * record but we don't have a means for now to know EndRecPtr + * is pointing to which of the beginning or ending of the + * failed record. + */ + ErrRecPtr = xlogreader->EndRecPtr; } if (readFile >= 0) @@ -2963,13 +2983,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode, } /* - * We only end up here without a message when XLogPageRead() - * failed - in that case we already logged something. In - * StandbyMode that only happens if we have been triggered, so we - * shouldn't loop anymore in that case. + * We only end up here without a message when XLogPageRead() failed + * in that case we already logged something, or just met end-of-WAL + * conditions. In StandbyMode that only happens if we have been + * triggered, so we shouldn't loop anymore in that case. When + * EndOfWAL is true, we don't emit that error if any immediately + * and instead will show it as a part of a decent end-of-wal + * message later. */ - if (errormsg) - ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr), + if (!xlogreader->EndOfWAL && errormsg) + ereport(emode_for_corrupt_record(emode, ErrRecPtr), (errmsg_internal("%s", errormsg) /* already translated */ )); } @@ -3000,11 +3023,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode, /* Great, got a record */ return record; } - else + + Assert(ErrRecPtr != InvalidXLogRecPtr); + + /* No valid record available from this source */ + lastSourceFailed = true; + + if (!fetching_ckpt) { - /* No valid record available from this source */ - lastSourceFailed = true; - /* * If archive recovery was requested, but we were still doing * crash recovery, switch to archive recovery and retry using the @@ -3017,11 +3043,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode, * we'd have no idea how far we'd have to replay to reach * consistency. So err on the safe side and give up. */ - if (!InArchiveRecovery && ArchiveRecoveryRequested && - !fetching_ckpt) + if (!InArchiveRecovery && ArchiveRecoveryRequested) { + /* + * We don't report this as LOG, since we don't stop recovery + * here + */ ereport(DEBUG1, - (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + (errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, enteringarchive recovery", + LSN_FORMAT_ARGS(ErrRecPtr), + replayTLI))); InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -3042,12 +3073,24 @@ ReadRecord(XLogReaderState *xlogreader, int emode, continue; } - /* In standby mode, loop back to retry. Otherwise, give up. */ - if (StandbyMode && !CheckForStandbyTrigger()) - continue; - else - return NULL; + /* + * recovery ended. + * + * Emit a decent message if we met end-of-WAL. Otherwise we should + * have already emitted an error message. + */ + if (xlogreader->EndOfWAL) + ereport(LOG, + (errmsg("reached end of WAL at %X/%X on timeline %u", + LSN_FORMAT_ARGS(ErrRecPtr), replayTLI), + (errormsg ? errdetail_internal("%s", errormsg) : 0))); } + + /* In standby mode, loop back to retry. Otherwise, give up. */ + if (StandbyMode && !CheckForStandbyTrigger()) + continue; + else + return NULL; } } @@ -3129,12 +3172,16 @@ retry: private->replayTLI, xlogreader->EndRecPtr)) { + Assert(!StandbyMode || CheckForStandbyTrigger()); + if (readFile >= 0) close(readFile); readFile = -1; readLen = 0; readSource = XLOG_FROM_ANY; + /* promotion exit is not end-of-WAL */ + xlogreader->EndOfWAL = !StandbyMode; return -1; } } @@ -3767,7 +3814,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr) { static XLogRecPtr lastComplaint = 0; - if (readSource == XLOG_FROM_PG_WAL && emode == LOG) + /* use currentSource as readSource is reset at failure */ + if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING) { if (RecPtr == lastComplaint) emode = DEBUG1; diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index ceaff097b9..4f117ea4da 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -472,10 +472,9 @@ WalReceiverMain(void) else if (len < 0) { ereport(LOG, - (errmsg("replication terminated by primary server"), - errdetail("End of WAL reached on timeline %u at %X/%X.", - startpointTLI, - LSN_FORMAT_ARGS(LogstreamResult.Write)))); + (errmsg("replication terminated by primary server at %X/%X on timeline %u.", + LSN_FORMAT_ARGS(LogstreamResult.Write), + startpointTLI))); endofwal = true; break; } diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c index a6251e1a96..3745e76488 100644 --- a/src/bin/pg_waldump/pg_waldump.c +++ b/src/bin/pg_waldump/pg_waldump.c @@ -1176,9 +1176,16 @@ main(int argc, char **argv) exit(0); if (errormsg) - fatal_error("error in WAL record at %X/%X: %s", - LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr), - errormsg); + { + if (xlogreader_state->EndOfWAL) + pg_log_info("end of WAL at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + else + fatal_error("error in WAL record at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + } XLogReaderFree(xlogreader_state); diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h index 477f0efe26..7b314ef10e 100644 --- a/src/include/access/xlogreader.h +++ b/src/include/access/xlogreader.h @@ -174,6 +174,7 @@ struct XLogReaderState */ XLogRecPtr ReadRecPtr; /* start of last record read */ XLogRecPtr EndRecPtr; /* end+1 of last record read */ + bool EndOfWAL; /* was the last attempt EOW? */ /* * Set at the end of recovery: the start point of a partial record at the diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl index 14154d1ce0..01033334d6 100644 --- a/src/test/recovery/t/011_crash_recovery.pl +++ b/src/test/recovery/t/011_crash_recovery.pl @@ -10,7 +10,9 @@ use PostgreSQL::Test::Cluster; use PostgreSQL::Test::Utils; use Test::More; use Config; +use IPC::Run; +my $reached_eow_pat = "reached end of WAL at "; my $node = PostgreSQL::Test::Cluster->new('primary'); $node->init(allows_streaming => 1); $node->start; @@ -48,7 +50,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), # Crash and restart the postmaster $node->stop('immediate'); +my $logstart = get_log_size($node); $node->start; +my $max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, $reached_eow_pat, $logstart)); + sleep 0.5; +} +ok ($max_attempts >= 0, "end-of-wal is logged"); # Make sure we really got a new xid cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'), @@ -61,4 +71,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), $stdin .= "\\q\n"; $tx->finish; # wait for psql to quit gracefully +my $segsize = $node->safe_psql('postgres', + qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]); + +# make sure no records afterwards go to the next segment +$node->safe_psql('postgres', qq[ + SELECT pg_switch_wal(); + CHECKPOINT; + CREATE TABLE t(); +]); +$node->stop('immediate'); + +# identify REDO WAL file +my $cmd = "pg_controldata -D " . $node->data_dir(); +$cmd = ['pg_controldata', '-D', $node->data_dir()]; +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m, + "checkpoint file is identified"); +my $chkptfile = $1; + +# identify the last record +my $walfile = $node->data_dir() . "/pg_wal/$chkptfile"; +$cmd = ['pg_waldump', $walfile]; +$stdout = ''; +$stderr = ''; +my $lastrec; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +foreach my $l (split(/\r?\n/, $stdout)) +{ + $lastrec = $l; +} +ok(defined $lastrec, "last WAL record is extracted"); +ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/, + "pg_waldump emits the correct ending message"); + +# read the last record LSN excluding leading zeroes +ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/, + "LSN of the last record identified"); +my $lastlsn = $1; + +# corrupt the last record +my $offset = hex($lastlsn) % $segsize; +open(my $segf, '+<', $walfile) or die "failed to open $walfile\n"; +seek($segf, $offset, 0); # halfway break the last record +print $segf "\0\0\0\0"; +close($segf); + +# pg_waldump complains about the corrupted record +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/, + "pg_waldump emits the correct error message"); + +# also server complains +$logstart = get_log_size($node); +$node->start; +$max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, "WARNING: invalid record length at 0/$lastlsn: wanted [0-9]+, got 0", + $logstart)); + sleep 0.5; +} +ok($max_attempts >= 0, "header error is logged at $lastlsn"); + +# no end-of-wal message should be seen this time +ok(!find_in_log($node, $reached_eow_pat, $logstart), + "false log message is not emitted"); + +$node->stop('immediate'); + done_testing(); + +#### helper routines +# return the size of logfile of $node in bytes +sub get_log_size +{ + my ($node) = @_; + + return (stat $node->logfile)[7]; +} + +# find $pat in logfile of $node after $off-th byte +sub find_in_log +{ + my ($node, $pat, $off) = @_; + + $off = 0 unless defined $off; + my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile); + return 0 if (length($log) <= $off); + + $log = substr($log, $off); + + return $log =~ m/$pat/; +} -- 2.27.0
On Thu, Feb 17, 2022 at 1:20 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Tue, 15 Feb 2022 20:17:20 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in > > OK. The v13 patch looks good. I have marked it as ready to commit. > > Thank you for working on all my review comments. > > Thaks! But the recent xlog.c refactoring crashes into this patch. > And I found a silly bug while rebasing. > > xlog.c:12463 / xlogrecovery.c:3168 > if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen, > .. > { > + Assert(!StandbyMode); > ... > + xlogreader->EndOfWAL = true; > > Yeah, I forgot about promotion there.. Yes, we exit WaitForWALToBecomeAvailable() even in standby mode provided the user has requested for the promotion. So checking for the !StandbyMode condition alone was not enough. So what I should have done is > setting EndOfWAL according to StandbyMode. > > + Assert(!StandbyMode || CheckForStandbyTrigger()); > ... > + /* promotion exit is not end-of-WAL */ > + xlogreader->EndOfWAL = !StandbyMode; > The changes looks good. thanks.! -- With Regards, Ashutosh Sharma.
At Sat, 19 Feb 2022 09:31:33 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in > The changes looks good. thanks.! Thanks! Some recent core change changed WAL insertion speed during the TAP test and revealed one forgotton case of EndOfWAL. When a record header flows into the next page, XLogReadRecord does separate check from ValidXLogRecordHeader by itself. > * If the whole record header is on this page, validate it immediately. > * Otherwise do just a basic sanity check on xl_tot_len, and validate the > * rest of the header after reading it from the next page. The xl_tot_len > * check is necessary here to ensure that we enter the "Need to reassemble > * record" code path below; otherwise we might fail to apply > * ValidXLogRecordHeader at all. > */ > if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord) > { ... > } > else > { > /* XXX: more validation should be done here */ > if (total_len < SizeOfXLogRecord) > { I could simplly copy-in a part of ValidXLogRecordHeader there but that results in rather large duplicate code. I could have ValidXLogRecordHeader handle the partial header case but it seems to me complex. So in this version I split the xl_tot_len part of ValidXLogRecordHeader into ValidXLogRecordLength. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 01cce076d2b3ad536398cc2b716ef64ed9b2c409 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH v15] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead, raise the error level for other kind of WAL failure to WARNING. --- src/backend/access/transam/xlogreader.c | 125 ++++++++++++++++++---- src/backend/access/transam/xlogrecovery.c | 92 ++++++++++++---- src/backend/replication/walreceiver.c | 7 +- src/bin/pg_waldump/pg_waldump.c | 13 ++- src/include/access/xlogreader.h | 1 + src/test/recovery/t/011_crash_recovery.pl | 106 ++++++++++++++++++ 6 files changed, 297 insertions(+), 47 deletions(-) diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 35029cf97d..ba1c1ece87 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -42,6 +42,8 @@ static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength); static int ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen); static void XLogReaderInvalReadState(XLogReaderState *state); +static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr, + XLogRecord *record); static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess); static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record, @@ -121,6 +123,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, pfree(state); return NULL; } + state->EndOfWAL = false; state->errormsg_buf[0] = '\0'; /* @@ -292,6 +295,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) /* reset error state */ *errormsg = NULL; state->errormsg_buf[0] = '\0'; + state->EndOfWAL = false; ResetDecoder(state); state->abortedRecPtr = InvalidXLogRecPtr; @@ -380,12 +384,11 @@ restart: * whole header. */ record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); - total_len = record->xl_tot_len; /* * If the whole record header is on this page, validate it immediately. - * Otherwise do just a basic sanity check on xl_tot_len, and validate the - * rest of the header after reading it from the next page. The xl_tot_len + * Otherwise do just a basic sanity check on record length, and validate + * the rest of the header after reading it from the next page. The length * check is necessary here to ensure that we enter the "Need to reassemble * record" code path below; otherwise we might fail to apply * ValidXLogRecordHeader at all. @@ -399,18 +402,13 @@ restart: } else { - /* XXX: more validation should be done here */ - if (total_len < SizeOfXLogRecord) - { - report_invalid_record(state, - "invalid record length at %X/%X: wanted %u, got %u", - LSN_FORMAT_ARGS(RecPtr), - (uint32) SizeOfXLogRecord, total_len); + if (!ValidXLogRecordLength(state, RecPtr, record)) goto err; - } + gotheader = false; } + total_len = record->xl_tot_len; len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ; if (total_len > len) { @@ -588,6 +586,15 @@ err: */ state->abortedRecPtr = RecPtr; state->missingContrecPtr = targetPagePtr; + + /* + * If the message is not set yet, that means we failed to load the + * page for the record. Otherwise do not hide the existing message. + */ + if (state->errormsg_buf[0] == '\0') + report_invalid_record(state, + "missing contrecord at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); } /* @@ -719,6 +726,60 @@ XLogReaderInvalReadState(XLogReaderState *state) state->readLen = 0; } +/* + * Validate record length of an XLOG record header. + * + * This is substantially a part of ValidXLogRecordHeader. But XLogReadRecord + * needs this separate from the function in case of a partial record header. + */ +static bool +ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr, + XLogRecord *record) +{ + if (record->xl_tot_len == 0) + { + char *p; + char *pe; + + /* + * We are almost sure reaching the end of WAL, make sure that the + * whole page after the record is filled with zeroes. + */ + p = (char *) record; + pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); + + while (*p == 0 && p < pe) + p++; + + if (p == pe) + { + /* + * The page after the record is completely zeroed. That suggests + * we don't have a record after this point. We don't bother + * checking the pages after since they are not zeroed in the case + * of recycled segments. + */ + report_invalid_record(state, "empty record at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); + + /* notify end-of-wal to callers */ + state->EndOfWAL = true; + return false; + } + } + + if (record->xl_tot_len < SizeOfXLogRecord) + { + report_invalid_record(state, + "invalid record length at %X/%X: wanted %u, got %u", + LSN_FORMAT_ARGS(RecPtr), + (uint32) SizeOfXLogRecord, record->xl_tot_len); + return false; + } + + return true; +} + /* * Validate an XLOG record header. * @@ -730,14 +791,9 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess) { - if (record->xl_tot_len < SizeOfXLogRecord) - { - report_invalid_record(state, - "invalid record length at %X/%X: wanted %u, got %u", - LSN_FORMAT_ARGS(RecPtr), - (uint32) SizeOfXLogRecord, record->xl_tot_len); + if (!ValidXLogRecordLength(state, RecPtr, record)) return false; - } + if (record->xl_rmid > RM_MAX_ID) { report_invalid_record(state, @@ -836,6 +892,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr); + StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero"); + + if (hdr->xlp_magic == 0) + { + /* Regard an empty page as End-Of-WAL */ + int i; + + for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++); + if (i == XLOG_BLCKSZ) + { + char fname[MAXFNAMELEN]; + + XLogFileName(fname, state->seg.ws_tli, segno, + state->segcxt.ws_segsize); + + report_invalid_record(state, + "empty page in log segment %s, offset %u", + fname, + offset); + state->EndOfWAL = true; + return false; + } + + /* The same condition will be caught as invalid magic number */ + } if (hdr->xlp_magic != XLOG_PAGE_MAGIC) { char fname[MAXFNAMELEN]; @@ -921,6 +1002,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, LSN_FORMAT_ARGS(hdr->xlp_pageaddr), fname, offset); + + /* + * If the page address is less than expected we assume it is an unused + * page in a recycled segment. + */ + if (hdr->xlp_pageaddr < recaddr) + state->EndOfWAL = true; + return false; } diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index f9f212680b..750056acaf 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -1592,7 +1592,7 @@ PerformWalRecovery(void) /* just have to read next record after CheckPoint */ Assert(xlogreader->ReadRecPtr == CheckPointLoc); replayTLI = CheckPointTLI; - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } if (record != NULL) @@ -1706,7 +1706,7 @@ PerformWalRecovery(void) } /* Else, try to fetch the next WAL record */ - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } while (record != NULL); /* @@ -1765,13 +1765,20 @@ PerformWalRecovery(void) InRedo = false; } - else + else if (xlogreader->EndOfWAL) { /* there are no WAL records following the checkpoint */ ereport(LOG, (errmsg("redo is not required"))); } + else + { + /* broken record found */ + ereport(WARNING, + (errmsg("redo is skipped"), + errhint("This suggests WAL file corruption. You might need to check the database."))); + } /* * This check is intentionally after the above log messages that indicate @@ -2939,6 +2946,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode, for (;;) { char *errormsg; + XLogRecPtr ErrRecPtr = InvalidXLogRecPtr; record = XLogReadRecord(xlogreader, &errormsg); if (record == NULL) @@ -2954,6 +2962,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; + ErrRecPtr = abortedRecPtr; + } + else + { + /* + * EndRecPtr is the LSN we tried to read but failed. In the + * case of decoding error, it is at the end of the failed + * record but we don't have a means for now to know EndRecPtr + * is pointing to which of the beginning or ending of the + * failed record. + */ + ErrRecPtr = xlogreader->EndRecPtr; } if (readFile >= 0) @@ -2963,13 +2983,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode, } /* - * We only end up here without a message when XLogPageRead() - * failed - in that case we already logged something. In - * StandbyMode that only happens if we have been triggered, so we - * shouldn't loop anymore in that case. + * We only end up here without a message when XLogPageRead() failed + * in that case we already logged something, or just met end-of-WAL + * conditions. In StandbyMode that only happens if we have been + * triggered, so we shouldn't loop anymore in that case. When + * EndOfWAL is true, we don't emit that error if any immediately + * and instead will show it as a part of a decent end-of-wal + * message later. */ - if (errormsg) - ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr), + if (!xlogreader->EndOfWAL && errormsg) + ereport(emode_for_corrupt_record(emode, ErrRecPtr), (errmsg_internal("%s", errormsg) /* already translated */ )); } @@ -3000,11 +3023,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode, /* Great, got a record */ return record; } - else + + Assert(ErrRecPtr != InvalidXLogRecPtr); + + /* No valid record available from this source */ + lastSourceFailed = true; + + if (!fetching_ckpt) { - /* No valid record available from this source */ - lastSourceFailed = true; - /* * If archive recovery was requested, but we were still doing * crash recovery, switch to archive recovery and retry using the @@ -3017,11 +3043,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode, * we'd have no idea how far we'd have to replay to reach * consistency. So err on the safe side and give up. */ - if (!InArchiveRecovery && ArchiveRecoveryRequested && - !fetching_ckpt) + if (!InArchiveRecovery && ArchiveRecoveryRequested) { + /* + * We don't report this as LOG, since we don't stop recovery + * here + */ ereport(DEBUG1, - (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + (errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, enteringarchive recovery", + LSN_FORMAT_ARGS(ErrRecPtr), + replayTLI))); InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -3042,12 +3073,24 @@ ReadRecord(XLogReaderState *xlogreader, int emode, continue; } - /* In standby mode, loop back to retry. Otherwise, give up. */ - if (StandbyMode && !CheckForStandbyTrigger()) - continue; - else - return NULL; + /* + * recovery ended. + * + * Emit a decent message if we met end-of-WAL. Otherwise we should + * have already emitted an error message. + */ + if (xlogreader->EndOfWAL) + ereport(LOG, + (errmsg("reached end of WAL at %X/%X on timeline %u", + LSN_FORMAT_ARGS(ErrRecPtr), replayTLI), + (errormsg ? errdetail_internal("%s", errormsg) : 0))); } + + /* In standby mode, loop back to retry. Otherwise, give up. */ + if (StandbyMode && !CheckForStandbyTrigger()) + continue; + else + return NULL; } } @@ -3129,12 +3172,16 @@ retry: private->replayTLI, xlogreader->EndRecPtr)) { + Assert(!StandbyMode || CheckForStandbyTrigger()); + if (readFile >= 0) close(readFile); readFile = -1; readLen = 0; readSource = XLOG_FROM_ANY; + /* promotion exit is not end-of-WAL */ + xlogreader->EndOfWAL = !StandbyMode; return -1; } } @@ -3767,7 +3814,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr) { static XLogRecPtr lastComplaint = 0; - if (readSource == XLOG_FROM_PG_WAL && emode == LOG) + /* use currentSource as readSource is reset at failure */ + if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING) { if (RecPtr == lastComplaint) emode = DEBUG1; diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index ceaff097b9..4f117ea4da 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -472,10 +472,9 @@ WalReceiverMain(void) else if (len < 0) { ereport(LOG, - (errmsg("replication terminated by primary server"), - errdetail("End of WAL reached on timeline %u at %X/%X.", - startpointTLI, - LSN_FORMAT_ARGS(LogstreamResult.Write)))); + (errmsg("replication terminated by primary server at %X/%X on timeline %u.", + LSN_FORMAT_ARGS(LogstreamResult.Write), + startpointTLI))); endofwal = true; break; } diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c index 2340dc247b..215abe95dc 100644 --- a/src/bin/pg_waldump/pg_waldump.c +++ b/src/bin/pg_waldump/pg_waldump.c @@ -1173,9 +1173,16 @@ main(int argc, char **argv) exit(0); if (errormsg) - fatal_error("error in WAL record at %X/%X: %s", - LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr), - errormsg); + { + if (xlogreader_state->EndOfWAL) + pg_log_info("end of WAL at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + else + fatal_error("error in WAL record at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + } XLogReaderFree(xlogreader_state); diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h index 477f0efe26..7b314ef10e 100644 --- a/src/include/access/xlogreader.h +++ b/src/include/access/xlogreader.h @@ -174,6 +174,7 @@ struct XLogReaderState */ XLogRecPtr ReadRecPtr; /* start of last record read */ XLogRecPtr EndRecPtr; /* end+1 of last record read */ + bool EndOfWAL; /* was the last attempt EOW? */ /* * Set at the end of recovery: the start point of a partial record at the diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl index 14154d1ce0..01033334d6 100644 --- a/src/test/recovery/t/011_crash_recovery.pl +++ b/src/test/recovery/t/011_crash_recovery.pl @@ -10,7 +10,9 @@ use PostgreSQL::Test::Cluster; use PostgreSQL::Test::Utils; use Test::More; use Config; +use IPC::Run; +my $reached_eow_pat = "reached end of WAL at "; my $node = PostgreSQL::Test::Cluster->new('primary'); $node->init(allows_streaming => 1); $node->start; @@ -48,7 +50,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), # Crash and restart the postmaster $node->stop('immediate'); +my $logstart = get_log_size($node); $node->start; +my $max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, $reached_eow_pat, $logstart)); + sleep 0.5; +} +ok ($max_attempts >= 0, "end-of-wal is logged"); # Make sure we really got a new xid cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'), @@ -61,4 +71,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), $stdin .= "\\q\n"; $tx->finish; # wait for psql to quit gracefully +my $segsize = $node->safe_psql('postgres', + qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]); + +# make sure no records afterwards go to the next segment +$node->safe_psql('postgres', qq[ + SELECT pg_switch_wal(); + CHECKPOINT; + CREATE TABLE t(); +]); +$node->stop('immediate'); + +# identify REDO WAL file +my $cmd = "pg_controldata -D " . $node->data_dir(); +$cmd = ['pg_controldata', '-D', $node->data_dir()]; +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m, + "checkpoint file is identified"); +my $chkptfile = $1; + +# identify the last record +my $walfile = $node->data_dir() . "/pg_wal/$chkptfile"; +$cmd = ['pg_waldump', $walfile]; +$stdout = ''; +$stderr = ''; +my $lastrec; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +foreach my $l (split(/\r?\n/, $stdout)) +{ + $lastrec = $l; +} +ok(defined $lastrec, "last WAL record is extracted"); +ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/, + "pg_waldump emits the correct ending message"); + +# read the last record LSN excluding leading zeroes +ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/, + "LSN of the last record identified"); +my $lastlsn = $1; + +# corrupt the last record +my $offset = hex($lastlsn) % $segsize; +open(my $segf, '+<', $walfile) or die "failed to open $walfile\n"; +seek($segf, $offset, 0); # halfway break the last record +print $segf "\0\0\0\0"; +close($segf); + +# pg_waldump complains about the corrupted record +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/, + "pg_waldump emits the correct error message"); + +# also server complains +$logstart = get_log_size($node); +$node->start; +$max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, "WARNING: invalid record length at 0/$lastlsn: wanted [0-9]+, got 0", + $logstart)); + sleep 0.5; +} +ok($max_attempts >= 0, "header error is logged at $lastlsn"); + +# no end-of-wal message should be seen this time +ok(!find_in_log($node, $reached_eow_pat, $logstart), + "false log message is not emitted"); + +$node->stop('immediate'); + done_testing(); + +#### helper routines +# return the size of logfile of $node in bytes +sub get_log_size +{ + my ($node) = @_; + + return (stat $node->logfile)[7]; +} + +# find $pat in logfile of $node after $off-th byte +sub find_in_log +{ + my ($node, $pat, $off) = @_; + + $off = 0 unless defined $off; + my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile); + return 0 if (length($log) <= $off); + + $log = substr($log, $off); + + return $log =~ m/$pat/; +} -- 2.27.0
On Wed, Mar 2, 2022 at 7:47 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Sat, 19 Feb 2022 09:31:33 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in > > The changes looks good. thanks.! > > Thanks! > > Some recent core change changed WAL insertion speed during the TAP > test and revealed one forgotton case of EndOfWAL. When a record > header flows into the next page, XLogReadRecord does separate check > from ValidXLogRecordHeader by itself. > The new changes made in the patch look good. Thanks to the recent changes to speed WAL insertion that have helped us catch this bug. One small comment: record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); - total_len = record->xl_tot_len; Do you think we need to change the position of the comments written for above code that says: /* * Read the record length. * ... ... -- With Regards, Ashutosh Sharma.
At Thu, 3 Mar 2022 15:39:44 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in > The new changes made in the patch look good. Thanks to the recent > changes to speed WAL insertion that have helped us catch this bug. Thanks for the quick checking. > One small comment: > > record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); > - total_len = record->xl_tot_len; > > Do you think we need to change the position of the comments written > for above code that says: Yeah, I didn't do that since it is about header verification. But as you pointed, the result still doesn't look perfect. On second thought the two seems repeating the same things. Thus I merged the two comments together. In this verion 16 it looks like this. > /* > * Validate the record header. > * > * Even though we use an XLogRecord pointer here, the whole record header > * might not fit on this page. If the whole record header is on this page, > * validate it immediately. Even otherwise xl_tot_len must be on this page > * (it is the first field of MAXALIGNed records), but we still cannot > * access any further fields until we've verified that we got the whole > * header, so do just a basic sanity check on record length, and validate > * the rest of the header after reading it from the next page. The length > * check is necessary here to ensure that we enter the "Need to reassemble > * record" code path below; otherwise we might fail to apply > * ValidXLogRecordHeader at all. > */ > record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); > > if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord) regards. -- Kyotaro Horiguchi NTT Open Source Software Center From 00d848df6bb8b9966dfbd39c98a388fda42a3e3c Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH v16] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead, raise the error level for other kind of WAL failure to WARNING. --- src/backend/access/transam/xlogreader.c | 144 +++++++++++++++++----- src/backend/access/transam/xlogrecovery.c | 92 ++++++++++---- src/backend/replication/walreceiver.c | 7 +- src/bin/pg_waldump/pg_waldump.c | 13 +- src/include/access/xlogreader.h | 1 + src/test/recovery/t/011_crash_recovery.pl | 106 ++++++++++++++++ 6 files changed, 305 insertions(+), 58 deletions(-) diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 35029cf97d..bd0f211a23 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -42,6 +42,8 @@ static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength); static int ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen); static void XLogReaderInvalReadState(XLogReaderState *state); +static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr, + XLogRecord *record); static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess); static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record, @@ -121,6 +123,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, pfree(state); return NULL; } + state->EndOfWAL = false; state->errormsg_buf[0] = '\0'; /* @@ -292,6 +295,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg) /* reset error state */ *errormsg = NULL; state->errormsg_buf[0] = '\0'; + state->EndOfWAL = false; ResetDecoder(state); state->abortedRecPtr = InvalidXLogRecPtr; @@ -371,25 +375,21 @@ restart: Assert(pageHeaderSize <= readOff); /* - * Read the record length. + * Validate the record header. * - * NB: Even though we use an XLogRecord pointer here, the whole record - * header might not fit on this page. xl_tot_len is the first field of the - * struct, so it must be on this page (the records are MAXALIGNed), but we - * cannot access any other fields until we've verified that we got the - * whole header. - */ - record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); - total_len = record->xl_tot_len; - - /* - * If the whole record header is on this page, validate it immediately. - * Otherwise do just a basic sanity check on xl_tot_len, and validate the - * rest of the header after reading it from the next page. The xl_tot_len + * Even though we use an XLogRecord pointer here, the whole record header + * might not fit on this page. If the whole record header is on this page, + * validate it immediately. Even otherwise xl_tot_len must be on this page + * (it is the first field of MAXALIGNed records), but we still cannot + * access any further fields until we've verified that we got the whole + * header, so do just a basic sanity check on record length, and validate + * the rest of the header after reading it from the next page. The length * check is necessary here to ensure that we enter the "Need to reassemble * record" code path below; otherwise we might fail to apply * ValidXLogRecordHeader at all. */ + record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); + if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord) { if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record, @@ -399,18 +399,13 @@ restart: } else { - /* XXX: more validation should be done here */ - if (total_len < SizeOfXLogRecord) - { - report_invalid_record(state, - "invalid record length at %X/%X: wanted %u, got %u", - LSN_FORMAT_ARGS(RecPtr), - (uint32) SizeOfXLogRecord, total_len); + if (!ValidXLogRecordLength(state, RecPtr, record)) goto err; - } + gotheader = false; } + total_len = record->xl_tot_len; len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ; if (total_len > len) { @@ -588,6 +583,15 @@ err: */ state->abortedRecPtr = RecPtr; state->missingContrecPtr = targetPagePtr; + + /* + * If the message is not set yet, that means we failed to load the + * page for the record. Otherwise do not hide the existing message. + */ + if (state->errormsg_buf[0] == '\0') + report_invalid_record(state, + "missing contrecord at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); } /* @@ -719,6 +723,60 @@ XLogReaderInvalReadState(XLogReaderState *state) state->readLen = 0; } +/* + * Validate record length of an XLOG record header. + * + * This is substantially a part of ValidXLogRecordHeader. But XLogReadRecord + * needs this separate from the function in case of a partial record header. + */ +static bool +ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr, + XLogRecord *record) +{ + if (record->xl_tot_len == 0) + { + char *p; + char *pe; + + /* + * We are almost sure reaching the end of WAL, make sure that the + * whole page after the record is filled with zeroes. + */ + p = (char *) record; + pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); + + while (*p == 0 && p < pe) + p++; + + if (p == pe) + { + /* + * The page after the record is completely zeroed. That suggests + * we don't have a record after this point. We don't bother + * checking the pages after since they are not zeroed in the case + * of recycled segments. + */ + report_invalid_record(state, "empty record at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); + + /* notify end-of-wal to callers */ + state->EndOfWAL = true; + return false; + } + } + + if (record->xl_tot_len < SizeOfXLogRecord) + { + report_invalid_record(state, + "invalid record length at %X/%X: wanted %u, got %u", + LSN_FORMAT_ARGS(RecPtr), + (uint32) SizeOfXLogRecord, record->xl_tot_len); + return false; + } + + return true; +} + /* * Validate an XLOG record header. * @@ -730,14 +788,9 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess) { - if (record->xl_tot_len < SizeOfXLogRecord) - { - report_invalid_record(state, - "invalid record length at %X/%X: wanted %u, got %u", - LSN_FORMAT_ARGS(RecPtr), - (uint32) SizeOfXLogRecord, record->xl_tot_len); + if (!ValidXLogRecordLength(state, RecPtr, record)) return false; - } + if (record->xl_rmid > RM_MAX_ID) { report_invalid_record(state, @@ -836,6 +889,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr); + StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero"); + + if (hdr->xlp_magic == 0) + { + /* Regard an empty page as End-Of-WAL */ + int i; + + for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++); + if (i == XLOG_BLCKSZ) + { + char fname[MAXFNAMELEN]; + + XLogFileName(fname, state->seg.ws_tli, segno, + state->segcxt.ws_segsize); + + report_invalid_record(state, + "empty page in log segment %s, offset %u", + fname, + offset); + state->EndOfWAL = true; + return false; + } + + /* The same condition will be caught as invalid magic number */ + } if (hdr->xlp_magic != XLOG_PAGE_MAGIC) { char fname[MAXFNAMELEN]; @@ -921,6 +999,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, LSN_FORMAT_ARGS(hdr->xlp_pageaddr), fname, offset); + + /* + * If the page address is less than expected we assume it is an unused + * page in a recycled segment. + */ + if (hdr->xlp_pageaddr < recaddr) + state->EndOfWAL = true; + return false; } diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index f9f212680b..750056acaf 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -1592,7 +1592,7 @@ PerformWalRecovery(void) /* just have to read next record after CheckPoint */ Assert(xlogreader->ReadRecPtr == CheckPointLoc); replayTLI = CheckPointTLI; - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } if (record != NULL) @@ -1706,7 +1706,7 @@ PerformWalRecovery(void) } /* Else, try to fetch the next WAL record */ - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } while (record != NULL); /* @@ -1765,13 +1765,20 @@ PerformWalRecovery(void) InRedo = false; } - else + else if (xlogreader->EndOfWAL) { /* there are no WAL records following the checkpoint */ ereport(LOG, (errmsg("redo is not required"))); } + else + { + /* broken record found */ + ereport(WARNING, + (errmsg("redo is skipped"), + errhint("This suggests WAL file corruption. You might need to check the database."))); + } /* * This check is intentionally after the above log messages that indicate @@ -2939,6 +2946,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode, for (;;) { char *errormsg; + XLogRecPtr ErrRecPtr = InvalidXLogRecPtr; record = XLogReadRecord(xlogreader, &errormsg); if (record == NULL) @@ -2954,6 +2962,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; + ErrRecPtr = abortedRecPtr; + } + else + { + /* + * EndRecPtr is the LSN we tried to read but failed. In the + * case of decoding error, it is at the end of the failed + * record but we don't have a means for now to know EndRecPtr + * is pointing to which of the beginning or ending of the + * failed record. + */ + ErrRecPtr = xlogreader->EndRecPtr; } if (readFile >= 0) @@ -2963,13 +2983,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode, } /* - * We only end up here without a message when XLogPageRead() - * failed - in that case we already logged something. In - * StandbyMode that only happens if we have been triggered, so we - * shouldn't loop anymore in that case. + * We only end up here without a message when XLogPageRead() failed + * in that case we already logged something, or just met end-of-WAL + * conditions. In StandbyMode that only happens if we have been + * triggered, so we shouldn't loop anymore in that case. When + * EndOfWAL is true, we don't emit that error if any immediately + * and instead will show it as a part of a decent end-of-wal + * message later. */ - if (errormsg) - ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr), + if (!xlogreader->EndOfWAL && errormsg) + ereport(emode_for_corrupt_record(emode, ErrRecPtr), (errmsg_internal("%s", errormsg) /* already translated */ )); } @@ -3000,11 +3023,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode, /* Great, got a record */ return record; } - else + + Assert(ErrRecPtr != InvalidXLogRecPtr); + + /* No valid record available from this source */ + lastSourceFailed = true; + + if (!fetching_ckpt) { - /* No valid record available from this source */ - lastSourceFailed = true; - /* * If archive recovery was requested, but we were still doing * crash recovery, switch to archive recovery and retry using the @@ -3017,11 +3043,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode, * we'd have no idea how far we'd have to replay to reach * consistency. So err on the safe side and give up. */ - if (!InArchiveRecovery && ArchiveRecoveryRequested && - !fetching_ckpt) + if (!InArchiveRecovery && ArchiveRecoveryRequested) { + /* + * We don't report this as LOG, since we don't stop recovery + * here + */ ereport(DEBUG1, - (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + (errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, enteringarchive recovery", + LSN_FORMAT_ARGS(ErrRecPtr), + replayTLI))); InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -3042,12 +3073,24 @@ ReadRecord(XLogReaderState *xlogreader, int emode, continue; } - /* In standby mode, loop back to retry. Otherwise, give up. */ - if (StandbyMode && !CheckForStandbyTrigger()) - continue; - else - return NULL; + /* + * recovery ended. + * + * Emit a decent message if we met end-of-WAL. Otherwise we should + * have already emitted an error message. + */ + if (xlogreader->EndOfWAL) + ereport(LOG, + (errmsg("reached end of WAL at %X/%X on timeline %u", + LSN_FORMAT_ARGS(ErrRecPtr), replayTLI), + (errormsg ? errdetail_internal("%s", errormsg) : 0))); } + + /* In standby mode, loop back to retry. Otherwise, give up. */ + if (StandbyMode && !CheckForStandbyTrigger()) + continue; + else + return NULL; } } @@ -3129,12 +3172,16 @@ retry: private->replayTLI, xlogreader->EndRecPtr)) { + Assert(!StandbyMode || CheckForStandbyTrigger()); + if (readFile >= 0) close(readFile); readFile = -1; readLen = 0; readSource = XLOG_FROM_ANY; + /* promotion exit is not end-of-WAL */ + xlogreader->EndOfWAL = !StandbyMode; return -1; } } @@ -3767,7 +3814,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr) { static XLogRecPtr lastComplaint = 0; - if (readSource == XLOG_FROM_PG_WAL && emode == LOG) + /* use currentSource as readSource is reset at failure */ + if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING) { if (RecPtr == lastComplaint) emode = DEBUG1; diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index ceaff097b9..4f117ea4da 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -472,10 +472,9 @@ WalReceiverMain(void) else if (len < 0) { ereport(LOG, - (errmsg("replication terminated by primary server"), - errdetail("End of WAL reached on timeline %u at %X/%X.", - startpointTLI, - LSN_FORMAT_ARGS(LogstreamResult.Write)))); + (errmsg("replication terminated by primary server at %X/%X on timeline %u.", + LSN_FORMAT_ARGS(LogstreamResult.Write), + startpointTLI))); endofwal = true; break; } diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c index 2340dc247b..215abe95dc 100644 --- a/src/bin/pg_waldump/pg_waldump.c +++ b/src/bin/pg_waldump/pg_waldump.c @@ -1173,9 +1173,16 @@ main(int argc, char **argv) exit(0); if (errormsg) - fatal_error("error in WAL record at %X/%X: %s", - LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr), - errormsg); + { + if (xlogreader_state->EndOfWAL) + pg_log_info("end of WAL at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + else + fatal_error("error in WAL record at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + } XLogReaderFree(xlogreader_state); diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h index 477f0efe26..7b314ef10e 100644 --- a/src/include/access/xlogreader.h +++ b/src/include/access/xlogreader.h @@ -174,6 +174,7 @@ struct XLogReaderState */ XLogRecPtr ReadRecPtr; /* start of last record read */ XLogRecPtr EndRecPtr; /* end+1 of last record read */ + bool EndOfWAL; /* was the last attempt EOW? */ /* * Set at the end of recovery: the start point of a partial record at the diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl index 14154d1ce0..01033334d6 100644 --- a/src/test/recovery/t/011_crash_recovery.pl +++ b/src/test/recovery/t/011_crash_recovery.pl @@ -10,7 +10,9 @@ use PostgreSQL::Test::Cluster; use PostgreSQL::Test::Utils; use Test::More; use Config; +use IPC::Run; +my $reached_eow_pat = "reached end of WAL at "; my $node = PostgreSQL::Test::Cluster->new('primary'); $node->init(allows_streaming => 1); $node->start; @@ -48,7 +50,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), # Crash and restart the postmaster $node->stop('immediate'); +my $logstart = get_log_size($node); $node->start; +my $max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, $reached_eow_pat, $logstart)); + sleep 0.5; +} +ok ($max_attempts >= 0, "end-of-wal is logged"); # Make sure we really got a new xid cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'), @@ -61,4 +71,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), $stdin .= "\\q\n"; $tx->finish; # wait for psql to quit gracefully +my $segsize = $node->safe_psql('postgres', + qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]); + +# make sure no records afterwards go to the next segment +$node->safe_psql('postgres', qq[ + SELECT pg_switch_wal(); + CHECKPOINT; + CREATE TABLE t(); +]); +$node->stop('immediate'); + +# identify REDO WAL file +my $cmd = "pg_controldata -D " . $node->data_dir(); +$cmd = ['pg_controldata', '-D', $node->data_dir()]; +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m, + "checkpoint file is identified"); +my $chkptfile = $1; + +# identify the last record +my $walfile = $node->data_dir() . "/pg_wal/$chkptfile"; +$cmd = ['pg_waldump', $walfile]; +$stdout = ''; +$stderr = ''; +my $lastrec; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +foreach my $l (split(/\r?\n/, $stdout)) +{ + $lastrec = $l; +} +ok(defined $lastrec, "last WAL record is extracted"); +ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/, + "pg_waldump emits the correct ending message"); + +# read the last record LSN excluding leading zeroes +ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/, + "LSN of the last record identified"); +my $lastlsn = $1; + +# corrupt the last record +my $offset = hex($lastlsn) % $segsize; +open(my $segf, '+<', $walfile) or die "failed to open $walfile\n"; +seek($segf, $offset, 0); # halfway break the last record +print $segf "\0\0\0\0"; +close($segf); + +# pg_waldump complains about the corrupted record +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/, + "pg_waldump emits the correct error message"); + +# also server complains +$logstart = get_log_size($node); +$node->start; +$max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, "WARNING: invalid record length at 0/$lastlsn: wanted [0-9]+, got 0", + $logstart)); + sleep 0.5; +} +ok($max_attempts >= 0, "header error is logged at $lastlsn"); + +# no end-of-wal message should be seen this time +ok(!find_in_log($node, $reached_eow_pat, $logstart), + "false log message is not emitted"); + +$node->stop('immediate'); + done_testing(); + +#### helper routines +# return the size of logfile of $node in bytes +sub get_log_size +{ + my ($node) = @_; + + return (stat $node->logfile)[7]; +} + +# find $pat in logfile of $node after $off-th byte +sub find_in_log +{ + my ($node, $pat, $off) = @_; + + $off = 0 unless defined $off; + my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile); + return 0 if (length($log) <= $off); + + $log = substr($log, $off); + + return $log =~ m/$pat/; +} -- 2.27.0
Hi, On 2022-03-04 09:43:59 +0900, Kyotaro Horiguchi wrote: > On second thought the two seems repeating the same things. Thus I > merged the two comments together. In this verion 16 it looks like > this. Patch currently fails to apply, needs a rebase: http://cfbot.cputube.org/patch_37_2490.log Greetings, Andres Freund
At Mon, 21 Mar 2022 17:01:19 -0700, Andres Freund <andres@anarazel.de> wrote in > Patch currently fails to apply, needs a rebase: > http://cfbot.cputube.org/patch_37_2490.log Thanks for noticing me of that. Rebased to the current HEAD. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From a7c9f36e631eaba5078398598dae5d459e79add9 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Date: Fri, 28 Feb 2020 15:52:58 +0900 Subject: [PATCH v17] Make End-Of-Recovery error less scary When recovery in any type ends, we see a bit scary error message like "invalid record length" that suggests something serious is happening. Actually if recovery meets a record with length = 0, that usually means it finished applying all available WAL records. Make this message less scary as "reached end of WAL". Instead, raise the error level for other kind of WAL failure to WARNING. --- src/backend/access/transam/xlogreader.c | 145 +++++++++++++++++----- src/backend/access/transam/xlogrecovery.c | 92 ++++++++++---- src/backend/replication/walreceiver.c | 7 +- src/bin/pg_waldump/pg_waldump.c | 13 +- src/include/access/xlogreader.h | 1 + src/test/recovery/t/011_crash_recovery.pl | 106 ++++++++++++++++ 6 files changed, 306 insertions(+), 58 deletions(-) diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index e437c42992..0942265408 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -46,6 +46,8 @@ static int ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen); static void XLogReaderInvalReadState(XLogReaderState *state); static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool non_blocking); +static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr, + XLogRecord *record); static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess); static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record, @@ -147,6 +149,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir, pfree(state); return NULL; } + state->EndOfWAL = false; state->errormsg_buf[0] = '\0'; /* @@ -552,6 +555,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking) /* reset error state */ state->errormsg_buf[0] = '\0'; decoded = NULL; + state->EndOfWAL = false; state->abortedRecPtr = InvalidXLogRecPtr; state->missingContrecPtr = InvalidXLogRecPtr; @@ -633,25 +637,21 @@ restart: Assert(pageHeaderSize <= readOff); /* - * Read the record length. + * Validate the record header. * - * NB: Even though we use an XLogRecord pointer here, the whole record - * header might not fit on this page. xl_tot_len is the first field of the - * struct, so it must be on this page (the records are MAXALIGNed), but we - * cannot access any other fields until we've verified that we got the - * whole header. - */ - record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); - total_len = record->xl_tot_len; - - /* - * If the whole record header is on this page, validate it immediately. - * Otherwise do just a basic sanity check on xl_tot_len, and validate the - * rest of the header after reading it from the next page. The xl_tot_len + * Even though we use an XLogRecord pointer here, the whole record header + * might not fit on this page. If the whole record header is on this page, + * validate it immediately. Even otherwise xl_tot_len must be on this page + * (it is the first field of MAXALIGNed records), but we still cannot + * access any further fields until we've verified that we got the whole + * header, so do just a basic sanity check on record length, and validate + * the rest of the header after reading it from the next page. The length * check is necessary here to ensure that we enter the "Need to reassemble * record" code path below; otherwise we might fail to apply * ValidXLogRecordHeader at all. */ + record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); + if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord) { if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record, @@ -661,18 +661,14 @@ restart: } else { - /* XXX: more validation should be done here */ - if (total_len < SizeOfXLogRecord) - { - report_invalid_record(state, - "invalid record length at %X/%X: wanted %u, got %u", - LSN_FORMAT_ARGS(RecPtr), - (uint32) SizeOfXLogRecord, total_len); + if (!ValidXLogRecordLength(state, RecPtr, record)) goto err; - } + gotheader = false; } + total_len = record->xl_tot_len; + /* * Find space to decode this record. Don't allow oversized allocation if * the caller requested nonblocking. Otherwise, we *have* to try to @@ -904,6 +900,15 @@ err: */ state->abortedRecPtr = RecPtr; state->missingContrecPtr = targetPagePtr; + + /* + * If the message is not set yet, that means we failed to load the + * page for the record. Otherwise do not hide the existing message. + */ + if (state->errormsg_buf[0] == '\0') + report_invalid_record(state, + "missing contrecord at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); } if (decoded && decoded->oversized) @@ -1083,6 +1088,60 @@ XLogReaderInvalReadState(XLogReaderState *state) state->readLen = 0; } +/* + * Validate record length of an XLOG record header. + * + * This is substantially a part of ValidXLogRecordHeader. But XLogReadRecord + * needs this separate from the function in case of a partial record header. + */ +static bool +ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr, + XLogRecord *record) +{ + if (record->xl_tot_len == 0) + { + char *p; + char *pe; + + /* + * We are almost sure reaching the end of WAL, make sure that the + * whole page after the record is filled with zeroes. + */ + p = (char *) record; + pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); + + while (*p == 0 && p < pe) + p++; + + if (p == pe) + { + /* + * The page after the record is completely zeroed. That suggests + * we don't have a record after this point. We don't bother + * checking the pages after since they are not zeroed in the case + * of recycled segments. + */ + report_invalid_record(state, "empty record at %X/%X", + LSN_FORMAT_ARGS(RecPtr)); + + /* notify end-of-wal to callers */ + state->EndOfWAL = true; + return false; + } + } + + if (record->xl_tot_len < SizeOfXLogRecord) + { + report_invalid_record(state, + "invalid record length at %X/%X: wanted %u, got %u", + LSN_FORMAT_ARGS(RecPtr), + (uint32) SizeOfXLogRecord, record->xl_tot_len); + return false; + } + + return true; +} + /* * Validate an XLOG record header. * @@ -1094,14 +1153,9 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr, XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess) { - if (record->xl_tot_len < SizeOfXLogRecord) - { - report_invalid_record(state, - "invalid record length at %X/%X: wanted %u, got %u", - LSN_FORMAT_ARGS(RecPtr), - (uint32) SizeOfXLogRecord, record->xl_tot_len); + if (!ValidXLogRecordLength(state, RecPtr, record)) return false; - } + if (record->xl_rmid > RM_MAX_ID) { report_invalid_record(state, @@ -1200,6 +1254,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr); + StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero"); + + if (hdr->xlp_magic == 0) + { + /* Regard an empty page as End-Of-WAL */ + int i; + + for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++); + if (i == XLOG_BLCKSZ) + { + char fname[MAXFNAMELEN]; + + XLogFileName(fname, state->seg.ws_tli, segno, + state->segcxt.ws_segsize); + + report_invalid_record(state, + "empty page in log segment %s, offset %u", + fname, + offset); + state->EndOfWAL = true; + return false; + } + + /* The same condition will be caught as invalid magic number */ + } if (hdr->xlp_magic != XLOG_PAGE_MAGIC) { char fname[MAXFNAMELEN]; @@ -1285,6 +1364,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr, LSN_FORMAT_ARGS(hdr->xlp_pageaddr), fname, offset); + + /* + * If the page address is less than expected we assume it is an unused + * page in a recycled segment. + */ + if (hdr->xlp_pageaddr < recaddr) + state->EndOfWAL = true; + return false; } diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index 9feea3e6ec..98382d66a4 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -1592,7 +1592,7 @@ PerformWalRecovery(void) /* just have to read next record after CheckPoint */ Assert(xlogreader->ReadRecPtr == CheckPointLoc); replayTLI = CheckPointTLI; - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } if (record != NULL) @@ -1706,7 +1706,7 @@ PerformWalRecovery(void) } /* Else, try to fetch the next WAL record */ - record = ReadRecord(xlogreader, LOG, false, replayTLI); + record = ReadRecord(xlogreader, WARNING, false, replayTLI); } while (record != NULL); /* @@ -1765,13 +1765,20 @@ PerformWalRecovery(void) InRedo = false; } - else + else if (xlogreader->EndOfWAL) { /* there are no WAL records following the checkpoint */ ereport(LOG, (errmsg("redo is not required"))); } + else + { + /* broken record found */ + ereport(WARNING, + (errmsg("redo is skipped"), + errhint("This suggests WAL file corruption. You might need to check the database."))); + } /* * This check is intentionally after the above log messages that indicate @@ -2939,6 +2946,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode, for (;;) { char *errormsg; + XLogRecPtr ErrRecPtr = InvalidXLogRecPtr; record = XLogReadRecord(xlogreader, &errormsg); if (record == NULL) @@ -2954,6 +2962,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode, { abortedRecPtr = xlogreader->abortedRecPtr; missingContrecPtr = xlogreader->missingContrecPtr; + ErrRecPtr = abortedRecPtr; + } + else + { + /* + * EndRecPtr is the LSN we tried to read but failed. In the + * case of decoding error, it is at the end of the failed + * record but we don't have a means for now to know EndRecPtr + * is pointing to which of the beginning or ending of the + * failed record. + */ + ErrRecPtr = xlogreader->EndRecPtr; } if (readFile >= 0) @@ -2963,13 +2983,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode, } /* - * We only end up here without a message when XLogPageRead() - * failed - in that case we already logged something. In - * StandbyMode that only happens if we have been triggered, so we - * shouldn't loop anymore in that case. + * We only end up here without a message when XLogPageRead() failed + * in that case we already logged something, or just met end-of-WAL + * conditions. In StandbyMode that only happens if we have been + * triggered, so we shouldn't loop anymore in that case. When + * EndOfWAL is true, we don't emit that error if any immediately + * and instead will show it as a part of a decent end-of-wal + * message later. */ - if (errormsg) - ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr), + if (!xlogreader->EndOfWAL && errormsg) + ereport(emode_for_corrupt_record(emode, ErrRecPtr), (errmsg_internal("%s", errormsg) /* already translated */ )); } @@ -3000,11 +3023,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode, /* Great, got a record */ return record; } - else + + Assert(ErrRecPtr != InvalidXLogRecPtr); + + /* No valid record available from this source */ + lastSourceFailed = true; + + if (!fetching_ckpt) { - /* No valid record available from this source */ - lastSourceFailed = true; - /* * If archive recovery was requested, but we were still doing * crash recovery, switch to archive recovery and retry using the @@ -3017,11 +3043,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode, * we'd have no idea how far we'd have to replay to reach * consistency. So err on the safe side and give up. */ - if (!InArchiveRecovery && ArchiveRecoveryRequested && - !fetching_ckpt) + if (!InArchiveRecovery && ArchiveRecoveryRequested) { + /* + * We don't report this as LOG, since we don't stop recovery + * here + */ ereport(DEBUG1, - (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery"))); + (errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery, enteringarchive recovery", + LSN_FORMAT_ARGS(ErrRecPtr), + replayTLI))); InArchiveRecovery = true; if (StandbyModeRequested) StandbyMode = true; @@ -3042,12 +3073,24 @@ ReadRecord(XLogReaderState *xlogreader, int emode, continue; } - /* In standby mode, loop back to retry. Otherwise, give up. */ - if (StandbyMode && !CheckForStandbyTrigger()) - continue; - else - return NULL; + /* + * recovery ended. + * + * Emit a decent message if we met end-of-WAL. Otherwise we should + * have already emitted an error message. + */ + if (xlogreader->EndOfWAL) + ereport(LOG, + (errmsg("reached end of WAL at %X/%X on timeline %u", + LSN_FORMAT_ARGS(ErrRecPtr), replayTLI), + (errormsg ? errdetail_internal("%s", errormsg) : 0))); } + + /* In standby mode, loop back to retry. Otherwise, give up. */ + if (StandbyMode && !CheckForStandbyTrigger()) + continue; + else + return NULL; } } @@ -3129,12 +3172,16 @@ retry: private->replayTLI, xlogreader->EndRecPtr)) { + Assert(!StandbyMode || CheckForStandbyTrigger()); + if (readFile >= 0) close(readFile); readFile = -1; readLen = 0; readSource = XLOG_FROM_ANY; + /* promotion exit is not end-of-WAL */ + xlogreader->EndOfWAL = !StandbyMode; return -1; } } @@ -3767,7 +3814,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr) { static XLogRecPtr lastComplaint = 0; - if (readSource == XLOG_FROM_PG_WAL && emode == LOG) + /* use currentSource as readSource is reset at failure */ + if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING) { if (RecPtr == lastComplaint) emode = DEBUG1; diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index ceaff097b9..4f117ea4da 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -472,10 +472,9 @@ WalReceiverMain(void) else if (len < 0) { ereport(LOG, - (errmsg("replication terminated by primary server"), - errdetail("End of WAL reached on timeline %u at %X/%X.", - startpointTLI, - LSN_FORMAT_ARGS(LogstreamResult.Write)))); + (errmsg("replication terminated by primary server at %X/%X on timeline %u.", + LSN_FORMAT_ARGS(LogstreamResult.Write), + startpointTLI))); endofwal = true; break; } diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c index fc081adfb8..9bebca8154 100644 --- a/src/bin/pg_waldump/pg_waldump.c +++ b/src/bin/pg_waldump/pg_waldump.c @@ -1174,9 +1174,16 @@ main(int argc, char **argv) exit(0); if (errormsg) - fatal_error("error in WAL record at %X/%X: %s", - LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr), - errormsg); + { + if (xlogreader_state->EndOfWAL) + pg_log_info("end of WAL at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + else + fatal_error("error in WAL record at %X/%X: %s", + LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr), + errormsg); + } XLogReaderFree(xlogreader_state); diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h index f4388cc9be..21a8f9552c 100644 --- a/src/include/access/xlogreader.h +++ b/src/include/access/xlogreader.h @@ -201,6 +201,7 @@ struct XLogReaderState */ XLogRecPtr ReadRecPtr; /* start of last record read */ XLogRecPtr EndRecPtr; /* end+1 of last record read */ + bool EndOfWAL; /* was the last attempt EOW? */ /* * Set at the end of recovery: the start point of a partial record at the diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl index 14154d1ce0..01033334d6 100644 --- a/src/test/recovery/t/011_crash_recovery.pl +++ b/src/test/recovery/t/011_crash_recovery.pl @@ -10,7 +10,9 @@ use PostgreSQL::Test::Cluster; use PostgreSQL::Test::Utils; use Test::More; use Config; +use IPC::Run; +my $reached_eow_pat = "reached end of WAL at "; my $node = PostgreSQL::Test::Cluster->new('primary'); $node->init(allows_streaming => 1); $node->start; @@ -48,7 +50,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), # Crash and restart the postmaster $node->stop('immediate'); +my $logstart = get_log_size($node); $node->start; +my $max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, $reached_eow_pat, $logstart)); + sleep 0.5; +} +ok ($max_attempts >= 0, "end-of-wal is logged"); # Make sure we really got a new xid cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'), @@ -61,4 +71,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]), $stdin .= "\\q\n"; $tx->finish; # wait for psql to quit gracefully +my $segsize = $node->safe_psql('postgres', + qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]); + +# make sure no records afterwards go to the next segment +$node->safe_psql('postgres', qq[ + SELECT pg_switch_wal(); + CHECKPOINT; + CREATE TABLE t(); +]); +$node->stop('immediate'); + +# identify REDO WAL file +my $cmd = "pg_controldata -D " . $node->data_dir(); +$cmd = ['pg_controldata', '-D', $node->data_dir()]; +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m, + "checkpoint file is identified"); +my $chkptfile = $1; + +# identify the last record +my $walfile = $node->data_dir() . "/pg_wal/$chkptfile"; +$cmd = ['pg_waldump', $walfile]; +$stdout = ''; +$stderr = ''; +my $lastrec; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +foreach my $l (split(/\r?\n/, $stdout)) +{ + $lastrec = $l; +} +ok(defined $lastrec, "last WAL record is extracted"); +ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/, + "pg_waldump emits the correct ending message"); + +# read the last record LSN excluding leading zeroes +ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/, + "LSN of the last record identified"); +my $lastlsn = $1; + +# corrupt the last record +my $offset = hex($lastlsn) % $segsize; +open(my $segf, '+<', $walfile) or die "failed to open $walfile\n"; +seek($segf, $offset, 0); # halfway break the last record +print $segf "\0\0\0\0"; +close($segf); + +# pg_waldump complains about the corrupted record +$stdout = ''; +$stderr = ''; +IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; +ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/, + "pg_waldump emits the correct error message"); + +# also server complains +$logstart = get_log_size($node); +$node->start; +$max_attempts = 360; +while ($max_attempts-- >= 0) +{ + last if (find_in_log($node, "WARNING: invalid record length at 0/$lastlsn: wanted [0-9]+, got 0", + $logstart)); + sleep 0.5; +} +ok($max_attempts >= 0, "header error is logged at $lastlsn"); + +# no end-of-wal message should be seen this time +ok(!find_in_log($node, $reached_eow_pat, $logstart), + "false log message is not emitted"); + +$node->stop('immediate'); + done_testing(); + +#### helper routines +# return the size of logfile of $node in bytes +sub get_log_size +{ + my ($node) = @_; + + return (stat $node->logfile)[7]; +} + +# find $pat in logfile of $node after $off-th byte +sub find_in_log +{ + my ($node, $pat, $off) = @_; + + $off = 0 unless defined $off; + my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile); + return 0 if (length($log) <= $off); + + $log = substr($log, $off); + + return $log =~ m/$pat/; +} -- 2.27.0
me> Rebased to the current HEAD. b64c3bd62e (removal of unused "use Config") conflicted on a TAP script. Rebased. -- Kyotaro Horiguchi NTT Open Source Software Center
Вложения
On Mon, Mar 28, 2022 at 11:07 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Rebased. Unfortunately this will need another rebase over latest. [CFM hat] Looking through the history here, this has been bumped to Ready for Committer a few times and then bumped back to Needs Review after a required rebase. What's the best way for us to provide support for contributors who get stuck in this loop? Maybe we can be more aggressive about automated notifications when a RfC patch goes red in the cfbot? Thanks, --Jacob
On Wed, Jul 06, 2022 at 11:05:51AM -0700, Jacob Champion wrote: > [CFM hat] Looking through the history here, this has been bumped to > Ready for Committer a few times and then bumped back to Needs Review > after a required rebase. What's the best way for us to provide support > for contributors who get stuck in this loop? Maybe we can be more > aggressive about automated notifications when a RfC patch goes red in > the cfbot? Having a better integration between the CF bot and the CF app would be great, IMO. People tend to easily forget about what they send in my experience, even if they manage a small pool of patches or a larger one. -- Michael
Вложения
At Wed, 6 Jul 2022 11:05:51 -0700, Jacob Champion <jchampion@timescale.com> wrote in > On Mon, Mar 28, 2022 at 11:07 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > > > Rebased. > > Unfortunately this will need another rebase over latest. Thanks! Done. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Вложения
@cfbot: rebased over adb466150, which did the same thing as one of the hunks in xlogreader.c.
Вложения
At Fri, 16 Sep 2022 23:21:50 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in > @cfbot: rebased over adb466150, which did the same thing as one of the > hunks in xlogreader.c. Oops. Thanks! And then this gets a further conflict (param names harmonization). So further rebased. And removed an extra blank line you pointed. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Вложения
Just rebased. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Вложения
Hi, On 2022-11-18 17:25:37 +0900, Kyotaro Horiguchi wrote: > Just rebased. Fails with address sanitizer: https://cirrus-ci.com/task/5632986241564672 Unfortunately one of the failures is in pg_waldump and we don't seem to capture its output in 011_crash_recovery. So we don't see the nice formattted output... [11:07:18.868] #0 0x00007fcf43803ce1 in raise () from /lib/x86_64-linux-gnu/libc.so.6 [11:07:18.912] [11:07:18.912] Thread 1 (Thread 0x7fcf43662780 (LWP 39124)): [11:07:18.912] #0 0x00007fcf43803ce1 in raise () from /lib/x86_64-linux-gnu/libc.so.6 [11:07:18.912] No symbol table info available. [11:07:18.912] #1 0x00007fcf437ed537 in abort () from /lib/x86_64-linux-gnu/libc.so.6 [11:07:18.912] No symbol table info available. [11:07:18.912] #2 0x00007fcf43b8511b in __sanitizer::Abort () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_posix_libcdep.cpp:155 [11:07:18.912] No locals. [11:07:18.912] #3 0x00007fcf43b8fce8 in __sanitizer::Die () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:58 [11:07:18.912] No locals. [11:07:18.912] #4 0x00007fcf43b7244c in __asan::ScopedInErrorReport::~ScopedInErrorReport (this=0x7ffd4fde18e6, __in_chrg=<optimizedout>) at ../../../../src/libsanitizer/asan/asan_report.cpp:186 [11:07:18.912] buffer_copy = {<__sanitizer::InternalMmapVectorNoCtor<char>> = {data_ = 0x7fcf40350000 '=' <repeats65 times>, "\n==39124==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x625000002100 at pc 0x55c36c21e315bp 0x7ffd4fde2550 sp 0x7ffd4fde2"..., capacity_bytes_ = 65536, size_ = <optimized out>}, <No data fields>} ... [11:07:18.912] #6 0x00007fcf43b72788 in __asan::__asan_report_load1 (addr=<optimized out>) at ../../../../src/libsanitizer/asan/asan_rtl.cpp:117 [11:07:18.912] bp = 140725943412048 [11:07:18.912] pc = <optimized out> [11:07:18.912] local_stack = 140528180793728 [11:07:18.912] sp = 140725943412040 [11:07:18.912] #7 0x000055c36c21e315 in ValidXLogRecordLength (state=state@entry=0x61a000000680, RecPtr=RecPtr@entry=33655480,record=record@entry=0x625000000bb8) at xlogreader.c:1126 [11:07:18.912] p = <optimized out> [11:07:18.912] pe = 0x625000002100 "" [11:07:18.912] #8 0x000055c36c21e3b1 in ValidXLogRecordHeader (state=state@entry=0x61a000000680, RecPtr=RecPtr@entry=33655480,PrevRecPtr=33655104, record=record@entry=0x625000000bb8, randAccess=randAccess@entry=false)at xlogreader.c:1169 [11:07:18.912] No locals. The most important bit is "AddressSanitizer: heap-buffer-overflow on address 0x6250000\ 02100 at pc 0x55c36c21e315 bp 0x7ffd4fde2550 sp 0x7ffd4fde2" Greetings, Andres Freund
On Fri, Nov 18, 2022 at 05:25:37PM +0900, Kyotaro Horiguchi wrote: > + while (*p == 0 && p < pe) > + p++; The bug reported by Andres/cfbot/ubsan is here. Fixed in attached. I didn't try to patch the test case to output the failing stderr, but that might be good.
Вложения
At Tue, 22 Nov 2022 16:04:56 -0600, Justin Pryzby <pryzby@telsasoft.com> wrote in > On Fri, Nov 18, 2022 at 05:25:37PM +0900, Kyotaro Horiguchi wrote: > > + while (*p == 0 && p < pe) > > + p++; > > The bug reported by Andres/cfbot/ubsan is here. > > Fixed in attached. Ur..ou.. - while (*p == 0 && p < pe) + while (p < pe && *p == 0) It was an off-by-one error. Thanks! > I didn't try to patch the test case to output the failing stderr, but > that might be good. I have made use of Cluster::wait_for_log(), but still find_in_log() is there since it is used to check if a message that should not be logged is not logged. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Вложения
So this patch is now failing because it applies new tests to 011_crash_recovery.pl, which was removed recently. Can you please move them elsewhere? I think the comment for ValidXLogRecordLength should explain what the return value is. -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
Thanks! At Fri, 3 Feb 2023 15:16:02 +0100, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in > So this patch is now failing because it applies new tests to > 011_crash_recovery.pl, which was removed recently. Can you please move > them elsewhere? I don't find an appropriate file to move to. In the end I created a new file with the name 034_recovery.pl. I added a test for standbys, too. (which is the first objective of this patch.) > I think the comment for ValidXLogRecordLength should explain what the > return value is. Agreed. /* * Validate record length of an XLOG record header. * * This is substantially a part of ValidXLogRecordHeader. But XLogReadRecord * needs this separate from the function in case of a partial record header. + * + * Returns true if the xl_tot_len header field has a seemingly valid value, + * which means the caller can proceed reading to the following part of the + * record. */ static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr, I added a similar description to ValidXLogRecordHeader. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Вложения
It looks like this needs a rebase and at a quick glance it looks like more than a trivial conflict. I'll mark it Waiting on Author. Please update it back when it's rebased -- Gregory Stark As Commitfest Manager
At Mon, 6 Mar 2023 14:58:15 -0500, "Gregory Stark (as CFM)" <stark.cfm@gmail.com> wrote in > It looks like this needs a rebase and at a quick glance it looks like > more than a trivial conflict. I'll mark it Waiting on Author. Please > update it back when it's rebased Thanks for checking it! I think 4ac30ba4f2 is that, which changes a few error messages. Addition to rebasing, I rewrote some code comments of xlogreader.c and revised the additional test script. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Вложения
Hi, > Thanks for checking it! > > I think 4ac30ba4f2 is that, which changes a few error > messages. Addition to rebasing, I rewrote some code comments of > xlogreader.c and revised the additional test script. Thanks for working on this, it bugged me for a while. I noticed that cfbot is not happy with the patch so I rebased it. postgresql:pg_waldump test suite didn't pass after the rebase. I fixed it too. Other than that the patch LGTM so I'm not changing its status from "Ready for Committer". It looks like the patch was moved between the commitfests since 2020... If there is anything that may help merging it into PG17 please let me know. -- Best regards, Aleksander Alekseev
Вложения
At Mon, 17 Jul 2023 15:20:30 +0300, Aleksander Alekseev <aleksander@timescale.com> wrote in > Thanks for working on this, it bugged me for a while. I noticed that > cfbot is not happy with the patch so I rebased it. > postgresql:pg_waldump test suite didn't pass after the rebase. I fixed > it too. Other than that the patch LGTM so I'm not changing its status > from "Ready for Committer". Thanks for the rebasing. > It looks like the patch was moved between the commitfests since > 2020... If there is anything that may help merging it into PG17 please > let me know. This might be just too-much or there might be some doubt in this.. This change basically makes a zero-length record be considered as the normal end of WAL. The most controvorsial point I think in the design is the criteria for an error condition. The assumption is that the WAL is sound if all bytes following a complete record, up to the next page boundary, are zeroed out. This is slightly narrower than the original criteria, merely checking the next record is zero-length. Naturally, there might be instances where that page has been blown out due to device failure or some other reasons. Despite this, I believe it is preferable rather than always issuing a warning (in the LOG level, though) about a potential WAL corruption. I've adjusted the condition for muting repeated log messages at the same LSN, changing it from ==LOG to <=WARNING. This is simply a consequence of following the change of "real" warnings from LOG to WARNING. I believe this is acceptable even without considering aforementioned change, as any single retriable (<ERROR) error at an LSN should be sufficient to alert users about potential issues. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Anyway, this requires rebsaing, and done. Thanks for John (Naylor) for pointing this out. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Вложения
On Wed, 22 Nov 2023 at 13:01, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Anyway, this requires rebsaing, and done. Few tests are failing at [1], kindly post an updated patch: /tmp/cirrus-ci-build/src/test/recovery --testgroup recovery --testname 039_end_of_wal -- /usr/local/bin/perl -I /tmp/cirrus-ci-build/src/test/perl -I /tmp/cirrus-ci-build/src/test/recovery /tmp/cirrus-ci-build/src/test/recovery/t/039_end_of_wal.pl [23:53:10.370] ――――――――――――――――――――――――――――――――――――― ✀ ――――――――――――――――――――――――――――――――――――― [23:53:10.370] stderr: [23:53:10.370] # Failed test 'xl_tot_len zero' [23:53:10.370] # at /tmp/cirrus-ci-build/src/test/recovery/t/039_end_of_wal.pl line 267. [23:53:10.370] # Failed test 'xlp_magic zero' [23:53:10.370] # at /tmp/cirrus-ci-build/src/test/recovery/t/039_end_of_wal.pl line 340. [23:53:10.370] # Failed test 'xlp_magic zero (split record header)' [23:53:10.370] # at /tmp/cirrus-ci-build/src/test/recovery/t/039_end_of_wal.pl line 445. [23:53:10.370] # Looks like you failed 3 tests of 14. [23:53:10.370] [23:53:10.370] (test program exited with status code 3) [23:53:10.370] ―――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――― [1] - https://cirrus-ci.com/task/5859293157654528 Regards, Vignesh
At Fri, 5 Jan 2024 16:02:24 +0530, vignesh C <vignesh21@gmail.com> wrote in > On Wed, 22 Nov 2023 at 13:01, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > > > Anyway, this requires rebsaing, and done. > > Few tests are failing at [1], kindly post an updated patch: Thanks! The errors occurred in a part of the tests for end-of-WAL detection added in the master branch. These failures were primarily due to changes in the message contents introduced by this patch. During the revision, I discovered an issue with the handling of empty pages that appear in the middle of reading continuation records. In the previous version, such empty pages were mistakenly identified as indicating a clean end-of-WAL (that is a LOG). However, they should actually be handled as a WARNING, since the record curently being read is broken at the empty pages. The following changes have been made in this version: 1. Adjusting the test to align with the error message changes introduced by this patch. 2. Adding tests for the newly added messages. 3. Correcting the handling of empty pages encountered during the reading of continuation records. (XLogReaderValidatePageHeader) 4. Revising code comments. 5. Changing the term "log segment" to "WAL segment". (XLogReaderValidatePageHeader) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Вложения
Hi, > The errors occurred in a part of the tests for end-of-WAL detection > added in the master branch. These failures were primarily due to > changes in the message contents introduced by this patch. During the > revision, I discovered an issue with the handling of empty pages that > appear in the middle of reading continuation records. In the previous > version, such empty pages were mistakenly identified as indicating a > clean end-of-WAL (that is a LOG). However, they should actually be > handled as a WARNING, since the record curently being read is broken > at the empty pages. The following changes have been made in this > version: > > 1. Adjusting the test to align with the error message changes > introduced by this patch. > > 2. Adding tests for the newly added messages. > > 3. Correcting the handling of empty pages encountered during the > reading of continuation records. (XLogReaderValidatePageHeader) > > 4. Revising code comments. > > 5. Changing the term "log segment" to "WAL > segment". (XLogReaderValidatePageHeader) > > regards. Thanks for the updated patch. ``` + p = (char *) record; + pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); + + while (p < pe && *p == 0) + p++; + + if (p == pe) ``` Just as a random thought: perhaps we should make this a separate function, as a part of src/port/. It seems to me that this code could benefit from using vector instructions some day, similarly to memcmp(), memset() etc. Surprisingly there doesn't seem to be a standard C function for this. Alternatively one could argue that one cycle doesn't make much code to reuse and that the C compiler will place SIMD instructions for us. However a counter-counter argument would be that we could use a macro or even better an inline function and have the same effect except getting a slightly more readable code. ``` - * This is just a convenience subroutine to avoid duplicated code in + * This is just a convenience subroutine to avoid duplicate code in ``` This change doesn't seem to be related to the patch. Personally I don't mind it though. All in all I find v28 somewhat scary. It does much more than "making one message less scary" as it was initially intended and what bugged me personally, and accordingly touches many more places including xlogreader.c, xlogrecovery.c, etc. Particularly I have mixed feeling about this: ``` + /* + * Consider it as end-of-WAL if all subsequent bytes of this page + * are zero. We don't bother checking the subsequent pages since + * they are not zeroed in the case of recycled segments. + */ ``` If I understand correctly, if somehow several FS blocks end up being zeroed (due to OS bug, bit rot, restoring from a corrupted for whatever reason backup, hardware failures, ...) there is non-zero chance that PG will interpret this as a normal situation. To my knowledge this is not what we typically do - typically PG would report an error and ask a human to figure out what happened. Of course the possibility of such a scenario is small, but I don't think that as DBMS developers we can ignore it. Does anyone agree or maybe I'm making things up? -- Best regards, Aleksander Alekseev
Thank you for the comments. At Fri, 12 Jan 2024 15:03:26 +0300, Aleksander Alekseev <aleksander@timescale.com> wrote in > ``` > + p = (char *) record; > + pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); > + > + while (p < pe && *p == 0) > + p++; > + > + if (p == pe) > ``` > > Just as a random thought: perhaps we should make this a separate > function, as a part of src/port/. It seems to me that this code could > benefit from using vector instructions some day, similarly to > memcmp(), memset() etc. Surprisingly there doesn't seem to be a > standard C function for this. Alternatively one could argue that one > cycle doesn't make much code to reuse and that the C compiler will > place SIMD instructions for us. However a counter-counter argument > would be that we could use a macro or even better an inline function > and have the same effect except getting a slightly more readable code. Creating a function with a name like memcmp_byte() should be straightforward, but implementing it with SIMD right away seems a bit challenging. Similar operations are already being performed elsewhere in the code, probably within the stats collector, where memcmp is used with a statically allocated area that's filled with zeros. If we can achieve a performance equivalent to memcmp with this new function, then it definitely seems worth pursuing. > ``` > - * This is just a convenience subroutine to avoid duplicated code in > + * This is just a convenience subroutine to avoid duplicate code in > ``` > > This change doesn't seem to be related to the patch. Personally I > don't mind it though. Ah, I'm sorry. That was something I mistakenly thought I had written at the last moment and made modifications to. > All in all I find v28 somewhat scary. It does much more than "making > one message less scary" as it was initially intended and what bugged > me personally, and accordingly touches many more places including > xlogreader.c, xlogrecovery.c, etc. > > Particularly I have mixed feeling about this: > > ``` > + /* > + * Consider it as end-of-WAL if all subsequent bytes of this page > + * are zero. We don't bother checking the subsequent pages since > + * they are not zeroed in the case of recycled segments. > + */ > ``` > > If I understand correctly, if somehow several FS blocks end up being > zeroed (due to OS bug, bit rot, restoring from a corrupted for > whatever reason backup, hardware failures, ...) there is non-zero > chance that PG will interpret this as a normal situation. To my > knowledge this is not what we typically do - typically PG would report > an error and ask a human to figure out what happened. Of course the > possibility of such a scenario is small, but I don't think that as > DBMS developers we can ignore it. For now, let me explain the basis for this patch. The fundamental issue is that these warnings that always appear are, in practice, not a problem in almost all cases. Some of those who encounter them for the first time may feel uneasy and reach out with inquiries. On the other hand, those familiar with these warnings tend to ignore them and only pay attention to details when actual issues arise. Therefore, the intention of this patch is to label them as "no issue" unless a problem is blatantly evident, in order to prevent unnecessary concern. > Does anyone agree or maybe I'm making things up? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, > > If I understand correctly, if somehow several FS blocks end up being > > zeroed (due to OS bug, bit rot, restoring from a corrupted for > > whatever reason backup, hardware failures, ...) there is non-zero > > chance that PG will interpret this as a normal situation. To my > > knowledge this is not what we typically do - typically PG would report > > an error and ask a human to figure out what happened. Of course the > > possibility of such a scenario is small, but I don't think that as > > DBMS developers we can ignore it. > > For now, let me explain the basis for this patch. The fundamental > issue is that these warnings that always appear are, in practice, not > a problem in almost all cases. Some of those who encounter them for > the first time may feel uneasy and reach out with inquiries. On the > other hand, those familiar with these warnings tend to ignore them and > only pay attention to details when actual issues arise. Therefore, the > intention of this patch is to label them as "no issue" unless a > problem is blatantly evident, in order to prevent unnecessary concern. I agree and don't mind affecting the error message per se. However I see that the actual logic of how WAL is processed is being changed. If we do this, at very least it requires thorough thinking. I strongly suspect that the proposed code is wrong and/or not safe and/or less safe than it is now for the reasons named above. -- Best regards, Aleksander Alekseev
On Tue, Jan 16, 2024 at 02:46:02PM +0300, Aleksander Alekseev wrote: >> For now, let me explain the basis for this patch. The fundamental >> issue is that these warnings that always appear are, in practice, not >> a problem in almost all cases. Some of those who encounter them for >> the first time may feel uneasy and reach out with inquiries. On the >> other hand, those familiar with these warnings tend to ignore them and >> only pay attention to details when actual issues arise. Therefore, the >> intention of this patch is to label them as "no issue" unless a >> problem is blatantly evident, in order to prevent unnecessary concern. > > I agree and don't mind affecting the error message per se. > > However I see that the actual logic of how WAL is processed is being > changed. If we do this, at very least it requires thorough thinking. I > strongly suspect that the proposed code is wrong and/or not safe > and/or less safe than it is now for the reasons named above. FWIW, that pretty much sums up my feeling regarding this patch, because an error, basically any error, would hurt back very badly. Sure, the error messages we generate now when reaching the end of WAL can sound scary, and they are (I suspect that's not really the case for anybody who has history doing support with PostgreSQL because a bunch of these messages are old enough to vote, but I can understand that anybody would freak out the first time they see that). However, per the recent issues we've had in this area, like cd7f19da3468 but I'm more thinking about 6b18b3fe2c2f and bae868caf222, I am of the opinion that the header validation, the empty page case in XLogReaderValidatePageHeader() and the record read changes are risky enough that I am not convinced that the gains are worth the risks taken. The error stack in the WAL reader is complicated enough that making it more complicated as the patch proposes does not sound like not a good tradeoff to me to make the reports related to the end of WAL cleaner for the end-user. I agree that we should do something, but the patch does not seem like a good step towards this goal. Perhaps somebody would be more excited about this proposal than I am, of course. -- Michael
Вложения
2024-01 Commitfest. Hi, This patch has a CF status of "Needs Review" [1], but it seems there were CFbot test failures last time it was run [2]. Please have a look and post an updated version if necessary. ====== [1] https://commitfest.postgresql.org/46/2490/ [2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/2490 Kind Regards, Peter Smith.
At Mon, 22 Jan 2024 16:09:28 +1100, Peter Smith <smithpb2250@gmail.com> wrote in > 2024-01 Commitfest. > > Hi, This patch has a CF status of "Needs Review" [1], but it seems > there were CFbot test failures last time it was run [2]. Please have a > look and post an updated version if necessary. > > ====== > [1] https://commitfest.postgresql.org/46/2490/ > [2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/2490 Thanks for noticing of that. Will repost a new version. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
At Wed, 17 Jan 2024 14:32:00 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Tue, Jan 16, 2024 at 02:46:02PM +0300, Aleksander Alekseev wrote: > >> For now, let me explain the basis for this patch. The fundamental > >> issue is that these warnings that always appear are, in practice, not > >> a problem in almost all cases. Some of those who encounter them for > >> the first time may feel uneasy and reach out with inquiries. On the > >> other hand, those familiar with these warnings tend to ignore them and > >> only pay attention to details when actual issues arise. Therefore, the > >> intention of this patch is to label them as "no issue" unless a > >> problem is blatantly evident, in order to prevent unnecessary concern. > > > > I agree and don't mind affecting the error message per se. > > > > However I see that the actual logic of how WAL is processed is being > > changed. If we do this, at very least it requires thorough thinking. I > > strongly suspect that the proposed code is wrong and/or not safe > > and/or less safe than it is now for the reasons named above. > > FWIW, that pretty much sums up my feeling regarding this patch, > because an error, basically any error, would hurt back very badly. > Sure, the error messages we generate now when reaching the end of WAL > can sound scary, and they are (I suspect that's not really the case > for anybody who has history doing support with PostgreSQL because a > bunch of these messages are old enough to vote, but I can understand > that anybody would freak out the first time they see that). > > However, per the recent issues we've had in this area, like > cd7f19da3468 but I'm more thinking about 6b18b3fe2c2f and > bae868caf222, I am of the opinion that the header validation, the > empty page case in XLogReaderValidatePageHeader() and the record read > changes are risky enough that I am not convinced that the gains are > worth the risks taken. > > The error stack in the WAL reader is complicated enough that making it > more complicated as the patch proposes does not sound like not a good > tradeoff to me to make the reports related to the end of WAL cleaner > for the end-user. I agree that we should do something, but the patch > does not seem like a good step towards this goal. Perhaps somebody > would be more excited about this proposal than I am, of course. Thank you both for the comments. The criticism seems valid. The approach to identifying the end-of-WAL state in this patch is quite heuristic, and its validity or safety can certainly be contested. On the other hand, if we seek perfection in this area of judgment, we may need to have the WAL format itself more robust. In any case, since the majority of the feedback on this patch seems to be negative, I am going to withdraw it if no supportive opinions emerge during this commit-fest. The attached patch addresses the errors reported by CF-bot. regards. -- Kyotaro Horiguchi NTT Open Source Software Center