Обсуждение: Make mesage at end-of-recovery less scary.

Поиск
Список
Период
Сортировка

Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
Hello, this is a followup thread of [1].

# I didn't noticed that the thread didn't cover -hackers..

When recovery of any type ends, we see several kinds of error messages
that says "WAL is broken".

> LOG:  invalid record length at 0/7CB6BC8: wanted 24, got 0
> LOG:  redo is not required
> LOG:  database system is ready to accept connections

This patch reduces the scariness of such messages as the follows.

> LOG:  rached end of WAL at 0/1551048 on timeline 1 in pg_wal during crash recovery
> DETAIL:  invalid record length at 0/1551048: wanted 24, got 0
> LOG:  redo is not required
> LOG:  database system is ready to accept connections

[1] https://www.postgresql.org/message-id/20200117.172655.1384889922565817808.horikyota.ntt%40gmail.com

I'll register this to the coming CF.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From f3692cb484b7f1ebc351ba8a522039c0b91bcfdb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is happening.
Make this message less scary as "reached end of WAL".
---
 src/backend/access/transam/xlog.c | 45 ++++++++++++++++++++++++++-----
 1 file changed, 38 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d19408b3be..452c376f62 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4288,6 +4288,10 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
         EndRecPtr = xlogreader->EndRecPtr;
         if (record == NULL)
         {
+            int actual_emode =
+                emode_for_corrupt_record(emode,
+                                         ReadRecPtr ? ReadRecPtr : EndRecPtr);
+
             if (readFile >= 0)
             {
                 close(readFile);
@@ -4295,14 +4299,41 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something. In
-             * StandbyMode that only happens if we have been triggered, so we
-             * shouldn't loop anymore in that case.
+             * randAccess here means we are reading successive records during
+             * recovery. If we get here during recovery, we can assume that we
+             * reached the end of WAL.  Otherwise something's really wrong and
+             * we report just only the errormsg if any. If we don't receive
+             * errormsg here, we already logged something.  We don't emit
+             * "reached end of WAL" in muted messages.
+             *
+             * Note: errormsg is alreay translated.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode, EndRecPtr),
-                        (errmsg_internal("%s", errormsg) /* already translated */ ));
+            if (!private->randAccess && actual_emode == emode)
+            {
+                if (StandbyMode)
+                    ereport(actual_emode,
+                            (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during streaming replication",
+                                     (uint32) (EndRecPtr >> 32), (uint32) EndRecPtr,
+                                     ThisTimeLineID,
+                                     xlogSourceNames[currentSource]),
+                             (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+                else if (InArchiveRecovery)
+                    ereport(actual_emode,
+                            (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during archive recovery",
+                                     (uint32) (EndRecPtr >> 32), (uint32) EndRecPtr,
+                                     ThisTimeLineID,
+                                     xlogSourceNames[currentSource]),
+                             (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+                else
+                    ereport(actual_emode,
+                            (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during crash recovery",
+                                     (uint32) (EndRecPtr >> 32), (uint32) EndRecPtr,
+                                     ThisTimeLineID,
+                                     xlogSourceNames[currentSource]),
+                             (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+            }
+            else if (errormsg)
+                ereport(actual_emode, (errmsg_internal("%s", errormsg)));
         }
 
         /*
-- 
2.18.2


Re: Make mesage at end-of-recovery less scary.

От
Michael Paquier
Дата:
On Fri, Feb 28, 2020 at 04:01:00PM +0900, Kyotaro Horiguchi wrote:
> Hello, this is a followup thread of [1].
>
> # I didn't noticed that the thread didn't cover -hackers..
>
> When recovery of any type ends, we see several kinds of error messages
> that says "WAL is broken".

Have you considered an error context here?  Your patch leads to a bit
of duplication with the message a bit down of what you are changing
where the end of local pg_wal is reached.

> +    * reached the end of WAL.  Otherwise something's really wrong and
> +    * we report just only the errormsg if any. If we don't receive

This sentence sounds strange to me.  Or you meant "Something is wrong,
so use errormsg as report if it is set"?

> +             * Note: errormsg is alreay translated.

Typo here.

> +    if (StandbyMode)
> +        ereport(actual_emode,
> +            (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during streaming replication",

StandbyMode happens also with only WAL archiving, depending on if
primary_conninfo is set or not.

> +    (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during crash recovery",

FWIW, you are introducing three times the same typo, in the same
word, in three different messages.
--
Michael

Вложения

Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
Thank you for the comments.

At Fri, 28 Feb 2020 16:33:18 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Fri, Feb 28, 2020 at 04:01:00PM +0900, Kyotaro Horiguchi wrote:
> > Hello, this is a followup thread of [1].
> > 
> > # I didn't noticed that the thread didn't cover -hackers..
> > 
> > When recovery of any type ends, we see several kinds of error messages
> > that says "WAL is broken".
> 
> Have you considered an error context here?  Your patch leads to a bit
> of duplication with the message a bit down of what you are changing
> where the end of local pg_wal is reached.

It is a DEBUG message and it is for the time moving from crash
recovery to archive recovery. I could remove that but decided to leave
it for tracability.

> > +    * reached the end of WAL.  Otherwise something's really wrong and
> > +    * we report just only the errormsg if any. If we don't receive
> 
> This sentence sounds strange to me.  Or you meant "Something is wrong,
> so use errormsg as report if it is set"?

The whole comment there follows.
| recovery. If we get here during recovery, we can assume that we
| reached the end of WAL.  Otherwise something's really wrong and
| we report just only the errormsg if any. If we don't receive
| errormsg here, we already logged something.  We don't emit
| "reached end of WAL" in muted messages.

"Othhersise" means "other than the case of recovery".  "Just only the
errmsg" means "show the message not as a part the message "reached end
of WAL".

> > +             * Note: errormsg is alreay translated.
> 
> Typo here.

Thanks. Will fix along with "rached".

> > +    if (StandbyMode)
> > +        ereport(actual_emode,
> > +            (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during streaming replication",
> 
> StandbyMode happens also with only WAL archiving, depending on if
> primary_conninfo is set or not.

Right. I'll fix it. Maybe to "during standby mode".

> > +    (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during crash recovery",
> 
> FWIW, you are introducing three times the same typo, in the same
> word, in three different messages.

They're copy-pasto.  I refrained from constructing an error message
from multiple nonindipendent parts.  Are you suggesting to do so?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
Hello.

I changed the condition from randAccess to fetching_ckpt considering
the discussion in another thread [1]. Then I moved the block that
shows the new messages to more appropriate place.

At Fri, 28 Feb 2020 17:28:06 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> > 
> > Have you considered an error context here?  Your patch leads to a bit
> > of duplication with the message a bit down of what you are changing
> > where the end of local pg_wal is reached.
> 
> It is a DEBUG message and it is for the time moving from crash
> recovery to archive recovery. I could remove that but decided to leave
> it for tracability.

I modified the message so that it has the same look to the new
messages, but I left it being DEBUG1, since it is just a intermediate
state. We should finally see one of the new three messages.

After the messages changed, another message from wal sender came to
look redundant.

| [20866] LOG:  replication terminated by primary server
| [20866] DETAIL:  End of WAL reached on timeline 1 at 0/30001C8.
| [20866] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
| [20851] LOG:  reached end of WAL at 0/30001C8 on timeline 1 in archive during standby mode
| [20851] DETAIL:  invalid record length at 0/30001C8: wanted 24, got 0

I changed the above to the below, which looks more adequate.

| [24271]  LOG:  replication terminated by primary server on timeline 1 at 0/3000240.
| [24271]  FATAL:  could not send end-of-streaming message to primary: no COPY in progress
| [24267]  LOG:  reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode
| [24267]  DETAIL:  invalid record length at 0/3000240: wanted 24, got 0

> > > +    * reached the end of WAL.  Otherwise something's really wrong and
> > > +    * we report just only the errormsg if any. If we don't receive
> > 
> > This sentence sounds strange to me.  Or you meant "Something is wrong,
> > so use errormsg as report if it is set"?

The message no longer exists.

> > > +    (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during crash recovery",
> > 
> > FWIW, you are introducing three times the same typo, in the same
> > word, in three different messages.
> 
> They're copy-pasto.  I refrained from constructing an error message
> from multiple nonindipendent parts.  Are you suggesting to do so?

The tree times repetition of almost same phrases is very unreadable. I
rewrote it in more simple shape.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 17ee82e5d44dd5a932ed69b8a1ea91a23d170952 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v2] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is happening.
Make this message less scary as "reached end of WAL".
---
 src/backend/access/transam/xlog.c     | 72 ++++++++++++++++++++-------
 src/backend/replication/walreceiver.c |  3 +-
 2 files changed, 55 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d19408b3be..849cf6fe6b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4282,12 +4282,15 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
     for (;;)
     {
         char       *errormsg;
+        XLogRecPtr    ErrRecPtr = InvalidXLogRecPtr;
 
         record = XLogReadRecord(xlogreader, &errormsg);
         ReadRecPtr = xlogreader->ReadRecPtr;
         EndRecPtr = xlogreader->EndRecPtr;
         if (record == NULL)
         {
+            ErrRecPtr = ReadRecPtr ? ReadRecPtr : EndRecPtr;
+
             if (readFile >= 0)
             {
                 close(readFile);
@@ -4295,14 +4298,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something. In
-             * StandbyMode that only happens if we have been triggered, so we
-             * shouldn't loop anymore in that case.
+             * If we are fetching checkpoint, we emit the error message right
+             * now. Otherwise the error is regarded as "end of WAL" and the
+             * message if any is shown as a part of the end-of-WAL message
+             * below.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode, EndRecPtr),
+            if (fetching_ckpt && errormsg)
+            {
+                ereport(emode_for_corrupt_record(emode, ErrRecPtr),
                         (errmsg_internal("%s", errormsg) /* already translated */ ));
+            }
         }
 
         /*
@@ -4332,11 +4337,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             /* Great, got a record */
             return record;
         }
-        else
-        {
-            /* No valid record available from this source */
-            lastSourceFailed = true;
 
+        /* No valid record available from this source */
+        lastSourceFailed = true;
+
+        if (!fetching_ckpt)
+        {
             /*
              * If archive recovery was requested, but we were still doing
              * crash recovery, switch to archive recovery and retry using the
@@ -4349,11 +4355,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
              * we'd have no idea how far we'd have to replay to reach
              * consistency.  So err on the safe side and give up.
              */
-            if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-                !fetching_ckpt)
+            if (!InArchiveRecovery && ArchiveRecoveryRequested)
             {
+                /*
+                 * We don't report this as LOG, since we don't stop recovery
+                 * here
+                 */
                 ereport(DEBUG1,
-                        (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+                        (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery,
enteringarchive recovery",
 
+                                         (uint32) (ErrRecPtr >> 32),
+                                         (uint32) ErrRecPtr,
+                                         ThisTimeLineID,
+                                         xlogSourceNames[currentSource])));
                 InArchiveRecovery = true;
                 if (StandbyModeRequested)
                     StandbyMode = true;
@@ -4391,12 +4404,35 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
                 continue;
             }
 
-            /* In standby mode, loop back to retry. Otherwise, give up. */
-            if (StandbyMode && !CheckForStandbyTrigger())
-                continue;
-            else
-                return NULL;
+            /*
+             *  We reached the end of WAL, show the messages just once at the
+             *  same LSN.
+             */
+            if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+            {
+                char *fmt;
+
+                if (StandbyMode)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+                else if (InArchiveRecovery)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+                else
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+                ereport(LOG,
+                        (errmsg (fmt, (uint32) (EndRecPtr >> 32),
+                                 (uint32) EndRecPtr,
+                                 ThisTimeLineID,
+                                 xlogSourceNames[currentSource]),
+                         (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+            }
         }
+
+        /* In standby mode, loop back to retry. Otherwise, give up. */
+        if (StandbyMode && !CheckForStandbyTrigger())
+            continue;
+        else
+            return NULL;
     }
 }
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2ab15c3cbb..682dbb4e1f 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -478,8 +478,7 @@ WalReceiverMain(void)
                         else if (len < 0)
                         {
                             ereport(LOG,
-                                    (errmsg("replication terminated by primary server"),
-                                     errdetail("End of WAL reached on timeline %u at %X/%X.",
+                                    (errmsg("replication terminated by primary server on timeline %u at %X/%X.",
                                                startpointTLI,
                                                (uint32) (LogstreamResult.Write >> 32), (uint32)
LogstreamResult.Write)));
                             endofwal = true;
-- 
2.18.2


Re: Make mesage at end-of-recovery less scary.

От
Peter Eisentraut
Дата:
On 2020-03-05 08:06, Kyotaro Horiguchi wrote:
> | [20866] LOG:  replication terminated by primary server
> | [20866] DETAIL:  End of WAL reached on timeline 1 at 0/30001C8.
> | [20866] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
> | [20851] LOG:  reached end of WAL at 0/30001C8 on timeline 1 in archive during standby mode
> | [20851] DETAIL:  invalid record length at 0/30001C8: wanted 24, got 0
> 
> I changed the above to the below, which looks more adequate.
> 
> | [24271]  LOG:  replication terminated by primary server on timeline 1 at 0/3000240.
> | [24271]  FATAL:  could not send end-of-streaming message to primary: no COPY in progress
> | [24267]  LOG:  reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode
> | [24267]  DETAIL:  invalid record length at 0/3000240: wanted 24, got 0

Is this the before and after?  That doesn't seem like a substantial 
improvement to me.  You still get the "scary" message at the end.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Make mesage at end-of-recovery less scary.

От
Ashwin Agrawal
Дата:
On Mon, Mar 23, 2020 at 2:37 AM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
On 2020-03-05 08:06, Kyotaro Horiguchi wrote:
> | [20866] LOG:  replication terminated by primary server
> | [20866] DETAIL:  End of WAL reached on timeline 1 at 0/30001C8.
> | [20866] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
> | [20851] LOG:  reached end of WAL at 0/30001C8 on timeline 1 in archive during standby mode
> | [20851] DETAIL:  invalid record length at 0/30001C8: wanted 24, got 0
>
> I changed the above to the below, which looks more adequate.
>
> | [24271]  LOG:  replication terminated by primary server on timeline 1 at 0/3000240.
> | [24271]  FATAL:  could not send end-of-streaming message to primary: no COPY in progress
> | [24267]  LOG:  reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode
> | [24267]  DETAIL:  invalid record length at 0/3000240: wanted 24, got 0

Is this the before and after?  That doesn't seem like a substantial
improvement to me.  You still get the "scary" message at the end.

+1 I agree it still reads scary and doesn't seem improvement.

Plus, I am hoping message will improve for pg_waldump as well?
Since it reads confusing and every-time have to explain new developer it's expected behavior which is annoying.

pg_waldump: fatal: error in WAL record at 0/1553F70: invalid record length at 0/1553FA8: wanted 24, got 0

Re: Make mesage at end-of-recovery less scary.

От
Andres Freund
Дата:
Hi,

On 2020-03-23 10:37:16 +0100, Peter Eisentraut wrote:
> On 2020-03-05 08:06, Kyotaro Horiguchi wrote:
> > | [20866] LOG:  replication terminated by primary server
> > | [20866] DETAIL:  End of WAL reached on timeline 1 at 0/30001C8.
> > | [20866] FATAL:  could not send end-of-streaming message to primary: no COPY in progress

IMO it's a bug that we see this FATAL. I seem to recall that we didn't
use to get that?


> > | [20851] LOG:  reached end of WAL at 0/30001C8 on timeline 1 in archive during standby mode
> > | [20851] DETAIL:  invalid record length at 0/30001C8: wanted 24, got 0
> > 
> > I changed the above to the below, which looks more adequate.
> > 
> > | [24271]  LOG:  replication terminated by primary server on timeline 1 at 0/3000240.
> > | [24271]  FATAL:  could not send end-of-streaming message to primary: no COPY in progress
> > | [24267]  LOG:  reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode
> > | [24267]  DETAIL:  invalid record length at 0/3000240: wanted 24, got 0
> 
> Is this the before and after?  That doesn't seem like a substantial
> improvement to me.  You still get the "scary" message at the end.

It seems like a minor improvement - folding the DETAIL into the message
makes sense to me here. But it indeed doesn't really address the main
issue.

I think we don't want to elide the information about how the end of the
WAL was detected - there are some issues where I found that quite
helpful. But we could reformulate it to be clearer that it's informative
output, not a bug.  E.g. something roughly like

LOG:  reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode
DETAIL: End detected due to invalid record length at 0/3000240: expected 24, got 0
(I first elided the position in the DETAIL, but it could differ from the
one in LOG)

I don't find that very satisfying, but I can't come up with something
that provides the current information, while being less scary than my
suggestion?

Greetings,

Andres Freund



Re: Make mesage at end-of-recovery less scary.

От
Andres Freund
Дата:
Hi,

On 2020-03-23 10:43:09 -0700, Ashwin Agrawal wrote:
> Plus, I am hoping message will improve for pg_waldump as well?
> Since it reads confusing and every-time have to explain new developer it's
> expected behavior which is annoying.
> 
> pg_waldump: fatal: error in WAL record at 0/1553F70: invalid record length
> at 0/1553FA8: wanted 24, got 0

What would you like to see here? There's inherently a lot less
information about the context in waldump. We can't know whether it's to
be expected that the WAL ends at that point, or whether there was
corruption.

Greetings,

Andres Freund



Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
At Mon, 23 Mar 2020 12:47:36 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Hi,
> 
> On 2020-03-23 10:37:16 +0100, Peter Eisentraut wrote:
> > On 2020-03-05 08:06, Kyotaro Horiguchi wrote:
> > > | [20866] LOG:  replication terminated by primary server
> > > | [20866] DETAIL:  End of WAL reached on timeline 1 at 0/30001C8.
> > > | [20866] FATAL:  could not send end-of-streaming message to primary: no COPY in progress
> 
> IMO it's a bug that we see this FATAL. I seem to recall that we didn't
> use to get that?

I thought that it is a convention that A auxiliary process uses ERROR
(which is turned into FATAL in ereport) to exit, which I didn't like
so much, but it was out of scope of this patch.

As for the message bove, the FATAL is preceded by the "LOG:
replication terminated by" message, that means walreceiver tries to
send new data after disconnection just to fail, which is
unreasonable. I think we should exit immediately after detecting
disconnection. The FATAL is gone by the attached.

> > > | [24267]  LOG:  reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode
> > > | [24267]  DETAIL:  invalid record length at 0/3000240: wanted 24, got 0
> > 
> > Is this the before and after?  That doesn't seem like a substantial
> > improvement to me.  You still get the "scary" message at the end.
> 
> It seems like a minor improvement - folding the DETAIL into the message
> makes sense to me here. But it indeed doesn't really address the main
> issue.
> 
> I think we don't want to elide the information about how the end of the
> WAL was detected - there are some issues where I found that quite
> helpful. But we could reformulate it to be clearer that it's informative
> output, not a bug.  E.g. something roughly like
> 
> LOG:  reached end of WAL at 0/3000240 on timeline 1 in archive during standby mode
> DETAIL: End detected due to invalid record length at 0/3000240: expected 24, got 0
> (I first elided the position in the DETAIL, but it could differ from the
> one in LOG)
> 
> I don't find that very satisfying, but I can't come up with something
> that provides the current information, while being less scary than my
> suggestion?

The 0-length record is not an "invalid" state during recovery, so we
can add the message for the state as "record length is 0 at %X/%X". I
think if other states found there, it implies something wrong.

LSN is redundantly shown but I'm not sure if it is better to remove it
from either of the two lines.

| LOG:  reached end of WAL at 0/3000850 on timeline 1 in pg_wal during crash recovery
| DETAIL:  record length is 0 at 0/3000850

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 47511afed5f8acf92abaf1cd6fcfecc1faea9c87 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v3] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is happening.
Make this message less scary as "reached end of WAL".
---
 src/backend/access/transam/xlog.c       | 69 ++++++++++++++++++-------
 src/backend/access/transam/xlogreader.c |  9 ++++
 src/backend/replication/walreceiver.c   | 11 ++--
 3 files changed, 67 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 793c076da6..6c2924dfb7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4283,12 +4283,15 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
     for (;;)
     {
         char       *errormsg;
+        XLogRecPtr    ErrRecPtr = InvalidXLogRecPtr;
 
         record = XLogReadRecord(xlogreader, &errormsg);
         ReadRecPtr = xlogreader->ReadRecPtr;
         EndRecPtr = xlogreader->EndRecPtr;
         if (record == NULL)
         {
+            ErrRecPtr = ReadRecPtr ? ReadRecPtr : EndRecPtr;
+
             if (readFile >= 0)
             {
                 close(readFile);
@@ -4296,13 +4299,13 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something. In
-             * StandbyMode that only happens if we have been triggered, so we
-             * shouldn't loop anymore in that case.
+             * If we are fetching checkpoint, we emit the error message right
+             * now. Otherwise the error is regarded as "end of WAL" and the
+             * message if any is shown as a part of the end-of-WAL message
+             * below.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode, EndRecPtr),
+            if (fetching_ckpt && errormsg)
+                ereport(emode_for_corrupt_record(emode, ErrRecPtr),
                         (errmsg_internal("%s", errormsg) /* already translated */ ));
         }
 
@@ -4333,11 +4336,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             /* Great, got a record */
             return record;
         }
-        else
-        {
-            /* No valid record available from this source */
-            lastSourceFailed = true;
 
+        /* No valid record available from this source */
+        lastSourceFailed = true;
+
+        if (!fetching_ckpt)
+        {
             /*
              * If archive recovery was requested, but we were still doing
              * crash recovery, switch to archive recovery and retry using the
@@ -4350,11 +4354,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
              * we'd have no idea how far we'd have to replay to reach
              * consistency.  So err on the safe side and give up.
              */
-            if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-                !fetching_ckpt)
+            if (!InArchiveRecovery && ArchiveRecoveryRequested)
             {
+                /*
+                 * We don't report this as LOG, since we don't stop recovery
+                 * here
+                 */
                 ereport(DEBUG1,
-                        (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+                        (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery,
enteringarchive recovery",
 
+                                         (uint32) (ErrRecPtr >> 32),
+                                         (uint32) ErrRecPtr,
+                                         ThisTimeLineID,
+                                         xlogSourceNames[currentSource])));
                 InArchiveRecovery = true;
                 if (StandbyModeRequested)
                     StandbyMode = true;
@@ -4392,12 +4403,34 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
                 continue;
             }
 
-            /* In standby mode, loop back to retry. Otherwise, give up. */
-            if (StandbyMode && !CheckForStandbyTrigger())
-                continue;
-            else
-                return NULL;
+            /*
+             *  We reached the end of WAL, show the messages just once at the
+             *  same LSN.
+             */
+            if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+            {
+                char *fmt;
+
+                if (StandbyMode)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+                else if (InArchiveRecovery)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+                else
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+                ereport(LOG,
+                        (errmsg(fmt, (uint32) (EndRecPtr >> 32),
+                                (uint32) EndRecPtr,    ThisTimeLineID,
+                                xlogSourceNames[currentSource]),
+                         (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+            }
         }
+
+        /* In standby mode, loop back to retry. Otherwise, give up. */
+        if (StandbyMode && !CheckForStandbyTrigger())
+            continue;
+        else
+            return NULL;
     }
 }
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 32f02256ed..9ea1305364 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -682,6 +682,15 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
                       XLogRecPtr PrevRecPtr, XLogRecord *record,
                       bool randAccess)
 {
+
+    if (record->xl_tot_len == 0)
+    {
+        /* This is strictly not an invalid state, so phrase it as so. */
+        report_invalid_record(state,
+                              "record length is 0 at %X/%X",
+                              (uint32) (RecPtr >> 32), (uint32) RecPtr);
+        return false;
+    }
     if (record->xl_tot_len < SizeOfXLogRecord)
     {
         report_invalid_record(state,
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 25e0333c9e..da978d4047 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -479,12 +479,15 @@ WalReceiverMain(void)
                         else if (len < 0)
                         {
                             ereport(LOG,
-                                    (errmsg("replication terminated by primary server"),
-                                     errdetail("End of WAL reached on timeline %u at %X/%X.",
+                                    (errmsg("replication terminated by primary server on timeline %u at %X/%X.",
                                                startpointTLI,
                                                (uint32) (LogstreamResult.Write >> 32), (uint32)
LogstreamResult.Write)));
-                            endofwal = true;
-                            break;
+
+                            /*
+                             * we have no longer anything to do on the broken
+                             * connection other than exiting.
+                             */
+                            proc_exit(1);
                         }
                         len = walrcv_receive(wrconn, &buf, &wait_fd);
                     }
-- 
2.18.2


Re: Make mesage at end-of-recovery less scary.

От
Peter Eisentraut
Дата:
On 2020-03-24 02:52, Kyotaro Horiguchi wrote:
>> I don't find that very satisfying, but I can't come up with something
>> that provides the current information, while being less scary than my
>> suggestion?
> The 0-length record is not an "invalid" state during recovery, so we
> can add the message for the state as "record length is 0 at %X/%X". I
> think if other states found there, it implies something wrong.
> 
> LSN is redundantly shown but I'm not sure if it is better to remove it
> from either of the two lines.
> 
> | LOG:  reached end of WAL at 0/3000850 on timeline 1 in pg_wal during crash recovery
> | DETAIL:  record length is 0 at 0/3000850

I'm not up to date on all these details, but my high-level idea would be 
some kind of hint associated with the existing error messages, like:

HINT:  This is to be expected if this is the end of the WAL.  Otherwise, 
it could indicate corruption.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Make mesage at end-of-recovery less scary.

От
Robert Haas
Дата:
On Wed, Mar 25, 2020 at 8:53 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> HINT:  This is to be expected if this is the end of the WAL.  Otherwise,
> it could indicate corruption.

First, I agree that this general issue is a problem, because it's come
up for me in quite a number of customer situations. Either people get
scared when they shouldn't, because the message is innocuous, or they
don't get scared about other things that actually are scary, because
if some scary-looking messages are actually innocuous, it can lead
people to believe that the same is true in other cases.

Second, I don't really like the particular formulation you have above,
because the user still doesn't know whether or not to be scared. Can
we figure that out? I think if we're in crash recovery, I think that
we should not be scared, because we have no alternative to assuming
that we've reached the end of WAL, so all crash recoveries will end
like this. If we're in archive recovery, we should definitely be
scared if we haven't yet reached the minimum recovery point, because
more WAL than that should certainly exist. After that, it depends on
how we got the WAL. If it's being streamed, the question is whether
we've reached the end of what got streamed. If it's being copied from
the archive, we ought to have the whole segment, but maybe not more.
Can we get the right context to the point where the error is being
reported to know whether we hit the error at the end of the WAL that
was streamed? If not, can we somehow rejigger things so that we only
make it sound scary if we keep getting stuck at the same point when we
woud've expected to make progress meanwhile?

I'm just spitballing here, but it would be really good if there's a
way to know definitely whether or not you should be scared. Corrupted
WAL segments are definitely a thing that happens, but retries are a
lot more common.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Make mesage at end-of-recovery less scary.

От
James Coleman
Дата:
On Thu, Mar 26, 2020 at 12:41 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Mar 25, 2020 at 8:53 AM Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
> > HINT:  This is to be expected if this is the end of the WAL.  Otherwise,
> > it could indicate corruption.
>
> First, I agree that this general issue is a problem, because it's come
> up for me in quite a number of customer situations. Either people get
> scared when they shouldn't, because the message is innocuous, or they
> don't get scared about other things that actually are scary, because
> if some scary-looking messages are actually innocuous, it can lead
> people to believe that the same is true in other cases.
>
> Second, I don't really like the particular formulation you have above,
> because the user still doesn't know whether or not to be scared. Can
> we figure that out? I think if we're in crash recovery, I think that
> we should not be scared, because we have no alternative to assuming
> that we've reached the end of WAL, so all crash recoveries will end
> like this. If we're in archive recovery, we should definitely be
> scared if we haven't yet reached the minimum recovery point, because
> more WAL than that should certainly exist. After that, it depends on
> how we got the WAL. If it's being streamed, the question is whether
> we've reached the end of what got streamed. If it's being copied from
> the archive, we ought to have the whole segment, but maybe not more.
> Can we get the right context to the point where the error is being
> reported to know whether we hit the error at the end of the WAL that
> was streamed? If not, can we somehow rejigger things so that we only
> make it sound scary if we keep getting stuck at the same point when we
> woud've expected to make progress meanwhile?
>
> I'm just spitballing here, but it would be really good if there's a
> way to know definitely whether or not you should be scared. Corrupted
> WAL segments are definitely a thing that happens, but retries are a
> lot more common.

First, I agree that getting enough context to say precisely is by far the ideal.

That being said, as an end user who's found this surprising -- and
momentarily scary every time I initially scan it even though I *know
intellectually it's not* -- I would find Peter's suggestion a
significant improvement over what we have now. I'm fairly certainly my
co-workers on our database team would also. Knowing that something is
at least not always scary is good. Though I'll grant that this does
have the negative in reverse: if it actually is a scary
situation...this mutes your concern level. On the other hand,
monitoring would tell us if there's a real problem (namely replication
lag), so I think the trade-off is clearly worth it.

How about this minor tweak:
HINT:  This is expected if this is the end of currently available WAL.
Otherwise, it could indicate corruption.

Thanks,
James



Re: Make mesage at end-of-recovery less scary.

От
"Bossart, Nathan"
Дата:
On 3/4/21, 10:50 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> As the result, the following messages are emitted with the attached.

I'd like to voice my support for this effort, and I intend to help
review the patch.  It looks like the latest patch no longer applies,
so I've marked the commitfest entry [0] as waiting-on-author.

Nathan

[0] https://commitfest.postgresql.org/35/2490/


Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
At Fri, 22 Oct 2021 17:54:40 +0000, "Bossart, Nathan" <bossartn@amazon.com> wrote in 
> On 3/4/21, 10:50 PM, "Kyotaro Horiguchi" <horikyota.ntt@gmail.com> wrote:
> > As the result, the following messages are emitted with the attached.
> 
> I'd like to voice my support for this effort, and I intend to help
> review the patch.  It looks like the latest patch no longer applies,
> so I've marked the commitfest entry [0] as waiting-on-author.
> 
> Nathan
> 
> [0] https://commitfest.postgresql.org/35/2490/

Sorry for being late to reply.  I rebased this to the current master.

- rebased

- use LSN_FORMAT_ARGS instead of bare shift and mask.

- v4 immediately exited walreceiver on disconnection. Maybe I wanted
  not to see a FATAL message on standby after primary dies. However
  that would be another issue and that change was plain wrong..  v5
  just removes the "end-of-WAL" part from the message, which duplicate
  to what startup emits.

- add a new error message "missing contrecord at %X/%X".  Maybe this
  should be regarded as a leftover of the contrecord patch. In the
  attached patch the "%X/%X" is the LSN of the current record. The log
  messages look like this (026_overwrite_contrecord).

LOG:  redo starts at 0/1486CB8
WARNING:  missing contrecord at 0/1FFC2E0
LOG:  consistent recovery state reached at 0/1FFC2E0
LOG:  started streaming WAL from primary at 0/2000000 on timeline 1
LOG:  successfully skipped missing contrecord at 0/1FFC2E0, overwritten at 2021-11-08 14:50:11.969952+09
CONTEXT:  WAL redo at 0/2000028 for XLOG/OVERWRITE_CONTRECORD: lsn 0/1FFC2E0; time 2021-11-08 14:50:11.969952+09

While checking the behavior for the case of missing-contrecord, I
noticed that emode_for_corrupt_record() doesn't work as expected since
readSource is reset to XLOG_FROM_ANY after a read failure. We could
remember the last failed source but pg_wal should have been visited if
page read error happened so I changed the function so that it treats
XLOG_FROM_ANY the same way with XLOG_FROM_PG_WAL.

(Otherwise we see "LOG: reached end-of-WAL at .." message after
 "WARNING: missing contrecord at.." message.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 276f59c8b37a31cb831b7753d2b107eb1d83c1fb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v5] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c       | 94 +++++++++++++++++++------
 src/backend/access/transam/xlogreader.c | 14 ++++
 src/backend/replication/walreceiver.c   |  3 +-
 src/include/access/xlogreader.h         |  1 +
 4 files changed, 87 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5cda30836f..623fb01d0a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4477,6 +4477,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
     for (;;)
     {
         char       *errormsg;
+        XLogRecPtr    ErrRecPtr = InvalidXLogRecPtr;
 
         record = XLogReadRecord(xlogreader, &errormsg);
         ReadRecPtr = xlogreader->ReadRecPtr;
@@ -4494,6 +4495,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             {
                 abortedRecPtr = xlogreader->abortedRecPtr;
                 missingContrecPtr = xlogreader->missingContrecPtr;
+                ErrRecPtr = abortedRecPtr;
+            }
+            else
+            {
+                /*
+                 * NULL ReadRecPtr means we could not read a record at
+                 * beginning. In that case EndRecPtr is storing the LSN of the
+                 * record we tried to read.
+                 */
+                ErrRecPtr = ReadRecPtr ? ReadRecPtr : EndRecPtr;
             }
 
             if (readFile >= 0)
@@ -4503,13 +4514,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something. In
-             * StandbyMode that only happens if we have been triggered, so we
-             * shouldn't loop anymore in that case.
+             * If we get here for other than end-of-wal, emit the error message
+             * right now. Otherwise the message if any is shown as a part of
+             * the end-of-WAL message below.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode, EndRecPtr),
+            if (!xlogreader->EndOfWAL && errormsg)
+                ereport(emode_for_corrupt_record(emode, ErrRecPtr),
                         (errmsg_internal("%s", errormsg) /* already translated */ ));
         }
 
@@ -4540,11 +4550,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             /* Great, got a record */
             return record;
         }
-        else
+
+        /* No valid record available from this source */
+        lastSourceFailed = true;
+
+        if (!fetching_ckpt)
         {
-            /* No valid record available from this source */
-            lastSourceFailed = true;
-
             /*
              * If archive recovery was requested, but we were still doing
              * crash recovery, switch to archive recovery and retry using the
@@ -4557,11 +4568,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
              * we'd have no idea how far we'd have to replay to reach
              * consistency.  So err on the safe side and give up.
              */
-            if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-                !fetching_ckpt)
+            if (!InArchiveRecovery && ArchiveRecoveryRequested)
             {
+                /*
+                 * We don't report this as LOG, since we don't stop recovery
+                 * here
+                 */
                 ereport(DEBUG1,
-                        (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+                        (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery,
enteringarchive recovery",
 
+                                         LSN_FORMAT_ARGS(ErrRecPtr),
+                                         replayTLI,
+                                         xlogSourceNames[currentSource])));
                 InArchiveRecovery = true;
                 if (StandbyModeRequested)
                     StandbyMode = true;
@@ -4609,12 +4626,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
                 continue;
             }
 
-            /* In standby mode, loop back to retry. Otherwise, give up. */
-            if (StandbyMode && !CheckForStandbyTrigger())
-                continue;
-            else
-                return NULL;
+            /*
+             *  If we haven't emit an error message, we have safely reached the
+             *  end-of-WAL.
+             */
+            if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+            {
+                char *fmt;
+
+                if (StandbyMode)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+                else if (InArchiveRecovery)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+                else
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+                ereport(LOG,
+                        (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI,
+                                xlogSourceNames[currentSource]),
+                         (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+            }
         }
+
+        /* In standby mode, loop back to retry. Otherwise, give up. */
+        if (StandbyMode && !CheckForStandbyTrigger())
+            continue;
+        else
+            return NULL;
     }
 }
 
@@ -7544,7 +7582,7 @@ StartupXLOG(void)
         else
         {
             /* just have to read next record after CheckPoint */
-            record = ReadRecord(xlogreader, LOG, false, ThisTimeLineID);
+            record = ReadRecord(xlogreader, WARNING, false, ThisTimeLineID);
         }
 
         if (record != NULL)
@@ -7781,7 +7819,7 @@ StartupXLOG(void)
                 }
 
                 /* Else, try to fetch the next WAL record */
-                record = ReadRecord(xlogreader, LOG, false, ThisTimeLineID);
+                record = ReadRecord(xlogreader, WARNING, false, ThisTimeLineID);
             } while (record != NULL);
 
             /*
@@ -7841,13 +7879,20 @@ StartupXLOG(void)
 
             InRedo = false;
         }
-        else
+        else if (xlogreader->EndOfWAL)
         {
             /* there are no WAL records following the checkpoint */
             ereport(LOG,
                     (errmsg("redo is not required")));
 
         }
+        else
+        {
+            /* broken record found */
+            ereport(WARNING,
+                    (errmsg("redo is skipped"),
+                     errhint("This suggests WAL file corruption. You might need to check the database.")));
+        }
 
         /*
          * This check is intentionally after the above log messages that
@@ -13135,7 +13180,9 @@ XLogShutdownWalRcv(void)
  * reading from pg_wal, because we don't expect any invalid records in archive
  * or in records streamed from the primary. Files in the archive should be complete,
  * and we should never hit the end of WAL because we stop and wait for more WAL
- * to arrive before replaying it.
+ * to arrive before replaying it.  When we failed to read a new page,
+ * readSource is reset to XLOG_FROM_ANY. This indicates all sources including
+ * pg_wal was failed. Thus treat that the same way with XLOG_FROM_PG_WAL.
  *
  * NOTE: This function remembers the RecPtr value it was last called with,
  * to suppress repeated messages about the same record. Only call this when
@@ -13147,7 +13194,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
     static XLogRecPtr lastComplaint = 0;
 
-    if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+    if ((readSource == XLOG_FROM_PG_WAL || readSource == XLOG_FROM_ANY)
+        && emode <= WARNING)
     {
         if (RecPtr == lastComplaint)
             emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f39f8044a9..df2198e862 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
         pfree(state);
         return NULL;
     }
+    state->EndOfWAL = false;
     state->errormsg_buf[0] = '\0';
 
     /*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
     /* reset error state */
     *errormsg = NULL;
     state->errormsg_buf[0] = '\0';
+    state->EndOfWAL = false;
 
     ResetDecoder(state);
     state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,9 @@ err:
          */
         state->abortedRecPtr = RecPtr;
         state->missingContrecPtr = targetPagePtr;
+        report_invalid_record(state,
+                              "missing contrecord at %X/%X",
+                              LSN_FORMAT_ARGS(RecPtr));
     }
 
     /*
@@ -730,6 +735,15 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
                       XLogRecPtr PrevRecPtr, XLogRecord *record,
                       bool randAccess)
 {
+    if (record->xl_tot_len == 0)
+    {
+        /* This is strictly not an invalid state, so phrase it as so. */
+        report_invalid_record(state,
+                              "record length is 0 at %X/%X",
+                              LSN_FORMAT_ARGS(RecPtr));
+        state->EndOfWAL = true;
+        return false;
+    }
     if (record->xl_tot_len < SizeOfXLogRecord)
     {
         report_invalid_record(state,
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 7a7eb3784e..ba3c4bd550 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -471,8 +471,7 @@ WalReceiverMain(void)
                         else if (len < 0)
                         {
                             ereport(LOG,
-                                    (errmsg("replication terminated by primary server"),
-                                     errdetail("End of WAL reached on timeline %u at %X/%X.",
+                                    (errmsg("replication terminated by primary server on timeline %u at %X/%X.",
                                                startpointTLI,
                                                LSN_FORMAT_ARGS(LogstreamResult.Write))));
                             endofwal = true;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index de6fd791fe..1241b85838 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
      */
     XLogRecPtr    ReadRecPtr;        /* start of last record read */
     XLogRecPtr    EndRecPtr;        /* end+1 of last record read */
+    bool        EndOfWAL;        /* the last attempt was EOW? */
 
     /*
      * Set at the end of recovery: the start point of a partial record at the
-- 
2.27.0


Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
Thank you for the comments!

At Tue, 9 Nov 2021 09:53:15 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Mon, Nov 08, 2021 at 02:59:46PM +0900, Kyotaro Horiguchi wrote:
> > While checking the behavior for the case of missing-contrecord, I
> > noticed that emode_for_corrupt_record() doesn't work as expected since
> > readSource is reset to XLOG_FROM_ANY after a read failure. We could
> > remember the last failed source but pg_wal should have been visited if
> > page read error happened so I changed the function so that it treats
> > XLOG_FROM_ANY the same way with XLOG_FROM_PG_WAL.
> 
> FWIW, I am not much a fan of assuming that it is fine to use
> XLOG_FROM_ANY as a condition here.  The comments on top of
> emode_for_corrupt_record() make it rather clear what the expectations
> are, and this is the default readSource.

The readSource is expected by the function to be the failed source but
it goes back to XLOG_FROM_ANY on page read failure.  So the function
*is* standing on the wrong assumption.  I noticed that currentSource
holds the last accessed source (but forgot about that). So it is
exactly what we need here.  No longer need to introduce the unclear
assumption by using it.

> > (Otherwise we see "LOG: reached end-of-WAL at .." message after
> >  "WARNING: missing contrecord at.." message.)
> 
> +      /* broken record found */
> +      ereport(WARNING,
> +                      (errmsg("redo is skipped"),
> +                       errhint("This suggests WAL file corruption. You might need to check the database.")));
> This looks rather scary to me, FWIW, and this could easily be reached

Yes, the message is intentionally scary, since we don't come here in
the case of clean WAL:)

> if one forgets about EndOfWAL while hacking on xlogreader.c.
> Unlikely so, still.

I don't understand.  Isn't it the case of almost every feature?

The patch compells hackers to maintain the condition for recovery
being considered cleanly ended.  If the last record doesn't meet the
condition, the WAL file should be considered having a
problem. However, I don't see the condition expanded to have another
term in future.

Even if someone including myself broke that condition, we will at
worst unwantedly see a scary message.  And I believe almost all
hackers can easily find it a bug from the DETAILED message shown along
aside.  I'm not sure such bugs could be found in development phase,
though..

> +       report_invalid_record(state,
> +                             "missing contrecord at %X/%X",
> +                             LSN_FORMAT_ARGS(RecPtr));
> Isn't there a risk here to break applications checking after error
> messages stored in the WAL reader after seeing a contrecord?

I'm not sure you are mentioning the case where no message is stored
previously, or the case where already a message is stored. The former
is fine as the record is actually broken. But I was missing the latter
case.  In this version I avoided to overwite the error message.

> +   if (record->xl_tot_len == 0)
> +   {
> +       /* This is strictly not an invalid state, so phrase it as so. */
> +       report_invalid_record(state,
> +                             "record length is 0 at %X/%X",
> +                             LSN_FORMAT_ARGS(RecPtr));
> +       state->EndOfWAL = true;
> +       return false;
> +   }
> This assumes that a value of 0 for xl_tot_len is a synonym of the end
> of WAL, but cannot we have also a corrupted record in this case in the
> shape of xl_tot_len being 0?  We validate the full record after
> reading the header, so it seems to me that we should not assume that
> things are just ending as proposed in this patch.

Yeah, it's the most serious concern to me. So I didn't hide the
detailed message in the "end-of-wal reached message".

> LOG:  reached end of WAL at 0/512F198 on timeline 1 in pg_wal during crash recovery
> DETAIL:  record length is 0 at 0/512F210

I believe everyone regards zero record length as fine unless something
wrong is seen afterwards.  However, we can extend the check to the
whole record header. I think it is by far nearer to the perfect for
almost all cases. The attached emits the following message for the
good (true end-of-WAL) case.

> LOG:  reached end of WAL at 0/512F4A0 on timeline 1 in pg_wal during crash recovery
> DETAIL:  empty record header found at 0/512F518

If garbage bytes are found in the header area, the following log will
be left. I think we can have a better message here.

> WARNING:  garbage record header at 0/2095458
> LOG:  redo done at 0/2095430 system usage: CPU: user: 0.03 s, system: 0.01 s, elapsed: 0.04 s


This is the updated version.

- emode_for_currupt_record() now uses currentSource instead of
  readSource.

- If zero record length is faced, make sure the whole header is zeroed
  before deciding it as the end-of-WAL.

- Do not overwrite existig message when missing contrecord is
  detected.  The message added here is seen in the TAP test log
  026_overwrite_contrecord_standby.log

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 1d5f6e707f8d67172eea79689c8a5f4d86889d3e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v6] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c       | 89 +++++++++++++++++++------
 src/backend/access/transam/xlogreader.c | 42 ++++++++++++
 src/backend/replication/walreceiver.c   |  3 +-
 src/include/access/xlogreader.h         |  1 +
 4 files changed, 111 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5cda30836f..e90c69810b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4477,6 +4477,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
     for (;;)
     {
         char       *errormsg;
+        XLogRecPtr    ErrRecPtr = InvalidXLogRecPtr;
 
         record = XLogReadRecord(xlogreader, &errormsg);
         ReadRecPtr = xlogreader->ReadRecPtr;
@@ -4494,6 +4495,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             {
                 abortedRecPtr = xlogreader->abortedRecPtr;
                 missingContrecPtr = xlogreader->missingContrecPtr;
+                ErrRecPtr = abortedRecPtr;
+            }
+            else
+            {
+                /*
+                 * NULL ReadRecPtr means we could not read a record at
+                 * beginning. In that case EndRecPtr is storing the LSN of the
+                 * record we tried to read.
+                 */
+                ErrRecPtr = ReadRecPtr ? ReadRecPtr : EndRecPtr;
             }
 
             if (readFile >= 0)
@@ -4503,13 +4514,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something. In
-             * StandbyMode that only happens if we have been triggered, so we
-             * shouldn't loop anymore in that case.
+             * If we get here for other than end-of-wal, emit the error message
+             * right now. Otherwise the message if any is shown as a part of
+             * the end-of-WAL message below.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode, EndRecPtr),
+            if (!xlogreader->EndOfWAL && errormsg)
+                ereport(emode_for_corrupt_record(emode, ErrRecPtr),
                         (errmsg_internal("%s", errormsg) /* already translated */ ));
         }
 
@@ -4540,11 +4550,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             /* Great, got a record */
             return record;
         }
-        else
+
+        /* No valid record available from this source */
+        lastSourceFailed = true;
+
+        if (!fetching_ckpt)
         {
-            /* No valid record available from this source */
-            lastSourceFailed = true;
-
             /*
              * If archive recovery was requested, but we were still doing
              * crash recovery, switch to archive recovery and retry using the
@@ -4557,11 +4568,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
              * we'd have no idea how far we'd have to replay to reach
              * consistency.  So err on the safe side and give up.
              */
-            if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-                !fetching_ckpt)
+            if (!InArchiveRecovery && ArchiveRecoveryRequested)
             {
+                /*
+                 * We don't report this as LOG, since we don't stop recovery
+                 * here
+                 */
                 ereport(DEBUG1,
-                        (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+                        (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery,
enteringarchive recovery",
 
+                                         LSN_FORMAT_ARGS(ErrRecPtr),
+                                         replayTLI,
+                                         xlogSourceNames[currentSource])));
                 InArchiveRecovery = true;
                 if (StandbyModeRequested)
                     StandbyMode = true;
@@ -4609,12 +4626,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
                 continue;
             }
 
-            /* In standby mode, loop back to retry. Otherwise, give up. */
-            if (StandbyMode && !CheckForStandbyTrigger())
-                continue;
-            else
-                return NULL;
+            /*
+             *  If we haven't emit an error message, we have safely reached the
+             *  end-of-WAL.
+             */
+            if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+            {
+                char *fmt;
+
+                if (StandbyMode)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+                else if (InArchiveRecovery)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+                else
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+                ereport(LOG,
+                        (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI,
+                                xlogSourceNames[currentSource]),
+                         (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+            }
         }
+
+        /* In standby mode, loop back to retry. Otherwise, give up. */
+        if (StandbyMode && !CheckForStandbyTrigger())
+            continue;
+        else
+            return NULL;
     }
 }
 
@@ -7544,7 +7582,7 @@ StartupXLOG(void)
         else
         {
             /* just have to read next record after CheckPoint */
-            record = ReadRecord(xlogreader, LOG, false, ThisTimeLineID);
+            record = ReadRecord(xlogreader, WARNING, false, ThisTimeLineID);
         }
 
         if (record != NULL)
@@ -7781,7 +7819,7 @@ StartupXLOG(void)
                 }
 
                 /* Else, try to fetch the next WAL record */
-                record = ReadRecord(xlogreader, LOG, false, ThisTimeLineID);
+                record = ReadRecord(xlogreader, WARNING, false, ThisTimeLineID);
             } while (record != NULL);
 
             /*
@@ -7841,13 +7879,20 @@ StartupXLOG(void)
 
             InRedo = false;
         }
-        else
+        else if (xlogreader->EndOfWAL)
         {
             /* there are no WAL records following the checkpoint */
             ereport(LOG,
                     (errmsg("redo is not required")));
 
         }
+        else
+        {
+            /* broken record found */
+            ereport(WARNING,
+                    (errmsg("redo is skipped"),
+                     errhint("This suggests WAL file corruption. You might need to check the database.")));
+        }
 
         /*
          * This check is intentionally after the above log messages that
@@ -13147,7 +13192,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
     static XLogRecPtr lastComplaint = 0;
 
-    if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+    if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
     {
         if (RecPtr == lastComplaint)
             emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index f39f8044a9..273b927cd9 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
         pfree(state);
         return NULL;
     }
+    state->EndOfWAL = false;
     state->errormsg_buf[0] = '\0';
 
     /*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
     /* reset error state */
     *errormsg = NULL;
     state->errormsg_buf[0] = '\0';
+    state->EndOfWAL = false;
 
     ResetDecoder(state);
     state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,16 @@ err:
          */
         state->abortedRecPtr = RecPtr;
         state->missingContrecPtr = targetPagePtr;
+
+        /*
+         * If the messages is not set yet, that means we failed to load the
+         * page for the record.  Otherwise do not hide the existing message at
+         * it should be more detailed.
+         */
+        if (state->errormsg_buf[0] == '\0')
+            report_invalid_record(state,
+                                  "missing contrecord at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
     }
 
     /*
@@ -730,6 +742,36 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
                       XLogRecPtr PrevRecPtr, XLogRecord *record,
                       bool randAccess)
 {
+    if (record->xl_tot_len == 0)
+    {
+        /*
+         * We are almost sure reaching the end of WAL, make sure that the whole
+         * header is zeroed.
+         */
+        char   *p = (char *)record;
+        char   *pe = (char *)record + SizeOfXLogRecord;
+
+        while (*p == 0 && p < pe)
+            p++;
+
+        if (p == pe)
+        {
+            /* it is completely zeroed, call it a day  */
+            report_invalid_record(state, "empty record header found at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
+
+            /* notify end-of-wal to callers */
+            state->EndOfWAL = true;
+        }
+        else
+        {
+            /* Otherwise we found a garbage header.. */
+            report_invalid_record(state, "garbage record header at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
+        }
+
+        return false;
+    }
     if (record->xl_tot_len < SizeOfXLogRecord)
     {
         report_invalid_record(state,
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 7a7eb3784e..ba3c4bd550 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -471,8 +471,7 @@ WalReceiverMain(void)
                         else if (len < 0)
                         {
                             ereport(LOG,
-                                    (errmsg("replication terminated by primary server"),
-                                     errdetail("End of WAL reached on timeline %u at %X/%X.",
+                                    (errmsg("replication terminated by primary server on timeline %u at %X/%X.",
                                                startpointTLI,
                                                LSN_FORMAT_ARGS(LogstreamResult.Write))));
                             endofwal = true;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index de6fd791fe..1241b85838 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
      */
     XLogRecPtr    ReadRecPtr;        /* start of last record read */
     XLogRecPtr    EndRecPtr;        /* end+1 of last record read */
+    bool        EndOfWAL;        /* the last attempt was EOW? */
 
     /*
      * Set at the end of recovery: the start point of a partial record at the
-- 
2.27.0


Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
At Tue, 09 Nov 2021 16:27:51 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> This is the updated version.
> 
> - emode_for_currupt_record() now uses currentSource instead of
>   readSource.
> 
> - If zero record length is faced, make sure the whole header is zeroed
>   before deciding it as the end-of-WAL.
> 
> - Do not overwrite existig message when missing contrecord is
>   detected.  The message added here is seen in the TAP test log
>   026_overwrite_contrecord_standby.log

d2ddfa681db27a138acb63c8defa8cc6fa588922 removed global variables
ReadRecPtr and EndRecPtr. This is rebased version that reads the LSNs
directly from xlogreader instead of the removed global variables.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From cc521692a9f98fabde07e248b63f1222f8406de1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v7] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c       | 89 +++++++++++++++++++------
 src/backend/access/transam/xlogreader.c | 42 ++++++++++++
 src/backend/replication/walreceiver.c   |  3 +-
 src/include/access/xlogreader.h         |  1 +
 4 files changed, 112 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d894af310a..fa435faec4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4469,6 +4469,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
     for (;;)
     {
         char       *errormsg;
+        XLogRecPtr    ErrRecPtr = InvalidXLogRecPtr;
 
         record = XLogReadRecord(xlogreader, &errormsg);
         if (record == NULL)
@@ -4484,6 +4485,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             {
                 abortedRecPtr = xlogreader->abortedRecPtr;
                 missingContrecPtr = xlogreader->missingContrecPtr;
+                ErrRecPtr = abortedRecPtr;
+            }
+            else
+            {
+                /*
+                 * NULL ReadRecPtr means we could not read a record at
+                 * beginning. In that case EndRecPtr is storing the LSN of the
+                 * record we tried to read.
+                 */
+                ErrRecPtr =
+                    xlogreader->ReadRecPtr ?
+                    xlogreader->ReadRecPtr : xlogreader->EndRecPtr;
             }
 
             if (readFile >= 0)
@@ -4493,12 +4506,11 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something. In
-             * StandbyMode that only happens if we have been triggered, so we
-             * shouldn't loop anymore in that case.
+             * If we get here for other than end-of-wal, emit the error message
+             * right now. Otherwise the message if any is shown as a part of
+             * the end-of-WAL message below.
              */
-            if (errormsg)
+            if (!xlogreader->EndOfWAL && errormsg)
                 ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
                         (errmsg_internal("%s", errormsg) /* already translated */ ));
         }
@@ -4530,11 +4542,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             /* Great, got a record */
             return record;
         }
-        else
+
+        /* No valid record available from this source */
+        lastSourceFailed = true;
+
+        if (!fetching_ckpt)
         {
-            /* No valid record available from this source */
-            lastSourceFailed = true;
-
             /*
              * If archive recovery was requested, but we were still doing
              * crash recovery, switch to archive recovery and retry using the
@@ -4547,11 +4560,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
              * we'd have no idea how far we'd have to replay to reach
              * consistency.  So err on the safe side and give up.
              */
-            if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-                !fetching_ckpt)
+            if (!InArchiveRecovery && ArchiveRecoveryRequested)
             {
+                /*
+                 * We don't report this as LOG, since we don't stop recovery
+                 * here
+                 */
                 ereport(DEBUG1,
-                        (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+                        (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery,
enteringarchive recovery",
 
+                                         LSN_FORMAT_ARGS(ErrRecPtr),
+                                         replayTLI,
+                                         xlogSourceNames[currentSource])));
                 InArchiveRecovery = true;
                 if (StandbyModeRequested)
                     StandbyMode = true;
@@ -4599,12 +4618,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
                 continue;
             }
 
-            /* In standby mode, loop back to retry. Otherwise, give up. */
-            if (StandbyMode && !CheckForStandbyTrigger())
-                continue;
-            else
-                return NULL;
+            /*
+             *  If we haven't emit an error message, we have safely reached the
+             *  end-of-WAL.
+             */
+            if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+            {
+                char *fmt;
+
+                if (StandbyMode)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+                else if (InArchiveRecovery)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+                else
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+                ereport(LOG,
+                        (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI,
+                                xlogSourceNames[currentSource]),
+                         (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+            }
         }
+
+        /* In standby mode, loop back to retry. Otherwise, give up. */
+        if (StandbyMode && !CheckForStandbyTrigger())
+            continue;
+        else
+            return NULL;
     }
 }
 
@@ -7536,7 +7576,7 @@ StartupXLOG(void)
         else
         {
             /* just have to read next record after CheckPoint */
-            record = ReadRecord(xlogreader, LOG, false, replayTLI);
+            record = ReadRecord(xlogreader, WARNING, false, replayTLI);
         }
 
         if (record != NULL)
@@ -7774,7 +7814,7 @@ StartupXLOG(void)
                 }
 
                 /* Else, try to fetch the next WAL record */
-                record = ReadRecord(xlogreader, LOG, false, replayTLI);
+                record = ReadRecord(xlogreader, WARNING, false, replayTLI);
             } while (record != NULL);
 
             /*
@@ -7834,13 +7874,20 @@ StartupXLOG(void)
 
             InRedo = false;
         }
-        else
+        else if (xlogreader->EndOfWAL)
         {
             /* there are no WAL records following the checkpoint */
             ereport(LOG,
                     (errmsg("redo is not required")));
 
         }
+        else
+        {
+            /* broken record found */
+            ereport(WARNING,
+                    (errmsg("redo is skipped"),
+                     errhint("This suggests WAL file corruption. You might need to check the database.")));
+        }
 
         /*
          * This check is intentionally after the above log messages that
@@ -13130,7 +13177,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
     static XLogRecPtr lastComplaint = 0;
 
-    if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+    if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
     {
         if (RecPtr == lastComplaint)
             emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 3a7de02565..e16b6fe041 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
         pfree(state);
         return NULL;
     }
+    state->EndOfWAL = false;
     state->errormsg_buf[0] = '\0';
 
     /*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
     /* reset error state */
     *errormsg = NULL;
     state->errormsg_buf[0] = '\0';
+    state->EndOfWAL = false;
 
     ResetDecoder(state);
     state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,16 @@ err:
          */
         state->abortedRecPtr = RecPtr;
         state->missingContrecPtr = targetPagePtr;
+
+        /*
+         * If the messages is not set yet, that means we failed to load the
+         * page for the record.  Otherwise do not hide the existing message at
+         * it should be more detailed.
+         */
+        if (state->errormsg_buf[0] == '\0')
+            report_invalid_record(state,
+                                  "missing contrecord at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
     }
 
     /*
@@ -730,6 +742,36 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
                       XLogRecPtr PrevRecPtr, XLogRecord *record,
                       bool randAccess)
 {
+    if (record->xl_tot_len == 0)
+    {
+        /*
+         * We are almost sure reaching the end of WAL, make sure that the whole
+         * header is zeroed.
+         */
+        char   *p = (char *)record;
+        char   *pe = (char *)record + SizeOfXLogRecord;
+
+        while (*p == 0 && p < pe)
+            p++;
+
+        if (p == pe)
+        {
+            /* it is completely zeroed, call it a day  */
+            report_invalid_record(state, "empty record header found at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
+
+            /* notify end-of-wal to callers */
+            state->EndOfWAL = true;
+        }
+        else
+        {
+            /* Otherwise we found a garbage header.. */
+            report_invalid_record(state, "garbage record header at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
+        }
+
+        return false;
+    }
     if (record->xl_tot_len < SizeOfXLogRecord)
     {
         report_invalid_record(state,
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 7a7eb3784e..ba3c4bd550 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -471,8 +471,7 @@ WalReceiverMain(void)
                         else if (len < 0)
                         {
                             ereport(LOG,
-                                    (errmsg("replication terminated by primary server"),
-                                     errdetail("End of WAL reached on timeline %u at %X/%X.",
+                                    (errmsg("replication terminated by primary server on timeline %u at %X/%X.",
                                                startpointTLI,
                                                LSN_FORMAT_ARGS(LogstreamResult.Write))));
                             endofwal = true;
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index de6fd791fe..1241b85838 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
      */
     XLogRecPtr    ReadRecPtr;        /* start of last record read */
     XLogRecPtr    EndRecPtr;        /* end+1 of last record read */
+    bool        EndOfWAL;        /* the last attempt was EOW? */
 
     /*
      * Set at the end of recovery: the start point of a partial record at the
-- 
2.27.0


Re: Make mesage at end-of-recovery less scary.

От
Pavel Borisov
Дата:
d2ddfa681db27a138acb63c8defa8cc6fa588922 removed global variables
ReadRecPtr and EndRecPtr. This is rebased version that reads the LSNs
directly from xlogreader instead of the removed global variables.

Hi, hackers!

I've checked the latest version of a patch. It applies cleanly, check-world passes and CI is also in the green state.
Proposed messages seem good to me, but probably it would be better to have a test on conditions where "reached end of WAL..." emitted.
Then, I believe it can be set as 'ready for committter'.

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
At Mon, 24 Jan 2022 14:23:33 +0400, Pavel Borisov <pashkin.elfe@gmail.com> wrote in 
> >
> > d2ddfa681db27a138acb63c8defa8cc6fa588922 removed global variables
> > ReadRecPtr and EndRecPtr. This is rebased version that reads the LSNs
> > directly from xlogreader instead of the removed global variables.
> >
> 
> Hi, hackers!
> 
> I've checked the latest version of a patch. It applies cleanly, check-world
> passes and CI is also in the green state.
> Proposed messages seem good to me, but probably it would be better to have
> a test on conditions where "reached end of WAL..." emitted.
> Then, I believe it can be set as 'ready for committter'.

Thanks for checking that, and the comment!

I thought that we usually don't test log messages, but finally I found
that I needed that.  It is because I found another mode of end-of-wal
and a bug that emits a spurious message on passing...

This v8 is changed in...

- Added tests to 011_crash_recovery.pl

- Fixed a bug that server emits "end-of-wal" messages even if it have
  emitted an error message for the same LSN.

- Changed XLogReaderValidatePageHeader() so that it recognizes an
  empty page as end-of-WAL.

- Made pg_waldump conscious of end-of-wal.

While doing the last item, I noticed that pg_waldump shows the wrong
LSN as the error position.  Concretely it emits the LSN of the last
sound WAL record as the error position.  I will post a bug-fix patch
for the issue after confirmation.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 0f1024bdfba9d1926465351fa1b7125698a21e8d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v8] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c         |  91 +++++++++++++-----
 src/backend/access/transam/xlogreader.c   |  64 +++++++++++++
 src/backend/replication/walreceiver.c     |   3 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 110 +++++++++++++++++++++-
 6 files changed, 254 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 58922f7ede..c08b9554b3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
     for (;;)
     {
         char       *errormsg;
+        XLogRecPtr    ErrRecPtr = InvalidXLogRecPtr;
 
         record = XLogReadRecord(xlogreader, &errormsg);
         if (record == NULL)
@@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             {
                 abortedRecPtr = xlogreader->abortedRecPtr;
                 missingContrecPtr = xlogreader->missingContrecPtr;
+                ErrRecPtr = abortedRecPtr;
+            }
+            else
+            {
+                /*
+                 * NULL ReadRecPtr means we could not read a record at the
+                 * beginning. In that case EndRecPtr is storing the LSN of the
+                 * record we tried to read.
+                 */
+                ErrRecPtr =
+                    xlogreader->ReadRecPtr ?
+                    xlogreader->ReadRecPtr : xlogreader->EndRecPtr;
             }
 
             if (readFile >= 0)
@@ -4504,13 +4517,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something. In
-             * StandbyMode that only happens if we have been triggered, so we
-             * shouldn't loop anymore in that case.
+             * If we get here for other than end-of-wal, emit the error message
+             * right now. Otherwise the message if any is shown as a part of
+             * the end-of-WAL message below.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+            if (!xlogreader->EndOfWAL && errormsg)
+                ereport(emode_for_corrupt_record(emode, ErrRecPtr),
                         (errmsg_internal("%s", errormsg) /* already translated */ ));
         }
 
@@ -4541,11 +4553,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             /* Great, got a record */
             return record;
         }
-        else
+
+        /* No valid record available from this source */
+        lastSourceFailed = true;
+
+        if (!fetching_ckpt)
         {
-            /* No valid record available from this source */
-            lastSourceFailed = true;
-
             /*
              * If archive recovery was requested, but we were still doing
              * crash recovery, switch to archive recovery and retry using the
@@ -4558,11 +4571,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
              * we'd have no idea how far we'd have to replay to reach
              * consistency.  So err on the safe side and give up.
              */
-            if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-                !fetching_ckpt)
+            if (!InArchiveRecovery && ArchiveRecoveryRequested)
             {
+                /*
+                 * We don't report this as LOG, since we don't stop recovery
+                 * here
+                 */
                 ereport(DEBUG1,
-                        (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+                        (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery,
enteringarchive recovery",
 
+                                         LSN_FORMAT_ARGS(ErrRecPtr),
+                                         replayTLI,
+                                         xlogSourceNames[currentSource])));
                 InArchiveRecovery = true;
                 if (StandbyModeRequested)
                     StandbyMode = true;
@@ -4610,12 +4629,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
                 continue;
             }
 
-            /* In standby mode, loop back to retry. Otherwise, give up. */
-            if (StandbyMode && !CheckForStandbyTrigger())
-                continue;
-            else
-                return NULL;
+            /*
+             *  If we haven't emit an error message, we have safely reached the
+             *  end-of-WAL.
+             */
+            if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+            {
+                char *fmt;
+
+                if (StandbyMode)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+                else if (InArchiveRecovery)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+                else
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+                ereport(LOG,
+                        (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI,
+                                xlogSourceNames[currentSource]),
+                         (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+            }
         }
+
+        /* In standby mode, loop back to retry. Otherwise, give up. */
+        if (StandbyMode && !CheckForStandbyTrigger())
+            continue;
+        else
+            return NULL;
     }
 }
 
@@ -7544,7 +7584,7 @@ StartupXLOG(void)
         else
         {
             /* just have to read next record after CheckPoint */
-            record = ReadRecord(xlogreader, LOG, false, replayTLI);
+            record = ReadRecord(xlogreader, WARNING, false, replayTLI);
         }
 
         if (record != NULL)
@@ -7782,7 +7822,7 @@ StartupXLOG(void)
                 }
 
                 /* Else, try to fetch the next WAL record */
-                record = ReadRecord(xlogreader, LOG, false, replayTLI);
+                record = ReadRecord(xlogreader, WARNING, false, replayTLI);
             } while (record != NULL);
 
             /*
@@ -7842,13 +7882,20 @@ StartupXLOG(void)
 
             InRedo = false;
         }
-        else
+        else if (xlogreader->EndOfWAL)
         {
             /* there are no WAL records following the checkpoint */
             ereport(LOG,
                     (errmsg("redo is not required")));
 
         }
+        else
+        {
+            /* broken record found */
+            ereport(WARNING,
+                    (errmsg("redo is skipped"),
+                     errhint("This suggests WAL file corruption. You might need to check the database.")));
+        }
 
         /*
          * This check is intentionally after the above log messages that
@@ -13097,7 +13144,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
     static XLogRecPtr lastComplaint = 0;
 
-    if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+    if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
     {
         if (RecPtr == lastComplaint)
             emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..55f54cd98d 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
         pfree(state);
         return NULL;
     }
+    state->EndOfWAL = false;
     state->errormsg_buf[0] = '\0';
 
     /*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
     /* reset error state */
     *errormsg = NULL;
     state->errormsg_buf[0] = '\0';
+    state->EndOfWAL = false;
 
     ResetDecoder(state);
     state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,15 @@ err:
          */
         state->abortedRecPtr = RecPtr;
         state->missingContrecPtr = targetPagePtr;
+
+        /*
+         * If the message is not set yet, that means we failed to load the
+         * page for the record.  Otherwise do not hide the existing message.
+         */
+        if (state->errormsg_buf[0] == '\0')
+            report_invalid_record(state,
+                                  "missing contrecord at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
     }
 
     /*
@@ -730,6 +741,36 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
                       XLogRecPtr PrevRecPtr, XLogRecord *record,
                       bool randAccess)
 {
+    if (record->xl_tot_len == 0)
+    {
+        /*
+         * We are almost sure reaching the end of WAL, make sure that the whole
+         * header is zeroed.
+         */
+        char   *p = (char *)record;
+        char   *pe = (char *)record + SizeOfXLogRecord;
+
+        while (*p == 0 && p < pe)
+            p++;
+
+        if (p == pe)
+        {
+            /* it is completely zeroed, call it a day  */
+            report_invalid_record(state, "empty record header found at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
+
+            /* notify end-of-wal to callers */
+            state->EndOfWAL = true;
+        }
+        else
+        {
+            /* Otherwise the header is corrupted. */
+            report_invalid_record(state, "garbage record header at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
+        }
+
+        return false;
+    }
     if (record->xl_tot_len < SizeOfXLogRecord)
     {
         report_invalid_record(state,
@@ -836,6 +877,29 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
     XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+    StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+    if (hdr->xlp_magic == 0)
+    {
+        /* Regard an empty page as End-Of-WAL */
+        int    i;
+
+        for (i = 0 ; i < XLOG_BLCKSZ && phdr[i] == 0 ; i++);
+        if (i == XLOG_BLCKSZ)
+        {
+            char        fname[MAXFNAMELEN];
+
+            XLogFileName(fname, state->seg.ws_tli, segno,
+                         state->segcxt.ws_segsize);
+
+            report_invalid_record(state,
+                                  "empty page in log segment %s, offset %u",
+                                  fname,
+                                  offset);
+            state->EndOfWAL = true;
+            return false;
+        }
+    }
     if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
     {
         char        fname[MAXFNAMELEN];
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b39fce8c23..3034f8281e 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -471,8 +471,7 @@ WalReceiverMain(void)
                         else if (len < 0)
                         {
                             ereport(LOG,
-                                    (errmsg("replication terminated by primary server"),
-                                     errdetail("End of WAL reached on timeline %u at %X/%X.",
+                                    (errmsg("replication terminated by primary server on timeline %u at %X/%X.",
                                                startpointTLI,
                                                LSN_FORMAT_ARGS(LogstreamResult.Write))));
                             endofwal = true;
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..3745e76488 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1176,9 +1176,16 @@ main(int argc, char **argv)
         exit(0);
 
     if (errormsg)
-        fatal_error("error in WAL record at %X/%X: %s",
-                    LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-                    errormsg);
+    {
+        if (xlogreader_state->EndOfWAL)
+            pg_log_info("end of WAL at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+        else
+            fatal_error("error in WAL record at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+    }
 
     XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..3eeba220a1 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
      */
     XLogRecPtr    ReadRecPtr;        /* start of last record read */
     XLogRecPtr    EndRecPtr;        /* end+1 of last record read */
+    bool        EndOfWAL;        /* the last attempt was EOW? */
 
     /*
      * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 3892aba3e5..b793280a5c 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,9 +10,11 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
-plan tests => 3;
+plan tests => 11;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -50,7 +52,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, $reached_eow_pat, $logstart));
+    sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -62,3 +72,101 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
+
+my $segsize = $node->safe_psql('postgres',
+       qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+                 SELECT pg_switch_wal();
+                 CHECKPOINT;
+                 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+my $chkptfile;
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+    $lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+    "LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, "WARNING:  garbage record header at 0/$lastlsn",
+                         $logstart));
+    sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# and the end-of-wal messages shouldn't be seen
+# the same message has been confirmed in the past
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.27.0


Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
At Tue, 25 Jan 2022 17:34:56 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> This v8 is changed in...
> 
> - Added tests to 011_crash_recovery.pl
> 
> - Fixed a bug that server emits "end-of-wal" messages even if it have
>   emitted an error message for the same LSN.
> 
> - Changed XLogReaderValidatePageHeader() so that it recognizes an
>   empty page as end-of-WAL.
> 
> - Made pg_waldump conscious of end-of-wal.
> 
> While doing the last item, I noticed that pg_waldump shows the wrong
> LSN as the error position.  Concretely it emits the LSN of the last
> sound WAL record as the error position.  I will post a bug-fix patch
> for the issue after confirmation.

I noticed that I added a useless error message "garbage record
header", but it is a kind of invalid record length.  So I removed the
message. That change makes the logic for EOW in ValidXLogRecordHeader
and XLogReaderValidatePageHeader share the same flow.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 57cb251f7cacbb96066ead4543b9f12f5b3c7062 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v9] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c         |  91 +++++++++++++-----
 src/backend/access/transam/xlogreader.c   |  61 ++++++++++++
 src/backend/replication/walreceiver.c     |   3 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 110 +++++++++++++++++++++-
 6 files changed, 251 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dfe2a0bcce..5727e0939f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
     for (;;)
     {
         char       *errormsg;
+        XLogRecPtr    ErrRecPtr = InvalidXLogRecPtr;
 
         record = XLogReadRecord(xlogreader, &errormsg);
         if (record == NULL)
@@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             {
                 abortedRecPtr = xlogreader->abortedRecPtr;
                 missingContrecPtr = xlogreader->missingContrecPtr;
+                ErrRecPtr = abortedRecPtr;
+            }
+            else
+            {
+                /*
+                 * NULL ReadRecPtr means we could not read a record at the
+                 * beginning. In that case EndRecPtr is storing the LSN of the
+                 * record we tried to read.
+                 */
+                ErrRecPtr =
+                    xlogreader->ReadRecPtr ?
+                    xlogreader->ReadRecPtr : xlogreader->EndRecPtr;
             }
 
             if (readFile >= 0)
@@ -4504,13 +4517,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something. In
-             * StandbyMode that only happens if we have been triggered, so we
-             * shouldn't loop anymore in that case.
+             * If we get here for other than end-of-wal, emit the error message
+             * right now. Otherwise the message if any is shown as a part of
+             * the end-of-WAL message below.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+            if (!xlogreader->EndOfWAL && errormsg)
+                ereport(emode_for_corrupt_record(emode, ErrRecPtr),
                         (errmsg_internal("%s", errormsg) /* already translated */ ));
         }
 
@@ -4541,11 +4553,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             /* Great, got a record */
             return record;
         }
-        else
+
+        /* No valid record available from this source */
+        lastSourceFailed = true;
+
+        if (!fetching_ckpt)
         {
-            /* No valid record available from this source */
-            lastSourceFailed = true;
-
             /*
              * If archive recovery was requested, but we were still doing
              * crash recovery, switch to archive recovery and retry using the
@@ -4558,11 +4571,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
              * we'd have no idea how far we'd have to replay to reach
              * consistency.  So err on the safe side and give up.
              */
-            if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-                !fetching_ckpt)
+            if (!InArchiveRecovery && ArchiveRecoveryRequested)
             {
+                /*
+                 * We don't report this as LOG, since we don't stop recovery
+                 * here
+                 */
                 ereport(DEBUG1,
-                        (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+                        (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery,
enteringarchive recovery",
 
+                                         LSN_FORMAT_ARGS(ErrRecPtr),
+                                         replayTLI,
+                                         xlogSourceNames[currentSource])));
                 InArchiveRecovery = true;
                 if (StandbyModeRequested)
                     StandbyMode = true;
@@ -4610,12 +4629,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
                 continue;
             }
 
-            /* In standby mode, loop back to retry. Otherwise, give up. */
-            if (StandbyMode && !CheckForStandbyTrigger())
-                continue;
-            else
-                return NULL;
+            /*
+             *  If we haven't emit an error message, we have safely reached the
+             *  end-of-WAL.
+             */
+            if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+            {
+                char *fmt;
+
+                if (StandbyMode)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+                else if (InArchiveRecovery)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+                else
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+                ereport(LOG,
+                        (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI,
+                                xlogSourceNames[currentSource]),
+                         (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+            }
         }
+
+        /* In standby mode, loop back to retry. Otherwise, give up. */
+        if (StandbyMode && !CheckForStandbyTrigger())
+            continue;
+        else
+            return NULL;
     }
 }
 
@@ -7544,7 +7584,7 @@ StartupXLOG(void)
         else
         {
             /* just have to read next record after CheckPoint */
-            record = ReadRecord(xlogreader, LOG, false, replayTLI);
+            record = ReadRecord(xlogreader, WARNING, false, replayTLI);
         }
 
         if (record != NULL)
@@ -7782,7 +7822,7 @@ StartupXLOG(void)
                 }
 
                 /* Else, try to fetch the next WAL record */
-                record = ReadRecord(xlogreader, LOG, false, replayTLI);
+                record = ReadRecord(xlogreader, WARNING, false, replayTLI);
             } while (record != NULL);
 
             /*
@@ -7842,13 +7882,20 @@ StartupXLOG(void)
 
             InRedo = false;
         }
-        else
+        else if (xlogreader->EndOfWAL)
         {
             /* there are no WAL records following the checkpoint */
             ereport(LOG,
                     (errmsg("redo is not required")));
 
         }
+        else
+        {
+            /* broken record found */
+            ereport(WARNING,
+                    (errmsg("redo is skipped"),
+                     errhint("This suggests WAL file corruption. You might need to check the database.")));
+        }
 
         /*
          * This check is intentionally after the above log messages that
@@ -13097,7 +13144,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
     static XLogRecPtr lastComplaint = 0;
 
-    if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+    if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
     {
         if (RecPtr == lastComplaint)
             emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..418fb66ef2 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
         pfree(state);
         return NULL;
     }
+    state->EndOfWAL = false;
     state->errormsg_buf[0] = '\0';
 
     /*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
     /* reset error state */
     *errormsg = NULL;
     state->errormsg_buf[0] = '\0';
+    state->EndOfWAL = false;
 
     ResetDecoder(state);
     state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,15 @@ err:
          */
         state->abortedRecPtr = RecPtr;
         state->missingContrecPtr = targetPagePtr;
+
+        /*
+         * If the message is not set yet, that means we failed to load the
+         * page for the record.  Otherwise do not hide the existing message.
+         */
+        if (state->errormsg_buf[0] == '\0')
+            report_invalid_record(state,
+                                  "missing contrecord at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
     }
 
     /*
@@ -730,6 +741,31 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
                       XLogRecPtr PrevRecPtr, XLogRecord *record,
                       bool randAccess)
 {
+    if (record->xl_tot_len == 0)
+    {
+        /*
+         * We are almost sure reaching the end of WAL, make sure that the whole
+         * header is filled with zeroes.
+         */
+        char   *p = (char *)record;
+        char   *pe = (char *)record + SizeOfXLogRecord;
+
+        while (*p == 0 && p < pe)
+            p++;
+
+        if (p == pe)
+        {
+            /* it is completely zeroed, call it a day  */
+            report_invalid_record(state, "empty record header found at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
+
+            /* notify end-of-wal to callers */
+            state->EndOfWAL = true;
+            return false;
+        }
+
+        /* The same condition will be caught as invalid record length */
+    }
     if (record->xl_tot_len < SizeOfXLogRecord)
     {
         report_invalid_record(state,
@@ -836,6 +872,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
     XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+    StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+    if (hdr->xlp_magic == 0)
+    {
+        /* Regard an empty page as End-Of-WAL */
+        int    i;
+
+        for (i = 0 ; i < XLOG_BLCKSZ && phdr[i] == 0 ; i++);
+        if (i == XLOG_BLCKSZ)
+        {
+            char        fname[MAXFNAMELEN];
+
+            XLogFileName(fname, state->seg.ws_tli, segno,
+                         state->segcxt.ws_segsize);
+
+            report_invalid_record(state,
+                                  "empty page in log segment %s, offset %u",
+                                  fname,
+                                  offset);
+            state->EndOfWAL = true;
+            return false;
+        }
+
+        /* The same condition will be caught as invalid magic number */
+    }
     if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
     {
         char        fname[MAXFNAMELEN];
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b39fce8c23..3034f8281e 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -471,8 +471,7 @@ WalReceiverMain(void)
                         else if (len < 0)
                         {
                             ereport(LOG,
-                                    (errmsg("replication terminated by primary server"),
-                                     errdetail("End of WAL reached on timeline %u at %X/%X.",
+                                    (errmsg("replication terminated by primary server on timeline %u at %X/%X.",
                                                startpointTLI,
                                                LSN_FORMAT_ARGS(LogstreamResult.Write))));
                             endofwal = true;
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..3745e76488 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1176,9 +1176,16 @@ main(int argc, char **argv)
         exit(0);
 
     if (errormsg)
-        fatal_error("error in WAL record at %X/%X: %s",
-                    LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-                    errormsg);
+    {
+        if (xlogreader_state->EndOfWAL)
+            pg_log_info("end of WAL at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+        else
+            fatal_error("error in WAL record at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+    }
 
     XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..3eeba220a1 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
      */
     XLogRecPtr    ReadRecPtr;        /* start of last record read */
     XLogRecPtr    EndRecPtr;        /* end+1 of last record read */
+    bool        EndOfWAL;        /* the last attempt was EOW? */
 
     /*
      * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 3892aba3e5..67d264df26 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,9 +10,11 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
-plan tests => 3;
+plan tests => 11;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -50,7 +52,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, $reached_eow_pat, $logstart));
+    sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -62,3 +72,101 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
+
+my $segsize = $node->safe_psql('postgres',
+       qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+                 SELECT pg_switch_wal();
+                 CHECKPOINT;
+                 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+my $chkptfile;
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+    $lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+    "LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+                         $logstart));
+    sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# and the end-of-wal messages shouldn't be seen
+# the same message has been confirmed in the past
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.27.0


Re: Make mesage at end-of-recovery less scary.

От
Pavel Borisov
Дата:
> This v8 is changed in...
>
> - Added tests to 011_crash_recovery.pl
>
> - Fixed a bug that server emits "end-of-wal" messages even if it have
>   emitted an error message for the same LSN.
>
> - Changed XLogReaderValidatePageHeader() so that it recognizes an
>   empty page as end-of-WAL.
>
> - Made pg_waldump conscious of end-of-wal.
>
> While doing the last item, I noticed that pg_waldump shows the wrong
> LSN as the error position.  Concretely it emits the LSN of the last
> sound WAL record as the error position.  I will post a bug-fix patch
> for the issue after confirmation.

I noticed that I added a useless error message "garbage record
header", but it is a kind of invalid record length.  So I removed the
message. That change makes the logic for EOW in ValidXLogRecordHeader
and XLogReaderValidatePageHeader share the same flow.

Hi,  Kyotaro!

I don't quite understand a meaning of a comment:
 /* it is completely zeroed, call it a day  */

Please also run pgindent on your code.

Otherwise the new patch seems ok.

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
Hi, Pavel.

At Mon, 31 Jan 2022 15:17:09 +0400, Pavel Borisov <pashkin.elfe@gmail.com> wrote in 
> I don't quite understand a meaning of a comment:
>  /* it is completely zeroed, call it a day  */

While rethinking about this comment, It came to my mind that
XLogReaderValidatePageHeader is doing whole-page check.  There is no
clear reason for not doing at least the same check here.
ValidXLogRecordHeader is changed to check all bytes in the rest of the
page, instead of just the record header.

While working on that, I noticed another end-of-WAL case, unexpected
pageaddr.  I think we can assume it safe when the pageaddr is smaller
than expected (or we have no choice than assuming
so). XLogReaderValidatePageHeader is changed that way. But I'm not
sure others regard it as a form of safe end-of-WAL.

> Please also run pgindent on your code.

Hmm. I'm not sure we need to do that at this stage. pgindent makes
changes on the whole file involving unrelated part from this patch.
Anyway I did that then removed irrelevant edits.

pgindent makes a seemingly not-great suggestion.

+        char       *pe =
+        (char *) record + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));

I'm not sure this is intended but I split the line into two lines to
define and assign.

> Otherwise the new patch seems ok.

Thanks!

This version 10 is changed in the following points.

- Rewrited the comment in ValidXLogRecordHeader.
- ValidXLogRecordHeader

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 9eacdd050a8041b358df11ca3e18c1071b693d20 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v10] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c         |  91 +++++++++++++-----
 src/backend/access/transam/xlogreader.c   |  77 +++++++++++++++
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 110 +++++++++++++++++++++-
 6 files changed, 269 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dfe2a0bcce..378c13ccf7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
     for (;;)
     {
         char       *errormsg;
+        XLogRecPtr    ErrRecPtr = InvalidXLogRecPtr;
 
         record = XLogReadRecord(xlogreader, &errormsg);
         if (record == NULL)
@@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             {
                 abortedRecPtr = xlogreader->abortedRecPtr;
                 missingContrecPtr = xlogreader->missingContrecPtr;
+                ErrRecPtr = abortedRecPtr;
+            }
+            else
+            {
+                /*
+                 * NULL ReadRecPtr means we could not read a record at the
+                 * beginning. In that case EndRecPtr is storing the LSN of the
+                 * record we tried to read.
+                 */
+                ErrRecPtr =
+                    xlogreader->ReadRecPtr ?
+                    xlogreader->ReadRecPtr : xlogreader->EndRecPtr;
             }
 
             if (readFile >= 0)
@@ -4504,13 +4517,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something. In
-             * StandbyMode that only happens if we have been triggered, so we
-             * shouldn't loop anymore in that case.
+             * If we get here for other than end-of-wal, emit the error
+             * message right now. Otherwise the message if any is shown as a
+             * part of the end-of-WAL message below.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+            if (!xlogreader->EndOfWAL && errormsg)
+                ereport(emode_for_corrupt_record(emode, ErrRecPtr),
                         (errmsg_internal("%s", errormsg) /* already translated */ ));
         }
 
@@ -4541,11 +4553,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             /* Great, got a record */
             return record;
         }
-        else
+
+        /* No valid record available from this source */
+        lastSourceFailed = true;
+
+        if (!fetching_ckpt)
         {
-            /* No valid record available from this source */
-            lastSourceFailed = true;
-
             /*
              * If archive recovery was requested, but we were still doing
              * crash recovery, switch to archive recovery and retry using the
@@ -4558,11 +4571,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
              * we'd have no idea how far we'd have to replay to reach
              * consistency.  So err on the safe side and give up.
              */
-            if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-                !fetching_ckpt)
+            if (!InArchiveRecovery && ArchiveRecoveryRequested)
             {
+                /*
+                 * We don't report this as LOG, since we don't stop recovery
+                 * here
+                 */
                 ereport(DEBUG1,
-                        (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+                        (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery,
enteringarchive recovery",
 
+                                         LSN_FORMAT_ARGS(ErrRecPtr),
+                                         replayTLI,
+                                         xlogSourceNames[currentSource])));
                 InArchiveRecovery = true;
                 if (StandbyModeRequested)
                     StandbyMode = true;
@@ -4610,12 +4629,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
                 continue;
             }
 
-            /* In standby mode, loop back to retry. Otherwise, give up. */
-            if (StandbyMode && !CheckForStandbyTrigger())
-                continue;
-            else
-                return NULL;
+            /*
+             * If we haven't emit an error message, we have safely reached the
+             * end-of-WAL.
+             */
+            if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+            {
+                char       *fmt;
+
+                if (StandbyMode)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+                else if (InArchiveRecovery)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+                else
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+                ereport(LOG,
+                        (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI,
+                                xlogSourceNames[currentSource]),
+                         (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+            }
         }
+
+        /* In standby mode, loop back to retry. Otherwise, give up. */
+        if (StandbyMode && !CheckForStandbyTrigger())
+            continue;
+        else
+            return NULL;
     }
 }
 
@@ -7544,7 +7584,7 @@ StartupXLOG(void)
         else
         {
             /* just have to read next record after CheckPoint */
-            record = ReadRecord(xlogreader, LOG, false, replayTLI);
+            record = ReadRecord(xlogreader, WARNING, false, replayTLI);
         }
 
         if (record != NULL)
@@ -7782,7 +7822,7 @@ StartupXLOG(void)
                 }
 
                 /* Else, try to fetch the next WAL record */
-                record = ReadRecord(xlogreader, LOG, false, replayTLI);
+                record = ReadRecord(xlogreader, WARNING, false, replayTLI);
             } while (record != NULL);
 
             /*
@@ -7842,13 +7882,20 @@ StartupXLOG(void)
 
             InRedo = false;
         }
-        else
+        else if (xlogreader->EndOfWAL)
         {
             /* there are no WAL records following the checkpoint */
             ereport(LOG,
                     (errmsg("redo is not required")));
 
         }
+        else
+        {
+            /* broken record found */
+            ereport(WARNING,
+                    (errmsg("redo is skipped"),
+                     errhint("This suggests WAL file corruption. You might need to check the database.")));
+        }
 
         /*
          * This check is intentionally after the above log messages that
@@ -13097,7 +13144,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
     static XLogRecPtr lastComplaint = 0;
 
-    if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+    if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
     {
         if (RecPtr == lastComplaint)
             emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..9bcc4a2d37 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
         pfree(state);
         return NULL;
     }
+    state->EndOfWAL = false;
     state->errormsg_buf[0] = '\0';
 
     /*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
     /* reset error state */
     *errormsg = NULL;
     state->errormsg_buf[0] = '\0';
+    state->EndOfWAL = false;
 
     ResetDecoder(state);
     state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,15 @@ err:
          */
         state->abortedRecPtr = RecPtr;
         state->missingContrecPtr = targetPagePtr;
+
+        /*
+         * If the message is not set yet, that means we failed to load the
+         * page for the record.  Otherwise do not hide the existing message.
+         */
+        if (state->errormsg_buf[0] == '\0')
+            report_invalid_record(state,
+                                  "missing contrecord at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
     }
 
     /*
@@ -730,6 +741,39 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
                       XLogRecPtr PrevRecPtr, XLogRecord *record,
                       bool randAccess)
 {
+    if (record->xl_tot_len == 0)
+    {
+        /*
+         * We are almost sure reaching the end of WAL, make sure that the
+         * whole page after the record is filled with zeroes.
+         */
+        char       *p = (char *) record;
+        char       *pe;
+
+        /* set pe to the beginning of the next page */
+        pe = (char *) record + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+        while (*p == 0 && p < pe)
+            p++;
+
+        if (p == pe)
+        {
+            /*
+             * The page after the record is completely zeroed. That suggests
+             * we don't have a record after this point. We don't bother
+             * checking the pages after since they are not zeroed in the case
+             * of recycled segments.
+             */
+            report_invalid_record(state, "empty record found at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
+
+            /* notify end-of-wal to callers */
+            state->EndOfWAL = true;
+            return false;
+        }
+
+        /* The same condition will be caught as invalid record length */
+    }
     if (record->xl_tot_len < SizeOfXLogRecord)
     {
         report_invalid_record(state,
@@ -836,6 +880,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
     XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+    StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+    if (hdr->xlp_magic == 0)
+    {
+        /* Regard an empty page as End-Of-WAL */
+        int            i;
+
+        for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+        if (i == XLOG_BLCKSZ)
+        {
+            char        fname[MAXFNAMELEN];
+
+            XLogFileName(fname, state->seg.ws_tli, segno,
+                         state->segcxt.ws_segsize);
+
+            report_invalid_record(state,
+                                  "empty page in log segment %s, offset %u",
+                                  fname,
+                                  offset);
+            state->EndOfWAL = true;
+            return false;
+        }
+
+        /* The same condition will be caught as invalid magic number */
+    }
     if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
     {
         char        fname[MAXFNAMELEN];
@@ -921,6 +990,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
                               LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
                               fname,
                               offset);
+
+        /*
+         * If the page address is less than expected we assume it is an unused
+         * page in a recycled segment.
+         */
+        if (hdr->xlp_pageaddr < recaddr)
+            state->EndOfWAL = true;
+
         return false;
     }
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b39fce8c23..8e1fa32489 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -471,10 +471,9 @@ WalReceiverMain(void)
                         else if (len < 0)
                         {
                             ereport(LOG,
-                                    (errmsg("replication terminated by primary server"),
-                                     errdetail("End of WAL reached on timeline %u at %X/%X.",
-                                               startpointTLI,
-                                               LSN_FORMAT_ARGS(LogstreamResult.Write))));
+                                    (errmsg("replication terminated by primary server on timeline %u at %X/%X.",
+                                            startpointTLI,
+                                            LSN_FORMAT_ARGS(LogstreamResult.Write))));
                             endofwal = true;
                             break;
                         }
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..3745e76488 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1176,9 +1176,16 @@ main(int argc, char **argv)
         exit(0);
 
     if (errormsg)
-        fatal_error("error in WAL record at %X/%X: %s",
-                    LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-                    errormsg);
+    {
+        if (xlogreader_state->EndOfWAL)
+            pg_log_info("end of WAL at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+        else
+            fatal_error("error in WAL record at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+    }
 
     XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..3eeba220a1 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
      */
     XLogRecPtr    ReadRecPtr;        /* start of last record read */
     XLogRecPtr    EndRecPtr;        /* end+1 of last record read */
+    bool        EndOfWAL;        /* the last attempt was EOW? */
 
     /*
      * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 3892aba3e5..67d264df26 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,9 +10,11 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
-plan tests => 3;
+plan tests => 11;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -50,7 +52,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, $reached_eow_pat, $logstart));
+    sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -62,3 +72,101 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
+
+my $segsize = $node->safe_psql('postgres',
+       qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+                 SELECT pg_switch_wal();
+                 CHECKPOINT;
+                 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+my $chkptfile;
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+    $lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+    "LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+                         $logstart));
+    sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# and the end-of-wal messages shouldn't be seen
+# the same message has been confirmed in the past
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.27.0


Re: Make mesage at end-of-recovery less scary.

От
Pavel Borisov
Дата:
This version 10 is changed in the following points.

- Rewrited the comment in ValidXLogRecordHeader.
- ValidXLogRecordHeader
Thanks!

Maybe it can be written little bit shorter:
pe = (char *) record + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1)); 
as 
pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
?


The problem that pgindent sometimes reflow formatting of unrelated blocks is indeed existing. But I think it's right to manually leave pgindent-ed code only on what is related to the patch. The leftover is pgindent-ed in a scheduled manner sometimes, so don't need to bother.

I'd like to set v10 as RfC.

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
At Tue, 1 Feb 2022 12:38:01 +0400, Pavel Borisov <pashkin.elfe@gmail.com> wrote in 
> Maybe it can be written little bit shorter:
> pe = (char *) record + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
> as
> pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
> ?

That difference would be a matter of taste, but I found it looks
cleaner that definition and assignment is separated for both p and pe.
Now it is like the following.

>    char       *p;
>    char       *pe;
>
>    /* scan from the beginning of the record to the end of block */
>    p  = (char *) record;
>    pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));


> The problem that pgindent sometimes reflow formatting of unrelated blocks
> is indeed existing. But I think it's right to manually leave pgindent-ed
> code only on what is related to the patch. The leftover is pgindent-ed in a
> scheduled manner sometimes, so don't need to bother.

Yeah, I meant that it is a bit annoying to unpginden-ting unrelated
edits:p

> I'd like to set v10 as RfC.

Thanks!  The suggested change is done in the attached v11.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 491416866920f8f9648dee9c0571022f71553879 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v11] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c         |  91 +++++++++++++-----
 src/backend/access/transam/xlogreader.c   |  78 +++++++++++++++
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 110 +++++++++++++++++++++-
 6 files changed, 270 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dfe2a0bcce..378c13ccf7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
     for (;;)
     {
         char       *errormsg;
+        XLogRecPtr    ErrRecPtr = InvalidXLogRecPtr;
 
         record = XLogReadRecord(xlogreader, &errormsg);
         if (record == NULL)
@@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             {
                 abortedRecPtr = xlogreader->abortedRecPtr;
                 missingContrecPtr = xlogreader->missingContrecPtr;
+                ErrRecPtr = abortedRecPtr;
+            }
+            else
+            {
+                /*
+                 * NULL ReadRecPtr means we could not read a record at the
+                 * beginning. In that case EndRecPtr is storing the LSN of the
+                 * record we tried to read.
+                 */
+                ErrRecPtr =
+                    xlogreader->ReadRecPtr ?
+                    xlogreader->ReadRecPtr : xlogreader->EndRecPtr;
             }
 
             if (readFile >= 0)
@@ -4504,13 +4517,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something. In
-             * StandbyMode that only happens if we have been triggered, so we
-             * shouldn't loop anymore in that case.
+             * If we get here for other than end-of-wal, emit the error
+             * message right now. Otherwise the message if any is shown as a
+             * part of the end-of-WAL message below.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+            if (!xlogreader->EndOfWAL && errormsg)
+                ereport(emode_for_corrupt_record(emode, ErrRecPtr),
                         (errmsg_internal("%s", errormsg) /* already translated */ ));
         }
 
@@ -4541,11 +4553,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             /* Great, got a record */
             return record;
         }
-        else
+
+        /* No valid record available from this source */
+        lastSourceFailed = true;
+
+        if (!fetching_ckpt)
         {
-            /* No valid record available from this source */
-            lastSourceFailed = true;
-
             /*
              * If archive recovery was requested, but we were still doing
              * crash recovery, switch to archive recovery and retry using the
@@ -4558,11 +4571,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
              * we'd have no idea how far we'd have to replay to reach
              * consistency.  So err on the safe side and give up.
              */
-            if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-                !fetching_ckpt)
+            if (!InArchiveRecovery && ArchiveRecoveryRequested)
             {
+                /*
+                 * We don't report this as LOG, since we don't stop recovery
+                 * here
+                 */
                 ereport(DEBUG1,
-                        (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+                        (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery,
enteringarchive recovery",
 
+                                         LSN_FORMAT_ARGS(ErrRecPtr),
+                                         replayTLI,
+                                         xlogSourceNames[currentSource])));
                 InArchiveRecovery = true;
                 if (StandbyModeRequested)
                     StandbyMode = true;
@@ -4610,12 +4629,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
                 continue;
             }
 
-            /* In standby mode, loop back to retry. Otherwise, give up. */
-            if (StandbyMode && !CheckForStandbyTrigger())
-                continue;
-            else
-                return NULL;
+            /*
+             * If we haven't emit an error message, we have safely reached the
+             * end-of-WAL.
+             */
+            if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+            {
+                char       *fmt;
+
+                if (StandbyMode)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+                else if (InArchiveRecovery)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+                else
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+                ereport(LOG,
+                        (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI,
+                                xlogSourceNames[currentSource]),
+                         (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+            }
         }
+
+        /* In standby mode, loop back to retry. Otherwise, give up. */
+        if (StandbyMode && !CheckForStandbyTrigger())
+            continue;
+        else
+            return NULL;
     }
 }
 
@@ -7544,7 +7584,7 @@ StartupXLOG(void)
         else
         {
             /* just have to read next record after CheckPoint */
-            record = ReadRecord(xlogreader, LOG, false, replayTLI);
+            record = ReadRecord(xlogreader, WARNING, false, replayTLI);
         }
 
         if (record != NULL)
@@ -7782,7 +7822,7 @@ StartupXLOG(void)
                 }
 
                 /* Else, try to fetch the next WAL record */
-                record = ReadRecord(xlogreader, LOG, false, replayTLI);
+                record = ReadRecord(xlogreader, WARNING, false, replayTLI);
             } while (record != NULL);
 
             /*
@@ -7842,13 +7882,20 @@ StartupXLOG(void)
 
             InRedo = false;
         }
-        else
+        else if (xlogreader->EndOfWAL)
         {
             /* there are no WAL records following the checkpoint */
             ereport(LOG,
                     (errmsg("redo is not required")));
 
         }
+        else
+        {
+            /* broken record found */
+            ereport(WARNING,
+                    (errmsg("redo is skipped"),
+                     errhint("This suggests WAL file corruption. You might need to check the database.")));
+        }
 
         /*
          * This check is intentionally after the above log messages that
@@ -13097,7 +13144,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
     static XLogRecPtr lastComplaint = 0;
 
-    if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+    if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
     {
         if (RecPtr == lastComplaint)
             emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..03a8b42f15 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
         pfree(state);
         return NULL;
     }
+    state->EndOfWAL = false;
     state->errormsg_buf[0] = '\0';
 
     /*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
     /* reset error state */
     *errormsg = NULL;
     state->errormsg_buf[0] = '\0';
+    state->EndOfWAL = false;
 
     ResetDecoder(state);
     state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,15 @@ err:
          */
         state->abortedRecPtr = RecPtr;
         state->missingContrecPtr = targetPagePtr;
+
+        /*
+         * If the message is not set yet, that means we failed to load the
+         * page for the record.  Otherwise do not hide the existing message.
+         */
+        if (state->errormsg_buf[0] == '\0')
+            report_invalid_record(state,
+                                  "missing contrecord at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
     }
 
     /*
@@ -730,6 +741,40 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
                       XLogRecPtr PrevRecPtr, XLogRecord *record,
                       bool randAccess)
 {
+    if (record->xl_tot_len == 0)
+    {
+        /*
+         * We are almost sure reaching the end of WAL, make sure that the
+         * whole page after the record is filled with zeroes.
+         */
+        char       *p;
+        char       *pe;
+
+        /* scan from the beginning of the record to the end of block */
+        p = (char *) record;
+        pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+        while (*p == 0 && p < pe)
+            p++;
+
+        if (p == pe)
+        {
+            /*
+             * The page after the record is completely zeroed. That suggests
+             * we don't have a record after this point. We don't bother
+             * checking the pages after since they are not zeroed in the case
+             * of recycled segments.
+             */
+            report_invalid_record(state, "empty record found at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
+
+            /* notify end-of-wal to callers */
+            state->EndOfWAL = true;
+            return false;
+        }
+
+        /* The same condition will be caught as invalid record length */
+    }
     if (record->xl_tot_len < SizeOfXLogRecord)
     {
         report_invalid_record(state,
@@ -836,6 +881,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
     XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+    StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+    if (hdr->xlp_magic == 0)
+    {
+        /* Regard an empty page as End-Of-WAL */
+        int            i;
+
+        for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+        if (i == XLOG_BLCKSZ)
+        {
+            char        fname[MAXFNAMELEN];
+
+            XLogFileName(fname, state->seg.ws_tli, segno,
+                         state->segcxt.ws_segsize);
+
+            report_invalid_record(state,
+                                  "empty page in log segment %s, offset %u",
+                                  fname,
+                                  offset);
+            state->EndOfWAL = true;
+            return false;
+        }
+
+        /* The same condition will be caught as invalid magic number */
+    }
     if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
     {
         char        fname[MAXFNAMELEN];
@@ -921,6 +991,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
                               LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
                               fname,
                               offset);
+
+        /*
+         * If the page address is less than expected we assume it is an unused
+         * page in a recycled segment.
+         */
+        if (hdr->xlp_pageaddr < recaddr)
+            state->EndOfWAL = true;
+
         return false;
     }
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b39fce8c23..8e1fa32489 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -471,10 +471,9 @@ WalReceiverMain(void)
                         else if (len < 0)
                         {
                             ereport(LOG,
-                                    (errmsg("replication terminated by primary server"),
-                                     errdetail("End of WAL reached on timeline %u at %X/%X.",
-                                               startpointTLI,
-                                               LSN_FORMAT_ARGS(LogstreamResult.Write))));
+                                    (errmsg("replication terminated by primary server on timeline %u at %X/%X.",
+                                            startpointTLI,
+                                            LSN_FORMAT_ARGS(LogstreamResult.Write))));
                             endofwal = true;
                             break;
                         }
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..3745e76488 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1176,9 +1176,16 @@ main(int argc, char **argv)
         exit(0);
 
     if (errormsg)
-        fatal_error("error in WAL record at %X/%X: %s",
-                    LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-                    errormsg);
+    {
+        if (xlogreader_state->EndOfWAL)
+            pg_log_info("end of WAL at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+        else
+            fatal_error("error in WAL record at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+    }
 
     XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..3eeba220a1 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
      */
     XLogRecPtr    ReadRecPtr;        /* start of last record read */
     XLogRecPtr    EndRecPtr;        /* end+1 of last record read */
+    bool        EndOfWAL;        /* the last attempt was EOW? */
 
     /*
      * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 3892aba3e5..67d264df26 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,9 +10,11 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
-plan tests => 3;
+plan tests => 11;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -50,7 +52,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, $reached_eow_pat, $logstart));
+    sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -62,3 +72,101 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
+
+my $segsize = $node->safe_psql('postgres',
+       qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+                 SELECT pg_switch_wal();
+                 CHECKPOINT;
+                 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+my $chkptfile;
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+    $lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+    "LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+                         $logstart));
+    sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# and the end-of-wal messages shouldn't be seen
+# the same message has been confirmed in the past
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.27.0


Re: Make mesage at end-of-recovery less scary.

От
Pavel Borisov
Дата:
Thanks!  The suggested change is done in the attached v11.

Thanks! v11 is a small refactoring of v10 that doesn't change behavior, so it is RfC as well.

--
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com

Re: Make mesage at end-of-recovery less scary.

От
Ashutosh Sharma
Дата:
Hi,

Here are some of my review comments on the v11 patch:

-                       (errmsg_internal("reached end of WAL in
pg_wal, entering archive recovery")));
+                       (errmsg_internal("reached end of WAL at %X/%X
on timeline %u in %s during crash recovery, entering archive
recovery",
+                                        LSN_FORMAT_ARGS(ErrRecPtr),
+                                        replayTLI,
+                                        xlogSourceNames[currentSource])));

Why crash recovery? Won't this message get printed even during PITR?

I just did a PITR and could see these messages in the logfile.

2022-02-08 18:00:44.367 IST [86185] LOG:  starting point-in-time
recovery to WAL location (LSN) "0/5227790"
2022-02-08 18:00:44.368 IST [86185] LOG:  database system was not
properly shut down; automatic recovery in progress
2022-02-08 18:00:44.369 IST [86185] LOG:  redo starts at 0/14DC8D8
2022-02-08 18:00:44.978 IST [86185] DEBUG1:  reached end of WAL at
0/3FFFFD0 on timeline 1 in pg_wal during crash recovery, entering
archive recovery

==

+           /*
+            * If we haven't emit an error message, we have safely reached the
+            * end-of-WAL.
+            */
+           if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+           {
+               char       *fmt;
+
+               if (StandbyMode)
+                   fmt = gettext_noop("reached end of WAL at %X/%X on
timeline %u in %s during standby mode");
+               else if (InArchiveRecovery)
+                   fmt = gettext_noop("reached end of WAL at %X/%X on
timeline %u in %s during archive recovery");
+               else
+                   fmt = gettext_noop("reached end of WAL at %X/%X on
timeline %u in %s during crash recovery");
+
+               ereport(LOG,
+                       (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI,
+                               xlogSourceNames[currentSource]),
+                        (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+           }

Doesn't it make sense to add an assert statement inside this if-block
that will check for xlogreader->EndOfWAL?

==

-            * We only end up here without a message when XLogPageRead()
-            * failed - in that case we already logged something. In
-            * StandbyMode that only happens if we have been triggered, so we
-            * shouldn't loop anymore in that case.
+            * If we get here for other than end-of-wal, emit the error
+            * message right now. Otherwise the message if any is shown as a
+            * part of the end-of-WAL message below.
             */

For consistency, I think we can replace "end-of-wal" with
"end-of-WAL". Please note that everywhere else in the comments you
have used "end-of-WAL". So why not the same here?

==

                            ereport(LOG,
-                                   (errmsg("replication terminated by
primary server"),
-                                    errdetail("End of WAL reached on
timeline %u at %X/%X.",
-                                              startpointTLI,
-
LSN_FORMAT_ARGS(LogstreamResult.Write))));
+                                   (errmsg("replication terminated by
primary server on timeline %u at %X/%X.",
+                                           startpointTLI,
+
LSN_FORMAT_ARGS(LogstreamResult.Write))));

Is this change really required? I don't see any issue with the
existing error message.

==

Lastly, are we also planning to backport this patch?

--
With Regards,
Ashutosh Sharma.

On Wed, Feb 2, 2022 at 11:05 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Tue, 1 Feb 2022 12:38:01 +0400, Pavel Borisov <pashkin.elfe@gmail.com> wrote in
> > Maybe it can be written little bit shorter:
> > pe = (char *) record + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
> > as
> > pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
> > ?
>
> That difference would be a matter of taste, but I found it looks
> cleaner that definition and assignment is separated for both p and pe.
> Now it is like the following.
>
> >       char       *p;
> >       char       *pe;
> >
> >       /* scan from the beginning of the record to the end of block */
> >       p  = (char *) record;
> >       pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
>
>
> > The problem that pgindent sometimes reflow formatting of unrelated blocks
> > is indeed existing. But I think it's right to manually leave pgindent-ed
> > code only on what is related to the patch. The leftover is pgindent-ed in a
> > scheduled manner sometimes, so don't need to bother.
>
> Yeah, I meant that it is a bit annoying to unpginden-ting unrelated
> edits:p
>
> > I'd like to set v10 as RfC.
>
> Thanks!  The suggested change is done in the attached v11.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center



Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
Hi, Ashutosh.

At Tue, 8 Feb 2022 18:35:34 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in 
> Here are some of my review comments on the v11 patch:

Thank you for taking a look on this.

> -                       (errmsg_internal("reached end of WAL in
> pg_wal, entering archive recovery")));
> +                       (errmsg_internal("reached end of WAL at %X/%X
> on timeline %u in %s during crash recovery, entering archive
> recovery",
> +                                        LSN_FORMAT_ARGS(ErrRecPtr),
> +                                        replayTLI,
> +                                        xlogSourceNames[currentSource])));
> 
> Why crash recovery? Won't this message get printed even during PITR?

It is in the if-block with the following condition.

>     * If archive recovery was requested, but we were still doing
>     * crash recovery, switch to archive recovery and retry using the
>     * offline archive. We have now replayed all the valid WAL in
>     * pg_wal, so we are presumably now consistent.
...
>    if (!InArchiveRecovery && ArchiveRecoveryRequested)

This means archive-recovery is requested but not started yet. That is,
we've just finished crash recovery.  The existing comment cited
together is mentioning that.

At the end of PITR (or archive recovery), the other code works.

> /*
>  * If we haven't emit an error message, we have safely reached the
>  * end-of-WAL.
>  */
> if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
> {
>     char       *fmt;
> 
>     if (StandbyMode)
>         fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
>     else if (InArchiveRecovery)
>         fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
>     else
>         fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");

The last among the above messages is choosed when archive-recovery is
not requested at all.

> I just did a PITR and could see these messages in the logfile.

Yeah, the log lines are describing that the server starting with crash
recovery to run PITR.

> 2022-02-08 18:00:44.367 IST [86185] LOG:  starting point-in-time
> recovery to WAL location (LSN) "0/5227790"
> 2022-02-08 18:00:44.368 IST [86185] LOG:  database system was not
> properly shut down; automatic recovery in progress

Well. I guess that the "automatic recovery" is ambiguous.  Does it
make sense if the second line were like the follows instead?

+ 2022-02-08 18:00:44.368 IST [86185] LOG:  database system was not properly shut down; crash recovery in progress

> 2022-02-08 18:00:44.369 IST [86185] LOG:  redo starts at 0/14DC8D8
> 2022-02-08 18:00:44.978 IST [86185] DEBUG1:  reached end of WAL at
> 0/3FFFFD0 on timeline 1 in pg_wal during crash recovery, entering
> archive recovery

(I don't include this change in this patch since it would be another
issue.)

> ==
> 
> +           /*
> +            * If we haven't emit an error message, we have safely reached the
> +            * end-of-WAL.
> +            */
> +           if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
> +           {
> +               char       *fmt;
> +
> +               if (StandbyMode)
> +                   fmt = gettext_noop("reached end of WAL at %X/%X on
> timeline %u in %s during standby mode");
> +               else if (InArchiveRecovery)
> +                   fmt = gettext_noop("reached end of WAL at %X/%X on
> timeline %u in %s during archive recovery");
> +               else
> +                   fmt = gettext_noop("reached end of WAL at %X/%X on
> timeline %u in %s during crash recovery");
> +
> +               ereport(LOG,
> +                       (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI,
> +                               xlogSourceNames[currentSource]),
> +                        (errormsg ? errdetail_internal("%s", errormsg) : 0)));
> +           }
> 
> Doesn't it make sense to add an assert statement inside this if-block
> that will check for xlogreader->EndOfWAL?

Good point.  On second thought, the condition there is flat wrong.
The message is "reached end of WAL" so the condition should be
EndOfWAL.  On the other hand we didn't make sure that the error
message for the stop is emitted anywhere.  Thus I don't particularly
want to be strict on that point.

I made the following change for this.

-            if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
+            if (xlogreader->EndOfWAL)



> ==
> 
> -            * We only end up here without a message when XLogPageRead()
> -            * failed - in that case we already logged something. In
> -            * StandbyMode that only happens if we have been triggered, so we
> -            * shouldn't loop anymore in that case.
> +            * If we get here for other than end-of-wal, emit the error
> +            * message right now. Otherwise the message if any is shown as a
> +            * part of the end-of-WAL message below.
>              */
> 
> For consistency, I think we can replace "end-of-wal" with
> "end-of-WAL". Please note that everywhere else in the comments you
> have used "end-of-WAL". So why not the same here?

Right.  Fixed.

> ==
> 
>                             ereport(LOG,
> -                                   (errmsg("replication terminated by
> primary server"),
> -                                    errdetail("End of WAL reached on
> timeline %u at %X/%X.",
> -                                              startpointTLI,
> -
> LSN_FORMAT_ARGS(LogstreamResult.Write))));
> +                                   (errmsg("replication terminated by
> primary server on timeline %u at %X/%X.",
> +                                           startpointTLI,
> +
> LSN_FORMAT_ARGS(LogstreamResult.Write))));
> 
> Is this change really required? I don't see any issue with the
> existing error message.

Without the change, we see two similar end-of-WAL messages from both
walreceiver and startup.  (Please don't care about the slight
difference of LSNs..)

[walreceiver] LOG:  replication terminated by primary server
[walreceiver] DETAIL:  End of WAL reached on timeline 1 at 0/B0000D8.
[startup] LOG:  reached end of WAL at 0/B000060 on timeline 1 in archive during standby mode
[startup] DETAIL:  empty record found at 0/B0000D8

But what the walreceiver detected at the time is not End-of-WAL but an
error on the streaming connection.  Since this patch makes startup
process to detect End-of-WAL, we don't need the duplicate and
in-a-sense false end-of-WAL message from walreceiver.

# By the way, I deliberately choosed to report the LSN of last
# successfully record in the "reached end of WAL" message. On second
# thought about this choice, I came to think that it is better to report
# the failure LSN.  I changed it to report the failure LSN.  In this
# case we face an ambiguity according to how we failed to read the
# record but for now we have no choice than blindly choosing one of
# them.  I choosed EndRecPtr since I think decode error happens quite
# rarely than read errors.

[walreceiver] LOG:  replication terminated by primary server at 0/B014228 on timeline 1.
[startup] LOG:  reached end of WAL at 0/B014228 on timeline 1 in archive during standby mode
[startup] DETAIL:  empty record found at 0/B014228

This is the reason for the change.


> Lastly, are we also planning to backport this patch?

This is apparent a behavioral change, not a bug fix, which I think we
regard as not appropriate for back-patching.


As the result, I made the following chages in the version 11.

1. Changed the condition for the "end-of-WAL" message from
   emode_for_corrupt_record to the EndOfWAL flag.

2. Corrected the wording of end-of-wal to end-of-WAL.

3. In the "reached end of WAL" message, report the LSN of the
  beginning of failed record instead of the beginning of the
  last-succeeded record.

4. In the changed message in walreceiver.c, I swapped LSN and timeline
  so that they are in the same order with other similar messages.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From e07c1501cd0020f2a817dd9544c4aa5063e29685 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v11] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c         |  93 +++++++++++++-----
 src/backend/access/transam/xlogreader.c   |  78 +++++++++++++++
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 110 +++++++++++++++++++++-
 6 files changed, 271 insertions(+), 31 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 958220c495..618f33d342 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
     for (;;)
     {
         char       *errormsg;
+        XLogRecPtr    ErrRecPtr = InvalidXLogRecPtr;
 
         record = XLogReadRecord(xlogreader, &errormsg);
         if (record == NULL)
@@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             {
                 abortedRecPtr = xlogreader->abortedRecPtr;
                 missingContrecPtr = xlogreader->missingContrecPtr;
+                ErrRecPtr = abortedRecPtr;
+            }
+            else
+            {
+                /*
+                 * EndRecPtr is the LSN we tried to read but failed. In the
+                 * case of decoding error, it is at the end of the failed
+                 * record but we don't have a means for now to know EndRecPtr
+                 * is pointing to which of the beginning or ending of the
+                 * failed record.
+                 */
+                ErrRecPtr = xlogreader->EndRecPtr;
             }
 
             if (readFile >= 0)
@@ -4504,13 +4517,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something. In
-             * StandbyMode that only happens if we have been triggered, so we
-             * shouldn't loop anymore in that case.
+             * If we get here for other than end-of-WAL, emit the error
+             * message right now. Otherwise the message if any is shown as a
+             * part of the end-of-WAL message below.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+            if (!xlogreader->EndOfWAL && errormsg)
+                ereport(emode_for_corrupt_record(emode, ErrRecPtr),
                         (errmsg_internal("%s", errormsg) /* already translated */ ));
         }
 
@@ -4541,11 +4553,12 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             /* Great, got a record */
             return record;
         }
-        else
+
+        /* No valid record available from this source */
+        lastSourceFailed = true;
+
+        if (!fetching_ckpt)
         {
-            /* No valid record available from this source */
-            lastSourceFailed = true;
-
             /*
              * If archive recovery was requested, but we were still doing
              * crash recovery, switch to archive recovery and retry using the
@@ -4558,11 +4571,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
              * we'd have no idea how far we'd have to replay to reach
              * consistency.  So err on the safe side and give up.
              */
-            if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-                !fetching_ckpt)
+            if (!InArchiveRecovery && ArchiveRecoveryRequested)
             {
+                /*
+                 * We don't report this as LOG, since we don't stop recovery
+                 * here
+                 */
                 ereport(DEBUG1,
-                        (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+                        (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery,
enteringarchive recovery",
 
+                                         LSN_FORMAT_ARGS(ErrRecPtr),
+                                         replayTLI,
+                                         xlogSourceNames[currentSource])));
                 InArchiveRecovery = true;
                 if (StandbyModeRequested)
                     StandbyMode = true;
@@ -4610,12 +4629,33 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
                 continue;
             }
 
-            /* In standby mode, loop back to retry. Otherwise, give up. */
-            if (StandbyMode && !CheckForStandbyTrigger())
-                continue;
-            else
-                return NULL;
+            /*
+             * If we haven't emit an error message, we have safely reached the
+             * end-of-WAL.
+             */
+            if (xlogreader->EndOfWAL)
+            {
+                char       *fmt;
+
+                if (StandbyMode)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
+                else if (InArchiveRecovery)
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
+                else
+                    fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
+
+                ereport(LOG,
+                        (errmsg(fmt, LSN_FORMAT_ARGS(ErrRecPtr), replayTLI,
+                                xlogSourceNames[currentSource]),
+                         (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+            }
         }
+
+        /* In standby mode, loop back to retry. Otherwise, give up. */
+        if (StandbyMode && !CheckForStandbyTrigger())
+            continue;
+        else
+            return NULL;
     }
 }
 
@@ -7294,7 +7334,7 @@ StartupXLOG(void)
         {
             ereport(LOG,
                     (errmsg("database system was not properly shut down; "
-                            "automatic recovery in progress")));
+                            "crash recovery in progress")));
             if (recoveryTargetTLI > ControlFile->checkPointCopy.ThisTimeLineID)
                 ereport(LOG,
                         (errmsg("crash recovery starts in timeline %u "
@@ -7544,7 +7584,7 @@ StartupXLOG(void)
         else
         {
             /* just have to read next record after CheckPoint */
-            record = ReadRecord(xlogreader, LOG, false, replayTLI);
+            record = ReadRecord(xlogreader, WARNING, false, replayTLI);
         }
 
         if (record != NULL)
@@ -7782,7 +7822,7 @@ StartupXLOG(void)
                 }
 
                 /* Else, try to fetch the next WAL record */
-                record = ReadRecord(xlogreader, LOG, false, replayTLI);
+                record = ReadRecord(xlogreader, WARNING, false, replayTLI);
             } while (record != NULL);
 
             /*
@@ -7842,13 +7882,20 @@ StartupXLOG(void)
 
             InRedo = false;
         }
-        else
+        else if (xlogreader->EndOfWAL)
         {
             /* there are no WAL records following the checkpoint */
             ereport(LOG,
                     (errmsg("redo is not required")));
 
         }
+        else
+        {
+            /* broken record found */
+            ereport(WARNING,
+                    (errmsg("redo is skipped"),
+                     errhint("This suggests WAL file corruption. You might need to check the database.")));
+        }
 
         /*
          * This check is intentionally after the above log messages that
@@ -13097,7 +13144,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
     static XLogRecPtr lastComplaint = 0;
 
-    if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+    if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
     {
         if (RecPtr == lastComplaint)
             emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..03a8b42f15 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
         pfree(state);
         return NULL;
     }
+    state->EndOfWAL = false;
     state->errormsg_buf[0] = '\0';
 
     /*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
     /* reset error state */
     *errormsg = NULL;
     state->errormsg_buf[0] = '\0';
+    state->EndOfWAL = false;
 
     ResetDecoder(state);
     state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,15 @@ err:
          */
         state->abortedRecPtr = RecPtr;
         state->missingContrecPtr = targetPagePtr;
+
+        /*
+         * If the message is not set yet, that means we failed to load the
+         * page for the record.  Otherwise do not hide the existing message.
+         */
+        if (state->errormsg_buf[0] == '\0')
+            report_invalid_record(state,
+                                  "missing contrecord at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
     }
 
     /*
@@ -730,6 +741,40 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
                       XLogRecPtr PrevRecPtr, XLogRecord *record,
                       bool randAccess)
 {
+    if (record->xl_tot_len == 0)
+    {
+        /*
+         * We are almost sure reaching the end of WAL, make sure that the
+         * whole page after the record is filled with zeroes.
+         */
+        char       *p;
+        char       *pe;
+
+        /* scan from the beginning of the record to the end of block */
+        p = (char *) record;
+        pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+        while (*p == 0 && p < pe)
+            p++;
+
+        if (p == pe)
+        {
+            /*
+             * The page after the record is completely zeroed. That suggests
+             * we don't have a record after this point. We don't bother
+             * checking the pages after since they are not zeroed in the case
+             * of recycled segments.
+             */
+            report_invalid_record(state, "empty record found at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
+
+            /* notify end-of-wal to callers */
+            state->EndOfWAL = true;
+            return false;
+        }
+
+        /* The same condition will be caught as invalid record length */
+    }
     if (record->xl_tot_len < SizeOfXLogRecord)
     {
         report_invalid_record(state,
@@ -836,6 +881,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
     XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+    StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+    if (hdr->xlp_magic == 0)
+    {
+        /* Regard an empty page as End-Of-WAL */
+        int            i;
+
+        for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+        if (i == XLOG_BLCKSZ)
+        {
+            char        fname[MAXFNAMELEN];
+
+            XLogFileName(fname, state->seg.ws_tli, segno,
+                         state->segcxt.ws_segsize);
+
+            report_invalid_record(state,
+                                  "empty page in log segment %s, offset %u",
+                                  fname,
+                                  offset);
+            state->EndOfWAL = true;
+            return false;
+        }
+
+        /* The same condition will be caught as invalid magic number */
+    }
     if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
     {
         char        fname[MAXFNAMELEN];
@@ -921,6 +991,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
                               LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
                               fname,
                               offset);
+
+        /*
+         * If the page address is less than expected we assume it is an unused
+         * page in a recycled segment.
+         */
+        if (hdr->xlp_pageaddr < recaddr)
+            state->EndOfWAL = true;
+
         return false;
     }
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b39fce8c23..1a7a692bc0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -471,10 +471,9 @@ WalReceiverMain(void)
                         else if (len < 0)
                         {
                             ereport(LOG,
-                                    (errmsg("replication terminated by primary server"),
-                                     errdetail("End of WAL reached on timeline %u at %X/%X.",
-                                               startpointTLI,
-                                               LSN_FORMAT_ARGS(LogstreamResult.Write))));
+                                    (errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+                                            LSN_FORMAT_ARGS(LogstreamResult.Write),
+                                            startpointTLI)));
                             endofwal = true;
                             break;
                         }
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..3745e76488 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1176,9 +1176,16 @@ main(int argc, char **argv)
         exit(0);
 
     if (errormsg)
-        fatal_error("error in WAL record at %X/%X: %s",
-                    LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-                    errormsg);
+    {
+        if (xlogreader_state->EndOfWAL)
+            pg_log_info("end of WAL at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+        else
+            fatal_error("error in WAL record at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+    }
 
     XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..3eeba220a1 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
      */
     XLogRecPtr    ReadRecPtr;        /* start of last record read */
     XLogRecPtr    EndRecPtr;        /* end+1 of last record read */
+    bool        EndOfWAL;        /* the last attempt was EOW? */
 
     /*
      * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 3892aba3e5..67d264df26 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,9 +10,11 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
-plan tests => 3;
+plan tests => 11;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -50,7 +52,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, $reached_eow_pat, $logstart));
+    sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -62,3 +72,101 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
+
+my $segsize = $node->safe_psql('postgres',
+       qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+                 SELECT pg_switch_wal();
+                 CHECKPOINT;
+                 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+my $chkptfile;
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+    $lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+    "LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+                         $logstart));
+    sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# and the end-of-wal messages shouldn't be seen
+# the same message has been confirmed in the past
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.27.0


Re: Make mesage at end-of-recovery less scary.

От
Ashutosh Sharma
Дата:
On Wed, Feb 9, 2022 at 1:14 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> Hi, Ashutosh.
>
> At Tue, 8 Feb 2022 18:35:34 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in
> > Here are some of my review comments on the v11 patch:
>
> Thank you for taking a look on this.
>
> > -                       (errmsg_internal("reached end of WAL in
> > pg_wal, entering archive recovery")));
> > +                       (errmsg_internal("reached end of WAL at %X/%X
> > on timeline %u in %s during crash recovery, entering archive
> > recovery",
> > +                                        LSN_FORMAT_ARGS(ErrRecPtr),
> > +                                        replayTLI,
> > +                                        xlogSourceNames[currentSource])));
> >
> > Why crash recovery? Won't this message get printed even during PITR?
>
> It is in the if-block with the following condition.
>
> >        * If archive recovery was requested, but we were still doing
> >        * crash recovery, switch to archive recovery and retry using the
> >        * offline archive. We have now replayed all the valid WAL in
> >        * pg_wal, so we are presumably now consistent.
> ...
> >    if (!InArchiveRecovery && ArchiveRecoveryRequested)
>
> This means archive-recovery is requested but not started yet. That is,
> we've just finished crash recovery.  The existing comment cited
> together is mentioning that.
>
> At the end of PITR (or archive recovery), the other code works.
>

This is quite understandable, the point here is that the message that
we are emitting says, we have just finished reading the wal files in
the pg_wal directory during crash recovery and are now entering
archive recovery when we are actually doing point-in-time recovery
which seems a bit misleading.

> > /*
> >  * If we haven't emit an error message, we have safely reached the
> >  * end-of-WAL.
> >  */
> > if (emode_for_corrupt_record(LOG, ErrRecPtr) == LOG)
> > {
> >       char       *fmt;
> >
> >       if (StandbyMode)
> >               fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during standby mode");
> >       else if (InArchiveRecovery)
> >               fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during archive recovery");
> >       else
> >               fmt = gettext_noop("reached end of WAL at %X/%X on timeline %u in %s during crash recovery");
>
> The last among the above messages is choosed when archive-recovery is
> not requested at all.
>
> > I just did a PITR and could see these messages in the logfile.
>
> Yeah, the log lines are describing that the server starting with crash
> recovery to run PITR.
>
> > 2022-02-08 18:00:44.367 IST [86185] LOG:  starting point-in-time
> > recovery to WAL location (LSN) "0/5227790"
> > 2022-02-08 18:00:44.368 IST [86185] LOG:  database system was not
> > properly shut down; automatic recovery in progress
>
> Well. I guess that the "automatic recovery" is ambiguous.  Does it
> make sense if the second line were like the follows instead?
>
> + 2022-02-08 18:00:44.368 IST [86185] LOG:  database system was not properly shut down; crash recovery in progress
>

Well, according to me the current message looks fine.

> > Lastly, are we also planning to backport this patch?
>
> This is apparent a behavioral change, not a bug fix, which I think we
> regard as not appropriate for back-patching.
>
>
> As the result, I made the following chages in the version 11.
>
> 1. Changed the condition for the "end-of-WAL" message from
>    emode_for_corrupt_record to the EndOfWAL flag.
>
> 2. Corrected the wording of end-of-wal to end-of-WAL.
>
> 3. In the "reached end of WAL" message, report the LSN of the
>   beginning of failed record instead of the beginning of the
>   last-succeeded record.
>
> 4. In the changed message in walreceiver.c, I swapped LSN and timeline
>   so that they are in the same order with other similar messages.
>

Thanks for sharing this information.

==

Here is one more comment:

One more comment:

+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+my $chkptfile;
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;

$chkptfile is declared twice in the same scope. We can probably remove
the first one.

--
With Regards,
Ashutosh Sharma.



Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
At Wed, 9 Feb 2022 17:31:02 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in 
> On Wed, Feb 9, 2022 at 1:14 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > This means archive-recovery is requested but not started yet. That is,
> > we've just finished crash recovery.  The existing comment cited
> > together is mentioning that.
> >
> > At the end of PITR (or archive recovery), the other code works.
> >
> 
> This is quite understandable, the point here is that the message that
> we are emitting says, we have just finished reading the wal files in
> the pg_wal directory during crash recovery and are now entering
> archive recovery when we are actually doing point-in-time recovery
> which seems a bit misleading.

Here is the messages.

> 2022-02-08 18:00:44.367 IST [86185] LOG:  starting point-in-time
> recovery to WAL location (LSN) "0/5227790"
> 2022-02-08 18:00:44.368 IST [86185] LOG:  database system was not
> properly shut down; automatic recovery in progress
> 2022-02-08 18:00:44.369 IST [86185] LOG:  redo starts at 0/14DC8D8
> 2022-02-08 18:00:44.978 IST [86185] DEBUG1:  reached end of WAL at
> 0/3FFFFD0 on timeline 1 in pg_wal during crash recovery, entering
> archive recovery

In the first place the last DEBUG1 is not on my part, but one of the
messages added by this patch says the same thing.  Is your point that
archive recovery is different thing from PITR?  In regard to the
difference, I think PITR is a form of archive recovery.

That being said, after some thoughts on this, I changed my mind that
we don't need to say what operation was being performed at the
end-of-WAL.  So in the attached the end-of-WAL message is not
accompanied by the kind of recovery.

> LOG:  reached end of WAL at 0/3000000 on timeline 1

I removed the archive-source part along with the operation mode.
Because it make the message untranslatable.  It is now very simple but
seems enough.

While working on this, I noticed that we need to set EndOfWAL when
WaitForWALToBecomeAvailable returned with failure.  That means the
file does not exist at all so it is a kind of end-of-WAL.  In that
sense the following existing comment in ReadRecord is a bit wrong.

>     * We only end up here without a message when XLogPageRead()
>     * failed - in that case we already logged something. In
>     * StandbyMode that only happens if we have been triggered, so we
>     * shouldn't loop anymore in that case.

Actually there's a case we get there without a message and without
logged something when a segment file is not found unless we're in
standby mode.

> > Well. I guess that the "automatic recovery" is ambiguous.  Does it
> > make sense if the second line were like the follows instead?
> >
> > + 2022-02-08 18:00:44.368 IST [86185] LOG:  database system was not properly shut down; crash recovery in progress
> >
> 
> Well, according to me the current message looks fine.

Good to hear. (In the previos version I modified the message by accident..)

> $chkptfile is declared twice in the same scope. We can probably remove
> the first one.

Ugh.. Fixed.  (I wonder why Perl doesn't complain on this..)


In this version 12 I made the following changes.

- Rewrote (halfly reverted) a comment in ReadRecord

- Simplified the "reached end of WAL" message by removing recovery
  mode and WAL source in ReadRecord.

- XLogPageRead sets EndOfWAL flag in the ENOENT case.

- Removed redundant declaration of the same variable in TAP script.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From e553164dbca709389d92b05cf8ae7a8b427e83a6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v12] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c         |  92 +++++++++++++-----
 src/backend/access/transam/xlogreader.c   |  78 ++++++++++++++++
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 108 +++++++++++++++++++++-
 6 files changed, 268 insertions(+), 31 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 958220c495..bf1d40e7cb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
     for (;;)
     {
         char       *errormsg;
+        XLogRecPtr    ErrRecPtr = InvalidXLogRecPtr;
 
         record = XLogReadRecord(xlogreader, &errormsg);
         if (record == NULL)
@@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             {
                 abortedRecPtr = xlogreader->abortedRecPtr;
                 missingContrecPtr = xlogreader->missingContrecPtr;
+                ErrRecPtr = abortedRecPtr;
+            }
+            else
+            {
+                /*
+                 * EndRecPtr is the LSN we tried to read but failed. In the
+                 * case of decoding error, it is at the end of the failed
+                 * record but we don't have a means for now to know EndRecPtr
+                 * is pointing to which of the beginning or ending of the
+                 * failed record.
+                 */
+                ErrRecPtr = xlogreader->EndRecPtr;
             }
 
             if (readFile >= 0)
@@ -4504,13 +4517,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something. In
-             * StandbyMode that only happens if we have been triggered, so we
-             * shouldn't loop anymore in that case.
+             * We only end up here without a message when XLogPageRead() failed
+             * in that case we already logged something, or just met end-of-WAL
+             * conditions. In StandbyMode that only happens if we have been
+             * triggered, so we shouldn't loop anymore in that case. When
+             * EndOfWAL is true, we don't emit that error if any immediately
+             * and instead will show it as a part of a decent end-of-wal
+             * message later.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+            if (!xlogreader->EndOfWAL && errormsg)
+                ereport(emode_for_corrupt_record(emode, ErrRecPtr),
                         (errmsg_internal("%s", errormsg) /* already translated */ ));
         }
 
@@ -4541,11 +4557,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             /* Great, got a record */
             return record;
         }
-        else
+
+        Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+        /* No valid record available from this source */
+        lastSourceFailed = true;
+
+        if (!fetching_ckpt)
         {
-            /* No valid record available from this source */
-            lastSourceFailed = true;
-
             /*
              * If archive recovery was requested, but we were still doing
              * crash recovery, switch to archive recovery and retry using the
@@ -4558,11 +4577,17 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
              * we'd have no idea how far we'd have to replay to reach
              * consistency.  So err on the safe side and give up.
              */
-            if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-                !fetching_ckpt)
+            if (!InArchiveRecovery && ArchiveRecoveryRequested)
             {
+                /*
+                 * We don't report this as LOG, since we don't stop recovery
+                 * here
+                 */
                 ereport(DEBUG1,
-                        (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+                        (errmsg_internal("reached end of WAL at %X/%X on timeline %u in %s during crash recovery,
enteringarchive recovery",
 
+                                         LSN_FORMAT_ARGS(ErrRecPtr),
+                                         replayTLI,
+                                         xlogSourceNames[currentSource])));
                 InArchiveRecovery = true;
                 if (StandbyModeRequested)
                     StandbyMode = true;
@@ -4610,12 +4635,24 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
                 continue;
             }
 
-            /* In standby mode, loop back to retry. Otherwise, give up. */
-            if (StandbyMode && !CheckForStandbyTrigger())
-                continue;
-            else
-                return NULL;
+            /*
+             * recovery ended.
+             *
+             * Emit a decent message if we met end-of-WAL. Otherwise we should
+             * have already emitted an error message.
+             */
+            if (xlogreader->EndOfWAL)
+                ereport(LOG,
+                        (errmsg("reached end of WAL at %X/%X on timeline %u",
+                                LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+                         (errormsg ? errdetail_internal("%s", errormsg) : 0)));
         }
+
+        /* In standby mode, loop back to retry. Otherwise, give up. */
+        if (StandbyMode && !CheckForStandbyTrigger())
+            continue;
+        else
+            return NULL;
     }
 }
 
@@ -7544,7 +7581,7 @@ StartupXLOG(void)
         else
         {
             /* just have to read next record after CheckPoint */
-            record = ReadRecord(xlogreader, LOG, false, replayTLI);
+            record = ReadRecord(xlogreader, WARNING, false, replayTLI);
         }
 
         if (record != NULL)
@@ -7782,7 +7819,7 @@ StartupXLOG(void)
                 }
 
                 /* Else, try to fetch the next WAL record */
-                record = ReadRecord(xlogreader, LOG, false, replayTLI);
+                record = ReadRecord(xlogreader, WARNING, false, replayTLI);
             } while (record != NULL);
 
             /*
@@ -7842,13 +7879,20 @@ StartupXLOG(void)
 
             InRedo = false;
         }
-        else
+        else if (xlogreader->EndOfWAL)
         {
             /* there are no WAL records following the checkpoint */
             ereport(LOG,
                     (errmsg("redo is not required")));
 
         }
+        else
+        {
+            /* broken record found */
+            ereport(WARNING,
+                    (errmsg("redo is skipped"),
+                     errhint("This suggests WAL file corruption. You might need to check the database.")));
+        }
 
         /*
          * This check is intentionally after the above log messages that
@@ -12434,12 +12478,14 @@ retry:
                                          private->replayTLI,
                                          xlogreader->EndRecPtr))
         {
+            Assert(!StandbyMode);
+
             if (readFile >= 0)
                 close(readFile);
             readFile = -1;
             readLen = 0;
             readSource = XLOG_FROM_ANY;
-
+            xlogreader->EndOfWAL = true;
             return -1;
         }
     }
@@ -13097,7 +13143,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
     static XLogRecPtr lastComplaint = 0;
 
-    if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+    if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
     {
         if (RecPtr == lastComplaint)
             emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..22982c4de7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
         pfree(state);
         return NULL;
     }
+    state->EndOfWAL = false;
     state->errormsg_buf[0] = '\0';
 
     /*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
     /* reset error state */
     *errormsg = NULL;
     state->errormsg_buf[0] = '\0';
+    state->EndOfWAL = false;
 
     ResetDecoder(state);
     state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,15 @@ err:
          */
         state->abortedRecPtr = RecPtr;
         state->missingContrecPtr = targetPagePtr;
+
+        /*
+         * If the message is not set yet, that means we failed to load the
+         * page for the record.  Otherwise do not hide the existing message.
+         */
+        if (state->errormsg_buf[0] == '\0')
+            report_invalid_record(state,
+                                  "missing contrecord at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
     }
 
     /*
@@ -730,6 +741,40 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
                       XLogRecPtr PrevRecPtr, XLogRecord *record,
                       bool randAccess)
 {
+    if (record->xl_tot_len == 0)
+    {
+        /*
+         * We are almost sure reaching the end of WAL, make sure that the
+         * whole page after the record is filled with zeroes.
+         */
+        char       *p;
+        char       *pe;
+
+        /* scan from the beginning of the record to the end of block */
+        p = (char *) record;
+        pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+        while (*p == 0 && p < pe)
+            p++;
+
+        if (p == pe)
+        {
+            /*
+             * The page after the record is completely zeroed. That suggests
+             * we don't have a record after this point. We don't bother
+             * checking the pages after since they are not zeroed in the case
+             * of recycled segments.
+             */
+            report_invalid_record(state, "empty record at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
+
+            /* notify end-of-wal to callers */
+            state->EndOfWAL = true;
+            return false;
+        }
+
+        /* The same condition will be caught as invalid record length */
+    }
     if (record->xl_tot_len < SizeOfXLogRecord)
     {
         report_invalid_record(state,
@@ -836,6 +881,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
     XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+    StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+    if (hdr->xlp_magic == 0)
+    {
+        /* Regard an empty page as End-Of-WAL */
+        int            i;
+
+        for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+        if (i == XLOG_BLCKSZ)
+        {
+            char        fname[MAXFNAMELEN];
+
+            XLogFileName(fname, state->seg.ws_tli, segno,
+                         state->segcxt.ws_segsize);
+
+            report_invalid_record(state,
+                                  "empty page in log segment %s, offset %u",
+                                  fname,
+                                  offset);
+            state->EndOfWAL = true;
+            return false;
+        }
+
+        /* The same condition will be caught as invalid magic number */
+    }
     if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
     {
         char        fname[MAXFNAMELEN];
@@ -921,6 +991,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
                               LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
                               fname,
                               offset);
+
+        /*
+         * If the page address is less than expected we assume it is an unused
+         * page in a recycled segment.
+         */
+        if (hdr->xlp_pageaddr < recaddr)
+            state->EndOfWAL = true;
+
         return false;
     }
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b39fce8c23..1a7a692bc0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -471,10 +471,9 @@ WalReceiverMain(void)
                         else if (len < 0)
                         {
                             ereport(LOG,
-                                    (errmsg("replication terminated by primary server"),
-                                     errdetail("End of WAL reached on timeline %u at %X/%X.",
-                                               startpointTLI,
-                                               LSN_FORMAT_ARGS(LogstreamResult.Write))));
+                                    (errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+                                            LSN_FORMAT_ARGS(LogstreamResult.Write),
+                                            startpointTLI)));
                             endofwal = true;
                             break;
                         }
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..3745e76488 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1176,9 +1176,16 @@ main(int argc, char **argv)
         exit(0);
 
     if (errormsg)
-        fatal_error("error in WAL record at %X/%X: %s",
-                    LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-                    errormsg);
+    {
+        if (xlogreader_state->EndOfWAL)
+            pg_log_info("end of WAL at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+        else
+            fatal_error("error in WAL record at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+    }
 
     XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..7b314ef10e 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
      */
     XLogRecPtr    ReadRecPtr;        /* start of last record read */
     XLogRecPtr    EndRecPtr;        /* end+1 of last record read */
+    bool        EndOfWAL;        /* was the last attempt EOW? */
 
     /*
      * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 3892aba3e5..1d7476c309 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,9 +10,11 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
-plan tests => 3;
+plan tests => 11;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -50,7 +52,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, $reached_eow_pat, $logstart));
+    sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -62,3 +72,99 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
+
+my $segsize = $node->safe_psql('postgres',
+       qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+                 SELECT pg_switch_wal();
+                 CHECKPOINT;
+                 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+    $lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+    "LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+                         $logstart));
+    sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.27.0


Re: Make mesage at end-of-recovery less scary.

От
Ashutosh Sharma
Дата:
Hi,

On Thu, Feb 10, 2022 at 11:47 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Wed, 9 Feb 2022 17:31:02 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in
> > On Wed, Feb 9, 2022 at 1:14 PM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> > > This means archive-recovery is requested but not started yet. That is,
> > > we've just finished crash recovery.  The existing comment cited
> > > together is mentioning that.
> > >
> > > At the end of PITR (or archive recovery), the other code works.
> > >
> >
> > This is quite understandable, the point here is that the message that
> > we are emitting says, we have just finished reading the wal files in
> > the pg_wal directory during crash recovery and are now entering
> > archive recovery when we are actually doing point-in-time recovery
> > which seems a bit misleading.
>
> Here is the messages.
>
> > 2022-02-08 18:00:44.367 IST [86185] LOG:  starting point-in-time
> > recovery to WAL location (LSN) "0/5227790"
> > 2022-02-08 18:00:44.368 IST [86185] LOG:  database system was not
> > properly shut down; automatic recovery in progress
> > 2022-02-08 18:00:44.369 IST [86185] LOG:  redo starts at 0/14DC8D8
> > 2022-02-08 18:00:44.978 IST [86185] DEBUG1:  reached end of WAL at
> > 0/3FFFFD0 on timeline 1 in pg_wal during crash recovery, entering
> > archive recovery
>
> In the first place the last DEBUG1 is not on my part, but one of the
> messages added by this patch says the same thing.  Is your point that
> archive recovery is different thing from PITR?  In regard to the
> difference, I think PITR is a form of archive recovery.
>

No, I haven't tried to compare archive recovery to PITR or vice versa,
instead I was trying to compare crash recovery with PITR. The message
you're emitting says just before entering into the archive recovery is
- "reached end-of-WAL on ... in pg_wal *during crash recovery*,
entering archive recovery".  This message is static and can be emitted
not only during crash recovery, but also during PITR. I think we can
remove the "during crash recovery" part from this message, so "reached
the end of WAL at %X/%X on timeline %u in %s, entering archive
recovery". Also I don't think we need format specifier %s here, it can
be hard-coded with pg_wal as in this case we can only enter archive
recovery after reading wal from pg_wal, so current WAL source has to
be pg_wal, isn't it?

> That being said, after some thoughts on this, I changed my mind that
> we don't need to say what operation was being performed at the
> end-of-WAL.  So in the attached the end-of-WAL message is not
> accompanied by the kind of recovery.
>
> > LOG:  reached end of WAL at 0/3000000 on timeline 1
>
> I removed the archive-source part along with the operation mode.
> Because it make the message untranslatable.  It is now very simple but
> seems enough.
>
> While working on this, I noticed that we need to set EndOfWAL when
> WaitForWALToBecomeAvailable returned with failure.  That means the
> file does not exist at all so it is a kind of end-of-WAL.  In that
> sense the following existing comment in ReadRecord is a bit wrong.
>
> >        * We only end up here without a message when XLogPageRead()
> >        * failed - in that case we already logged something. In
> >        * StandbyMode that only happens if we have been triggered, so we
> >        * shouldn't loop anymore in that case.
>
> Actually there's a case we get there without a message and without
> logged something when a segment file is not found unless we're in
> standby mode.
>
> > > Well. I guess that the "automatic recovery" is ambiguous.  Does it
> > > make sense if the second line were like the follows instead?
> > >
> > > + 2022-02-08 18:00:44.368 IST [86185] LOG:  database system was not properly shut down; crash recovery in
progress
> > >
> >
> > Well, according to me the current message looks fine.
>
> Good to hear. (In the previos version I modified the message by accident..)
>
> > $chkptfile is declared twice in the same scope. We can probably remove
> > the first one.
>
> Ugh.. Fixed.  (I wonder why Perl doesn't complain on this..)
>
>
> In this version 12 I made the following changes.
>
> - Rewrote (halfly reverted) a comment in ReadRecord
>
> - Simplified the "reached end of WAL" message by removing recovery
>   mode and WAL source in ReadRecord.
>
> - XLogPageRead sets EndOfWAL flag in the ENOENT case.
>
> - Removed redundant declaration of the same variable in TAP script.
>

Thanks for the changes. Please note that I am not able to apply the
latest patch on HEAD. Could you please rebase it on HEAD and share the
new version. Thank you.

--
With Regards,
Ashutosh Sharma.



Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
At Mon, 14 Feb 2022 20:14:11 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in 
> No, I haven't tried to compare archive recovery to PITR or vice versa,
> instead I was trying to compare crash recovery with PITR. The message
> you're emitting says just before entering into the archive recovery is
> - "reached end-of-WAL on ... in pg_wal *during crash recovery*,
> entering archive recovery".  This message is static and can be emitted
> not only during crash recovery, but also during PITR. I think we can

No. It is emitted *only* after crash recovery before starting archive
recovery.  Another message this patch adds can be emitted after PITR
or archive recovery.

> not only during crash recovery, but also during PITR. I think we can
> remove the "during crash recovery" part from this message, so "reached
> the end of WAL at %X/%X on timeline %u in %s, entering archive

What makes you think it can be emitted after other than crash recovery?
(Please look at the code comment just above.)

> recovery". Also I don't think we need format specifier %s here, it can
> be hard-coded with pg_wal as in this case we can only enter archive
> recovery after reading wal from pg_wal, so current WAL source has to
> be pg_wal, isn't it?

You're right that it can't be other than pg_wal.  It was changed just
in accordance woth another message this patch adds and it would be a
matter of taste.  I replaced to "pg_wal" in this version.

> Thanks for the changes. Please note that I am not able to apply the
> latest patch on HEAD. Could you please rebase it on HEAD and share the
> new version. Thank you.

A change on TAP script hit this.  The v13 attached is:

- Rebased.

- Replaced "%s" in the debug transition message from crash recovery to
  archive recovery.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 311e862e87dbdeb6348c6fc17063308342359c02 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v13] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlog.c         |  91 ++++++++++++++-----
 src/backend/access/transam/xlogreader.c   |  78 ++++++++++++++++
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 ++++++++++++++++++++++
 6 files changed, 266 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 958220c495..bb7026ac77 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4480,6 +4480,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
     for (;;)
     {
         char       *errormsg;
+        XLogRecPtr    ErrRecPtr = InvalidXLogRecPtr;
 
         record = XLogReadRecord(xlogreader, &errormsg);
         if (record == NULL)
@@ -4495,6 +4496,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             {
                 abortedRecPtr = xlogreader->abortedRecPtr;
                 missingContrecPtr = xlogreader->missingContrecPtr;
+                ErrRecPtr = abortedRecPtr;
+            }
+            else
+            {
+                /*
+                 * EndRecPtr is the LSN we tried to read but failed. In the
+                 * case of decoding error, it is at the end of the failed
+                 * record but we don't have a means for now to know EndRecPtr
+                 * is pointing to which of the beginning or ending of the
+                 * failed record.
+                 */
+                ErrRecPtr = xlogreader->EndRecPtr;
             }
 
             if (readFile >= 0)
@@ -4504,13 +4517,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something. In
-             * StandbyMode that only happens if we have been triggered, so we
-             * shouldn't loop anymore in that case.
+             * We only end up here without a message when XLogPageRead() failed
+             * in that case we already logged something, or just met end-of-WAL
+             * conditions. In StandbyMode that only happens if we have been
+             * triggered, so we shouldn't loop anymore in that case. When
+             * EndOfWAL is true, we don't emit that error if any immediately
+             * and instead will show it as a part of a decent end-of-wal
+             * message later.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+            if (!xlogreader->EndOfWAL && errormsg)
+                ereport(emode_for_corrupt_record(emode, ErrRecPtr),
                         (errmsg_internal("%s", errormsg) /* already translated */ ));
         }
 
@@ -4541,11 +4557,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             /* Great, got a record */
             return record;
         }
-        else
+
+        Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+        /* No valid record available from this source */
+        lastSourceFailed = true;
+
+        if (!fetching_ckpt)
         {
-            /* No valid record available from this source */
-            lastSourceFailed = true;
-
             /*
              * If archive recovery was requested, but we were still doing
              * crash recovery, switch to archive recovery and retry using the
@@ -4558,11 +4577,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
              * we'd have no idea how far we'd have to replay to reach
              * consistency.  So err on the safe side and give up.
              */
-            if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-                !fetching_ckpt)
+            if (!InArchiveRecovery && ArchiveRecoveryRequested)
             {
+                /*
+                 * We don't report this as LOG, since we don't stop recovery
+                 * here
+                 */
                 ereport(DEBUG1,
-                        (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+                        (errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery,
enteringarchive recovery",
 
+                                         LSN_FORMAT_ARGS(ErrRecPtr),
+                                         replayTLI)));
                 InArchiveRecovery = true;
                 if (StandbyModeRequested)
                     StandbyMode = true;
@@ -4610,12 +4634,24 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
                 continue;
             }
 
-            /* In standby mode, loop back to retry. Otherwise, give up. */
-            if (StandbyMode && !CheckForStandbyTrigger())
-                continue;
-            else
-                return NULL;
+            /*
+             * recovery ended.
+             *
+             * Emit a decent message if we met end-of-WAL. Otherwise we should
+             * have already emitted an error message.
+             */
+            if (xlogreader->EndOfWAL)
+                ereport(LOG,
+                        (errmsg("reached end of WAL at %X/%X on timeline %u",
+                                LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+                         (errormsg ? errdetail_internal("%s", errormsg) : 0)));
         }
+
+        /* In standby mode, loop back to retry. Otherwise, give up. */
+        if (StandbyMode && !CheckForStandbyTrigger())
+            continue;
+        else
+            return NULL;
     }
 }
 
@@ -7544,7 +7580,7 @@ StartupXLOG(void)
         else
         {
             /* just have to read next record after CheckPoint */
-            record = ReadRecord(xlogreader, LOG, false, replayTLI);
+            record = ReadRecord(xlogreader, WARNING, false, replayTLI);
         }
 
         if (record != NULL)
@@ -7782,7 +7818,7 @@ StartupXLOG(void)
                 }
 
                 /* Else, try to fetch the next WAL record */
-                record = ReadRecord(xlogreader, LOG, false, replayTLI);
+                record = ReadRecord(xlogreader, WARNING, false, replayTLI);
             } while (record != NULL);
 
             /*
@@ -7842,13 +7878,20 @@ StartupXLOG(void)
 
             InRedo = false;
         }
-        else
+        else if (xlogreader->EndOfWAL)
         {
             /* there are no WAL records following the checkpoint */
             ereport(LOG,
                     (errmsg("redo is not required")));
 
         }
+        else
+        {
+            /* broken record found */
+            ereport(WARNING,
+                    (errmsg("redo is skipped"),
+                     errhint("This suggests WAL file corruption. You might need to check the database.")));
+        }
 
         /*
          * This check is intentionally after the above log messages that
@@ -12434,12 +12477,14 @@ retry:
                                          private->replayTLI,
                                          xlogreader->EndRecPtr))
         {
+            Assert(!StandbyMode);
+
             if (readFile >= 0)
                 close(readFile);
             readFile = -1;
             readLen = 0;
             readSource = XLOG_FROM_ANY;
-
+            xlogreader->EndOfWAL = true;
             return -1;
         }
     }
@@ -13097,7 +13142,7 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
     static XLogRecPtr lastComplaint = 0;
 
-    if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+    if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
     {
         if (RecPtr == lastComplaint)
             emode = DEBUG1;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..22982c4de7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
         pfree(state);
         return NULL;
     }
+    state->EndOfWAL = false;
     state->errormsg_buf[0] = '\0';
 
     /*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
     /* reset error state */
     *errormsg = NULL;
     state->errormsg_buf[0] = '\0';
+    state->EndOfWAL = false;
 
     ResetDecoder(state);
     state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,15 @@ err:
          */
         state->abortedRecPtr = RecPtr;
         state->missingContrecPtr = targetPagePtr;
+
+        /*
+         * If the message is not set yet, that means we failed to load the
+         * page for the record.  Otherwise do not hide the existing message.
+         */
+        if (state->errormsg_buf[0] == '\0')
+            report_invalid_record(state,
+                                  "missing contrecord at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
     }
 
     /*
@@ -730,6 +741,40 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
                       XLogRecPtr PrevRecPtr, XLogRecord *record,
                       bool randAccess)
 {
+    if (record->xl_tot_len == 0)
+    {
+        /*
+         * We are almost sure reaching the end of WAL, make sure that the
+         * whole page after the record is filled with zeroes.
+         */
+        char       *p;
+        char       *pe;
+
+        /* scan from the beginning of the record to the end of block */
+        p = (char *) record;
+        pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+        while (*p == 0 && p < pe)
+            p++;
+
+        if (p == pe)
+        {
+            /*
+             * The page after the record is completely zeroed. That suggests
+             * we don't have a record after this point. We don't bother
+             * checking the pages after since they are not zeroed in the case
+             * of recycled segments.
+             */
+            report_invalid_record(state, "empty record at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
+
+            /* notify end-of-wal to callers */
+            state->EndOfWAL = true;
+            return false;
+        }
+
+        /* The same condition will be caught as invalid record length */
+    }
     if (record->xl_tot_len < SizeOfXLogRecord)
     {
         report_invalid_record(state,
@@ -836,6 +881,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
     XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+    StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+    if (hdr->xlp_magic == 0)
+    {
+        /* Regard an empty page as End-Of-WAL */
+        int            i;
+
+        for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+        if (i == XLOG_BLCKSZ)
+        {
+            char        fname[MAXFNAMELEN];
+
+            XLogFileName(fname, state->seg.ws_tli, segno,
+                         state->segcxt.ws_segsize);
+
+            report_invalid_record(state,
+                                  "empty page in log segment %s, offset %u",
+                                  fname,
+                                  offset);
+            state->EndOfWAL = true;
+            return false;
+        }
+
+        /* The same condition will be caught as invalid magic number */
+    }
     if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
     {
         char        fname[MAXFNAMELEN];
@@ -921,6 +991,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
                               LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
                               fname,
                               offset);
+
+        /*
+         * If the page address is less than expected we assume it is an unused
+         * page in a recycled segment.
+         */
+        if (hdr->xlp_pageaddr < recaddr)
+            state->EndOfWAL = true;
+
         return false;
     }
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index b39fce8c23..1a7a692bc0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -471,10 +471,9 @@ WalReceiverMain(void)
                         else if (len < 0)
                         {
                             ereport(LOG,
-                                    (errmsg("replication terminated by primary server"),
-                                     errdetail("End of WAL reached on timeline %u at %X/%X.",
-                                               startpointTLI,
-                                               LSN_FORMAT_ARGS(LogstreamResult.Write))));
+                                    (errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+                                            LSN_FORMAT_ARGS(LogstreamResult.Write),
+                                            startpointTLI)));
                             endofwal = true;
                             break;
                         }
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..3745e76488 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1176,9 +1176,16 @@ main(int argc, char **argv)
         exit(0);
 
     if (errormsg)
-        fatal_error("error in WAL record at %X/%X: %s",
-                    LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-                    errormsg);
+    {
+        if (xlogreader_state->EndOfWAL)
+            pg_log_info("end of WAL at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+        else
+            fatal_error("error in WAL record at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+    }
 
     XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..7b314ef10e 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
      */
     XLogRecPtr    ReadRecPtr;        /* start of last record read */
     XLogRecPtr    EndRecPtr;        /* end+1 of last record read */
+    bool        EndOfWAL;        /* was the last attempt EOW? */
 
     /*
      * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 14154d1ce0..01033334d6 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,7 +10,9 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -48,7 +50,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, $reached_eow_pat, $logstart));
+    sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -61,4 +71,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
 
+my $segsize = $node->safe_psql('postgres',
+       qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+                 SELECT pg_switch_wal();
+                 CHECKPOINT;
+                 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+    $lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+    "LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+                         $logstart));
+    sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
 done_testing();
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.27.0


Re: Make mesage at end-of-recovery less scary.

От
Ashutosh Sharma
Дата:
On Tue, Feb 15, 2022 at 7:52 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Mon, 14 Feb 2022 20:14:11 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in
> > No, I haven't tried to compare archive recovery to PITR or vice versa,
> > instead I was trying to compare crash recovery with PITR. The message
> > you're emitting says just before entering into the archive recovery is
> > - "reached end-of-WAL on ... in pg_wal *during crash recovery*,
> > entering archive recovery".  This message is static and can be emitted
> > not only during crash recovery, but also during PITR. I think we can
>
> No. It is emitted *only* after crash recovery before starting archive
> recovery.  Another message this patch adds can be emitted after PITR
> or archive recovery.
>
> > not only during crash recovery, but also during PITR. I think we can
> > remove the "during crash recovery" part from this message, so "reached
> > the end of WAL at %X/%X on timeline %u in %s, entering archive
>
> What makes you think it can be emitted after other than crash recovery?
> (Please look at the code comment just above.)
>

Yep that's right. We won't be coming here in case of pitr.

> > recovery". Also I don't think we need format specifier %s here, it can
> > be hard-coded with pg_wal as in this case we can only enter archive
> > recovery after reading wal from pg_wal, so current WAL source has to
> > be pg_wal, isn't it?
>
> You're right that it can't be other than pg_wal.  It was changed just
> in accordance woth another message this patch adds and it would be a
> matter of taste.  I replaced to "pg_wal" in this version.
>

OK. I have verified the changes.

> > Thanks for the changes. Please note that I am not able to apply the
> > latest patch on HEAD. Could you please rebase it on HEAD and share the
> > new version. Thank you.
>
> A change on TAP script hit this.  The v13 attached is:
>

OK. The v13 patch looks good. I have marked it as ready to commit.
Thank you for working on all my review comments.

--
With Regards,
Ashutosh Sharma.



Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
At Tue, 15 Feb 2022 20:17:20 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in 
> OK. The v13 patch looks good. I have marked it as ready to commit.
> Thank you for working on all my review comments.

Thaks! But the recent xlog.c refactoring crashes into this patch.
And I found a silly bug while rebasing.

xlog.c:12463 / xlogrecovery.c:3168
        if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
..
{
+            Assert(!StandbyMode);
...
+            xlogreader->EndOfWAL = true;

Yeah, I forgot about promotion there..  So what I should have done is
setting EndOfWAL according to StandbyMode.

+            Assert(!StandbyMode || CheckForStandbyTrigger());
...
+            /* promotion exit is not end-of-WAL */
+            xlogreader->EndOfWAL = !StandbyMode;

The rebased v14 is attached.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 5613ee80a4d2a9786f5ce8421dcbb560b63a13c1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v14] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlogreader.c   |  78 ++++++++++++++++
 src/backend/access/transam/xlogrecovery.c |  92 ++++++++++++++-----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 ++++++++++++++++++++++
 6 files changed, 268 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..22982c4de7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -121,6 +121,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
         pfree(state);
         return NULL;
     }
+    state->EndOfWAL = false;
     state->errormsg_buf[0] = '\0';
 
     /*
@@ -292,6 +293,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
     /* reset error state */
     *errormsg = NULL;
     state->errormsg_buf[0] = '\0';
+    state->EndOfWAL = false;
 
     ResetDecoder(state);
     state->abortedRecPtr = InvalidXLogRecPtr;
@@ -588,6 +590,15 @@ err:
          */
         state->abortedRecPtr = RecPtr;
         state->missingContrecPtr = targetPagePtr;
+
+        /*
+         * If the message is not set yet, that means we failed to load the
+         * page for the record.  Otherwise do not hide the existing message.
+         */
+        if (state->errormsg_buf[0] == '\0')
+            report_invalid_record(state,
+                                  "missing contrecord at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
     }
 
     /*
@@ -730,6 +741,40 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
                       XLogRecPtr PrevRecPtr, XLogRecord *record,
                       bool randAccess)
 {
+    if (record->xl_tot_len == 0)
+    {
+        /*
+         * We are almost sure reaching the end of WAL, make sure that the
+         * whole page after the record is filled with zeroes.
+         */
+        char       *p;
+        char       *pe;
+
+        /* scan from the beginning of the record to the end of block */
+        p = (char *) record;
+        pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+        while (*p == 0 && p < pe)
+            p++;
+
+        if (p == pe)
+        {
+            /*
+             * The page after the record is completely zeroed. That suggests
+             * we don't have a record after this point. We don't bother
+             * checking the pages after since they are not zeroed in the case
+             * of recycled segments.
+             */
+            report_invalid_record(state, "empty record at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
+
+            /* notify end-of-wal to callers */
+            state->EndOfWAL = true;
+            return false;
+        }
+
+        /* The same condition will be caught as invalid record length */
+    }
     if (record->xl_tot_len < SizeOfXLogRecord)
     {
         report_invalid_record(state,
@@ -836,6 +881,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
     XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+    StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+    if (hdr->xlp_magic == 0)
+    {
+        /* Regard an empty page as End-Of-WAL */
+        int            i;
+
+        for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+        if (i == XLOG_BLCKSZ)
+        {
+            char        fname[MAXFNAMELEN];
+
+            XLogFileName(fname, state->seg.ws_tli, segno,
+                         state->segcxt.ws_segsize);
+
+            report_invalid_record(state,
+                                  "empty page in log segment %s, offset %u",
+                                  fname,
+                                  offset);
+            state->EndOfWAL = true;
+            return false;
+        }
+
+        /* The same condition will be caught as invalid magic number */
+    }
     if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
     {
         char        fname[MAXFNAMELEN];
@@ -921,6 +991,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
                               LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
                               fname,
                               offset);
+
+        /*
+         * If the page address is less than expected we assume it is an unused
+         * page in a recycled segment.
+         */
+        if (hdr->xlp_pageaddr < recaddr)
+            state->EndOfWAL = true;
+
         return false;
     }
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f9f212680b..750056acaf 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1592,7 +1592,7 @@ PerformWalRecovery(void)
         /* just have to read next record after CheckPoint */
         Assert(xlogreader->ReadRecPtr == CheckPointLoc);
         replayTLI = CheckPointTLI;
-        record = ReadRecord(xlogreader, LOG, false, replayTLI);
+        record = ReadRecord(xlogreader, WARNING, false, replayTLI);
     }
 
     if (record != NULL)
@@ -1706,7 +1706,7 @@ PerformWalRecovery(void)
             }
 
             /* Else, try to fetch the next WAL record */
-            record = ReadRecord(xlogreader, LOG, false, replayTLI);
+            record = ReadRecord(xlogreader, WARNING, false, replayTLI);
         } while (record != NULL);
 
         /*
@@ -1765,13 +1765,20 @@ PerformWalRecovery(void)
 
         InRedo = false;
     }
-    else
+    else if (xlogreader->EndOfWAL)
     {
         /* there are no WAL records following the checkpoint */
         ereport(LOG,
                 (errmsg("redo is not required")));
 
     }
+    else
+    {
+        /* broken record found */
+        ereport(WARNING,
+                (errmsg("redo is skipped"),
+                 errhint("This suggests WAL file corruption. You might need to check the database.")));
+    }
 
     /*
      * This check is intentionally after the above log messages that indicate
@@ -2939,6 +2946,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
     for (;;)
     {
         char       *errormsg;
+        XLogRecPtr    ErrRecPtr = InvalidXLogRecPtr;
 
         record = XLogReadRecord(xlogreader, &errormsg);
         if (record == NULL)
@@ -2954,6 +2962,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             {
                 abortedRecPtr = xlogreader->abortedRecPtr;
                 missingContrecPtr = xlogreader->missingContrecPtr;
+                ErrRecPtr = abortedRecPtr;
+            }
+            else
+            {
+                /*
+                 * EndRecPtr is the LSN we tried to read but failed. In the
+                 * case of decoding error, it is at the end of the failed
+                 * record but we don't have a means for now to know EndRecPtr
+                 * is pointing to which of the beginning or ending of the
+                 * failed record.
+                 */
+                ErrRecPtr = xlogreader->EndRecPtr;
             }
 
             if (readFile >= 0)
@@ -2963,13 +2983,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something. In
-             * StandbyMode that only happens if we have been triggered, so we
-             * shouldn't loop anymore in that case.
+             * We only end up here without a message when XLogPageRead() failed
+             * in that case we already logged something, or just met end-of-WAL
+             * conditions. In StandbyMode that only happens if we have been
+             * triggered, so we shouldn't loop anymore in that case. When
+             * EndOfWAL is true, we don't emit that error if any immediately
+             * and instead will show it as a part of a decent end-of-wal
+             * message later.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+            if (!xlogreader->EndOfWAL && errormsg)
+                ereport(emode_for_corrupt_record(emode, ErrRecPtr),
                         (errmsg_internal("%s", errormsg) /* already translated */ ));
         }
 
@@ -3000,11 +3023,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             /* Great, got a record */
             return record;
         }
-        else
+
+        Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+        /* No valid record available from this source */
+        lastSourceFailed = true;
+
+        if (!fetching_ckpt)
         {
-            /* No valid record available from this source */
-            lastSourceFailed = true;
-
             /*
              * If archive recovery was requested, but we were still doing
              * crash recovery, switch to archive recovery and retry using the
@@ -3017,11 +3043,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
              * we'd have no idea how far we'd have to replay to reach
              * consistency.  So err on the safe side and give up.
              */
-            if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-                !fetching_ckpt)
+            if (!InArchiveRecovery && ArchiveRecoveryRequested)
             {
+                /*
+                 * We don't report this as LOG, since we don't stop recovery
+                 * here
+                 */
                 ereport(DEBUG1,
-                        (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+                        (errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery,
enteringarchive recovery",
 
+                                         LSN_FORMAT_ARGS(ErrRecPtr),
+                                         replayTLI)));
                 InArchiveRecovery = true;
                 if (StandbyModeRequested)
                     StandbyMode = true;
@@ -3042,12 +3073,24 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
                 continue;
             }
 
-            /* In standby mode, loop back to retry. Otherwise, give up. */
-            if (StandbyMode && !CheckForStandbyTrigger())
-                continue;
-            else
-                return NULL;
+            /*
+             * recovery ended.
+             *
+             * Emit a decent message if we met end-of-WAL. Otherwise we should
+             * have already emitted an error message.
+             */
+            if (xlogreader->EndOfWAL)
+                ereport(LOG,
+                        (errmsg("reached end of WAL at %X/%X on timeline %u",
+                                LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+                         (errormsg ? errdetail_internal("%s", errormsg) : 0)));
         }
+
+        /* In standby mode, loop back to retry. Otherwise, give up. */
+        if (StandbyMode && !CheckForStandbyTrigger())
+            continue;
+        else
+            return NULL;
     }
 }
 
@@ -3129,12 +3172,16 @@ retry:
                                          private->replayTLI,
                                          xlogreader->EndRecPtr))
         {
+            Assert(!StandbyMode || CheckForStandbyTrigger());
+
             if (readFile >= 0)
                 close(readFile);
             readFile = -1;
             readLen = 0;
             readSource = XLOG_FROM_ANY;
 
+            /* promotion exit is not end-of-WAL */
+            xlogreader->EndOfWAL = !StandbyMode;
             return -1;
         }
     }
@@ -3767,7 +3814,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
     static XLogRecPtr lastComplaint = 0;
 
-    if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+    /* use currentSource as readSource is reset at failure */
+    if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
     {
         if (RecPtr == lastComplaint)
             emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ceaff097b9..4f117ea4da 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -472,10 +472,9 @@ WalReceiverMain(void)
                         else if (len < 0)
                         {
                             ereport(LOG,
-                                    (errmsg("replication terminated by primary server"),
-                                     errdetail("End of WAL reached on timeline %u at %X/%X.",
-                                               startpointTLI,
-                                               LSN_FORMAT_ARGS(LogstreamResult.Write))));
+                                    (errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+                                            LSN_FORMAT_ARGS(LogstreamResult.Write),
+                                            startpointTLI)));
                             endofwal = true;
                             break;
                         }
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a6251e1a96..3745e76488 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1176,9 +1176,16 @@ main(int argc, char **argv)
         exit(0);
 
     if (errormsg)
-        fatal_error("error in WAL record at %X/%X: %s",
-                    LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-                    errormsg);
+    {
+        if (xlogreader_state->EndOfWAL)
+            pg_log_info("end of WAL at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+        else
+            fatal_error("error in WAL record at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+    }
 
     XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..7b314ef10e 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
      */
     XLogRecPtr    ReadRecPtr;        /* start of last record read */
     XLogRecPtr    EndRecPtr;        /* end+1 of last record read */
+    bool        EndOfWAL;        /* was the last attempt EOW? */
 
     /*
      * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 14154d1ce0..01033334d6 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,7 +10,9 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -48,7 +50,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, $reached_eow_pat, $logstart));
+    sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -61,4 +71,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
 
+my $segsize = $node->safe_psql('postgres',
+       qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+                 SELECT pg_switch_wal();
+                 CHECKPOINT;
+                 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+    $lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+    "LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+                         $logstart));
+    sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
 done_testing();
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.27.0


Re: Make mesage at end-of-recovery less scary.

От
Ashutosh Sharma
Дата:
On Thu, Feb 17, 2022 at 1:20 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Tue, 15 Feb 2022 20:17:20 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in
> > OK. The v13 patch looks good. I have marked it as ready to commit.
> > Thank you for working on all my review comments.
>
> Thaks! But the recent xlog.c refactoring crashes into this patch.
> And I found a silly bug while rebasing.
>
> xlog.c:12463 / xlogrecovery.c:3168
>                 if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
> ..
> {
> +                       Assert(!StandbyMode);
> ...
> +                       xlogreader->EndOfWAL = true;
>
> Yeah, I forgot about promotion there..

Yes, we exit WaitForWALToBecomeAvailable() even in standby mode
provided the user has requested for the promotion. So checking for the
!StandbyMode condition alone was not enough.

So what I should have done is
> setting EndOfWAL according to StandbyMode.
>
> +                       Assert(!StandbyMode || CheckForStandbyTrigger());
> ...
> +                       /* promotion exit is not end-of-WAL */
> +                       xlogreader->EndOfWAL = !StandbyMode;
>

The changes looks good. thanks.!

--
With Regards,
Ashutosh Sharma.



Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
At Sat, 19 Feb 2022 09:31:33 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in 
> The changes looks good. thanks.!

Thanks!

Some recent core change changed WAL insertion speed during the TAP
test and revealed one forgotton case of EndOfWAL.  When a record
header flows into the next page, XLogReadRecord does separate check
from ValidXLogRecordHeader by itself.

>     * If the whole record header is on this page, validate it immediately.
>     * Otherwise do just a basic sanity check on xl_tot_len, and validate the
>     * rest of the header after reading it from the next page.  The xl_tot_len
>     * check is necessary here to ensure that we enter the "Need to reassemble
>     * record" code path below; otherwise we might fail to apply
>     * ValidXLogRecordHeader at all.
>     */
>    if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
>    {
...
>    }
>    else
>    {
>        /* XXX: more validation should be done here */
>        if (total_len < SizeOfXLogRecord)
>        {

I could simplly copy-in a part of ValidXLogRecordHeader there but that
results in rather large duplicate code. I could have
ValidXLogRecordHeader handle the partial header case but it seems to
me complex.

So in this version I split the xl_tot_len part of
ValidXLogRecordHeader into ValidXLogRecordLength.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 01cce076d2b3ad536398cc2b716ef64ed9b2c409 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v15] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlogreader.c   | 125 ++++++++++++++++++----
 src/backend/access/transam/xlogrecovery.c |  92 ++++++++++++----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 ++-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 ++++++++++++++++++
 6 files changed, 297 insertions(+), 47 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..ba1c1ece87 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -42,6 +42,8 @@ static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int    ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
                              int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+                                  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
                                   XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -121,6 +123,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
         pfree(state);
         return NULL;
     }
+    state->EndOfWAL = false;
     state->errormsg_buf[0] = '\0';
 
     /*
@@ -292,6 +295,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
     /* reset error state */
     *errormsg = NULL;
     state->errormsg_buf[0] = '\0';
+    state->EndOfWAL = false;
 
     ResetDecoder(state);
     state->abortedRecPtr = InvalidXLogRecPtr;
@@ -380,12 +384,11 @@ restart:
      * whole header.
      */
     record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-    total_len = record->xl_tot_len;
 
     /*
      * If the whole record header is on this page, validate it immediately.
-     * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-     * rest of the header after reading it from the next page.  The xl_tot_len
+     * Otherwise do just a basic sanity check on record length, and validate
+     * the rest of the header after reading it from the next page.  The length
      * check is necessary here to ensure that we enter the "Need to reassemble
      * record" code path below; otherwise we might fail to apply
      * ValidXLogRecordHeader at all.
@@ -399,18 +402,13 @@ restart:
     }
     else
     {
-        /* XXX: more validation should be done here */
-        if (total_len < SizeOfXLogRecord)
-        {
-            report_invalid_record(state,
-                                  "invalid record length at %X/%X: wanted %u, got %u",
-                                  LSN_FORMAT_ARGS(RecPtr),
-                                  (uint32) SizeOfXLogRecord, total_len);
+        if (!ValidXLogRecordLength(state, RecPtr, record))
             goto err;
-        }
+
         gotheader = false;
     }
 
+    total_len = record->xl_tot_len;
     len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
     if (total_len > len)
     {
@@ -588,6 +586,15 @@ err:
          */
         state->abortedRecPtr = RecPtr;
         state->missingContrecPtr = targetPagePtr;
+
+        /*
+         * If the message is not set yet, that means we failed to load the
+         * page for the record.  Otherwise do not hide the existing message.
+         */
+        if (state->errormsg_buf[0] == '\0')
+            report_invalid_record(state,
+                                  "missing contrecord at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
     }
 
     /*
@@ -719,6 +726,60 @@ XLogReaderInvalReadState(XLogReaderState *state)
     state->readLen = 0;
 }
 
+/*
+ * Validate record length of an XLOG record header.
+ *
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
+ */
+static bool
+ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+                      XLogRecord *record)
+{
+    if (record->xl_tot_len == 0)
+    {
+        char       *p;
+        char       *pe;
+
+        /*
+         * We are almost sure reaching the end of WAL, make sure that the
+         * whole page after the record is filled with zeroes.
+         */
+        p = (char *) record;
+        pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+        while (*p == 0 && p < pe)
+            p++;
+
+        if (p == pe)
+        {
+            /*
+             * The page after the record is completely zeroed. That suggests
+             * we don't have a record after this point. We don't bother
+             * checking the pages after since they are not zeroed in the case
+             * of recycled segments.
+             */
+            report_invalid_record(state, "empty record at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
+
+            /* notify end-of-wal to callers */
+            state->EndOfWAL = true;
+            return false;
+        }
+    }
+
+    if (record->xl_tot_len < SizeOfXLogRecord)
+    {
+        report_invalid_record(state,
+                              "invalid record length at %X/%X: wanted %u, got %u",
+                              LSN_FORMAT_ARGS(RecPtr),
+                              (uint32) SizeOfXLogRecord, record->xl_tot_len);
+        return false;
+    }
+
+    return true;
+}
+
 /*
  * Validate an XLOG record header.
  *
@@ -730,14 +791,9 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
                       XLogRecPtr PrevRecPtr, XLogRecord *record,
                       bool randAccess)
 {
-    if (record->xl_tot_len < SizeOfXLogRecord)
-    {
-        report_invalid_record(state,
-                              "invalid record length at %X/%X: wanted %u, got %u",
-                              LSN_FORMAT_ARGS(RecPtr),
-                              (uint32) SizeOfXLogRecord, record->xl_tot_len);
+    if (!ValidXLogRecordLength(state, RecPtr, record))
         return false;
-    }
+
     if (record->xl_rmid > RM_MAX_ID)
     {
         report_invalid_record(state,
@@ -836,6 +892,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
     XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+    StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+    if (hdr->xlp_magic == 0)
+    {
+        /* Regard an empty page as End-Of-WAL */
+        int            i;
+
+        for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+        if (i == XLOG_BLCKSZ)
+        {
+            char        fname[MAXFNAMELEN];
+
+            XLogFileName(fname, state->seg.ws_tli, segno,
+                         state->segcxt.ws_segsize);
+
+            report_invalid_record(state,
+                                  "empty page in log segment %s, offset %u",
+                                  fname,
+                                  offset);
+            state->EndOfWAL = true;
+            return false;
+        }
+
+        /* The same condition will be caught as invalid magic number */
+    }
     if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
     {
         char        fname[MAXFNAMELEN];
@@ -921,6 +1002,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
                               LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
                               fname,
                               offset);
+
+        /*
+         * If the page address is less than expected we assume it is an unused
+         * page in a recycled segment.
+         */
+        if (hdr->xlp_pageaddr < recaddr)
+            state->EndOfWAL = true;
+
         return false;
     }
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f9f212680b..750056acaf 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1592,7 +1592,7 @@ PerformWalRecovery(void)
         /* just have to read next record after CheckPoint */
         Assert(xlogreader->ReadRecPtr == CheckPointLoc);
         replayTLI = CheckPointTLI;
-        record = ReadRecord(xlogreader, LOG, false, replayTLI);
+        record = ReadRecord(xlogreader, WARNING, false, replayTLI);
     }
 
     if (record != NULL)
@@ -1706,7 +1706,7 @@ PerformWalRecovery(void)
             }
 
             /* Else, try to fetch the next WAL record */
-            record = ReadRecord(xlogreader, LOG, false, replayTLI);
+            record = ReadRecord(xlogreader, WARNING, false, replayTLI);
         } while (record != NULL);
 
         /*
@@ -1765,13 +1765,20 @@ PerformWalRecovery(void)
 
         InRedo = false;
     }
-    else
+    else if (xlogreader->EndOfWAL)
     {
         /* there are no WAL records following the checkpoint */
         ereport(LOG,
                 (errmsg("redo is not required")));
 
     }
+    else
+    {
+        /* broken record found */
+        ereport(WARNING,
+                (errmsg("redo is skipped"),
+                 errhint("This suggests WAL file corruption. You might need to check the database.")));
+    }
 
     /*
      * This check is intentionally after the above log messages that indicate
@@ -2939,6 +2946,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
     for (;;)
     {
         char       *errormsg;
+        XLogRecPtr    ErrRecPtr = InvalidXLogRecPtr;
 
         record = XLogReadRecord(xlogreader, &errormsg);
         if (record == NULL)
@@ -2954,6 +2962,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             {
                 abortedRecPtr = xlogreader->abortedRecPtr;
                 missingContrecPtr = xlogreader->missingContrecPtr;
+                ErrRecPtr = abortedRecPtr;
+            }
+            else
+            {
+                /*
+                 * EndRecPtr is the LSN we tried to read but failed. In the
+                 * case of decoding error, it is at the end of the failed
+                 * record but we don't have a means for now to know EndRecPtr
+                 * is pointing to which of the beginning or ending of the
+                 * failed record.
+                 */
+                ErrRecPtr = xlogreader->EndRecPtr;
             }
 
             if (readFile >= 0)
@@ -2963,13 +2983,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something. In
-             * StandbyMode that only happens if we have been triggered, so we
-             * shouldn't loop anymore in that case.
+             * We only end up here without a message when XLogPageRead() failed
+             * in that case we already logged something, or just met end-of-WAL
+             * conditions. In StandbyMode that only happens if we have been
+             * triggered, so we shouldn't loop anymore in that case. When
+             * EndOfWAL is true, we don't emit that error if any immediately
+             * and instead will show it as a part of a decent end-of-wal
+             * message later.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+            if (!xlogreader->EndOfWAL && errormsg)
+                ereport(emode_for_corrupt_record(emode, ErrRecPtr),
                         (errmsg_internal("%s", errormsg) /* already translated */ ));
         }
 
@@ -3000,11 +3023,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             /* Great, got a record */
             return record;
         }
-        else
+
+        Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+        /* No valid record available from this source */
+        lastSourceFailed = true;
+
+        if (!fetching_ckpt)
         {
-            /* No valid record available from this source */
-            lastSourceFailed = true;
-
             /*
              * If archive recovery was requested, but we were still doing
              * crash recovery, switch to archive recovery and retry using the
@@ -3017,11 +3043,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
              * we'd have no idea how far we'd have to replay to reach
              * consistency.  So err on the safe side and give up.
              */
-            if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-                !fetching_ckpt)
+            if (!InArchiveRecovery && ArchiveRecoveryRequested)
             {
+                /*
+                 * We don't report this as LOG, since we don't stop recovery
+                 * here
+                 */
                 ereport(DEBUG1,
-                        (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+                        (errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery,
enteringarchive recovery",
 
+                                         LSN_FORMAT_ARGS(ErrRecPtr),
+                                         replayTLI)));
                 InArchiveRecovery = true;
                 if (StandbyModeRequested)
                     StandbyMode = true;
@@ -3042,12 +3073,24 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
                 continue;
             }
 
-            /* In standby mode, loop back to retry. Otherwise, give up. */
-            if (StandbyMode && !CheckForStandbyTrigger())
-                continue;
-            else
-                return NULL;
+            /*
+             * recovery ended.
+             *
+             * Emit a decent message if we met end-of-WAL. Otherwise we should
+             * have already emitted an error message.
+             */
+            if (xlogreader->EndOfWAL)
+                ereport(LOG,
+                        (errmsg("reached end of WAL at %X/%X on timeline %u",
+                                LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+                         (errormsg ? errdetail_internal("%s", errormsg) : 0)));
         }
+
+        /* In standby mode, loop back to retry. Otherwise, give up. */
+        if (StandbyMode && !CheckForStandbyTrigger())
+            continue;
+        else
+            return NULL;
     }
 }
 
@@ -3129,12 +3172,16 @@ retry:
                                          private->replayTLI,
                                          xlogreader->EndRecPtr))
         {
+            Assert(!StandbyMode || CheckForStandbyTrigger());
+
             if (readFile >= 0)
                 close(readFile);
             readFile = -1;
             readLen = 0;
             readSource = XLOG_FROM_ANY;
 
+            /* promotion exit is not end-of-WAL */
+            xlogreader->EndOfWAL = !StandbyMode;
             return -1;
         }
     }
@@ -3767,7 +3814,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
     static XLogRecPtr lastComplaint = 0;
 
-    if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+    /* use currentSource as readSource is reset at failure */
+    if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
     {
         if (RecPtr == lastComplaint)
             emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ceaff097b9..4f117ea4da 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -472,10 +472,9 @@ WalReceiverMain(void)
                         else if (len < 0)
                         {
                             ereport(LOG,
-                                    (errmsg("replication terminated by primary server"),
-                                     errdetail("End of WAL reached on timeline %u at %X/%X.",
-                                               startpointTLI,
-                                               LSN_FORMAT_ARGS(LogstreamResult.Write))));
+                                    (errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+                                            LSN_FORMAT_ARGS(LogstreamResult.Write),
+                                            startpointTLI)));
                             endofwal = true;
                             break;
                         }
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 2340dc247b..215abe95dc 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1173,9 +1173,16 @@ main(int argc, char **argv)
         exit(0);
 
     if (errormsg)
-        fatal_error("error in WAL record at %X/%X: %s",
-                    LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-                    errormsg);
+    {
+        if (xlogreader_state->EndOfWAL)
+            pg_log_info("end of WAL at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+        else
+            fatal_error("error in WAL record at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+    }
 
     XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..7b314ef10e 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
      */
     XLogRecPtr    ReadRecPtr;        /* start of last record read */
     XLogRecPtr    EndRecPtr;        /* end+1 of last record read */
+    bool        EndOfWAL;        /* was the last attempt EOW? */
 
     /*
      * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 14154d1ce0..01033334d6 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,7 +10,9 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -48,7 +50,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, $reached_eow_pat, $logstart));
+    sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -61,4 +71,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
 
+my $segsize = $node->safe_psql('postgres',
+       qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+                 SELECT pg_switch_wal();
+                 CHECKPOINT;
+                 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+    $lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+    "LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+                         $logstart));
+    sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
 done_testing();
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.27.0


Re: Make mesage at end-of-recovery less scary.

От
Ashutosh Sharma
Дата:
On Wed, Mar 2, 2022 at 7:47 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Sat, 19 Feb 2022 09:31:33 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in
> > The changes looks good. thanks.!
>
> Thanks!
>
> Some recent core change changed WAL insertion speed during the TAP
> test and revealed one forgotton case of EndOfWAL.  When a record
> header flows into the next page, XLogReadRecord does separate check
> from ValidXLogRecordHeader by itself.
>

The new changes made in the patch look good. Thanks to the recent
changes to speed WAL insertion that have helped us catch this bug.

One small comment:

        record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-       total_len = record->xl_tot_len;

Do you think we need to change the position of the comments written
for above code that says:

        /*
         * Read the record length.
         *
...
...

--
With Regards,
Ashutosh Sharma.



Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
At Thu, 3 Mar 2022 15:39:44 +0530, Ashutosh Sharma <ashu.coek88@gmail.com> wrote in 
> The new changes made in the patch look good. Thanks to the recent
> changes to speed WAL insertion that have helped us catch this bug.

Thanks for the quick checking.

> One small comment:
> 
>         record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
> -       total_len = record->xl_tot_len;
> 
> Do you think we need to change the position of the comments written
> for above code that says:

Yeah, I didn't do that since it is about header verification.  But as
you pointed, the result still doesn't look perfect.

On second thought the two seems repeating the same things.  Thus I
merged the two comments together.  In this verion 16 it looks like
this.

>    /*
>     * Validate the record header.
>     *
>     * Even though we use an XLogRecord pointer here, the whole record header
>     * might not fit on this page.  If the whole record header is on this page,
>     * validate it immediately.  Even otherwise xl_tot_len must be on this page
>     * (it is the first field of MAXALIGNed records), but we still cannot
>     * access any further fields until we've verified that we got the whole
>     * header, so do just a basic sanity check on record length, and validate
>     * the rest of the header after reading it from the next page.  The length
>     * check is necessary here to ensure that we enter the "Need to reassemble
>     * record" code path below; otherwise we might fail to apply
>     * ValidXLogRecordHeader at all.
>     */
>    record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
>
>    if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 00d848df6bb8b9966dfbd39c98a388fda42a3e3c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v16] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlogreader.c   | 144 +++++++++++++++++-----
 src/backend/access/transam/xlogrecovery.c |  92 ++++++++++----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 +-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 ++++++++++++++++
 6 files changed, 305 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 35029cf97d..bd0f211a23 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -42,6 +42,8 @@ static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
 static int    ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
                              int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+                                  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
                                   XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -121,6 +123,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
         pfree(state);
         return NULL;
     }
+    state->EndOfWAL = false;
     state->errormsg_buf[0] = '\0';
 
     /*
@@ -292,6 +295,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
     /* reset error state */
     *errormsg = NULL;
     state->errormsg_buf[0] = '\0';
+    state->EndOfWAL = false;
 
     ResetDecoder(state);
     state->abortedRecPtr = InvalidXLogRecPtr;
@@ -371,25 +375,21 @@ restart:
     Assert(pageHeaderSize <= readOff);
 
     /*
-     * Read the record length.
+     * Validate the record header.
      *
-     * NB: Even though we use an XLogRecord pointer here, the whole record
-     * header might not fit on this page. xl_tot_len is the first field of the
-     * struct, so it must be on this page (the records are MAXALIGNed), but we
-     * cannot access any other fields until we've verified that we got the
-     * whole header.
-     */
-    record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-    total_len = record->xl_tot_len;
-
-    /*
-     * If the whole record header is on this page, validate it immediately.
-     * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-     * rest of the header after reading it from the next page.  The xl_tot_len
+     * Even though we use an XLogRecord pointer here, the whole record header
+     * might not fit on this page.  If the whole record header is on this page,
+     * validate it immediately.  Even otherwise xl_tot_len must be on this page
+     * (it is the first field of MAXALIGNed records), but we still cannot
+     * access any further fields until we've verified that we got the whole
+     * header, so do just a basic sanity check on record length, and validate
+     * the rest of the header after reading it from the next page.  The length
      * check is necessary here to ensure that we enter the "Need to reassemble
      * record" code path below; otherwise we might fail to apply
      * ValidXLogRecordHeader at all.
      */
+    record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+
     if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
     {
         if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
@@ -399,18 +399,13 @@ restart:
     }
     else
     {
-        /* XXX: more validation should be done here */
-        if (total_len < SizeOfXLogRecord)
-        {
-            report_invalid_record(state,
-                                  "invalid record length at %X/%X: wanted %u, got %u",
-                                  LSN_FORMAT_ARGS(RecPtr),
-                                  (uint32) SizeOfXLogRecord, total_len);
+        if (!ValidXLogRecordLength(state, RecPtr, record))
             goto err;
-        }
+
         gotheader = false;
     }
 
+    total_len = record->xl_tot_len;
     len = XLOG_BLCKSZ - RecPtr % XLOG_BLCKSZ;
     if (total_len > len)
     {
@@ -588,6 +583,15 @@ err:
          */
         state->abortedRecPtr = RecPtr;
         state->missingContrecPtr = targetPagePtr;
+
+        /*
+         * If the message is not set yet, that means we failed to load the
+         * page for the record.  Otherwise do not hide the existing message.
+         */
+        if (state->errormsg_buf[0] == '\0')
+            report_invalid_record(state,
+                                  "missing contrecord at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
     }
 
     /*
@@ -719,6 +723,60 @@ XLogReaderInvalReadState(XLogReaderState *state)
     state->readLen = 0;
 }
 
+/*
+ * Validate record length of an XLOG record header.
+ *
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
+ */
+static bool
+ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+                      XLogRecord *record)
+{
+    if (record->xl_tot_len == 0)
+    {
+        char       *p;
+        char       *pe;
+
+        /*
+         * We are almost sure reaching the end of WAL, make sure that the
+         * whole page after the record is filled with zeroes.
+         */
+        p = (char *) record;
+        pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+        while (*p == 0 && p < pe)
+            p++;
+
+        if (p == pe)
+        {
+            /*
+             * The page after the record is completely zeroed. That suggests
+             * we don't have a record after this point. We don't bother
+             * checking the pages after since they are not zeroed in the case
+             * of recycled segments.
+             */
+            report_invalid_record(state, "empty record at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
+
+            /* notify end-of-wal to callers */
+            state->EndOfWAL = true;
+            return false;
+        }
+    }
+
+    if (record->xl_tot_len < SizeOfXLogRecord)
+    {
+        report_invalid_record(state,
+                              "invalid record length at %X/%X: wanted %u, got %u",
+                              LSN_FORMAT_ARGS(RecPtr),
+                              (uint32) SizeOfXLogRecord, record->xl_tot_len);
+        return false;
+    }
+
+    return true;
+}
+
 /*
  * Validate an XLOG record header.
  *
@@ -730,14 +788,9 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
                       XLogRecPtr PrevRecPtr, XLogRecord *record,
                       bool randAccess)
 {
-    if (record->xl_tot_len < SizeOfXLogRecord)
-    {
-        report_invalid_record(state,
-                              "invalid record length at %X/%X: wanted %u, got %u",
-                              LSN_FORMAT_ARGS(RecPtr),
-                              (uint32) SizeOfXLogRecord, record->xl_tot_len);
+    if (!ValidXLogRecordLength(state, RecPtr, record))
         return false;
-    }
+
     if (record->xl_rmid > RM_MAX_ID)
     {
         report_invalid_record(state,
@@ -836,6 +889,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
     XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+    StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+    if (hdr->xlp_magic == 0)
+    {
+        /* Regard an empty page as End-Of-WAL */
+        int            i;
+
+        for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+        if (i == XLOG_BLCKSZ)
+        {
+            char        fname[MAXFNAMELEN];
+
+            XLogFileName(fname, state->seg.ws_tli, segno,
+                         state->segcxt.ws_segsize);
+
+            report_invalid_record(state,
+                                  "empty page in log segment %s, offset %u",
+                                  fname,
+                                  offset);
+            state->EndOfWAL = true;
+            return false;
+        }
+
+        /* The same condition will be caught as invalid magic number */
+    }
     if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
     {
         char        fname[MAXFNAMELEN];
@@ -921,6 +999,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
                               LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
                               fname,
                               offset);
+
+        /*
+         * If the page address is less than expected we assume it is an unused
+         * page in a recycled segment.
+         */
+        if (hdr->xlp_pageaddr < recaddr)
+            state->EndOfWAL = true;
+
         return false;
     }
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f9f212680b..750056acaf 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1592,7 +1592,7 @@ PerformWalRecovery(void)
         /* just have to read next record after CheckPoint */
         Assert(xlogreader->ReadRecPtr == CheckPointLoc);
         replayTLI = CheckPointTLI;
-        record = ReadRecord(xlogreader, LOG, false, replayTLI);
+        record = ReadRecord(xlogreader, WARNING, false, replayTLI);
     }
 
     if (record != NULL)
@@ -1706,7 +1706,7 @@ PerformWalRecovery(void)
             }
 
             /* Else, try to fetch the next WAL record */
-            record = ReadRecord(xlogreader, LOG, false, replayTLI);
+            record = ReadRecord(xlogreader, WARNING, false, replayTLI);
         } while (record != NULL);
 
         /*
@@ -1765,13 +1765,20 @@ PerformWalRecovery(void)
 
         InRedo = false;
     }
-    else
+    else if (xlogreader->EndOfWAL)
     {
         /* there are no WAL records following the checkpoint */
         ereport(LOG,
                 (errmsg("redo is not required")));
 
     }
+    else
+    {
+        /* broken record found */
+        ereport(WARNING,
+                (errmsg("redo is skipped"),
+                 errhint("This suggests WAL file corruption. You might need to check the database.")));
+    }
 
     /*
      * This check is intentionally after the above log messages that indicate
@@ -2939,6 +2946,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
     for (;;)
     {
         char       *errormsg;
+        XLogRecPtr    ErrRecPtr = InvalidXLogRecPtr;
 
         record = XLogReadRecord(xlogreader, &errormsg);
         if (record == NULL)
@@ -2954,6 +2962,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             {
                 abortedRecPtr = xlogreader->abortedRecPtr;
                 missingContrecPtr = xlogreader->missingContrecPtr;
+                ErrRecPtr = abortedRecPtr;
+            }
+            else
+            {
+                /*
+                 * EndRecPtr is the LSN we tried to read but failed. In the
+                 * case of decoding error, it is at the end of the failed
+                 * record but we don't have a means for now to know EndRecPtr
+                 * is pointing to which of the beginning or ending of the
+                 * failed record.
+                 */
+                ErrRecPtr = xlogreader->EndRecPtr;
             }
 
             if (readFile >= 0)
@@ -2963,13 +2983,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something. In
-             * StandbyMode that only happens if we have been triggered, so we
-             * shouldn't loop anymore in that case.
+             * We only end up here without a message when XLogPageRead() failed
+             * in that case we already logged something, or just met end-of-WAL
+             * conditions. In StandbyMode that only happens if we have been
+             * triggered, so we shouldn't loop anymore in that case. When
+             * EndOfWAL is true, we don't emit that error if any immediately
+             * and instead will show it as a part of a decent end-of-wal
+             * message later.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+            if (!xlogreader->EndOfWAL && errormsg)
+                ereport(emode_for_corrupt_record(emode, ErrRecPtr),
                         (errmsg_internal("%s", errormsg) /* already translated */ ));
         }
 
@@ -3000,11 +3023,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             /* Great, got a record */
             return record;
         }
-        else
+
+        Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+        /* No valid record available from this source */
+        lastSourceFailed = true;
+
+        if (!fetching_ckpt)
         {
-            /* No valid record available from this source */
-            lastSourceFailed = true;
-
             /*
              * If archive recovery was requested, but we were still doing
              * crash recovery, switch to archive recovery and retry using the
@@ -3017,11 +3043,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
              * we'd have no idea how far we'd have to replay to reach
              * consistency.  So err on the safe side and give up.
              */
-            if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-                !fetching_ckpt)
+            if (!InArchiveRecovery && ArchiveRecoveryRequested)
             {
+                /*
+                 * We don't report this as LOG, since we don't stop recovery
+                 * here
+                 */
                 ereport(DEBUG1,
-                        (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+                        (errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery,
enteringarchive recovery",
 
+                                         LSN_FORMAT_ARGS(ErrRecPtr),
+                                         replayTLI)));
                 InArchiveRecovery = true;
                 if (StandbyModeRequested)
                     StandbyMode = true;
@@ -3042,12 +3073,24 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
                 continue;
             }
 
-            /* In standby mode, loop back to retry. Otherwise, give up. */
-            if (StandbyMode && !CheckForStandbyTrigger())
-                continue;
-            else
-                return NULL;
+            /*
+             * recovery ended.
+             *
+             * Emit a decent message if we met end-of-WAL. Otherwise we should
+             * have already emitted an error message.
+             */
+            if (xlogreader->EndOfWAL)
+                ereport(LOG,
+                        (errmsg("reached end of WAL at %X/%X on timeline %u",
+                                LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+                         (errormsg ? errdetail_internal("%s", errormsg) : 0)));
         }
+
+        /* In standby mode, loop back to retry. Otherwise, give up. */
+        if (StandbyMode && !CheckForStandbyTrigger())
+            continue;
+        else
+            return NULL;
     }
 }
 
@@ -3129,12 +3172,16 @@ retry:
                                          private->replayTLI,
                                          xlogreader->EndRecPtr))
         {
+            Assert(!StandbyMode || CheckForStandbyTrigger());
+
             if (readFile >= 0)
                 close(readFile);
             readFile = -1;
             readLen = 0;
             readSource = XLOG_FROM_ANY;
 
+            /* promotion exit is not end-of-WAL */
+            xlogreader->EndOfWAL = !StandbyMode;
             return -1;
         }
     }
@@ -3767,7 +3814,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
     static XLogRecPtr lastComplaint = 0;
 
-    if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+    /* use currentSource as readSource is reset at failure */
+    if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
     {
         if (RecPtr == lastComplaint)
             emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ceaff097b9..4f117ea4da 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -472,10 +472,9 @@ WalReceiverMain(void)
                         else if (len < 0)
                         {
                             ereport(LOG,
-                                    (errmsg("replication terminated by primary server"),
-                                     errdetail("End of WAL reached on timeline %u at %X/%X.",
-                                               startpointTLI,
-                                               LSN_FORMAT_ARGS(LogstreamResult.Write))));
+                                    (errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+                                            LSN_FORMAT_ARGS(LogstreamResult.Write),
+                                            startpointTLI)));
                             endofwal = true;
                             break;
                         }
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 2340dc247b..215abe95dc 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1173,9 +1173,16 @@ main(int argc, char **argv)
         exit(0);
 
     if (errormsg)
-        fatal_error("error in WAL record at %X/%X: %s",
-                    LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-                    errormsg);
+    {
+        if (xlogreader_state->EndOfWAL)
+            pg_log_info("end of WAL at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+        else
+            fatal_error("error in WAL record at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+    }
 
     XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 477f0efe26..7b314ef10e 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -174,6 +174,7 @@ struct XLogReaderState
      */
     XLogRecPtr    ReadRecPtr;        /* start of last record read */
     XLogRecPtr    EndRecPtr;        /* end+1 of last record read */
+    bool        EndOfWAL;        /* was the last attempt EOW? */
 
     /*
      * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 14154d1ce0..01033334d6 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,7 +10,9 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -48,7 +50,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, $reached_eow_pat, $logstart));
+    sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -61,4 +71,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
 
+my $segsize = $node->safe_psql('postgres',
+       qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+                 SELECT pg_switch_wal();
+                 CHECKPOINT;
+                 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+    $lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+    "LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+                         $logstart));
+    sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
 done_testing();
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.27.0


Re: Make mesage at end-of-recovery less scary.

От
Andres Freund
Дата:
Hi,

On 2022-03-04 09:43:59 +0900, Kyotaro Horiguchi wrote:
> On second thought the two seems repeating the same things.  Thus I
> merged the two comments together.  In this verion 16 it looks like
> this.

Patch currently fails to apply, needs a rebase:
http://cfbot.cputube.org/patch_37_2490.log

Greetings,

Andres Freund



Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
At Mon, 21 Mar 2022 17:01:19 -0700, Andres Freund <andres@anarazel.de> wrote in 
> Patch currently fails to apply, needs a rebase:
> http://cfbot.cputube.org/patch_37_2490.log

Thanks for noticing me of that.

Rebased to the current HEAD.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From a7c9f36e631eaba5078398598dae5d459e79add9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 28 Feb 2020 15:52:58 +0900
Subject: [PATCH v17] Make End-Of-Recovery error less scary

When recovery in any type ends, we see a bit scary error message like
"invalid record length" that suggests something serious is
happening. Actually if recovery meets a record with length = 0, that
usually means it finished applying all available WAL records.

Make this message less scary as "reached end of WAL". Instead, raise
the error level for other kind of WAL failure to WARNING.
---
 src/backend/access/transam/xlogreader.c   | 145 +++++++++++++++++-----
 src/backend/access/transam/xlogrecovery.c |  92 ++++++++++----
 src/backend/replication/walreceiver.c     |   7 +-
 src/bin/pg_waldump/pg_waldump.c           |  13 +-
 src/include/access/xlogreader.h           |   1 +
 src/test/recovery/t/011_crash_recovery.pl | 106 ++++++++++++++++
 6 files changed, 306 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index e437c42992..0942265408 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -46,6 +46,8 @@ static int    ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr,
                              int reqLen);
 static void XLogReaderInvalReadState(XLogReaderState *state);
 static XLogPageReadResult XLogDecodeNextRecord(XLogReaderState *state, bool non_blocking);
+static bool ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+                                  XLogRecord *record);
 static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
                                   XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);
 static bool ValidXLogRecord(XLogReaderState *state, XLogRecord *record,
@@ -147,6 +149,7 @@ XLogReaderAllocate(int wal_segment_size, const char *waldir,
         pfree(state);
         return NULL;
     }
+    state->EndOfWAL = false;
     state->errormsg_buf[0] = '\0';
 
     /*
@@ -552,6 +555,7 @@ XLogDecodeNextRecord(XLogReaderState *state, bool nonblocking)
     /* reset error state */
     state->errormsg_buf[0] = '\0';
     decoded = NULL;
+    state->EndOfWAL = false;
 
     state->abortedRecPtr = InvalidXLogRecPtr;
     state->missingContrecPtr = InvalidXLogRecPtr;
@@ -633,25 +637,21 @@ restart:
     Assert(pageHeaderSize <= readOff);
 
     /*
-     * Read the record length.
+     * Validate the record header.
      *
-     * NB: Even though we use an XLogRecord pointer here, the whole record
-     * header might not fit on this page. xl_tot_len is the first field of the
-     * struct, so it must be on this page (the records are MAXALIGNed), but we
-     * cannot access any other fields until we've verified that we got the
-     * whole header.
-     */
-    record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
-    total_len = record->xl_tot_len;
-
-    /*
-     * If the whole record header is on this page, validate it immediately.
-     * Otherwise do just a basic sanity check on xl_tot_len, and validate the
-     * rest of the header after reading it from the next page.  The xl_tot_len
+     * Even though we use an XLogRecord pointer here, the whole record header
+     * might not fit on this page.  If the whole record header is on this page,
+     * validate it immediately.  Even otherwise xl_tot_len must be on this page
+     * (it is the first field of MAXALIGNed records), but we still cannot
+     * access any further fields until we've verified that we got the whole
+     * header, so do just a basic sanity check on record length, and validate
+     * the rest of the header after reading it from the next page.  The length
      * check is necessary here to ensure that we enter the "Need to reassemble
      * record" code path below; otherwise we might fail to apply
      * ValidXLogRecordHeader at all.
      */
+    record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
+
     if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
     {
         if (!ValidXLogRecordHeader(state, RecPtr, state->DecodeRecPtr, record,
@@ -661,18 +661,14 @@ restart:
     }
     else
     {
-        /* XXX: more validation should be done here */
-        if (total_len < SizeOfXLogRecord)
-        {
-            report_invalid_record(state,
-                                  "invalid record length at %X/%X: wanted %u, got %u",
-                                  LSN_FORMAT_ARGS(RecPtr),
-                                  (uint32) SizeOfXLogRecord, total_len);
+        if (!ValidXLogRecordLength(state, RecPtr, record))
             goto err;
-        }
+
         gotheader = false;
     }
 
+    total_len = record->xl_tot_len;
+
     /*
      * Find space to decode this record.  Don't allow oversized allocation if
      * the caller requested nonblocking.  Otherwise, we *have* to try to
@@ -904,6 +900,15 @@ err:
          */
         state->abortedRecPtr = RecPtr;
         state->missingContrecPtr = targetPagePtr;
+
+        /*
+         * If the message is not set yet, that means we failed to load the
+         * page for the record.  Otherwise do not hide the existing message.
+         */
+        if (state->errormsg_buf[0] == '\0')
+            report_invalid_record(state,
+                                  "missing contrecord at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
     }
 
     if (decoded && decoded->oversized)
@@ -1083,6 +1088,60 @@ XLogReaderInvalReadState(XLogReaderState *state)
     state->readLen = 0;
 }
 
+/*
+ * Validate record length of an XLOG record header.
+ *
+ * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
+ * needs this separate from the function in case of a partial record header.
+ */
+static bool
+ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,
+                      XLogRecord *record)
+{
+    if (record->xl_tot_len == 0)
+    {
+        char       *p;
+        char       *pe;
+
+        /*
+         * We are almost sure reaching the end of WAL, make sure that the
+         * whole page after the record is filled with zeroes.
+         */
+        p = (char *) record;
+        pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+        while (*p == 0 && p < pe)
+            p++;
+
+        if (p == pe)
+        {
+            /*
+             * The page after the record is completely zeroed. That suggests
+             * we don't have a record after this point. We don't bother
+             * checking the pages after since they are not zeroed in the case
+             * of recycled segments.
+             */
+            report_invalid_record(state, "empty record at %X/%X",
+                                  LSN_FORMAT_ARGS(RecPtr));
+
+            /* notify end-of-wal to callers */
+            state->EndOfWAL = true;
+            return false;
+        }
+    }
+
+    if (record->xl_tot_len < SizeOfXLogRecord)
+    {
+        report_invalid_record(state,
+                              "invalid record length at %X/%X: wanted %u, got %u",
+                              LSN_FORMAT_ARGS(RecPtr),
+                              (uint32) SizeOfXLogRecord, record->xl_tot_len);
+        return false;
+    }
+
+    return true;
+}
+
 /*
  * Validate an XLOG record header.
  *
@@ -1094,14 +1153,9 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
                       XLogRecPtr PrevRecPtr, XLogRecord *record,
                       bool randAccess)
 {
-    if (record->xl_tot_len < SizeOfXLogRecord)
-    {
-        report_invalid_record(state,
-                              "invalid record length at %X/%X: wanted %u, got %u",
-                              LSN_FORMAT_ARGS(RecPtr),
-                              (uint32) SizeOfXLogRecord, record->xl_tot_len);
+    if (!ValidXLogRecordLength(state, RecPtr, record))
         return false;
-    }
+
     if (record->xl_rmid > RM_MAX_ID)
     {
         report_invalid_record(state,
@@ -1200,6 +1254,31 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
 
     XLogSegNoOffsetToRecPtr(segno, offset, state->segcxt.ws_segsize, recaddr);
 
+    StaticAssertStmt(XLOG_PAGE_MAGIC != 0, "XLOG_PAGE_MAGIC is zero");
+
+    if (hdr->xlp_magic == 0)
+    {
+        /* Regard an empty page as End-Of-WAL */
+        int            i;
+
+        for (i = 0; i < XLOG_BLCKSZ && phdr[i] == 0; i++);
+        if (i == XLOG_BLCKSZ)
+        {
+            char        fname[MAXFNAMELEN];
+
+            XLogFileName(fname, state->seg.ws_tli, segno,
+                         state->segcxt.ws_segsize);
+
+            report_invalid_record(state,
+                                  "empty page in log segment %s, offset %u",
+                                  fname,
+                                  offset);
+            state->EndOfWAL = true;
+            return false;
+        }
+
+        /* The same condition will be caught as invalid magic number */
+    }
     if (hdr->xlp_magic != XLOG_PAGE_MAGIC)
     {
         char        fname[MAXFNAMELEN];
@@ -1285,6 +1364,14 @@ XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
                               LSN_FORMAT_ARGS(hdr->xlp_pageaddr),
                               fname,
                               offset);
+
+        /*
+         * If the page address is less than expected we assume it is an unused
+         * page in a recycled segment.
+         */
+        if (hdr->xlp_pageaddr < recaddr)
+            state->EndOfWAL = true;
+
         return false;
     }
 
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 9feea3e6ec..98382d66a4 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1592,7 +1592,7 @@ PerformWalRecovery(void)
         /* just have to read next record after CheckPoint */
         Assert(xlogreader->ReadRecPtr == CheckPointLoc);
         replayTLI = CheckPointTLI;
-        record = ReadRecord(xlogreader, LOG, false, replayTLI);
+        record = ReadRecord(xlogreader, WARNING, false, replayTLI);
     }
 
     if (record != NULL)
@@ -1706,7 +1706,7 @@ PerformWalRecovery(void)
             }
 
             /* Else, try to fetch the next WAL record */
-            record = ReadRecord(xlogreader, LOG, false, replayTLI);
+            record = ReadRecord(xlogreader, WARNING, false, replayTLI);
         } while (record != NULL);
 
         /*
@@ -1765,13 +1765,20 @@ PerformWalRecovery(void)
 
         InRedo = false;
     }
-    else
+    else if (xlogreader->EndOfWAL)
     {
         /* there are no WAL records following the checkpoint */
         ereport(LOG,
                 (errmsg("redo is not required")));
 
     }
+    else
+    {
+        /* broken record found */
+        ereport(WARNING,
+                (errmsg("redo is skipped"),
+                 errhint("This suggests WAL file corruption. You might need to check the database.")));
+    }
 
     /*
      * This check is intentionally after the above log messages that indicate
@@ -2939,6 +2946,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
     for (;;)
     {
         char       *errormsg;
+        XLogRecPtr    ErrRecPtr = InvalidXLogRecPtr;
 
         record = XLogReadRecord(xlogreader, &errormsg);
         if (record == NULL)
@@ -2954,6 +2962,18 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             {
                 abortedRecPtr = xlogreader->abortedRecPtr;
                 missingContrecPtr = xlogreader->missingContrecPtr;
+                ErrRecPtr = abortedRecPtr;
+            }
+            else
+            {
+                /*
+                 * EndRecPtr is the LSN we tried to read but failed. In the
+                 * case of decoding error, it is at the end of the failed
+                 * record but we don't have a means for now to know EndRecPtr
+                 * is pointing to which of the beginning or ending of the
+                 * failed record.
+                 */
+                ErrRecPtr = xlogreader->EndRecPtr;
             }
 
             if (readFile >= 0)
@@ -2963,13 +2983,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something. In
-             * StandbyMode that only happens if we have been triggered, so we
-             * shouldn't loop anymore in that case.
+             * We only end up here without a message when XLogPageRead() failed
+             * in that case we already logged something, or just met end-of-WAL
+             * conditions. In StandbyMode that only happens if we have been
+             * triggered, so we shouldn't loop anymore in that case. When
+             * EndOfWAL is true, we don't emit that error if any immediately
+             * and instead will show it as a part of a decent end-of-wal
+             * message later.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr),
+            if (!xlogreader->EndOfWAL && errormsg)
+                ereport(emode_for_corrupt_record(emode, ErrRecPtr),
                         (errmsg_internal("%s", errormsg) /* already translated */ ));
         }
 
@@ -3000,11 +3023,14 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
             /* Great, got a record */
             return record;
         }
-        else
+
+        Assert(ErrRecPtr != InvalidXLogRecPtr);
+
+        /* No valid record available from this source */
+        lastSourceFailed = true;
+
+        if (!fetching_ckpt)
         {
-            /* No valid record available from this source */
-            lastSourceFailed = true;
-
             /*
              * If archive recovery was requested, but we were still doing
              * crash recovery, switch to archive recovery and retry using the
@@ -3017,11 +3043,16 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
              * we'd have no idea how far we'd have to replay to reach
              * consistency.  So err on the safe side and give up.
              */
-            if (!InArchiveRecovery && ArchiveRecoveryRequested &&
-                !fetching_ckpt)
+            if (!InArchiveRecovery && ArchiveRecoveryRequested)
             {
+                /*
+                 * We don't report this as LOG, since we don't stop recovery
+                 * here
+                 */
                 ereport(DEBUG1,
-                        (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+                        (errmsg_internal("reached end of WAL at %X/%X on timeline %u in pg_wal during crash recovery,
enteringarchive recovery",
 
+                                         LSN_FORMAT_ARGS(ErrRecPtr),
+                                         replayTLI)));
                 InArchiveRecovery = true;
                 if (StandbyModeRequested)
                     StandbyMode = true;
@@ -3042,12 +3073,24 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
                 continue;
             }
 
-            /* In standby mode, loop back to retry. Otherwise, give up. */
-            if (StandbyMode && !CheckForStandbyTrigger())
-                continue;
-            else
-                return NULL;
+            /*
+             * recovery ended.
+             *
+             * Emit a decent message if we met end-of-WAL. Otherwise we should
+             * have already emitted an error message.
+             */
+            if (xlogreader->EndOfWAL)
+                ereport(LOG,
+                        (errmsg("reached end of WAL at %X/%X on timeline %u",
+                                LSN_FORMAT_ARGS(ErrRecPtr), replayTLI),
+                         (errormsg ? errdetail_internal("%s", errormsg) : 0)));
         }
+
+        /* In standby mode, loop back to retry. Otherwise, give up. */
+        if (StandbyMode && !CheckForStandbyTrigger())
+            continue;
+        else
+            return NULL;
     }
 }
 
@@ -3129,12 +3172,16 @@ retry:
                                          private->replayTLI,
                                          xlogreader->EndRecPtr))
         {
+            Assert(!StandbyMode || CheckForStandbyTrigger());
+
             if (readFile >= 0)
                 close(readFile);
             readFile = -1;
             readLen = 0;
             readSource = XLOG_FROM_ANY;
 
+            /* promotion exit is not end-of-WAL */
+            xlogreader->EndOfWAL = !StandbyMode;
             return -1;
         }
     }
@@ -3767,7 +3814,8 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
     static XLogRecPtr lastComplaint = 0;
 
-    if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
+    /* use currentSource as readSource is reset at failure */
+    if (currentSource == XLOG_FROM_PG_WAL && emode <= WARNING)
     {
         if (RecPtr == lastComplaint)
             emode = DEBUG1;
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index ceaff097b9..4f117ea4da 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -472,10 +472,9 @@ WalReceiverMain(void)
                         else if (len < 0)
                         {
                             ereport(LOG,
-                                    (errmsg("replication terminated by primary server"),
-                                     errdetail("End of WAL reached on timeline %u at %X/%X.",
-                                               startpointTLI,
-                                               LSN_FORMAT_ARGS(LogstreamResult.Write))));
+                                    (errmsg("replication terminated by primary server at %X/%X on timeline %u.",
+                                            LSN_FORMAT_ARGS(LogstreamResult.Write),
+                                            startpointTLI)));
                             endofwal = true;
                             break;
                         }
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index fc081adfb8..9bebca8154 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1174,9 +1174,16 @@ main(int argc, char **argv)
         exit(0);
 
     if (errormsg)
-        fatal_error("error in WAL record at %X/%X: %s",
-                    LSN_FORMAT_ARGS(xlogreader_state->ReadRecPtr),
-                    errormsg);
+    {
+        if (xlogreader_state->EndOfWAL)
+            pg_log_info("end of WAL at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+        else
+            fatal_error("error in WAL record at %X/%X: %s",
+                        LSN_FORMAT_ARGS(xlogreader_state->EndRecPtr),
+                        errormsg);
+    }
 
     XLogReaderFree(xlogreader_state);
 
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index f4388cc9be..21a8f9552c 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -201,6 +201,7 @@ struct XLogReaderState
      */
     XLogRecPtr    ReadRecPtr;        /* start of last record read */
     XLogRecPtr    EndRecPtr;        /* end+1 of last record read */
+    bool        EndOfWAL;        /* was the last attempt EOW? */
 
     /*
      * Set at the end of recovery: the start point of a partial record at the
diff --git a/src/test/recovery/t/011_crash_recovery.pl b/src/test/recovery/t/011_crash_recovery.pl
index 14154d1ce0..01033334d6 100644
--- a/src/test/recovery/t/011_crash_recovery.pl
+++ b/src/test/recovery/t/011_crash_recovery.pl
@@ -10,7 +10,9 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 use Config;
+use IPC::Run;
 
+my $reached_eow_pat = "reached end of WAL at ";
 my $node = PostgreSQL::Test::Cluster->new('primary');
 $node->init(allows_streaming => 1);
 $node->start;
@@ -48,7 +50,15 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 
 # Crash and restart the postmaster
 $node->stop('immediate');
+my $logstart = get_log_size($node);
 $node->start;
+my $max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, $reached_eow_pat, $logstart));
+    sleep 0.5;
+}
+ok ($max_attempts >= 0, "end-of-wal is logged");
 
 # Make sure we really got a new xid
 cmp_ok($node->safe_psql('postgres', 'SELECT pg_current_xact_id()'),
@@ -61,4 +71,100 @@ is($node->safe_psql('postgres', qq[SELECT pg_xact_status('$xid');]),
 $stdin .= "\\q\n";
 $tx->finish;    # wait for psql to quit gracefully
 
+my $segsize = $node->safe_psql('postgres',
+       qq[SELECT setting FROM pg_settings WHERE name = 'wal_segment_size';]);
+
+# make sure no records afterwards go to the next segment
+$node->safe_psql('postgres', qq[
+                 SELECT pg_switch_wal();
+                 CHECKPOINT;
+                 CREATE TABLE t();
+]);
+$node->stop('immediate');
+
+# identify REDO WAL file
+my $cmd = "pg_controldata -D " . $node->data_dir();
+$cmd = ['pg_controldata', '-D', $node->data_dir()];
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stdout =~ /^Latest checkpoint's REDO WAL file:[ \t] *(.+)$/m,
+   "checkpoint file is identified");
+my $chkptfile = $1;
+
+# identify the last record
+my $walfile = $node->data_dir() . "/pg_wal/$chkptfile";
+$cmd = ['pg_waldump', $walfile];
+$stdout = '';
+$stderr = '';
+my $lastrec;
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+foreach my $l (split(/\r?\n/, $stdout))
+{
+    $lastrec = $l;
+}
+ok(defined $lastrec, "last WAL record is extracted");
+ok($stderr =~ /end of WAL at ([0-9A-F\/]+): .* at \g1/,
+   "pg_waldump emits the correct ending message");
+
+# read the last record LSN excluding leading zeroes
+ok ($lastrec =~ /, lsn: 0\/0*([1-9A-F][0-9A-F]+),/,
+    "LSN of the last record identified");
+my $lastlsn = $1;
+
+# corrupt the last record
+my $offset = hex($lastlsn) % $segsize;
+open(my $segf, '+<', $walfile) or die "failed to open $walfile\n";
+seek($segf, $offset, 0);  # halfway break the last record
+print $segf "\0\0\0\0";
+close($segf);
+
+# pg_waldump complains about the corrupted record
+$stdout = '';
+$stderr = '';
+IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+ok($stderr =~ /fatal: error in WAL record at 0\/$lastlsn: .* at 0\/$lastlsn/,
+   "pg_waldump emits the correct error message");
+
+# also server complains
+$logstart = get_log_size($node);
+$node->start;
+$max_attempts = 360;
+while ($max_attempts-- >= 0)
+{
+    last if (find_in_log($node, "WARNING:  invalid record length at 0/$lastlsn: wanted [0-9]+, got 0",
+                         $logstart));
+    sleep 0.5;
+}
+ok($max_attempts >= 0, "header error is logged at $lastlsn");
+
+# no end-of-wal message should be seen this time
+ok(!find_in_log($node, $reached_eow_pat, $logstart),
+   "false log message is not emitted");
+
+$node->stop('immediate');
+
 done_testing();
+
+#### helper routines
+# return the size of logfile of $node in bytes
+sub get_log_size
+{
+    my ($node) = @_;
+
+    return (stat $node->logfile)[7];
+}
+
+# find $pat in logfile of $node after $off-th byte
+sub find_in_log
+{
+    my ($node, $pat, $off) = @_;
+
+    $off = 0 unless defined $off;
+    my $log = PostgreSQL::Test::Utils::slurp_file($node->logfile);
+    return 0 if (length($log) <= $off);
+
+    $log = substr($log, $off);
+
+    return $log =~ m/$pat/;
+}
-- 
2.27.0


Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
me> Rebased to the current HEAD.

b64c3bd62e (removal of unused "use Config") conflicted on a TAP
script.

Rebased.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

Re: Make mesage at end-of-recovery less scary.

От
Jacob Champion
Дата:
On Mon, Mar 28, 2022 at 11:07 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> Rebased.

Unfortunately this will need another rebase over latest.

[CFM hat] Looking through the history here, this has been bumped to
Ready for Committer a few times and then bumped back to Needs Review
after a required rebase. What's the best way for us to provide support
for contributors who get stuck in this loop? Maybe we can be more
aggressive about automated notifications when a RfC patch goes red in
the cfbot?

Thanks,
--Jacob



Re: Make mesage at end-of-recovery less scary.

От
Michael Paquier
Дата:
On Wed, Jul 06, 2022 at 11:05:51AM -0700, Jacob Champion wrote:
> [CFM hat] Looking through the history here, this has been bumped to
> Ready for Committer a few times and then bumped back to Needs Review
> after a required rebase. What's the best way for us to provide support
> for contributors who get stuck in this loop? Maybe we can be more
> aggressive about automated notifications when a RfC patch goes red in
> the cfbot?

Having a better integration between the CF bot and the CF app would be
great, IMO.  People tend to easily forget about what they send in my
experience, even if they manage a small pool of patches or a larger
one.
--
Michael

Вложения

Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
At Wed, 6 Jul 2022 11:05:51 -0700, Jacob Champion <jchampion@timescale.com> wrote in 
> On Mon, Mar 28, 2022 at 11:07 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> >
> > Rebased.
> 
> Unfortunately this will need another rebase over latest.

Thanks! Done. 

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

Re: Make mesage at end-of-recovery less scary.

От
Justin Pryzby
Дата:
@cfbot: rebased over adb466150, which did the same thing as one of the
hunks in xlogreader.c.

Вложения

Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
At Fri, 16 Sep 2022 23:21:50 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in 
> @cfbot: rebased over adb466150, which did the same thing as one of the
> hunks in xlogreader.c.

Oops. Thanks! And then this gets a further conflict (param names
harmonization). So further rebased.  And removed an extra blank line
you pointed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

Re: Make mesage at end-of-recovery less scary.

От
Justin Pryzby
Дата:

Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
Just rebased.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

Re: Make mesage at end-of-recovery less scary.

От
Andres Freund
Дата:
Hi,

On 2022-11-18 17:25:37 +0900, Kyotaro Horiguchi wrote:
> Just rebased.

Fails with address sanitizer:
https://cirrus-ci.com/task/5632986241564672

Unfortunately one of the failures is in pg_waldump and we don't seem to
capture its output in 011_crash_recovery. So we don't see the nice formattted
output...

[11:07:18.868] #0  0x00007fcf43803ce1 in raise () from /lib/x86_64-linux-gnu/libc.so.6
[11:07:18.912] 
[11:07:18.912] Thread 1 (Thread 0x7fcf43662780 (LWP 39124)):
[11:07:18.912] #0  0x00007fcf43803ce1 in raise () from /lib/x86_64-linux-gnu/libc.so.6
[11:07:18.912] No symbol table info available.
[11:07:18.912] #1  0x00007fcf437ed537 in abort () from /lib/x86_64-linux-gnu/libc.so.6
[11:07:18.912] No symbol table info available.
[11:07:18.912] #2  0x00007fcf43b8511b in __sanitizer::Abort () at
../../../../src/libsanitizer/sanitizer_common/sanitizer_posix_libcdep.cpp:155
[11:07:18.912] No locals.
[11:07:18.912] #3  0x00007fcf43b8fce8 in __sanitizer::Die () at
../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:58
[11:07:18.912] No locals.
[11:07:18.912] #4  0x00007fcf43b7244c in __asan::ScopedInErrorReport::~ScopedInErrorReport (this=0x7ffd4fde18e6,
__in_chrg=<optimizedout>) at ../../../../src/libsanitizer/asan/asan_report.cpp:186
 
[11:07:18.912]         buffer_copy = {<__sanitizer::InternalMmapVectorNoCtor<char>> = {data_ = 0x7fcf40350000 '='
<repeats65 times>, "\n==39124==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x625000002100 at pc
0x55c36c21e315bp 0x7ffd4fde2550 sp 0x7ffd4fde2"..., capacity_bytes_ = 65536, size_ = <optimized out>}, <No data
fields>}
...
[11:07:18.912] #6  0x00007fcf43b72788 in __asan::__asan_report_load1 (addr=<optimized out>) at
../../../../src/libsanitizer/asan/asan_rtl.cpp:117
[11:07:18.912]         bp = 140725943412048
[11:07:18.912]         pc = <optimized out>
[11:07:18.912]         local_stack = 140528180793728
[11:07:18.912]         sp = 140725943412040
[11:07:18.912] #7  0x000055c36c21e315 in ValidXLogRecordLength (state=state@entry=0x61a000000680,
RecPtr=RecPtr@entry=33655480,record=record@entry=0x625000000bb8) at xlogreader.c:1126
 
[11:07:18.912]         p = <optimized out>
[11:07:18.912]         pe = 0x625000002100 ""
[11:07:18.912] #8  0x000055c36c21e3b1 in ValidXLogRecordHeader (state=state@entry=0x61a000000680,
RecPtr=RecPtr@entry=33655480,PrevRecPtr=33655104, record=record@entry=0x625000000bb8,
randAccess=randAccess@entry=false)at xlogreader.c:1169
 
[11:07:18.912] No locals.

The  most important bit is "AddressSanitizer: heap-buffer-overflow on address 0x6250000\
02100 at pc 0x55c36c21e315 bp 0x7ffd4fde2550 sp 0x7ffd4fde2"

Greetings,

Andres Freund



Re: Make mesage at end-of-recovery less scary.

От
Justin Pryzby
Дата:
On Fri, Nov 18, 2022 at 05:25:37PM +0900, Kyotaro Horiguchi wrote:
> +        while (*p == 0 && p < pe)
> +            p++;

The bug reported by Andres/cfbot/ubsan is here.

Fixed in attached.

I didn't try to patch the test case to output the failing stderr, but
that might be good.

Вложения

Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
At Tue, 22 Nov 2022 16:04:56 -0600, Justin Pryzby <pryzby@telsasoft.com> wrote in 
> On Fri, Nov 18, 2022 at 05:25:37PM +0900, Kyotaro Horiguchi wrote:
> > +        while (*p == 0 && p < pe)
> > +            p++;
> 
> The bug reported by Andres/cfbot/ubsan is here.
>
> Fixed in attached.

Ur..ou..

-        while (*p == 0 && p < pe)
+        while (p < pe && *p == 0)

It was an off-by-one error. Thanks!

> I didn't try to patch the test case to output the failing stderr, but
> that might be good.

I have made use of Cluster::wait_for_log(), but still find_in_log() is
there since it is used to check if a message that should not be logged
is not logged.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

Re: Make mesage at end-of-recovery less scary.

От
Alvaro Herrera
Дата:
So this patch is now failing because it applies new tests to
011_crash_recovery.pl, which was removed recently.  Can you please move
them elsewhere?

I think the comment for ValidXLogRecordLength should explain what the
return value is.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/



Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
Thanks!

At Fri, 3 Feb 2023 15:16:02 +0100, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in 
> So this patch is now failing because it applies new tests to
> 011_crash_recovery.pl, which was removed recently.  Can you please move
> them elsewhere?

I don't find an appropriate file to move to. In the end I created a
new file with the name 034_recovery.pl.  I added a test for standbys,
too. (which is the first objective of this patch.)

> I think the comment for ValidXLogRecordLength should explain what the
> return value is.

Agreed.


/*
  * Validate record length of an XLOG record header.
  *
  * This is substantially a part of ValidXLogRecordHeader.  But XLogReadRecord
  * needs this separate from the function in case of a partial record header.
+ *
+ * Returns true if the xl_tot_len header field has a seemingly valid value,
+ * which means the caller can proceed reading to the following part of the
+ * record.
  */
 static bool
 ValidXLogRecordLength(XLogReaderState *state, XLogRecPtr RecPtr,

I added a similar description to ValidXLogRecordHeader.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

Re: Make mesage at end-of-recovery less scary.

От
"Gregory Stark (as CFM)"
Дата:
It looks like this needs a rebase and at a quick glance it looks like
more than a trivial conflict. I'll mark it Waiting on Author. Please
update it back when it's rebased




--
Gregory Stark
As Commitfest Manager



Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
At Mon, 6 Mar 2023 14:58:15 -0500, "Gregory Stark (as CFM)" <stark.cfm@gmail.com> wrote in 
> It looks like this needs a rebase and at a quick glance it looks like
> more than a trivial conflict. I'll mark it Waiting on Author. Please
> update it back when it's rebased

Thanks for checking it!

I think 4ac30ba4f2 is that, which changes a few error
messages. Addition to rebasing, I rewrote some code comments of
xlogreader.c and revised the additional test script.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

Re: Make mesage at end-of-recovery less scary.

От
Aleksander Alekseev
Дата:
Hi,

> Thanks for checking it!
>
> I think 4ac30ba4f2 is that, which changes a few error
> messages. Addition to rebasing, I rewrote some code comments of
> xlogreader.c and revised the additional test script.

Thanks for working on this, it bugged me for a while. I noticed that
cfbot is not happy with the patch so I rebased it.
postgresql:pg_waldump test suite didn't pass after the rebase. I fixed
it too. Other than that the patch LGTM so I'm not changing its status
from "Ready for Committer".

It looks like the patch was moved between the commitfests since
2020... If there is anything that may help merging it into PG17 please
let me know.

-- 
Best regards,
Aleksander Alekseev

Вложения

Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
At Mon, 17 Jul 2023 15:20:30 +0300, Aleksander Alekseev <aleksander@timescale.com> wrote in 
> Thanks for working on this, it bugged me for a while. I noticed that
> cfbot is not happy with the patch so I rebased it.
> postgresql:pg_waldump test suite didn't pass after the rebase. I fixed
> it too. Other than that the patch LGTM so I'm not changing its status
> from "Ready for Committer".

Thanks for the rebasing.

> It looks like the patch was moved between the commitfests since
> 2020... If there is anything that may help merging it into PG17 please
> let me know.

This might be just too-much or there might be some doubt in this..

This change basically makes a zero-length record be considered as the
normal end of WAL.

The most controvorsial point I think in the design is the criteria for
an error condition. The assumption is that the WAL is sound if all
bytes following a complete record, up to the next page boundary, are
zeroed out. This is slightly narrower than the original criteria,
merely checking the next record is zero-length.  Naturally, there
might be instances where that page has been blown out due to device
failure or some other reasons. Despite this, I believe it is
preferable rather than always issuing a warning (in the LOG level,
though) about a potential WAL corruption.

I've adjusted the condition for muting repeated log messages at the
same LSN, changing it from ==LOG to <=WARNING. This is simply a
consequence of following the change of "real" warnings from LOG to
WARNING. I believe this is acceptable even without considering
aforementioned change, as any single retriable (<ERROR) error at an
LSN should be sufficient to alert users about potential issues.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
Anyway, this requires rebsaing, and done.

Thanks for John (Naylor) for pointing this out.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

Re: Make mesage at end-of-recovery less scary.

От
vignesh C
Дата:
On Wed, 22 Nov 2023 at 13:01, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
>
> Anyway, this requires rebsaing, and done.

Few tests are failing at [1], kindly post an updated patch:
/tmp/cirrus-ci-build/src/test/recovery --testgroup recovery --testname
039_end_of_wal -- /usr/local/bin/perl -I
/tmp/cirrus-ci-build/src/test/perl -I
/tmp/cirrus-ci-build/src/test/recovery
/tmp/cirrus-ci-build/src/test/recovery/t/039_end_of_wal.pl
[23:53:10.370] ――――――――――――――――――――――――――――――――――――― ✀
―――――――――――――――――――――――――――――――――――――
[23:53:10.370] stderr:
[23:53:10.370] # Failed test 'xl_tot_len zero'
[23:53:10.370] # at
/tmp/cirrus-ci-build/src/test/recovery/t/039_end_of_wal.pl line 267.
[23:53:10.370] # Failed test 'xlp_magic zero'
[23:53:10.370] # at
/tmp/cirrus-ci-build/src/test/recovery/t/039_end_of_wal.pl line 340.
[23:53:10.370] # Failed test 'xlp_magic zero (split record header)'
[23:53:10.370] # at
/tmp/cirrus-ci-build/src/test/recovery/t/039_end_of_wal.pl line 445.
[23:53:10.370] # Looks like you failed 3 tests of 14.
[23:53:10.370]
[23:53:10.370] (test program exited with status code 3)
[23:53:10.370] ――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――――

[1] - https://cirrus-ci.com/task/5859293157654528

Regards,
Vignesh

Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
At Fri, 5 Jan 2024 16:02:24 +0530, vignesh C <vignesh21@gmail.com> wrote in 
> On Wed, 22 Nov 2023 at 13:01, Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> >
> > Anyway, this requires rebsaing, and done.
> 
> Few tests are failing at [1], kindly post an updated patch:

Thanks!

The errors occurred in a part of the tests for end-of-WAL detection
added in the master branch. These failures were primarily due to
changes in the message contents introduced by this patch. During the
revision, I discovered an issue with the handling of empty pages that
appear in the middle of reading continuation records. In the previous
version, such empty pages were mistakenly identified as indicating a
clean end-of-WAL (that is a LOG). However, they should actually be
handled as a WARNING, since the record curently being read is broken
at the empty pages. The following changes have been made in this
version:

1. Adjusting the test to align with the error message changes
  introduced by this patch.

2. Adding tests for the newly added messages.

3. Correcting the handling of empty pages encountered during the
  reading of continuation records. (XLogReaderValidatePageHeader)

4. Revising code comments.

5. Changing the term "log segment" to "WAL
  segment". (XLogReaderValidatePageHeader)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения

Re: Make mesage at end-of-recovery less scary.

От
Aleksander Alekseev
Дата:
Hi,

> The errors occurred in a part of the tests for end-of-WAL detection
> added in the master branch. These failures were primarily due to
> changes in the message contents introduced by this patch. During the
> revision, I discovered an issue with the handling of empty pages that
> appear in the middle of reading continuation records. In the previous
> version, such empty pages were mistakenly identified as indicating a
> clean end-of-WAL (that is a LOG). However, they should actually be
> handled as a WARNING, since the record curently being read is broken
> at the empty pages. The following changes have been made in this
> version:
>
> 1. Adjusting the test to align with the error message changes
>   introduced by this patch.
>
> 2. Adding tests for the newly added messages.
>
> 3. Correcting the handling of empty pages encountered during the
>   reading of continuation records. (XLogReaderValidatePageHeader)
>
> 4. Revising code comments.
>
> 5. Changing the term "log segment" to "WAL
>   segment". (XLogReaderValidatePageHeader)
>
> regards.

Thanks for the updated patch.

```
+        p = (char *) record;
+        pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
+
+        while (p < pe && *p == 0)
+            p++;
+
+        if (p == pe)
```

Just as a random thought: perhaps we should make this a separate
function, as a part of src/port/. It seems to me that this code could
benefit from using vector instructions some day, similarly to
memcmp(), memset() etc. Surprisingly there doesn't seem to be a
standard C function for this. Alternatively one could argue that one
cycle doesn't make much code to reuse and that the C compiler will
place SIMD instructions for us. However a counter-counter argument
would be that we could use a macro or even better an inline function
and have the same effect except getting a slightly more readable code.

```
- * This is just a convenience subroutine to avoid duplicated code in
+ * This is just a convenience subroutine to avoid duplicate code in
```

This change doesn't seem to be related to the patch. Personally I
don't mind it though.

All in all I find v28 somewhat scary. It does much more than "making
one message less scary" as it was initially intended and what bugged
me personally, and accordingly touches many more places including
xlogreader.c, xlogrecovery.c, etc.

Particularly I have mixed feeling about this:

```
+            /*
+             * Consider it as end-of-WAL if all subsequent bytes of this page
+             * are zero. We don't bother checking the subsequent pages since
+             * they are not zeroed in the case of recycled segments.
+             */
```

If I understand correctly, if somehow several FS blocks end up being
zeroed (due to OS bug, bit rot, restoring from a corrupted for
whatever reason backup, hardware failures, ...) there is non-zero
chance that PG will interpret this as a normal situation. To my
knowledge this is not what we typically do - typically PG would report
an error and ask a human to figure out what happened. Of course the
possibility of such a scenario is small, but I don't think that as
DBMS developers we can ignore it.

Does anyone agree or maybe I'm making things up?

-- 
Best regards,
Aleksander Alekseev



Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
Thank you for the comments.

At Fri, 12 Jan 2024 15:03:26 +0300, Aleksander Alekseev <aleksander@timescale.com> wrote in 
> ```
> +        p = (char *) record;
> +        pe = p + XLOG_BLCKSZ - (RecPtr & (XLOG_BLCKSZ - 1));
> +
> +        while (p < pe && *p == 0)
> +            p++;
> +
> +        if (p == pe)
> ```
> 
> Just as a random thought: perhaps we should make this a separate
> function, as a part of src/port/. It seems to me that this code could
> benefit from using vector instructions some day, similarly to
> memcmp(), memset() etc. Surprisingly there doesn't seem to be a
> standard C function for this. Alternatively one could argue that one
> cycle doesn't make much code to reuse and that the C compiler will
> place SIMD instructions for us. However a counter-counter argument
> would be that we could use a macro or even better an inline function
> and have the same effect except getting a slightly more readable code.

Creating a function with a name like memcmp_byte() should be
straightforward, but implementing it with SIMD right away seems a bit
challenging. Similar operations are already being performed elsewhere
in the code, probably within the stats collector, where memcmp is used
with a statically allocated area that's filled with zeros. If we can
achieve a performance equivalent to memcmp with this new function,
then it definitely seems worth pursuing.

> ```
> - * This is just a convenience subroutine to avoid duplicated code in
> + * This is just a convenience subroutine to avoid duplicate code in
> ```
> 
> This change doesn't seem to be related to the patch. Personally I
> don't mind it though.

Ah, I'm sorry. That was something I mistakenly thought I had written
at the last moment and made modifications to.

> All in all I find v28 somewhat scary. It does much more than "making
> one message less scary" as it was initially intended and what bugged
> me personally, and accordingly touches many more places including
> xlogreader.c, xlogrecovery.c, etc.
> 
> Particularly I have mixed feeling about this:
> 
> ```
> +            /*
> +             * Consider it as end-of-WAL if all subsequent bytes of this page
> +             * are zero. We don't bother checking the subsequent pages since
> +             * they are not zeroed in the case of recycled segments.
> +             */
> ```
> 
> If I understand correctly, if somehow several FS blocks end up being
> zeroed (due to OS bug, bit rot, restoring from a corrupted for
> whatever reason backup, hardware failures, ...) there is non-zero
> chance that PG will interpret this as a normal situation. To my
> knowledge this is not what we typically do - typically PG would report
> an error and ask a human to figure out what happened. Of course the
> possibility of such a scenario is small, but I don't think that as
> DBMS developers we can ignore it.

For now, let me explain the basis for this patch. The fundamental
issue is that these warnings that always appear are, in practice, not
a problem in almost all cases. Some of those who encounter them for
the first time may feel uneasy and reach out with inquiries. On the
other hand, those familiar with these warnings tend to ignore them and
only pay attention to details when actual issues arise. Therefore, the
intention of this patch is to label them as "no issue" unless a
problem is blatantly evident, in order to prevent unnecessary concern.

> Does anyone agree or maybe I'm making things up?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Make mesage at end-of-recovery less scary.

От
Aleksander Alekseev
Дата:
Hi,

> > If I understand correctly, if somehow several FS blocks end up being
> > zeroed (due to OS bug, bit rot, restoring from a corrupted for
> > whatever reason backup, hardware failures, ...) there is non-zero
> > chance that PG will interpret this as a normal situation. To my
> > knowledge this is not what we typically do - typically PG would report
> > an error and ask a human to figure out what happened. Of course the
> > possibility of such a scenario is small, but I don't think that as
> > DBMS developers we can ignore it.
>
> For now, let me explain the basis for this patch. The fundamental
> issue is that these warnings that always appear are, in practice, not
> a problem in almost all cases. Some of those who encounter them for
> the first time may feel uneasy and reach out with inquiries. On the
> other hand, those familiar with these warnings tend to ignore them and
> only pay attention to details when actual issues arise. Therefore, the
> intention of this patch is to label them as "no issue" unless a
> problem is blatantly evident, in order to prevent unnecessary concern.

I agree and don't mind affecting the error message per se.

However I see that the actual logic of how WAL is processed is being
changed. If we do this, at very least it requires thorough thinking. I
strongly suspect that the proposed code is wrong and/or not safe
and/or less safe than it is now for the reasons named above.

-- 
Best regards,
Aleksander Alekseev



Re: Make mesage at end-of-recovery less scary.

От
Michael Paquier
Дата:
On Tue, Jan 16, 2024 at 02:46:02PM +0300, Aleksander Alekseev wrote:
>> For now, let me explain the basis for this patch. The fundamental
>> issue is that these warnings that always appear are, in practice, not
>> a problem in almost all cases. Some of those who encounter them for
>> the first time may feel uneasy and reach out with inquiries. On the
>> other hand, those familiar with these warnings tend to ignore them and
>> only pay attention to details when actual issues arise. Therefore, the
>> intention of this patch is to label them as "no issue" unless a
>> problem is blatantly evident, in order to prevent unnecessary concern.
>
> I agree and don't mind affecting the error message per se.
>
> However I see that the actual logic of how WAL is processed is being
> changed. If we do this, at very least it requires thorough thinking. I
> strongly suspect that the proposed code is wrong and/or not safe
> and/or less safe than it is now for the reasons named above.

FWIW, that pretty much sums up my feeling regarding this patch,
because an error, basically any error, would hurt back very badly.
Sure, the error messages we generate now when reaching the end of WAL
can sound scary, and they are (I suspect that's not really the case
for anybody who has history doing support with PostgreSQL because a
bunch of these messages are old enough to vote, but I can understand
that anybody would freak out the first time they see that).

However, per the recent issues we've had in this area, like
cd7f19da3468 but I'm more thinking about 6b18b3fe2c2f and
bae868caf222, I am of the opinion that the header validation, the
empty page case in XLogReaderValidatePageHeader() and the record read
changes are risky enough that I am not convinced that the gains are
worth the risks taken.

The error stack in the WAL reader is complicated enough that making it
more complicated as the patch proposes does not sound like not a good
tradeoff to me to make the reports related to the end of WAL cleaner
for the end-user.  I agree that we should do something, but the patch
does not seem like a good step towards this goal.  Perhaps somebody
would be more excited about this proposal than I am, of course.
--
Michael

Вложения

Re: Make mesage at end-of-recovery less scary.

От
Peter Smith
Дата:
2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

======
[1] https://commitfest.postgresql.org/46/2490/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/2490

Kind Regards,
Peter Smith.



Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
At Mon, 22 Jan 2024 16:09:28 +1100, Peter Smith <smithpb2250@gmail.com> wrote in 
> 2024-01 Commitfest.
> 
> Hi, This patch has a CF status of "Needs Review" [1], but it seems
> there were CFbot test failures last time it was run [2]. Please have a
> look and post an updated version if necessary.
> 
> ======
> [1] https://commitfest.postgresql.org/46/2490/
> [2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/2490

Thanks for noticing of that. Will repost a new version.
regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Make mesage at end-of-recovery less scary.

От
Kyotaro Horiguchi
Дата:
At Wed, 17 Jan 2024 14:32:00 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Tue, Jan 16, 2024 at 02:46:02PM +0300, Aleksander Alekseev wrote:
> >> For now, let me explain the basis for this patch. The fundamental
> >> issue is that these warnings that always appear are, in practice, not
> >> a problem in almost all cases. Some of those who encounter them for
> >> the first time may feel uneasy and reach out with inquiries. On the
> >> other hand, those familiar with these warnings tend to ignore them and
> >> only pay attention to details when actual issues arise. Therefore, the
> >> intention of this patch is to label them as "no issue" unless a
> >> problem is blatantly evident, in order to prevent unnecessary concern.
> > 
> > I agree and don't mind affecting the error message per se.
> > 
> > However I see that the actual logic of how WAL is processed is being
> > changed. If we do this, at very least it requires thorough thinking. I
> > strongly suspect that the proposed code is wrong and/or not safe
> > and/or less safe than it is now for the reasons named above.
> 
> FWIW, that pretty much sums up my feeling regarding this patch,
> because an error, basically any error, would hurt back very badly.
> Sure, the error messages we generate now when reaching the end of WAL
> can sound scary, and they are (I suspect that's not really the case
> for anybody who has history doing support with PostgreSQL because a
> bunch of these messages are old enough to vote, but I can understand
> that anybody would freak out the first time they see that).
> 
> However, per the recent issues we've had in this area, like
> cd7f19da3468 but I'm more thinking about 6b18b3fe2c2f and
> bae868caf222, I am of the opinion that the header validation, the
> empty page case in XLogReaderValidatePageHeader() and the record read
> changes are risky enough that I am not convinced that the gains are
> worth the risks taken.
> 
> The error stack in the WAL reader is complicated enough that making it
> more complicated as the patch proposes does not sound like not a good
> tradeoff to me to make the reports related to the end of WAL cleaner
> for the end-user.  I agree that we should do something, but the patch
> does not seem like a good step towards this goal.  Perhaps somebody
> would be more excited about this proposal than I am, of course.

Thank you both for the comments. The criticism seems valid. The
approach to identifying the end-of-WAL state in this patch is quite
heuristic, and its validity or safety can certainly be contested. On
the other hand, if we seek perfection in this area of judgment, we may
need to have the WAL format itself more robust. In any case, since the
majority of the feedback on this patch seems to be negative, I am
going to withdraw it if no supportive opinions emerge during this
commit-fest.

The attached patch addresses the errors reported by CF-bot.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Вложения