Обсуждение: [Bug fix]There is the case archive_timeout parameter is ignored afterrecovery works.

Поиск
Список
Период
Сортировка

[Bug fix]There is the case archive_timeout parameter is ignored afterrecovery works.

От
"higuchi.daisuke@fujitsu.com"
Дата:
Hi,

I found the bug about archive_timeout parameter.
There is the case archive_timeout parameter is ignored after recovery works.

[Problem]
When the value of archive_timeout is smaller than that of checkpoint_timeout and recovery works, archive_timeout is
ignoredin the first WAL archiving. 
Once WAL is archived, the archive_timeout seems to be valid after that.

I attached the simple script for reproducing this problem on version 12.
I also confirmed that PostgreSQL10, 11 and 12. I think other supported versions have this problem.

[Investigation]
In the CheckpointerMain(), calculate the time (cur_timeout) to wait on WaitLatch.

-----------------------------------------------------------------
now = (pg_time_t) time(NULL);
elapsed_secs = now - last_checkpoint_time;
if (elapsed_secs >= CheckPointTimeout)
    continue;           /* no sleep for us ... */
cur_timeout = CheckPointTimeout - elapsed_secs;
if (XLogArchiveTimeout > 0 && !RecoveryInProgress())
{
    elapsed_secs = now - last_xlog_switch_time;
    if (elapsed_secs >= XLogArchiveTimeout)
        continue;       /* no sleep for us ... */
    cur_timeout = Min(cur_timeout, XLogArchiveTimeout - elapsed_secs);
}

(void) WaitLatch(MyLatch,
                 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
                 cur_timeout * 1000L /* convert to ms */ ,
                 WAIT_EVENT_CHECKPOINTER_MAIN);
-----------------------------------------------------------------

Currently, cur_timeout is set according to only checkpoint_timeout when it is during recovery.
Even during recovery, the cur_timeout should be calculated including archive_timeout as well as checkpoint_timeout, I
think.
I attached the patch to solve this problem.

Regards,
Daisuke, Higuchi

Вложения

Re: [Bug fix]There is the case archive_timeout parameter isignored after recovery works.

От
Kyotaro Horiguchi
Дата:
Hello.

At Mon, 29 Jun 2020 04:35:11 +0000, "higuchi.daisuke@fujitsu.com" <higuchi.daisuke@fujitsu.com> wrote in 
> Hi,
> 
> I found the bug about archive_timeout parameter.
> There is the case archive_timeout parameter is ignored after recovery works.
...
> [Problem]
> When the value of archive_timeout is smaller than that of checkpoint_timeout and recovery works, archive_timeout is
ignoredin the first WAL archiving.
 
> Once WAL is archived, the archive_timeout seems to be valid after that.
...
> Currently, cur_timeout is set according to only checkpoint_timeout when it is during recovery.
> Even during recovery, the cur_timeout should be calculated including archive_timeout as well as checkpoint_timeout, I
think.
> I attached the patch to solve this problem.

Unfortunately the diff command in your test script doesn't show me
anything, but I can understand what you are thinking is a problem,
maybe.  But the patch doesn't seem the fix for the issue.

Archiving works irrelevantly from that parameter. Completed WAL
segments are immediately marked as ".ready" and archiver does its task
immediately independently from checkpointer. The parameter, as
described in documentation, forces the server to switch to a new WAL
segment file periodically so that it can be archived, that is, it
works only on primary.  On the other hand on standby, a WAL segment is
not marked as ".ready" until any data for the *next* segment comes. So
the patch is not the fix for the issue.

If primary switched segment and archived it but standby didn't archive
the same immediately, you could force that by writing something on the
master.

Anyway, the attached patch would resolve your problem.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From fec10780e132c1d284c66355c5215c284c16204d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 29 Jun 2020 16:12:01 +0900
Subject: [PATCH] Let complete segment archived immediately on standy

walreceiver marks a completed segment as ".ready" after any data for
the next segment comes. So standby can archive a WAL segment later
than the primary archives the same segment.  Let walreceiver archive a
segment as soon as it is completed.
---
 src/backend/replication/walreceiver.c | 77 +++++++++++++++------------
 1 file changed, 42 insertions(+), 35 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index d1ad75da87..06c1e3cbe4 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -902,44 +902,10 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
     {
         int            segbytes;
 
-        if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+        if (recvFile < 0)
         {
             bool        use_existent;
 
-            /*
-             * fsync() and close current file before we switch to next one. We
-             * would otherwise have to reopen this file to fsync it later
-             */
-            if (recvFile >= 0)
-            {
-                char        xlogfname[MAXFNAMELEN];
-
-                XLogWalRcvFlush(false);
-
-                XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-
-                /*
-                 * XLOG segment files will be re-read by recovery in startup
-                 * process soon, so we don't advise the OS to release cache
-                 * pages associated with the file like XLogFileClose() does.
-                 */
-                if (close(recvFile) != 0)
-                    ereport(PANIC,
-                            (errcode_for_file_access(),
-                             errmsg("could not close log segment %s: %m",
-                                    xlogfname)));
-
-                /*
-                 * Create .done file forcibly to prevent the streamed segment
-                 * from being archived later.
-                 */
-                if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
-                    XLogArchiveForceDone(xlogfname);
-                else
-                    XLogArchiveNotify(xlogfname);
-            }
-            recvFile = -1;
-
             /* Create/use new log file */
             XLByteToSeg(recptr, recvSegNo, wal_segment_size);
             use_existent = true;
@@ -985,6 +951,47 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
         buf += byteswritten;
 
         LogstreamResult.Write = recptr;
+
+        /*
+         * Close the current WAL segment if it is completed then let the file
+         * be archived if needed.
+         */
+        if (!XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+        {
+            char        xlogfname[MAXFNAMELEN];
+
+            Assert (recvFile >= 0);
+
+            /*
+             * fsync() and close current file before we switch to next one. We
+             * would otherwise have to reopen this file to fsync it later
+             */
+            XLogWalRcvFlush(false);
+
+            XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
+
+            /*
+             * XLOG segment files will be re-read by recovery in startup
+             * process soon, so we don't advise the OS to release cache
+             * pages associated with the file like XLogFileClose() does.
+             */
+            if (close(recvFile) != 0)
+                ereport(PANIC,
+                        (errcode_for_file_access(),
+                         errmsg("could not close log segment %s: %m",
+                                xlogfname)));
+
+            /*
+             * Create .done file forcibly to prevent the streamed segment
+             * from being archived later.
+             */
+            if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+                XLogArchiveForceDone(xlogfname);
+            else
+                XLogArchiveNotify(xlogfname);
+
+            recvFile = -1;
+        }
     }
 
     /* Update shared-memory status */
-- 
2.18.4


Re: [Bug fix]There is the case archive_timeout parameter isignored after recovery works.

От
Kyotaro Horiguchi
Дата:
At Mon, 29 Jun 2020 16:41:11 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> Anyway, the attached patch would resolve your problem.

I found another issue related to my last patch.

For the current master (and older versions) if walreceiver is signaled
to exit just after a segment is completed, walreceiver exits without
marking the last segment as ".ready".  After restart, it doesn't
remember that it didn't notified the last segment and the segment is
missing in archive. I think this is really a bug.

With the patch, that failure won't happen.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: [Bug fix]There is the case archive_timeout parameter is ignoredafter recovery works.

От
Fujii Masao
Дата:

On 2020/06/29 16:41, Kyotaro Horiguchi wrote:
> Hello.
> 
> At Mon, 29 Jun 2020 04:35:11 +0000, "higuchi.daisuke@fujitsu.com" <higuchi.daisuke@fujitsu.com> wrote in
>> Hi,
>>
>> I found the bug about archive_timeout parameter.
>> There is the case archive_timeout parameter is ignored after recovery works.
> ...
>> [Problem]
>> When the value of archive_timeout is smaller than that of checkpoint_timeout and recovery works, archive_timeout is
ignoredin the first WAL archiving.
 
>> Once WAL is archived, the archive_timeout seems to be valid after that.
> ...
>> Currently, cur_timeout is set according to only checkpoint_timeout when it is during recovery.
>> Even during recovery, the cur_timeout should be calculated including archive_timeout as well as checkpoint_timeout,
Ithink.
 
>> I attached the patch to solve this problem.
> 
> Unfortunately the diff command in your test script doesn't show me
> anything, but I can understand what you are thinking is a problem,
> maybe.  But the patch doesn't seem the fix for the issue.
> 
> Archiving works irrelevantly from that parameter. Completed WAL
> segments are immediately marked as ".ready" and archiver does its task
> immediately independently from checkpointer. The parameter, as
> described in documentation, forces the server to switch to a new WAL
> segment file periodically so that it can be archived, that is, it
> works only on primary.  On the other hand on standby, a WAL segment is
> not marked as ".ready" until any data for the *next* segment comes. So
> the patch is not the fix for the issue.

The problems that you're describing and Daisuke-san reported are really
the same? The reported problem seems that checkpointer can sleep on
the latch for more than archive_timeout just after recovery and cannot
switch WAL files promptly even if necessary.

The cause of this problem is that the checkpointer's sleep time is calculated
from both checkpoint_timeout and archive_timeout during normal running,
but calculated only from checkpoint_timeout during recovery. So Daisuke-san's
patch tries to change that so that it's calculated from both of them even
during recovery. No?

-               if (XLogArchiveTimeout > 0 && !RecoveryInProgress())
+               if (XLogArchiveTimeout > 0)
                 {
                         elapsed_secs = now - last_xlog_switch_time;
-                       if (elapsed_secs >= XLogArchiveTimeout)
+                       if (elapsed_secs >= XLogArchiveTimeout && !RecoveryInProgress())
                                 continue;               /* no sleep for us ... */
                         cur_timeout = Min(cur_timeout, XLogArchiveTimeout - elapsed_secs);

last_xlog_switch_time is not updated during recovery. So "elapsed_secs" can be
large and cur_timeout can be negative. Isn't this problematic?

As another approach, what about waking the checkpointer up at the end of
recovery like we already do for walsenders?

Regards,


-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Bug fix]There is the case archive_timeout parameter is ignoredafter recovery works.

От
"higuchi.daisuke@fujitsu.com"
Дата:
Thank you for comments.

>Unfortunately the diff command in your test script doesn't show me
>anything, but I can understand what you are thinking is a problem,
>maybe.

I'm sorry but I might have confused you... I explain how to use my test script.
I use diff command to check if the archiver has started. diff command does not output nothing to stdout.
So, please see the time displayed by the two date command by output of my test script.
I think you can confirm that the difference between the results of date commands is not the archive_timeout setting of
10seconds. 
If my test script runs for a few minutes, it means that my problem is reproduced.

>immediately independently from checkpointer. The parameter, as
>described in documentation, forces the server to switch to a new WAL
>segment file periodically so that it can be archived, that is, it
>works only on primary.

I confirm that this problem is occurred in non-replication environment.
The problem occurs when database try to archive WAL during or after archive recovery.
So your patch may be good to solve another problem, but unfortunately it didn't fix my problem.

Regards,
Daisuke, Higuchi




RE: [Bug fix]There is the case archive_timeout parameter is ignoredafter recovery works.

От
"higuchi.daisuke@fujitsu.com"
Дата:
Fujii-san, thank you for comments.

>The cause of this problem is that the checkpointer's sleep time is calculated
>from both checkpoint_timeout and archive_timeout during normal running,
>but calculated only from checkpoint_timeout during recovery. So Daisuke-san's
>patch tries to change that so that it's calculated from both of them even
>during recovery. No?

Yes, it's exactly so.

>last_xlog_switch_time is not updated during recovery. So "elapsed_secs" can be
>large and cur_timeout can be negative. Isn't this problematic?

Yes... My patch was missing this.
How about using the original archive_timeout value for calculating cur_timeout during recovery?

                if (XLogArchiveTimeout > 0 && !RecoveryInProgress())
                {
                        elapsed_secs = now - last_xlog_switch_time;
                        if (elapsed_secs >= XLogArchiveTimeout)
                                continue;               /* no sleep for us ... */
                        cur_timeout = Min(cur_timeout, XLogArchiveTimeout - elapsed_secs);
                }
+               else if (XLogArchiveTimeout > 0)
+                       cur_timeout = Min(cur_timeout, XLogArchiveTimeout);

During recovery, accurate cur_timeout is not calculated because elapsed_secs is not used.
However, after recovery is complete, WAL archiving will start by the next archive_timeout is reached.
I felt it is enough to solve this problem.

>As another approach, what about waking the checkpointer up at the end of
>recovery like we already do for walsenders?

If the above solution is not good, I will consider this approach.

Regards,
Daisuke, Higuchi



Re: [Bug fix]There is the case archive_timeout parameter is ignored after recovery works.

От
Kyotaro Horiguchi
Дата:
Opps! I misunderstood that.

At Mon, 29 Jun 2020 13:00:25 +0000, "higuchi.daisuke@fujitsu.com" <higuchi.daisuke@fujitsu.com> wrote in 
> Fujii-san, thank you for comments.
> 
> >The cause of this problem is that the checkpointer's sleep time is calculated
> >from both checkpoint_timeout and archive_timeout during normal running,
> >but calculated only from checkpoint_timeout during recovery. So Daisuke-san's
> >patch tries to change that so that it's calculated from both of them even
> >during recovery. No?
> 
> Yes, it's exactly so.
> 
> >last_xlog_switch_time is not updated during recovery. So "elapsed_secs" can be
> >large and cur_timeout can be negative. Isn't this problematic?
> 
> Yes... My patch was missing this.

The patch also makes WaitLatch called with zero timeout, which causes
assertion failure.

> How about using the original archive_timeout value for calculating cur_timeout during recovery?
> 
>                 if (XLogArchiveTimeout > 0 && !RecoveryInProgress())
>                 {
>                         elapsed_secs = now - last_xlog_switch_time;
>                         if (elapsed_secs >= XLogArchiveTimeout)
>                                 continue;               /* no sleep for us ... */
>                         cur_timeout = Min(cur_timeout, XLogArchiveTimeout - elapsed_secs);
>                 }
> +               else if (XLogArchiveTimeout > 0)
> +                       cur_timeout = Min(cur_timeout, XLogArchiveTimeout);
> 
> During recovery, accurate cur_timeout is not calculated because elapsed_secs is not used.
> However, after recovery is complete, WAL archiving will start by the next archive_timeout is reached.
> I felt it is enough to solve this problem.

That causes unwanted change of cur_timeout during recovery.

> >As another approach, what about waking the checkpointer up at the end of
> >recovery like we already do for walsenders?

We don't want change checkpoint interval during recovery, that means
we cannot cnosider archive_timeout at the fist checkpointer after
recovery ends. So I think that the suggestion from Fujii-san is the
direction.

> If the above solution is not good, I will consider this approach.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: [Bug fix]There is the case archive_timeout parameter is ignored after recovery works.

От
Fujii Masao
Дата:

On 2020/06/30 9:14, Kyotaro Horiguchi wrote:
> Opps! I misunderstood that.
> 
> At Mon, 29 Jun 2020 13:00:25 +0000, "higuchi.daisuke@fujitsu.com" <higuchi.daisuke@fujitsu.com> wrote in
>> Fujii-san, thank you for comments.
>>
>>> The cause of this problem is that the checkpointer's sleep time is calculated
>> >from both checkpoint_timeout and archive_timeout during normal running,
>>> but calculated only from checkpoint_timeout during recovery. So Daisuke-san's
>>> patch tries to change that so that it's calculated from both of them even
>>> during recovery. No?
>>
>> Yes, it's exactly so.
>>
>>> last_xlog_switch_time is not updated during recovery. So "elapsed_secs" can be
>>> large and cur_timeout can be negative. Isn't this problematic?
>>
>> Yes... My patch was missing this.
> 
> The patch also makes WaitLatch called with zero timeout, which causes
> assertion failure.
> 
>> How about using the original archive_timeout value for calculating cur_timeout during recovery?
>>
>>                  if (XLogArchiveTimeout > 0 && !RecoveryInProgress())
>>                  {
>>                          elapsed_secs = now - last_xlog_switch_time;
>>                          if (elapsed_secs >= XLogArchiveTimeout)
>>                                  continue;               /* no sleep for us ... */
>>                          cur_timeout = Min(cur_timeout, XLogArchiveTimeout - elapsed_secs);
>>                  }
>> +               else if (XLogArchiveTimeout > 0)
>> +                       cur_timeout = Min(cur_timeout, XLogArchiveTimeout);
>>
>> During recovery, accurate cur_timeout is not calculated because elapsed_secs is not used.

Yes, that's an idea. But I'm a bit concerned about that this change makes
checkpointer wake up more frequently than necessary during recovery.
Which may increase the idle power consumption of checkpointer during
recovery. Of course, this would not be so problematic in practice because
we can expect that archive_timeout is not so small. But it seems better to
avoid unncessary wake-ups if we can easily do that.

>> However, after recovery is complete, WAL archiving will start by the next archive_timeout is reached.
>> I felt it is enough to solve this problem.
> 
> That causes unwanted change of cur_timeout during recovery.
> 
>>> As another approach, what about waking the checkpointer up at the end of
>>> recovery like we already do for walsenders?
> 
> We don't want change checkpoint interval during recovery, that means
> we cannot cnosider archive_timeout at the fist checkpointer after
> recovery ends. So I think that the suggestion from Fujii-san is the
> direction.

+1
If this idea has some problems, we can revisit Daisuke-san's idea.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



RE: [Bug fix]There is the case archive_timeout parameter is ignored after recovery works.

От
"higuchi.daisuke@fujitsu.com"
Дата:
>> We don't want change checkpoint interval during recovery, that means
>> we cannot cnosider archive_timeout at the fist checkpointer after
>> recovery ends. So I think that the suggestion from Fujii-san is the
>> direction.
>+1
>If this idea has some problems, we can revisit Daisuke-san's idea.

Thanks for your comments. 
Ok, I will work on the fix to wake the checkpointer up at the end of recovery.

Regards,
Daisuke, Higuchi