Обсуждение: Fix checkpoint skip logic on idle systems by tracking LSN progress

Поиск
Список
Период
Сортировка

Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Michael Paquier
Дата:
Hi all,

A couple of months back is has been reported to pgsql-bugs that WAL
segments were always switched with a low value of archive_timeout even
if a system is completely idle:
http://www.postgresql.org/message-id/20151016203031.3019.72930@wrigleys.postgresql.org
In short, a closer look at the problem has showed up that the logic in
charge of checking if a checkpoint should be skipped or not is
currently broken, because it completely ignores standby snapshots in
its calculation of the WAL activity. So a checkpoint always occurs
after checkpoint_timeout on an idle system since hot_standby has been
introduced as wal_level. This did not get better from 9.4, since
standby snapshots are logged every 15s by the background writer
process. In 9.6, since wal_level = 'archive' and 'hot_standby'
actually has the same meaning, the skip logic that worked with
wal_level = 'archive' does not do its job anymore.

One solution that has been discussed is to track the progress of WAL
activity when doing record insertion by being able to mark some
records as not updating the progress of WAL. Standby snapshot records
enter in this category, making the checkpoint skip logic more robust.

Attached is a patch implementing a solution for it, by adding in
WALInsertLock a new field that gets updated for each record to track
the LSN progress. This allows to reliably skip the generation of
standby snapshots in the bgwriter or checkpoints on an idle system.
Per discussion with Andres at PGcon, we decided that this is an
optimization, only for 9.7~ because this has been broken for a long
time. I have also changed XLogIncludeOrigin() to use a more generic
routine to set of status flags for a record being inserted:
XLogSetFlags(). This routine can use two flags:
- INCLUDE_ORIGIN to decide if the origin should be logged or not
- NO_PROGRESS to decide at insertion if a record should update the LSN
progress or not.
Andres mentioned me that we'd want to have something similar to
XLogIncludeOrigin, but while hacking I noticed that grouping both
things under the same umbrella made more sense.

I am adding that to the commit fest of September.

Regards,
--
Michael

Вложения

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Michael Paquier
Дата:
On Thu, May 19, 2016 at 6:57 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> I am adding that to the commit fest of September.

And a lot of activity has happened here since. Attached are refreshed
patches based on da6c4f6. v11 still applies correctly but it's always
better to avoid hunks when applying them.
--
Michael

Вложения

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Kyotaro HORIGUCHI
Дата:
Hi,

I apologize in advance that the comments in this message might
one of the ideas discarded in the past thread.. I might not grasp
the discussion completely X(

The attached patches are rebased to the master and additional one
mentioned below.

At Wed, 18 May 2016 17:57:49 -0400, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com>

> A couple of months back is has been reported to pgsql-bugs that WAL
> segments were always switched with a low value of archive_timeout even
> if a system is completely idle:
> http://www.postgresql.org/message-id/20151016203031.3019.72930@wrigleys.postgresql.org
> In short, a closer look at the problem has showed up that the logic in
> charge of checking if a checkpoint should be skipped or not is
> currently broken, because it completely ignores standby snapshots in
> its calculation of the WAL activity. So a checkpoint always occurs
> after checkpoint_timeout on an idle system since hot_standby has been
> introduced as wal_level. This did not get better from 9.4, since
> standby snapshots are logged every 15s by the background writer
> process. In 9.6, since wal_level = 'archive' and 'hot_standby'
> actually has the same meaning, the skip logic that worked with
> wal_level = 'archive' does not do its job anymore.
> 
> One solution that has been discussed is to track the progress of WAL
> activity when doing record insertion by being able to mark some
> records as not updating the progress of WAL. Standby snapshot records
> enter in this category, making the checkpoint skip logic more robust.
> Attached is a patch implementing a solution for it, by adding in

If I understand the old thread correctly, this still doesn't
solve the main issue, excessive checkpoint and segment
switching. The reason is the interaction between XLOG_SWITCH and
checkpoint as mentioned there. I think we may treat XLOG_SWITCH
as NO_PROGRESS, since we can recover to the lastest state without
it. If it is not wrong, the second patch attached (v12-2) inserts
XLOG_SWITCH as NO_PROGRESS and skips segment switching when no
progress took place for the round.

> WALInsertLock a new field that gets updated for each record to track
> the LSN progress. This allows to reliably skip the generation of
> standby snapshots in the bgwriter or checkpoints on an idle system.

WALInsertLock doesn't seem to me to be a good place for
progressAt even considering the discussion about adding few
instructions (containing a branch) in the
hot-section. BackgroundWriterMain blocks all running
XLogInsertRecord every 200 ms, not 15 or 30 seconds (only for
replica, though). If this is correct, the Amit's suggestion below
will have more significance, and I rather agree with it. XLogCtl
is more suitable place for progressAt for the case.

https://www.postgresql.org/message-id/CAA4eK1LB9HDq+F8Lw8bGRQx6Sw42XaikX1UQ2DZk+AuEGbfjWA@mail.gmail.com
Amit> Taking and releasing locks again and again (which is done in patch)
Amit> matters much more than adding few instructions under it and I think
Amit> if you would have written the code in-a-way as in patch in some of
Amit> the critical path, it would have been regressed badly, but because
Amit> checkpoint doesn't happen often, reproducing regression is difficult.


https://www.postgresql.org/message-id/CAB7nPqSO6HVJ0T6LUT84PCy+_ihitdt64Ze2D+SJrHZy72Y0wg@mail.gmail.com
> > Also, I think it is worth to once take the performance data for
> > write tests (using pgbench 30 minute run or some other way) with
> > minimum checkpoint_timeout (i.e 30s) to see if the additional locking
> > has any impact on performance.  I think taking locks at intervals
> > of 15s or 30s should not matter much, but it is better to be safe.
> 
> I don't think performance will be impacted, but there are no reasons
> to not do any measurements either. I'll try to get some graphs
> tomorrow with runs on my laptop, mainly looking for any effects of
> this patch on TPS when checkpoints show up.

I don't think the impact is measurable on a laptop, where 2 to 4
cores available at most.

> Per discussion with Andres at PGcon, we decided that this is an
> optimization, only for 9.7~ because this has been broken for a long
> time. I have also changed XLogIncludeOrigin() to use a more generic
> routine to set of status flags for a record being inserted:
> XLogSetFlags(). This routine can use two flags:
> - INCLUDE_ORIGIN to decide if the origin should be logged or not
> - NO_PROGRESS to decide at insertion if a record should update the LSN
> progress or not.
> Andres mentioned me that we'd want to have something similar to
> XLogIncludeOrigin, but while hacking I noticed that grouping both
> things under the same umbrella made more sense.

This looks reasonable.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
David Steele
Дата:
On 9/27/16 6:16 AM, Kyotaro HORIGUCHI wrote:

> I apologize in advance that the comments in this message might
> one of the ideas discarded in the past thread.. I might not grasp
> the discussion completely X(
>
> The attached patches are rebased to the master and additional one
> mentioned below.

I tried the attached patch set and noticed an interesting behavior. 
With archive_timeout=5 whenever I made a change I would get a WAL 
segment within a few seconds as expected then another one would follow a 
few minutes later.

Database init:
16M Sep 27 20:05 000000010000000000000001
16M Sep 27 20:09 000000010000000000000002

Create test table:
16M Sep 27 20:13 000000010000000000000003
16M Sep 27 20:15 000000010000000000000004

Insert row into test table:
16M Sep 27 20:46 000000010000000000000005
16M Sep 27 20:49 000000010000000000000006

The cluster was completely idle with no sessions connected in between 
those three commands.  Is it possible this is caused by:

+         * switch segment only when any substantial progress have made from
+         * the last segment switching by timeout. Segment switching by other
+         * reasons will cause last_xlog_switch_lsn stay behind but it doesn't
+         * matter since it just causes possible one excessive segment switch.          */

I would like to give Michael a chance to respond to the updated patches 
before delving deeper.

-- 
-David
david@pgmasters.net



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Michael Paquier
Дата:
On Tue, Sep 27, 2016 at 7:16 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> I apologize in advance that the comments in this message might
> one of the ideas discarded in the past thread.. I might not grasp
> the discussion completely X(

No problem.

> At Wed, 18 May 2016 17:57:49 -0400, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqQcPqxEM3S735Bd2RzApNqSNJVietAC=6kfkYv_45dKwA@mail.gmail.com>
>
>> A couple of months back is has been reported to pgsql-bugs that WAL
>> segments were always switched with a low value of archive_timeout even
>> if a system is completely idle:
>> http://www.postgresql.org/message-id/20151016203031.3019.72930@wrigleys.postgresql.org
>> In short, a closer look at the problem has showed up that the logic in
>> charge of checking if a checkpoint should be skipped or not is
>> currently broken, because it completely ignores standby snapshots in
>> its calculation of the WAL activity. So a checkpoint always occurs
>> after checkpoint_timeout on an idle system since hot_standby has been
>> introduced as wal_level. This did not get better from 9.4, since
>> standby snapshots are logged every 15s by the background writer
>> process. In 9.6, since wal_level = 'archive' and 'hot_standby'
>> actually has the same meaning, the skip logic that worked with
>> wal_level = 'archive' does not do its job anymore.
>>
>> One solution that has been discussed is to track the progress of WAL
>> activity when doing record insertion by being able to mark some
>> records as not updating the progress of WAL. Standby snapshot records
>> enter in this category, making the checkpoint skip logic more robust.
>> Attached is a patch implementing a solution for it, by adding in
>
> If I understand the old thread correctly, this still doesn't
> solve the main issue, excessive checkpoint and segment
> switching. The reason is the interaction between XLOG_SWITCH and
> checkpoint as mentioned there. I think we may treat XLOG_SWITCH
> as NO_PROGRESS, since we can recover to the lastest state without
> it. If it is not wrong, the second patch attached (v12-2) inserts
> XLOG_SWITCH as NO_PROGRESS and skips segment switching when no
> progress took place for the round.

Possibly. That's a second problem I did not want to tackle now. I was
going to study that more precisely after this set of patches gets
done. There is already enough complication in them, and they solve a
large portion of the problem.

>> WALInsertLock a new field that gets updated for each record to track
>> the LSN progress. This allows to reliably skip the generation of
>> standby snapshots in the bgwriter or checkpoints on an idle system.
>
> WALInsertLock doesn't seem to me to be a good place for
> progressAt even considering the discussion about adding few
> instructions (containing a branch) in the
> hot-section. BackgroundWriterMain blocks all running
> XLogInsertRecord every 200 ms, not 15 or 30 seconds (only for
> replica, though). If this is correct, the Amit's suggestion below
> will have more significance, and I rather agree with it. XLogCtl
> is more suitable place for progressAt for the case.

Based on my past look at the problem and memories, having a variable
in WALInsertLock allows use to not have to touch the hottest spinlock
code path in WAL insertion and PG: ReserveXLogInsertLocation(). I'd
rather still avoid that.

>> > Also, I think it is worth to once take the performance data for
>> > write tests (using pgbench 30 minute run or some other way) with
>> > minimum checkpoint_timeout (i.e 30s) to see if the additional locking
>> > has any impact on performance.  I think taking locks at intervals
>> > of 15s or 30s should not matter much, but it is better to be safe.
>>
>> I don't think performance will be impacted, but there are no reasons
>> to not do any measurements either. I'll try to get some graphs
>> tomorrow with runs on my laptop, mainly looking for any effects of
>> this patch on TPS when checkpoints show up.
>
> I don't think the impact is measurable on a laptop, where 2 to 4
> cores available at most.

Yeah, I couldn't either.. Still I would like to keep the hot spinlock
section as small as possible if we can.

>> Per discussion with Andres at PGcon, we decided that this is an
>> optimization, only for 9.7~ because this has been broken for a long
>> time. I have also changed XLogIncludeOrigin() to use a more generic
>> routine to set of status flags for a record being inserted:
>> XLogSetFlags(). This routine can use two flags:
>> - INCLUDE_ORIGIN to decide if the origin should be logged or not
>> - NO_PROGRESS to decide at insertion if a record should update the LSN
>> progress or not.
>> Andres mentioned me that we'd want to have something similar to
>> XLogIncludeOrigin, but while hacking I noticed that grouping both
>> things under the same umbrella made more sense.
>
> This looks reasonable.

Thanks. That would be fine as a first, independent patch, but that
would mean that XLogSetFlags has only only value, which is a bit
pointless so I grouped them. And this makes the exiting interface
cleaner as well for replication origins.
-- 
Michael



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Michael Paquier
Дата:
On Wed, Sep 28, 2016 at 6:12 AM, David Steele <david@pgmasters.net> wrote:
> I tried the attached patch set and noticed an interesting behavior. With
> archive_timeout=5 whenever I made a change I would get a WAL segment within
> a few seconds as expected then another one would follow a few minutes later.

That's intentional. We may be able to make XLOG_SWITCH records as not
updating the progress LSN, but I wanted to tackle that as a separate
patch once we got the basics done correctly, which is still what I
think this patch is doing. I should have been more precise upthread:
this patch makes the handling of checkpoint skip logic correct for
only standby snapshots, not segment switches, and puts the infra to
handle other things.
-- 
Michael



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
David Steele
Дата:
On 9/28/16 3:35 AM, Michael Paquier wrote:
> On Wed, Sep 28, 2016 at 6:12 AM, David Steele <david@pgmasters.net> wrote:
>> I tried the attached patch set and noticed an interesting behavior. With
>> archive_timeout=5 whenever I made a change I would get a WAL segment within
>> a few seconds as expected then another one would follow a few minutes later.
> 
> That's intentional. We may be able to make XLOG_SWITCH records as not
> updating the progress LSN, but I wanted to tackle that as a separate
> patch once we got the basics done correctly, which is still what I
> think this patch is doing. I should have been more precise upthread:
> this patch makes the handling of checkpoint skip logic correct for
> only standby snapshots, not segment switches, and puts the infra to
> handle other things.

OK, I've done functional testing and this patch seems to work as
specified (including the caveat noted above).  Some comments:

* [PATCH 1/3] hs-checkpoints-v12-1

+++ b/src/backend/access/transam/xlog.c
+     * Taking a lock is as well necessary to prevent potential torn reads
+     * on some platforms.

How about, "Taking a lock is also necessary..."

+        LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);

That's a lot of exclusive locks and that would seem to have performance
implications.  It seems to me this is going to be a hard one to
benchmark because the regression (if any) would only be seen under heavy
load on a very large system.

In general I agree with the other comments that this could end up being
a problem.  On the other hand, since the additional locks are only taken
at checkpoint or archive_timeout it may not be that big a deal.

+++ b/src/backend/access/transam/xloginsert.c * Should this record
include the replication origin if one is set up?

Outdated comment from XLogIncludeOrigin().

* [PATCH 2/3] hs-checkpoints-v12-2

+++ b/src/backend/postmaster/checkpointer.c
+            /* OK, it's time to switch */
+            elog(LOG, "Request XLog Switch");

LOG level seems a bit much here, perhaps DEBUG1?

* [PATCH 3/3] hs-checkpoints-v12-3

+         * switch segment only when any substantial progress have made from
+         * reasons will cause last_xlog_switch_lsn stay behind but it doesn't

How about, "Switch segment only when substantial progress has been made
after the last segment was switched by a timeout.  Segment switching for
other reasons..."

+++ b/src/backend/access/transam/xlog.c
+        elog(LOG, "Not a forced or shutdown checkpoint: progress_lsn %X/%X,
ckpt %X/%X",
+            elog(LOG, "Checkpoint is skipped");
+        elog(LOG, "snapshot taken by checkpoint %X/%X",

Same for the above, seems like it would just be noise for most users.

+++ b/src/backend/postmaster/bgwriter.c
+                elog(LOG, "snapshot taken by bgwriter %X/%X",

Ditto.

I don't see any unintended consequences in this patch but it doesn't
mean there aren't any.  I'm definitely concerned by the exclusive locks
but it may turn out they do not actually represent a bottleneck.

This does seem like the kind of patch that should get committed very
early in the release cycle to allow maximum time for regression testing.

-- 
-David
david@pgmasters.net



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Michael Paquier
Дата:
On Thu, Sep 29, 2016 at 7:45 AM, David Steele <david@pgmasters.net> wrote:
> OK, I've done functional testing and this patch seems to work as
> specified (including the caveat noted above).  Some comments:

Thanks!

> * [PATCH 1/3] hs-checkpoints-v12-1
>
> +++ b/src/backend/access/transam/xlog.c
> +        * Taking a lock is as well necessary to prevent potential torn reads
> +        * on some platforms.
>
> How about, "Taking a lock is also necessary..."
>
> +               LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
>
> That's a lot of exclusive locks and that would seem to have performance
> implications.  It seems to me this is going to be a hard one to
> benchmark because the regression (if any) would only be seen under heavy
> load on a very large system.
>
> In general I agree with the other comments that this could end up being
> a problem.  On the other hand, since the additional locks are only taken
> at checkpoint or archive_timeout it may not be that big a deal.

Yes, I did some tests on my laptop a couple of months back, that has 4
cores. After reducing NUM_XLOGINSERT_LOCKS from 8 to 4 to increase
contention and performing a bunch of INSERT using 4 clients on 4
different relations I could not catch a difference.. Autovacuum was
disabled to eliminate any noise. I tried checkpoint_segments at 30s to
see its effects, as well as larger values to see the impact with the
standby snapshot taken by the bgwriter. Other thoughts are welcome.

> +++ b/src/backend/access/transam/xloginsert.c * Should this record
> include the replication origin if one is set up?
>
> Outdated comment from XLogIncludeOrigin().

Fixed. I added as well some comments on top of XLogSetFlags to mention
what are the flags that can be used. I didn't think that it was
necessary to add an assertion here. Also, I noticed that the comment
on top of XLogInsertRecord mentioned those flags but was incorrect.

> * [PATCH 2/3] hs-checkpoints-v12-2
>
> +++ b/src/backend/postmaster/checkpointer.c
> +                       /* OK, it's time to switch */
> +                       elog(LOG, "Request XLog Switch");
>
> LOG level seems a bit much here, perhaps DEBUG1?

That's from Horiguchi-san's patch, and those would be definitely
better as DEBUG1 by looking at it. Now and in order to keep things
simple I think that we had better discard this patch for now. I was
planning to come back to this thing anyway once we are done with the
first problem.

> * [PATCH 3/3] hs-checkpoints-v12-3
>
> +                * switch segment only when any substantial progress have made from
> +                * reasons will cause last_xlog_switch_lsn stay behind but it doesn't
>
> How about, "Switch segment only when substantial progress has been made
> after the last segment was switched by a timeout.  Segment switching for
> other reasons..."
>
> +++ b/src/backend/access/transam/xlog.c
> +               elog(LOG, "Not a forced or shutdown checkpoint: progress_lsn %X/%X,
> ckpt %X/%X",
> +                       elog(LOG, "Checkpoint is skipped");
> +               elog(LOG, "snapshot taken by checkpoint %X/%X",
>
> Same for the above, seems like it would just be noise for most users.
>
> +++ b/src/backend/postmaster/bgwriter.c
> +                               elog(LOG, "snapshot taken by bgwriter %X/%X",
>
> Ditto.

The original patch was developed to ease debugging, and I chose LOG to
not be polluted with a bunch of DEBUG1 entries :)

Now we can do something, as follows:
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8450,6 +8450,8 @@ CreateCheckPoint(int flags)
    {
        if (progress_lsn == ControlFile->checkPoint)
        {
+           if (log_checkpoints)
+               ereport(LOG, "checkpoint skipped");
            WALInsertLockRelease();
            LWLockRelease(CheckpointLock);
            END_CRIT_SECTION();
Letting users know that the checkpoint has been skipped sounds like a
good idea. Perhaps that's better if squashed with the first patch.

> I don't see any unintended consequences in this patch but it doesn't
> mean there aren't any.  I'm definitely concerned by the exclusive locks
> but it may turn out they do not actually represent a bottleneck.

That's a hard to see a difference. Perhaps I didn't try hard enough..

Well for now attached are two patches, that could just be squashed into one.
--
Michael

Вложения

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
David Steele
Дата:
On 9/28/16 10:32 PM, Michael Paquier wrote:
> On Thu, Sep 29, 2016 at 7:45 AM, David Steele <david@pgmasters.net> wrote:
>>
>> In general I agree with the other comments that this could end up being
>> a problem.  On the other hand, since the additional locks are only taken
>> at checkpoint or archive_timeout it may not be that big a deal.
>
> Yes, I did some tests on my laptop a couple of months back, that has 4
> cores. After reducing NUM_XLOGINSERT_LOCKS from 8 to 4 to increase
> contention and performing a bunch of INSERT using 4 clients on 4
> different relations I could not catch a difference.. Autovacuum was
> disabled to eliminate any noise. I tried checkpoint_segments at 30s to
> see its effects, as well as larger values to see the impact with the
> standby snapshot taken by the bgwriter. Other thoughts are welcome.

I don't have any better ideas than that.

>> +++ b/src/backend/postmaster/checkpointer.c
>> +                       /* OK, it's time to switch */
>> +                       elog(LOG, "Request XLog Switch");
>>
>> LOG level seems a bit much here, perhaps DEBUG1?
>
> That's from Horiguchi-san's patch, and those would be definitely
> better as DEBUG1 by looking at it. Now and in order to keep things
> simple I think that we had better discard this patch for now. I was
> planning to come back to this thing anyway once we are done with the
> first problem.

I still see this:

+++ b/src/backend/postmaster/checkpointer.c         /* OK, it's time to switch */
+        elog(LOG, "Request XLog Switch");

> Well for now attached are two patches, that could just be squashed into one.

Yes, I think that makes sense.

More importantly, there is a regression.  With your new patch the xlogs 
are switching on archive_timeout again even with no changes.  The v12 
worked fine.

The differences are all in 0002-hs-checkpoints-v12-2.patch and as far as 
I can see the patch does not work correctly without these changes.  Am I 
missing something?

-- 
-David
david@pgmasters.net



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Kyotaro HORIGUCHI
Дата:
Sorry, I might have torn off this thread somehow..

At Thu, 29 Sep 2016 11:26:29 -0400, David Steele <david@pgmasters.net> wrote in
<30095aea-3910-dbb7-1790-a579fb93fa5e@pgmasters.net>
> On 9/28/16 10:32 PM, Michael Paquier wrote:
> > On Thu, Sep 29, 2016 at 7:45 AM, David Steele <david@pgmasters.net>
> > wrote:
> >>
> >> In general I agree with the other comments that this could end up
> >> being
> >> a problem.  On the other hand, since the additional locks are only
> >> taken
> >> at checkpoint or archive_timeout it may not be that big a deal.
> >
> > Yes, I did some tests on my laptop a couple of months back, that has 4
> > cores. After reducing NUM_XLOGINSERT_LOCKS from 8 to 4 to increase
> > contention and performing a bunch of INSERT using 4 clients on 4
> > different relations I could not catch a difference.. Autovacuum was
> > disabled to eliminate any noise. I tried checkpoint_segments at 30s to
> > see its effects, as well as larger values to see the impact with the
> > standby snapshot taken by the bgwriter. Other thoughts are welcome.
> 
> I don't have any better ideas than that.

I don't see no problem in setting progressAt in XLogInsertRecord.
But I doubt GetProgressRecPtr is harmful, especially when
NUM_XLOGINSERT_LOCKS is *large*. So reducing the number seems
rather alleviates the impact. But it actually doesn't seem so
harmful up to 8. (Even though I don't like the locking in
GetProgressRecPtr..)

Currently possiblly harmful calling of GetProgressRecPtr is that
in BackgroundWriterMain. It should be called with ther interval
BgWriterDelay, and anytime pgwriter recieved SIGUSR1. But I don't
see the issuer of SIGUSR1 of bgwriter..


> >> +++ b/src/backend/postmaster/checkpointer.c
> >> +                       /* OK, it's time to switch */
> >> +                       elog(LOG, "Request XLog Switch");
> >>
> >> LOG level seems a bit much here, perhaps DEBUG1?
> >
> > That's from Horiguchi-san's patch, and those would be definitely
> > better as DEBUG1 by looking at it. Now and in order to keep things
> > simple I think that we had better discard this patch for now. I was
> > planning to come back to this thing anyway once we are done with the
> > first problem.
> 
> I still see this:
> 
> +++ b/src/backend/postmaster/checkpointer.c
>          /* OK, it's time to switch */
> +        elog(LOG, "Request XLog Switch");
> 
> > Well for now attached are two patches, that could just be squashed
> > into one.

Mmmm. Sorry, this was for my quite private instant debug, spilt
outside.. But I don't mind to leave it with DEBUG2 if it seems
useful.

> Yes, I think that makes sense.
> 
> More importantly, there is a regression.  With your new patch the
> xlogs are switching on archive_timeout again even with no changes.
> The v12 worked fine.

As Michael mentioned in this or another thread, it is another
issue that he wants to solve separately. I personally doubt that
this patch (v11 and v13) can be evaluated alone without it, but
we can review this with the excessive switching problem, perhaps?

> The differences are all in 0002-hs-checkpoints-v12-2.patch and as far
> as I can see the patch does not work correctly without these changes.
> Am I missing something?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Kyotaro HORIGUCHI
Дата:
Sorry, it wrote wrong thing.

At Fri, 30 Sep 2016 14:00:15 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20160930.140015.150178454.horiguchi.kyotaro@lab.ntt.co.jp>
> Sorry, I might have torn off this thread somehow..
> 
> At Thu, 29 Sep 2016 11:26:29 -0400, David Steele <david@pgmasters.net> wrote in
<30095aea-3910-dbb7-1790-a579fb93fa5e@pgmasters.net>
> > On 9/28/16 10:32 PM, Michael Paquier wrote:
> > > On Thu, Sep 29, 2016 at 7:45 AM, David Steele <david@pgmasters.net>
> > > wrote:
> > >>
> > >> In general I agree with the other comments that this could end up
> > >> being
> > >> a problem.  On the other hand, since the additional locks are only
> > >> taken
> > >> at checkpoint or archive_timeout it may not be that big a deal.
> > >
> > > Yes, I did some tests on my laptop a couple of months back, that has 4
> > > cores. After reducing NUM_XLOGINSERT_LOCKS from 8 to 4 to increase
> > > contention and performing a bunch of INSERT using 4 clients on 4
> > > different relations I could not catch a difference.. Autovacuum was
> > > disabled to eliminate any noise. I tried checkpoint_segments at 30s to
> > > see its effects, as well as larger values to see the impact with the
> > > standby snapshot taken by the bgwriter. Other thoughts are welcome.
> > 
> > I don't have any better ideas than that.
> 
> I don't see no problem in setting progressAt in XLogInsertRecord.
> But I doubt GetProgressRecPtr is harmful, especially when

But I suspect that GetProgressRecPtr could be harmful.

> NUM_XLOGINSERT_LOCKS is *large*. So reducing the number seems
> rather alleviates the impact. But it actually doesn't seem so
> harmful up to 8. (Even though I don't like the locking in
> GetProgressRecPtr..)
> 
> Currently possiblly harmful calling of GetProgressRecPtr is that
> in BackgroundWriterMain. It should be called with ther interval
> BgWriterDelay, and anytime pgwriter recieved SIGUSR1. But I don't
> see the issuer of SIGUSR1 of bgwriter..
> 
> 
> > >> +++ b/src/backend/postmaster/checkpointer.c
> > >> +                       /* OK, it's time to switch */
> > >> +                       elog(LOG, "Request XLog Switch");
> > >>
> > >> LOG level seems a bit much here, perhaps DEBUG1?
> > >
> > > That's from Horiguchi-san's patch, and those would be definitely
> > > better as DEBUG1 by looking at it. Now and in order to keep things
> > > simple I think that we had better discard this patch for now. I was
> > > planning to come back to this thing anyway once we are done with the
> > > first problem.
> > 
> > I still see this:
> > 
> > +++ b/src/backend/postmaster/checkpointer.c
> >          /* OK, it's time to switch */
> > +        elog(LOG, "Request XLog Switch");
> > 
> > > Well for now attached are two patches, that could just be squashed
> > > into one.
> 
> Mmmm. Sorry, this was for my quite private instant debug, spilt
> outside.. But I don't mind to leave it with DEBUG2 if it seems
> useful.
> 
> > Yes, I think that makes sense.
> > 
> > More importantly, there is a regression.  With your new patch the
> > xlogs are switching on archive_timeout again even with no changes.
> > The v12 worked fine.
> 
> As Michael mentioned in this or another thread, it is another
> issue that he wants to solve separately. I personally doubt that
> this patch (v11 and v13) can be evaluated alone without it, but
> we can review this with the excessive switching problem, perhaps?
> 
> > The differences are all in 0002-hs-checkpoints-v12-2.patch and as far
> > as I can see the patch does not work correctly without these changes.
> > Am I missing something?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Michael Paquier
Дата:
On Fri, Sep 30, 2016 at 2:05 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Fri, 30 Sep 2016 14:00:15 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20160930.140015.150178454.horiguchi.kyotaro@lab.ntt.co.jp>
 
>> I don't see no problem in setting progressAt in XLogInsertRecord.
>> But I doubt GetProgressRecPtr is harmful, especially when
>
> But I suspect that GetProgressRecPtr could be harmful.

Well, you can maximize its effects by doing NUM_XLOGINSERT_LOCKS ==
nproc and reducing checkpoint_timeout. That's what I did but..
-- 
Michael



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Michael Paquier
Дата:
On Fri, Sep 30, 2016 at 2:51 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Sep 30, 2016 at 2:05 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> At Fri, 30 Sep 2016 14:00:15 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20160930.140015.150178454.horiguchi.kyotaro@lab.ntt.co.jp>
 
>>> I don't see no problem in setting progressAt in XLogInsertRecord.
>>> But I doubt GetProgressRecPtr is harmful, especially when
>>
>> But I suspect that GetProgressRecPtr could be harmful.
>
> Well, you can maximize its effects by doing NUM_XLOGINSERT_LOCKS ==
> nproc and reducing checkpoint_timeout. That's what I did but..

Note: I am moving this patch to next CF.
-- 
Michael



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Michael Paquier
Дата:
(Squashing replies)

On Fri, Sep 30, 2016 at 6:13 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Sep 30, 2016 at 2:51 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Fri, Sep 30, 2016 at 2:05 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> At Fri, 30 Sep 2016 14:00:15 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20160930.140015.150178454.horiguchi.kyotaro@lab.ntt.co.jp> 
>>>> I don't see no problem in setting progressAt in XLogInsertRecord.
>>>> But I doubt GetProgressRecPtr is harmful, especially when
>>>
>>> But I suspect that GetProgressRecPtr could be harmful.
>>
>> Well, you can maximize its effects by doing NUM_XLOGINSERT_LOCKS ==
>> nproc and reducing checkpoint_timeout. That's what I did but..
>
> Note: I am moving this patch to next CF.

And I am back on it more seriously... And I am taking back what I said upthread.

I looked at the v12 that Horiguchi-san has written, and that seems
correct to me. So I have squashed everything into a single patch,
including the log entry that gets logged with log_checkpoints. Testing
with archive_timeout to 10s, checkpoint_timeout to 30s, sometimes
triggering manual activity with CREATE TABLE/whatever and manual
pg_switch_xlog(), I am able to see that checkpoints can be correctly
skipped or generated.

There was as well a compilation error with ereport(). Not sure how I
missed that... Likely too many patches handled these days.

I have also updated the description of archive_timeout that increasing
checkpoint_timeout would reduce unnecessary checkpoints on a idle
system. With this patch, on an idle system checkpoints are just
skipped as they should.

How does that look?
--
Michael

Вложения

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Kyotaro HORIGUCHI
Дата:
Thanks for merging. It still applies on the current master with
some displacements.

At Wed, 5 Oct 2016 15:18:53 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqT4U=OSOLXuFuxMonmfdQFmd5F_0DmKoddvjG-HHWQaBA@mail.gmail.com>
> (Squashing replies)
> 
> On Fri, Sep 30, 2016 at 6:13 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > On Fri, Sep 30, 2016 at 2:51 PM, Michael Paquier
> > <michael.paquier@gmail.com> wrote:
> >> On Fri, Sep 30, 2016 at 2:05 PM, Kyotaro HORIGUCHI
> >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >>> At Fri, 30 Sep 2016 14:00:15 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
wrotein <20160930.140015.150178454.horiguchi.kyotaro@lab.ntt.co.jp>
 
> >>>> I don't see no problem in setting progressAt in XLogInsertRecord.
> >>>> But I doubt GetProgressRecPtr is harmful, especially when
> >>>
> >>> But I suspect that GetProgressRecPtr could be harmful.
> >>
> >> Well, you can maximize its effects by doing NUM_XLOGINSERT_LOCKS ==
> >> nproc and reducing checkpoint_timeout. That's what I did but..
> >
> > Note: I am moving this patch to next CF.
> 
> And I am back on it more seriously... And I am taking back what I said upthread.
> 
> I looked at the v12 that Horiguchi-san has written, and that seems
> correct to me. So I have squashed everything into a single patch,

Could you let me struggle a bit more to avoid LWLocks in
GetProgressRecPtr?

I considered two alternatives for updating logic of progressAt
more seriously. One is, as Amit suggested, replacing progressAt
within the SpinLock section in
ReserverXLogInsertLocation. Another is using pg_atomic_u64 for
progressAt. The attached two patches rouhgly implement the aboves
respectively. (But I've not tested them. The patches are to show
the two alternatives concretely.)

I found that the former reuiqres to take insertpos_lck also on
reading. I have to admit that this is too bad. (Even I saw no
degradation by pgbench on my poor environment. It marks 118tr/s
by 10 threads and that doesn't seem make any stress on xlog
logics...)

For the latter, it is free from locks and doesn't reduce parallel
degree but I'm not sure it is proper to use it there and I'm not
sure about actual overheads.  In the worst case, it makes another
SpinLock section for every call to pg_atmoic_* functions.

> including the log entry that gets logged with log_checkpoints. Testing
> with archive_timeout to 10s, checkpoint_timeout to 30s, sometimes
> triggering manual activity with CREATE TABLE/whatever and manual
> pg_switch_xlog(), I am able to see that checkpoints can be correctly
> skipped or generated.
> 
> There was as well a compilation error with ereport(). Not sure how I
> missed that... Likely too many patches handled these days.
> 
> I have also updated the description of archive_timeout that increasing
> checkpoint_timeout would reduce unnecessary checkpoints on a idle
> system. With this patch, on an idle system checkpoints are just
> skipped as they should.
> 
> How does that look?

All except the above point looks good for me. Maybe it is better
that XLOG_INCLUDE_ORIGIN stuff is in a separate patch.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Michael Paquier
Дата:
On Tue, Nov 1, 2016 at 8:31 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Could you let me struggle a bit more to avoid LWLocks in
> GetProgressRecPtr?

Be my guest :)

> I considered two alternatives for updating logic of progressAt
> more seriously. One is, as Amit suggested, replacing progressAt
> within the SpinLock section in
> ReserverXLogInsertLocation. Another is using pg_atomic_u64 for
> progressAt. The attached two patches rouhgly implement the aboves
> respectively. (But I've not tested them. The patches are to show
> the two alternatives concretely.)

Okay.

> I found that the former requires to take insertpos_lck also on
> reading. I have to admit that this is too bad. (Even I saw no
> degradation by pgbench on my poor environment. It marks 118tr/s
> by 10 threads and that doesn't seem make any stress on xlog
> logics...)

Interesting...

> For the latter, it is free from locks and doesn't reduce parallel
> degree but I'm not sure it is proper to use it there and I'm not
> sure about actual overheads.  In the worst case, it makes another
> SpinLock section for every call to pg_atmoic_* functions.

The WAL insert slots have been introduced in 9.4, and the PG atomics
in 9.5, so perhaps the first implementation of the WAL insert slots
would have used it. Still that looks quite promising. At the same time
we may be able to do something for insertingAt to make the locks held
more consistent, and just remove WALInsertLocks, even if this makes me
wonder about torn reads and about how we may break things if we rely
on something else than LW_EXCLUSIVE compared to now. To keep things
more simple I' would still favor using WALInsertLocks for this patch,
that looks more consistent, and also because there is no noticeable
difference.

> All except the above point looks good for me. Maybe it is better
> that XLOG_INCLUDE_ORIGIN stuff is in a separate patch.

I have kept that grouped bas XLOG_INCLUDE_ORIGIN is the only
XLogInsert flag present on HEAD. Could the patch be marked as "ready
for committer" then?
-- 
Michael



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Kyotaro HORIGUCHI
Дата:
Hello,

At Tue, 8 Nov 2016 14:45:59 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqT-VW5gRbUJwQusmgiu2MKpZSCV-XdrHv84w8FZa286KQ@mail.gmail.com>
> On Tue, Nov 1, 2016 at 8:31 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Could you let me struggle a bit more to avoid LWLocks in
> > GetProgressRecPtr?
> 
> Be my guest :)
> 
> > I considered two alternatives for updating logic of progressAt
> > more seriously. One is, as Amit suggested, replacing progressAt
> > within the SpinLock section in
> > ReserverXLogInsertLocation. Another is using pg_atomic_u64 for
> > progressAt. The attached two patches rouhgly implement the aboves
> > respectively. (But I've not tested them. The patches are to show
> > the two alternatives concretely.)
> 
> Okay.
> 
> > I found that the former requires to take insertpos_lck also on
> > reading. I have to admit that this is too bad. (Even I saw no
> > degradation by pgbench on my poor environment. It marks 118tr/s
> > by 10 threads and that doesn't seem make any stress on xlog
> > logics...)
> 
> Interesting...
> 
> > For the latter, it is free from locks and doesn't reduce parallel
> > degree but I'm not sure it is proper to use it there and I'm not
> > sure about actual overheads.  In the worst case, it makes another
> > SpinLock section for every call to pg_atmoic_* functions.
> 
> The WAL insert slots have been introduced in 9.4, and the PG atomics
> in 9.5, so perhaps the first implementation of the WAL insert slots
> would have used it. Still that looks quite promising. At the same time
> we may be able to do something for insertingAt to make the locks held
> more consistent, and just remove WALInsertLocks, even if this makes me
> wonder about torn reads and about how we may break things if we rely

If I understand you correctly, atomics prevents torn reads by its
definition on cache management and bus arbitration level. It
should be faster than LWlocks but as I said in the previous mail,
on some platforms, if any, it will fallbacks to individual
spinlocks. (atomics.c)

> on something else than LW_EXCLUSIVE compared to now. To keep things
> more simple I' would still favor using WALInsertLocks for this patch,
> that looks more consistent, and also because there is no noticeable
> difference.

Ok, the patch looks fine. So there's nothing for me but to accept
the current shape since the discussion about performance seems
not to be settled with out performance measurement with machines
with many cores.

> > All except the above point looks good for me. Maybe it is better
> > that XLOG_INCLUDE_ORIGIN stuff is in a separate patch.
> 
> I have kept that grouped bas XLOG_INCLUDE_ORIGIN is the only
> XLogInsert flag present on HEAD. Could the patch be marked as "ready
> for committer" then?

Ok, that is not so siginificant point. Well, I'd like to wait for
a couple of days for anyone wants to comment, then mark this
'ready for committer'.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
David Steele
Дата:
On 10/5/16 7:18 AM, Michael Paquier wrote:

>> Note: I am moving this patch to next CF.
>
> And I am back on it more seriously... And I am taking back what I said upthread.
>
> I looked at the v12 that Horiguchi-san has written, and that seems
> correct to me. So I have squashed everything into a single patch,
> including the log entry that gets logged with log_checkpoints. Testing
> with archive_timeout to 10s, checkpoint_timeout to 30s, sometimes
> triggering manual activity with CREATE TABLE/whatever and manual
> pg_switch_xlog(), I am able to see that checkpoints can be correctly
> skipped or generated.
>
> There was as well a compilation error with ereport(). Not sure how I
> missed that... Likely too many patches handled these days.
>
> I have also updated the description of archive_timeout that increasing
> checkpoint_timeout would reduce unnecessary checkpoints on a idle
> system. With this patch, on an idle system checkpoints are just
> skipped as they should.
>
> How does that look?

This looks much better now and exhibits exactly the behavior that I 
expect.

In theory it would be nice if the checkpoint records did not cause 
rotation, but this can be mitigated in the way you have described and 
perhaps for safety's sake it's best.

I had a bit of trouble parsing this paragraph:

+    /*
+     * Update the progress LSN positions. At least one WAL insertion lock
+     * is already taken appropriately before doing that, and it is just more
+     * simple to do that here where WAL record data and type is at hand.
+     * The progress is set at the start position of the record tracked that
+     * is being added, making easier checkpoint progress tracking as the
+     * control file already saves the start LSN position of the last
+     * checkpoint run. If an exclusive lock is taken for WAL insertion,
+     * there is actually no need to update all the progression fields, so

So I did a little reworking:

Update the LSN progress positions. At least one WAL insertion lock is 
already taken appropriately before doing that, and it is simpler to do 
that here when the WAL record data and type are at hand. Progress is set 
at the start position of the tracked record that is being added, making 
checkpoint progress tracking easier as the control file already saves 
the start LSN position of the last checkpoint. If an exclusive lock is 
taken for WAL insertion there is no need to update all the progress 
fields, only the first one.

If that still says what you think it should, then I believe it is 
clearer.  Also:

+         * last time a segment has switched because of a timeout. Segment
+         * switching because of other reasons, like manual trigerring of

typo, should be "triggering".

I don't see any further issues with this patch unless there are 
performance concerns about the locks taken in GetProgressRecPtr().  The 
locks seem reasonable to me but I'd like to see this committed so 
there's plenty of time to detect any regression before 10.0.

As such, my vote is to mark this "Ready for Committer."  I'm fine with 
waiting a few days as Kyotaro suggested, or we can consider my review 
"additional comments" and do it now.

-- 
-David
david@pgmasters.net



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Michael Paquier
Дата:
On Tue, Nov 8, 2016 at 9:32 PM, David Steele <david@pgmasters.net> wrote:
> I had a bit of trouble parsing this paragraph:
>
> [...]
>
> So I did a little reworking:
>
> [...]
>
> If that still says what you think it should, then I believe it is clearer.

Thanks! I have included your suggestion.

> Also:
>
> +                * last time a segment has switched because of a timeout.
> Segment
> +                * switching because of other reasons, like manual
> trigerring of
>
> typo, should be "triggering".

Right.

> I don't see any further issues with this patch unless there are performance
> concerns about the locks taken in GetProgressRecPtr().  The locks seem
> reasonable to me but I'd like to see this committed so there's plenty of
> time to detect any regression before 10.0.
>
> As such, my vote is to mark this "Ready for Committer."  I'm fine with
> waiting a few days as Kyotaro suggested, or we can consider my review
> "additional comments" and do it now.

Thanks for the review! Waiting for a couple of days more is fine for
me. This won't change much. Attached is v15 with the fixes you
mentioned.
--
Michael

Вложения

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Stephen Frost
Дата:
Michael,

* Michael Paquier (michael.paquier@gmail.com) wrote:
> Thanks for the review! Waiting for a couple of days more is fine for
> me. This won't change much. Attached is v15 with the fixes you
> mentioned.

I figured I'd go ahead and start looking into this (and it's pretty easy
for me to discuss it with David, given he works in the same office ;).

A couple initial comments:

> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> index adab2f8..38c2385 100644
> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -2826,12 +2826,9 @@ include_dir 'conf.d'
>          parameter is greater than zero, the server will switch to a new
>          segment file whenever this many seconds have elapsed since the last
>          segment file switch, and there has been any database activity,
> -        including a single checkpoint.  (Increasing
> -        <varname>checkpoint_timeout</> will reduce unnecessary
> -        checkpoints on an idle system.)
> -        Note that archived files that are closed early
> -        due to a forced switch are still the same length as completely full
> -        files.  Therefore, it is unwise to use a very short
> +        including a single checkpoint.  Note that archived files that are
> +        closed early due to a forced switch are still the same length as
> +        completely full files.  Therefore, it is unwise to use a very short
>          <varname>archive_timeout</> — it will bloat your archive
>          storage.  <varname>archive_timeout</> settings of a minute or so are
>          usually reasonable.  You should consider using streaming replication,

We should probably include in here that we may skip a checkpoint if no
activity has happened, meaning that this is a safe setting to set for
environments which are idle for long periods (I'm thinking embedded
systems here).

> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
[...]
> +            if (log_checkpoints)
> +                ereport(LOG, (errmsg("checkpoint skipped")));

Do we really need to log that we're skipping a checkpoint..?  As the
point of this is to avoid write activity on a system which is idle, it
doesn't make sense to me to add a new cause for writes to happen when
we're idle.

Thanks!

Stephen

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
David Steele
Дата:
On 11/10/16 10:28 AM, Stephen Frost wrote:

>> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> [...]
>> +            if (log_checkpoints)
>> +                ereport(LOG, (errmsg("checkpoint skipped")));
>
> Do we really need to log that we're skipping a checkpoint..?  As the
> point of this is to avoid write activity on a system which is idle, it
> doesn't make sense to me to add a new cause for writes to happen when
> we're idle.

log_checkpoints is not enabled by default, though, so if the user does
enable it don't you think they would want to know when checkpoints
*don't* happen?

Or are you thinking the main use of this logging is to determine when
checkpoints are too close together and so skipped checkpoints aren't
very important?

Thanks,
--
-David
david@pgmasters.net


Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
"Joshua D. Drake"
Дата:
On 11/10/2016 09:33 AM, David Steele wrote:
> On 11/10/16 10:28 AM, Stephen Frost wrote:
>
>>> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
>> [...]
>>> +            if (log_checkpoints)
>>> +                ereport(LOG, (errmsg("checkpoint skipped")));
>>
>> Do we really need to log that we're skipping a checkpoint..?  As the
>> point of this is to avoid write activity on a system which is idle, it
>> doesn't make sense to me to add a new cause for writes to happen when
>> we're idle.
>
> log_checkpoints is not enabled by default, though, so if the user does
> enable it don't you think they would want to know when checkpoints
> *don't* happen?

Yes but I don't know that it needs to be anywhere below DEBUG2 (vs 
log_checkpoints).

Sincerely,

JD



-- 
Command Prompt, Inc.                  http://the.postgres.company/                        +1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.
Unless otherwise stated, opinions are my own.



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Stephen Frost
Дата:
On Thursday, November 10, 2016, Joshua D. Drake <jd@commandprompt.com> wrote:
On 11/10/2016 09:33 AM, David Steele wrote:
On 11/10/16 10:28 AM, Stephen Frost wrote:

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
[...]
+                       if (log_checkpoints)
+                               ereport(LOG, (errmsg("checkpoint skipped")));

Do we really need to log that we're skipping a checkpoint..?  As the
point of this is to avoid write activity on a system which is idle, it
doesn't make sense to me to add a new cause for writes to happen when
we're idle.

log_checkpoints is not enabled by default, though, so if the user does
enable it don't you think they would want to know when checkpoints
*don't* happen?

Yes but I don't know that it needs to be anywhere below DEBUG2 (vs log_checkpoints).

Agreed. You certainly may wish to log checkpoints, even on an embedded or low I/o system, but logging that nothing is happening doesn't seem useful except perhaps for debugging. 

Thanks!

Stephen 

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
David Steele
Дата:
On 11/10/16 1:03 PM, Stephen Frost wrote:
> On Thursday, November 10, 2016, Joshua D. Drake <jd@commandprompt.com
> <mailto:jd@commandprompt.com>> wrote:
> 
>     On 11/10/2016 09:33 AM, David Steele wrote:
> 
>         On 11/10/16 10:28 AM, Stephen Frost wrote:
> 
>                 diff --git a/src/backend/access/transam/xlog.c
>                 b/src/backend/access/transam/xlog.c
> 
>             [...]
> 
>                 +                       if (log_checkpoints)
>                 +                               ereport(LOG,
>                 (errmsg("checkpoint skipped")));
> 
> 
>             Do we really need to log that we're skipping a
>             checkpoint..?  As the
>             point of this is to avoid write activity on a system which
>             is idle, it
>             doesn't make sense to me to add a new cause for writes to
>             happen when
>             we're idle.
> 
> 
>         log_checkpoints is not enabled by default, though, so if the
>         user does
>         enable it don't you think they would want to know when checkpoints
>         *don't* happen?
> 
> 
>     Yes but I don't know that it needs to be anywhere below DEBUG2 (vs
>     log_checkpoints).
> 
> 
> Agreed. You certainly may wish to log checkpoints, even on an embedded
> or low I/o system, but logging that nothing is happening doesn't seem
> useful except perhaps for debugging. 

Sure, DEBUG1 or DEBUG2 makes sense.

-- 
-David
david@pgmasters.net



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Michael Paquier
Дата:
On Fri, Nov 11, 2016 at 12:28 AM, Stephen Frost <sfrost@snowman.net> wrote:
> * Michael Paquier (michael.paquier@gmail.com) wrote:
>> Thanks for the review! Waiting for a couple of days more is fine for
>> me. This won't change much. Attached is v15 with the fixes you
>> mentioned.
>
> I figured I'd go ahead and start looking into this (and it's pretty easy
> for me to discuss it with David, given he works in the same office ;).

Thanks!

> A couple initial comments:
>
>> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
>> index adab2f8..38c2385 100644
>> --- a/doc/src/sgml/config.sgml
>> +++ b/doc/src/sgml/config.sgml
>> @@ -2826,12 +2826,9 @@ include_dir 'conf.d'
>>          parameter is greater than zero, the server will switch to a new
>>          segment file whenever this many seconds have elapsed since the last
>>          segment file switch, and there has been any database activity,
>> -        including a single checkpoint.  (Increasing
>> -        <varname>checkpoint_timeout</> will reduce unnecessary
>> -        checkpoints on an idle system.)
>> -        Note that archived files that are closed early
>> -        due to a forced switch are still the same length as completely full
>> -        files.  Therefore, it is unwise to use a very short
>> +        including a single checkpoint.  Note that archived files that are
>> +        closed early due to a forced switch are still the same length as
>> +        completely full files.  Therefore, it is unwise to use a very short
>>          <varname>archive_timeout</> — it will bloat your archive
>>          storage.  <varname>archive_timeout</> settings of a minute or so are
>>          usually reasonable.  You should consider using streaming replication,
>
> We should probably include in here that we may skip a checkpoint if no
> activity has happened, meaning that this is a safe setting to set for
> environments which are idle for long periods.

OK, here is the interesting bit I just updated (I cut the diff a bit
as the rest is just reformatting):
         parameter is greater than zero, the server will switch to a new
         segment file whenever this many seconds have elapsed since the last
         segment file switch, and there has been any database activity,
-        including a single checkpoint.  (Increasing
-        <varname>checkpoint_timeout</> will reduce unnecessary
-        checkpoints on an idle system.)
[...]
+        including a single checkpoint.  Checkpoints can however be skipped
+        if there is no database activity, making this parameter a safe
+        setting for environments which are idle for a long period of time.

> (I'm thinking embedded systems here).

(Those are most of my users :{) ).

On Fri, Nov 11, 2016 at 3:23 AM, David Steele <david@pgmasters.net> wrote:
> On 11/10/16 1:03 PM, Stephen Frost wrote:
>> Agreed. You certainly may wish to log checkpoints, even on an embedded
>> or low I/o system, but logging that nothing is happening doesn't seem
>> useful except perhaps for debugging.
>
> Sure, DEBUG1 or DEBUG2 makes sense.

OK. LOG was useful to avoid noise when debugging the thing, but DEBUG1
is fine for me as well in the final version.
--
Michael

Вложения

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Kyotaro HORIGUCHI
Дата:
Thank you for the new patch.

At Fri, 11 Nov 2016 16:42:43 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqR73vusv5kQgZzket5mLZLeEcgNF-3hKh7061QtcZiuVw@mail.gmail.com>
> On Fri, Nov 11, 2016 at 12:28 AM, Stephen Frost <sfrost@snowman.net> wrote:
> > We should probably include in here that we may skip a checkpoint if no
> > activity has happened, meaning that this is a safe setting to set for
> > environments which are idle for long periods.
> 
> OK, here is the interesting bit I just updated (I cut the diff a bit
> as the rest is just reformatting):
>          parameter is greater than zero, the server will switch to a new
>          segment file whenever this many seconds have elapsed since the last
>          segment file switch, and there has been any database activity,
> -        including a single checkpoint.  (Increasing
> -        <varname>checkpoint_timeout</> will reduce unnecessary
> -        checkpoints on an idle system.)
> [...]
> +        including a single checkpoint.  Checkpoints can however be skipped
> +        if there is no database activity, making this parameter a safe
> +        setting for environments which are idle for a long period of time.
> 
> > (I'm thinking embedded systems here).
> 
> (Those are most of my users :{) ).

Ok, (FWIW..,) it seems fine for me.

> On Fri, Nov 11, 2016 at 3:23 AM, David Steele <david@pgmasters.net> wrote:
> > On 11/10/16 1:03 PM, Stephen Frost wrote:
> >> Agreed. You certainly may wish to log checkpoints, even on an embedded
> >> or low I/o system, but logging that nothing is happening doesn't seem
> >> useful except perhaps for debugging.
> >
> > Sure, DEBUG1 or DEBUG2 makes sense.
> 
> OK. LOG was useful to avoid noise when debugging the thing, but DEBUG1
> is fine for me as well in the final version.

Agreed. DEBUG2 seems too deep for it.

Well, I think we had the final comment and it has been addressd
so I mark this as ready for committer soon.

Thank you all.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Amit Kapila
Дата:
On Tue, Nov 8, 2016 at 5:18 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello,
>
>> on something else than LW_EXCLUSIVE compared to now. To keep things
>> more simple I' would still favor using WALInsertLocks for this patch,
>> that looks more consistent, and also because there is no noticeable
>> difference.
>
> Ok, the patch looks fine. So there's nothing for me but to accept
> the current shape since the discussion about performance seems
> not to be settled with out performance measurement with machines
> with many cores.
>

I think it is good to check the performance impact of this patch on
many core m/c.  Is it possible for you to once check with Alexander
Korotkov to see if he can provide you access to his powerful m/c which
has 70 cores (if I remember correctly)?


@@ -1035,6 +1038,7 @@ LogAccessExclusiveLocks(int nlocks,
xl_standby_lock *locks) XLogBeginInsert(); XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
XLogRegisterData((char*) locks, nlocks * sizeof(xl_standby_lock));
 
+ XLogSetFlags(XLOG_NO_PROGRESS);


Is it right to set XLOG_NO_PROGRESS flag in LogAccessExclusiveLocks?
This function is called not only in LogStandbySnapshot(), but during
DDL operations as well where lockmode >= AccessExclusiveLock.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Andres Freund
Дата:
Hi,

On 2016-11-11 16:42:43 +0900, Michael Paquier wrote:

> + * This takes also
> + * advantage to avoid 8-byte torn reads on some platforms by using the
> + * fact that each insert lock is located on the same cache line.

Something residing on the same cache line doens't provide that guarantee
on all platforms.


> + * XXX: There is still room for more improvements here, particularly
> + * WAL operations related to unlogged relations (INIT_FORKNUM) should not
> + * update the progress LSN as those relations are reset during crash
> + * recovery so enforcing buffers of such relations to be flushed for
> + * example in the case of a load only on unlogged relations is a waste
> + * of disk write.

Commit records still have to be written, everything else doesn't write
WAL. So I'm doubtful this matters much?

> @@ -997,6 +1022,24 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
>          inserted = true;
>      }
>  
> +    /*
> +     * Update the LSN progress positions. At least one WAL insertion lock
> +     * is already taken appropriately before doing that, and it is simpler
> +     * to do that here when the WAL record data and type are at hand.

But we don't use the "WAL record data and type"?


> + * GetProgressRecPtr -- Returns the newest WAL activity position, or in
> + * other words any activity not referring to standby logging or segment
> + * switches. Finding the last activity position is done by scanning each
> + * WAL insertion lock by taking directly the light-weight lock associated
> + * to it.
> + */

I'd prefer not to list the specific records here - that's just
guaranteed to get out of date. Why not say something "any activity not
requiring a checkpoint to be triggered" or such?


> +     * If this isn't a shutdown or forced checkpoint, and if there has been no
> +     * WAL activity, skip the checkpoint.  The idea here is to avoid inserting
> +     * duplicate checkpoints when the system is idle. That wastes log space,
> +     * and more importantly it exposes us to possible loss of both current and
> +     * previous checkpoint records if the machine crashes just as we're writing
> +     * the update.

Shouldn't this mention archiving and also that we also ignore some forms
of WAL activity?

> -/* Should the in-progress insertion log the origin? */
> -static bool include_origin = false;
> +/* status flags of the in-progress insertion */
> +static uint8 status_flags = 0;

that seems a bit too generic a name. 'curinsert_flags'?

>  /*

> @@ -317,19 +317,23 @@ BackgroundWriterMain(void)
>          {
>              TimestampTz timeout = 0;
>              TimestampTz now = GetCurrentTimestamp();
> +            XLogRecPtr    current_progress_lsn = GetProgressRecPtr();
>  
>              timeout = TimestampTzPlusMilliseconds(last_snapshot_ts,
>                                                    LOG_SNAPSHOT_INTERVAL_MS);
>  
>              /*
> -             * only log if enough time has passed and some xlog record has
> -             * been inserted.
> +             * Only log if enough time has passed, that some WAL activity
> +             * has happened since last checkpoint, and that some new WAL
> +             * records have been inserted since the last time we came here.

I think that sentence needs some polish.


>               */
>              if (now >= timeout &&
> -                last_snapshot_lsn != GetXLogInsertRecPtr())
> +                GetLastCheckpointRecPtr() < current_progress_lsn &&
> +                last_progress_lsn < current_progress_lsn)
>              {

Hm. I don't think it's correct to use GetLastCheckpointRecPtr() here?
Don't we need to do the comparisons here (and when doing the checkpoint
itself) with the REDO pointer of the last checkpoint?


> diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
> index 397267c..7ecc00e 100644
> --- a/src/backend/postmaster/checkpointer.c
> +++ b/src/backend/postmaster/checkpointer.c
> @@ -164,6 +164,7 @@ static double ckpt_cached_elapsed;
>  
>  static pg_time_t last_checkpoint_time;
>  static pg_time_t last_xlog_switch_time;
> +static XLogRecPtr last_xlog_switch_lsn = InvalidXLogRecPtr;

Hm. Is it a good idea to use a static for this? Did you consider
checkpointer restarts?


Greetings,

Andres Freund



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Kyotaro HORIGUCHI
Дата:
Hello, thank you for the comment.

At Sat, 12 Nov 2016 10:28:56 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1K0gGQTBxCyKqi6QnqOWGzEoVVPHCgPJ_RkOBoLPejCTA@mail.gmail.com>
> On Tue, Nov 8, 2016 at 5:18 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Hello,
> >
> >> on something else than LW_EXCLUSIVE compared to now. To keep things
> >> more simple I' would still favor using WALInsertLocks for this patch,
> >> that looks more consistent, and also because there is no noticeable
> >> difference.
> >
> > Ok, the patch looks fine. So there's nothing for me but to accept
> > the current shape since the discussion about performance seems
> > not to be settled with out performance measurement with machines
> > with many cores.
> >
> 
> I think it is good to check the performance impact of this patch on
> many core m/c.  Is it possible for you to once check with Alexander
> Korotkov to see if he can provide you access to his powerful m/c which
> has 70 cores (if I remember correctly)?
> 
> 
> @@ -1035,6 +1038,7 @@ LogAccessExclusiveLocks(int nlocks,
> xl_standby_lock *locks)
>   XLogBeginInsert();
>   XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
>   XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
> + XLogSetFlags(XLOG_NO_PROGRESS);
> 
> 
> Is it right to set XLOG_NO_PROGRESS flag in LogAccessExclusiveLocks?
> This function is called not only in LogStandbySnapshot(), but during
> DDL operations as well where lockmode >= AccessExclusiveLock.

This does not remove any record from WAL. So theoretically any
kind of record can be NO_PROGRESS, but practically as long as
checkpoints are not unreasonably suppressed. Any explicit
database operation must be accompanied with at least commit
record that triggers checkpoint. NO_PROGRESSing there doesn't
seem to me to harm database durability for this reason.

The objective of this patch is skipping WALs on completely-idle
state and the NO_PROGRESSing is necessary to do its work. Of
course we can distinguish exclusive lock with PROGRESS and
without PROGRESS but it is unnecessary complexity.


But rethinking about the above, the namging of "XLOG_NO_PROGRESS"
might be inappropriate. "XLOG_NO_CKPT_TRIGGER" or any sainer name
would be needed.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Michael Paquier
Дата:
On Mon, Nov 14, 2016 at 12:49 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Sat, 12 Nov 2016 10:28:56 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1K0gGQTBxCyKqi6QnqOWGzEoVVPHCgPJ_RkOBoLPejCTA@mail.gmail.com>
>> I think it is good to check the performance impact of this patch on
>> many core m/c.  Is it possible for you to once check with Alexander
>> Korotkov to see if he can provide you access to his powerful m/c which
>> has 70 cores (if I remember correctly)?

I heard about a number like that, and there is no reason to not do
tests to be sure. With that many cores we are more likely going to see
the limitation of the number of XLOG insert slots popping up as a
bottleneck, but that's just an assumption without any numbers.
Alexander (added in CC), could it be possible to get an access to this
machine?

>> @@ -1035,6 +1038,7 @@ LogAccessExclusiveLocks(int nlocks,
>> xl_standby_lock *locks)
>>   XLogBeginInsert();
>>   XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
>>   XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
>> + XLogSetFlags(XLOG_NO_PROGRESS);
>>
>>
>> Is it right to set XLOG_NO_PROGRESS flag in LogAccessExclusiveLocks?
>> This function is called not only in LogStandbySnapshot(), but during
>> DDL operations as well where lockmode >= AccessExclusiveLock.
>
> This does not remove any record from WAL. So theoretically any
> kind of record can be NO_PROGRESS, but practically as long as
> checkpoints are not unreasonably suppressed. Any explicit
> database operation must be accompanied with at least commit
> record that triggers checkpoint. NO_PROGRESSing there doesn't
> seem to me to harm database durability for this reason.
>
> The objective of this patch is skipping WALs on completely-idle
> state and the NO_PROGRESSing is necessary to do its work. Of
> course we can distinguish exclusive lock with PROGRESS and
> without PROGRESS but it is unnecessary complexity.

The point that applies here is that logging the exclusive lock
information is necessary for the *standby* recovery conflicts, not the
primary  which is why it should not influence the checkpoint activity
that is happening on the primary. So marking this record with
NO_PROGRESS is actually fine to me.

> But rethinking about the above, the namging of "XLOG_NO_PROGRESS"
> might be inappropriate. "XLOG_NO_CKPT_TRIGGER" or any sainer name
> would be needed.

I got fond of NO_PROGRESS to be honest with the time, even if I don't
like much the negative meaning it has... Perhaps something like
XLOG_SKIP_PROGRESS would hold more meaning.
-- 
Michael



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Michael Paquier
Дата:
On Sat, Nov 12, 2016 at 9:01 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-11-11 16:42:43 +0900, Michael Paquier wrote:
>
>> + * This takes also
>> + * advantage to avoid 8-byte torn reads on some platforms by using the
>> + * fact that each insert lock is located on the same cache line.
>
> Something residing on the same cache line doens't provide that guarantee
> on all platforms.

OK. Let's remove it then.

>> + * XXX: There is still room for more improvements here, particularly
>> + * WAL operations related to unlogged relations (INIT_FORKNUM) should not
>> + * update the progress LSN as those relations are reset during crash
>> + * recovery so enforcing buffers of such relations to be flushed for
>> + * example in the case of a load only on unlogged relations is a waste
>> + * of disk write.
>
> Commit records still have to be written, everything else doesn't write
> WAL. So I'm doubtful this matters much?

Hm, okay. In most cases this may not matter... Let's rip that off.

>> @@ -997,6 +1022,24 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
>>               inserted = true;
>>       }
>>
>> +     /*
>> +      * Update the LSN progress positions. At least one WAL insertion lock
>> +      * is already taken appropriately before doing that, and it is simpler
>> +      * to do that here when the WAL record data and type are at hand.
>
> But we don't use the "WAL record data and type"?

Yes, at some point this patch did so...

>> + * GetProgressRecPtr -- Returns the newest WAL activity position, or in
>> + * other words any activity not referring to standby logging or segment
>> + * switches. Finding the last activity position is done by scanning each
>> + * WAL insertion lock by taking directly the light-weight lock associated
>> + * to it.
>> + */
>
> I'd prefer not to list the specific records here - that's just
> guaranteed to get out of date. Why not say something "any activity not
> requiring a checkpoint to be triggered" or such?

OK. Makes sense to minimize maintenance.

>> +      * If this isn't a shutdown or forced checkpoint, and if there has been no
>> +      * WAL activity, skip the checkpoint.  The idea here is to avoid inserting
>> +      * duplicate checkpoints when the system is idle. That wastes log space,
>> +      * and more importantly it exposes us to possible loss of both current and
>> +      * previous checkpoint records if the machine crashes just as we're writing
>> +      * the update.
>
> Shouldn't this mention archiving and also that we also ignore some forms
> of WAL activity?

I have reworded that as:
"If this isn't a shutdown or forced checkpoint, and if there has been
no WAL activity requiring a checkpoint, skip it."

>> -/* Should the in-progress insertion log the origin? */
>> -static bool include_origin = false;
>> +/* status flags of the in-progress insertion */
>> +static uint8 status_flags = 0;
>
> that seems a bit too generic a name. 'curinsert_flags'?

OK.

>>                       /*
>> -                      * only log if enough time has passed and some xlog record has
>> -                      * been inserted.
>> +                      * Only log if enough time has passed, that some WAL activity
>> +                      * has happened since last checkpoint, and that some new WAL
>> +                      * records have been inserted since the last time we came here.
>
> I think that sentence needs some polish.

Let's do this better:
            /*
-            * only log if enough time has passed and some xlog record has
-            * been inserted.
+            * Only log if one of the following conditions is satisfied since
+            * the last time we came here::
+            * - timeout has been reached.
+            * - WAL activity has happened since last checkpoint.
+            * - New WAL records have been inserted.
             */

>>                        */
>>                       if (now >= timeout &&
>> -                             last_snapshot_lsn != GetXLogInsertRecPtr())
>> +                             GetLastCheckpointRecPtr() < current_progress_lsn &&
>> +                             last_progress_lsn < current_progress_lsn)
>>                       {
>
> Hm. I don't think it's correct to use GetLastCheckpointRecPtr() here?
> Don't we need to do the comparisons here (and when doing the checkpoint
> itself) with the REDO pointer of the last checkpoint?

Hm? The progress pointer is pointing to the lastly inserted LSN, which
is not the position of the REDO pointer, but the one of the checkpoint
record. Doing a comparison of the REDO pointer would be a moot
operation, because as the checkpoint completes, the progress LSN will
be updated as well. Or do you mean that the progress LSN should *not*
be updated for a checkpoint record? It seems to me that it should
but...

>> diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
>> index 397267c..7ecc00e 100644
>> --- a/src/backend/postmaster/checkpointer.c
>> +++ b/src/backend/postmaster/checkpointer.c
>> @@ -164,6 +164,7 @@ static double ckpt_cached_elapsed;
>>
>>  static pg_time_t last_checkpoint_time;
>>  static pg_time_t last_xlog_switch_time;
>> +static XLogRecPtr last_xlog_switch_lsn = InvalidXLogRecPtr;
>
> Hm. Is it a good idea to use a static for this? Did you consider
> checkpointer restarts?

Indeed, I forgot about that and the current approach is not solid. The
best way to do things then is to track the LSN position of the last
switched segment in XLogCtl..
--
Michael

Вложения

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Kyotaro HORIGUCHI
Дата:
Hello,

It applies the master and compiled cleanly and no error by
regtest.  (I didn't confirmed that the problem is still fixed but
seemingly no problem)

At Mon, 14 Nov 2016 15:09:09 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqRhzS0fNHNAAtRCE+CqdOKKW+KyrAzy5O_R-7zqucGevA@mail.gmail.com>
> On Sat, Nov 12, 2016 at 9:01 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-11-11 16:42:43 +0900, Michael Paquier wrote:
> >
> >> + * This takes also
> >> + * advantage to avoid 8-byte torn reads on some platforms by using the
> >> + * fact that each insert lock is located on the same cache line.
> >
> > Something residing on the same cache line doens't provide that guarantee
> > on all platforms.
> 
> OK. Let's remove it then.
> 
> >> + * XXX: There is still room for more improvements here, particularly
> >> + * WAL operations related to unlogged relations (INIT_FORKNUM) should not
> >> + * update the progress LSN as those relations are reset during crash
> >> + * recovery so enforcing buffers of such relations to be flushed for
> >> + * example in the case of a load only on unlogged relations is a waste
> >> + * of disk write.
> >
> > Commit records still have to be written, everything else doesn't write
> > WAL. So I'm doubtful this matters much?
> 
> Hm, okay. In most cases this may not matter... Let's rip that off.
> 
> >> @@ -997,6 +1022,24 @@ XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
> >>               inserted = true;
> >>       }
> >>
> >> +     /*
> >> +      * Update the LSN progress positions. At least one WAL insertion lock
> >> +      * is already taken appropriately before doing that, and it is simpler
> >> +      * to do that here when the WAL record data and type are at hand.
> >
> > But we don't use the "WAL record data and type"?
> 
> Yes, at some point this patch did so...
> 
> >> + * GetProgressRecPtr -- Returns the newest WAL activity position, or in
> >> + * other words any activity not referring to standby logging or segment
> >> + * switches. Finding the last activity position is done by scanning each
> >> + * WAL insertion lock by taking directly the light-weight lock associated
> >> + * to it.
> >> + */
> >
> > I'd prefer not to list the specific records here - that's just
> > guaranteed to get out of date. Why not say something "any activity not
> > requiring a checkpoint to be triggered" or such?
> 
> OK. Makes sense to minimize maintenance.
> 
> >> +      * If this isn't a shutdown or forced checkpoint, and if there has been no
> >> +      * WAL activity, skip the checkpoint.  The idea here is to avoid inserting
> >> +      * duplicate checkpoints when the system is idle. That wastes log space,
> >> +      * and more importantly it exposes us to possible loss of both current and
> >> +      * previous checkpoint records if the machine crashes just as we're writing
> >> +      * the update.
> >
> > Shouldn't this mention archiving and also that we also ignore some forms
> > of WAL activity?
> 
> I have reworded that as:
> "If this isn't a shutdown or forced checkpoint, and if there has been
> no WAL activity requiring a checkpoint, skip it."
> 
> >> -/* Should the in-progress insertion log the origin? */
> >> -static bool include_origin = false;
> >> +/* status flags of the in-progress insertion */
> >> +static uint8 status_flags = 0;
> >
> > that seems a bit too generic a name. 'curinsert_flags'?
> 
> OK.
> 
> >>                       /*
> >> -                      * only log if enough time has passed and some xlog record has
> >> -                      * been inserted.
> >> +                      * Only log if enough time has passed, that some WAL activity
> >> +                      * has happened since last checkpoint, and that some new WAL
> >> +                      * records have been inserted since the last time we came here.
> >
> > I think that sentence needs some polish.
> 
> Let's do this better:
>             /*
> -            * only log if enough time has passed and some xlog record has
> -            * been inserted.
> +            * Only log if one of the following conditions is satisfied since
> +            * the last time we came here::
> +            * - timeout has been reached.
> +            * - WAL activity has happened since last checkpoint.
> +            * - New WAL records have been inserted.
>              */
> 
> >>                        */
> >>                       if (now >= timeout &&
> >> -                             last_snapshot_lsn != GetXLogInsertRecPtr())
> >> +                             GetLastCheckpointRecPtr() < current_progress_lsn &&
> >> +                             last_progress_lsn < current_progress_lsn)
> >>                       {
> >
> > Hm. I don't think it's correct to use GetLastCheckpointRecPtr() here?
> > Don't we need to do the comparisons here (and when doing the checkpoint
> > itself) with the REDO pointer of the last checkpoint?
> 
> Hm? The progress pointer is pointing to the lastly inserted LSN, which
> is not the position of the REDO pointer, but the one of the checkpoint
> record. Doing a comparison of the REDO pointer would be a moot
> operation, because as the checkpoint completes, the progress LSN will
> be updated as well. Or do you mean that the progress LSN should *not*
> be updated for a checkpoint record? It seems to me that it should
> but...
> 
> >> diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
> >> index 397267c..7ecc00e 100644
> >> --- a/src/backend/postmaster/checkpointer.c
> >> +++ b/src/backend/postmaster/checkpointer.c
> >> @@ -164,6 +164,7 @@ static double ckpt_cached_elapsed;
> >>
> >>  static pg_time_t last_checkpoint_time;
> >>  static pg_time_t last_xlog_switch_time;
> >> +static XLogRecPtr last_xlog_switch_lsn = InvalidXLogRecPtr;
> >
> > Hm. Is it a good idea to use a static for this? Did you consider
> > checkpointer restarts?
> 
> Indeed, I forgot about that and the current approach is not solid. The
> best way to do things then is to track the LSN position of the last
> switched segment in XLogCtl..

If I'm not missing something, at the worst we have a checkpoint
after a checkpointer restart that should have been supressed. Is
it worth picking it up for the complexity?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Michael Paquier
Дата:
On Mon, Nov 14, 2016 at 6:10 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> It applies the master and compiled cleanly and no error by
> regtest.  (I didn't confirmed that the problem is still fixed but
> seemingly no problem)

Thanks for double-checking.

> If I'm not missing something, at the worst we have a checkpoint
> after a checkpointer restart that should have been supressed. Is
> it worth picking it up for the complexity?

I think so, that's not that much code if you think about it as there
is already a routine to get the timestamp of the lastly switched
segment that gets used by checkpointer.c.
-- 
Michael



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Amit Kapila
Дата:
On Mon, Nov 14, 2016 at 9:33 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Nov 14, 2016 at 12:49 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> At Sat, 12 Nov 2016 10:28:56 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1K0gGQTBxCyKqi6QnqOWGzEoVVPHCgPJ_RkOBoLPejCTA@mail.gmail.com>
>>> I think it is good to check the performance impact of this patch on
>>> many core m/c.  Is it possible for you to once check with Alexander
>>> Korotkov to see if he can provide you access to his powerful m/c which
>>> has 70 cores (if I remember correctly)?
>
> I heard about a number like that, and there is no reason to not do
> tests to be sure. With that many cores we are more likely going to see
> the limitation of the number of XLOG insert slots popping up as a
> bottleneck, but that's just an assumption without any numbers.
> Alexander (added in CC), could it be possible to get an access to this
> machine?
>
>>> @@ -1035,6 +1038,7 @@ LogAccessExclusiveLocks(int nlocks,
>>> xl_standby_lock *locks)
>>>   XLogBeginInsert();
>>>   XLogRegisterData((char *) &xlrec, offsetof(xl_standby_locks, locks));
>>>   XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
>>> + XLogSetFlags(XLOG_NO_PROGRESS);
>>>
>>>
>>> Is it right to set XLOG_NO_PROGRESS flag in LogAccessExclusiveLocks?
>>> This function is called not only in LogStandbySnapshot(), but during
>>> DDL operations as well where lockmode >= AccessExclusiveLock.
>>
>> This does not remove any record from WAL. So theoretically any
>> kind of record can be NO_PROGRESS, but practically as long as
>> checkpoints are not unreasonably suppressed. Any explicit
>> database operation must be accompanied with at least commit
>> record that triggers checkpoint. NO_PROGRESSing there doesn't
>> seem to me to harm database durability for this reason.
>>

By this theory, you can even mark the insert record as no progress
which is not good.

>> The objective of this patch is skipping WALs on completely-idle
>> state and the NO_PROGRESSing is necessary to do its work. Of
>> course we can distinguish exclusive lock with PROGRESS and
>> without PROGRESS but it is unnecessary complexity.
>
> The point that applies here is that logging the exclusive lock
> information is necessary for the *standby* recovery conflicts, not the
> primary  which is why it should not influence the checkpoint activity
> that is happening on the primary. So marking this record with
> NO_PROGRESS is actually fine to me.
>

The progress parameter is used not only for checkpoint activity but by
bgwriter as well for logging standby snapshot.  If you want to keep
this record under no_progress category (which I don't endorse), then
it might be better to add a comment, so that it will be easier for the
readers of this code to understand the reason.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Kyotaro HORIGUCHI
Дата:
Hello,

At Mon, 14 Nov 2016 16:53:35 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1KJAXA3PdxH4T1QJKBNOvyUK8UKm_GCvTuT+FC5jpjmjg@mail.gmail.com>
> On Mon, Nov 14, 2016 at 9:33 AM, Michael Paquier
> >>> Is it right to set XLOG_NO_PROGRESS flag in LogAccessExclusiveLocks?
> >>> This function is called not only in LogStandbySnapshot(), but during
> >>> DDL operations as well where lockmode >= AccessExclusiveLock.
> >>
> >> This does not remove any record from WAL. So theoretically any
> >> kind of record can be NO_PROGRESS, but practically as long as
> >> checkpoints are not unreasonably suppressed. Any explicit
> >> database operation must be accompanied with at least commit
> >> record that triggers checkpoint. NO_PROGRESSing there doesn't
> >> seem to me to harm database durability for this reason.
> >>
> 
> By this theory, you can even mark the insert record as no progress
> which is not good.

Of course. So we carefully choose the kinds of records to be
so. If we mark all xlog records to be SKIP_PROGRESS,
archive_timeout gets useless and as its result vacuum may leave
certain number of records not removed for maybe problematic time.

> >> The objective of this patch is skipping WALs on completely-idle
> >> state and the NO_PROGRESSing is necessary to do its work. Of
> >> course we can distinguish exclusive lock with PROGRESS and
> >> without PROGRESS but it is unnecessary complexity.
> >
> > The point that applies here is that logging the exclusive lock
> > information is necessary for the *standby* recovery conflicts, not the
> > primary  which is why it should not influence the checkpoint activity
> > that is happening on the primary. So marking this record with
> > NO_PROGRESS is actually fine to me.
> >
> 
> The progress parameter is used not only for checkpoint activity but by
> bgwriter as well for logging standby snapshot.  If you want to keep
> this record under no_progress category (which I don't endorse), then
> it might be better to add a comment, so that it will be easier for the
> readers of this code to understand the reason.

I rather agree to that. But how detailed it should be?

>LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks)
>{
>...
>    XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
>    /* Needs XLOG_SKIP_PROGRESS because called from LogStandbySnapshot */
>    XLogSetFlags(XLOG_SKIP_PROGRESS);

or 

>    /*
>        * Needs XLOG_SKIP_PROGRESS because called from LogStandbySnapshot.
>        * See the comment for LogCurrentRunningXact for the detail.
>        */

or more detiled?

The term "WAL activity' is used in the comment for
GetProgressRecPtr. Its meaning is not clear but not well
defined. Might need a bit detailed explanation about that or "WAL
activity tracking". What do you think about this?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Amit Kapila
Дата:
On Tue, Nov 15, 2016 at 9:27 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello,
>
> At Mon, 14 Nov 2016 16:53:35 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
>> >
>>
>> The progress parameter is used not only for checkpoint activity but by
>> bgwriter as well for logging standby snapshot.  If you want to keep
>> this record under no_progress category (which I don't endorse), then
>> it might be better to add a comment, so that it will be easier for the
>> readers of this code to understand the reason.
>
> I rather agree to that. But how detailed it should be?
>
>>LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks)
>>{
>>...
>>       XLogRegisterData((char *) locks, nlocks * sizeof(xl_standby_lock));
>>       /* Needs XLOG_SKIP_PROGRESS because called from LogStandbySnapshot */
>>       XLogSetFlags(XLOG_SKIP_PROGRESS);
>
> or
>
>>       /*
>>        * Needs XLOG_SKIP_PROGRESS because called from LogStandbySnapshot.
>>        * See the comment for LogCurrentRunningXact for the detail.
>>        */
>
> or more detiled?
>

I think referring to a place where we have explained why skipping XLOG
progress is okay for this or related WAL records (like comments for
struct WALInsertLock)  will be more suitable.  Also, maybe it is worth
mentioning that this code will skip updating XLOG progress even when
we want to log AccessExclusiveLocks for operations other than a
snapshot.


> The term "WAL activity' is used in the comment for
> GetProgressRecPtr. Its meaning is not clear but not well
> defined. Might need a bit detailed explanation about that or "WAL
> activity tracking". What do you think about this?
>

I would have written it as below:

GetProgressRecPtr -- Returns the WAL progress.  WAL progress is
determined by scanning each WALinsertion lock by taking directly the
light-weight lock associated to it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
David Steele
Дата:
On 11/14/16 4:29 AM, Michael Paquier wrote:
> On Mon, Nov 14, 2016 at 6:10 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> It applies the master and compiled cleanly and no error by
>> regtest.  (I didn't confirmed that the problem is still fixed but
>> seemingly no problem)
> 
> Thanks for double-checking.

Also looks good to me.  I like curinsert_flags and XLOG_SKIP_PROGRESS
better than the old names.

>> If I'm not missing something, at the worst we have a checkpoint
>> after a checkpointer restart that should have been supressed. Is
>> it worth picking it up for the complexity?

That's the way I read it as well.  It's not clear to me how the
checkpointer would get restarted under normal circumstances.

I did a kill on the checkpointer and it was ignored.  After a kill -9
the checkpointer process came back but also switched the xlog.  Is this
the expected behavior?

-- 
-David
david@pgmasters.net



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Amit Kapila
Дата:
On Mon, Nov 14, 2016 at 9:33 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
> On Mon, Nov 14, 2016 at 12:49 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> At Sat, 12 Nov 2016 10:28:56 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1K0gGQTBxCyKqi6QnqOWGzEoVVPHCgPJ_RkOBoLPejCTA@mail.gmail.com>
>>> I think it is good to check the performance impact of this patch on
>>> many core m/c.  Is it possible for you to once check with Alexander
>>> Korotkov to see if he can provide you access to his powerful m/c which
>>> has 70 cores (if I remember correctly)?
>
> I heard about a number like that, and there is no reason to not do
> tests to be sure.
>

Okay, I have done some performance tests with this patch and found that it doesn't have any noticeable impact which is good.  Details of performance tests is below:
Machine configuration:
2 sockets, 28 cores (56 including Hyper-Threading)
RAM = 64GB
Data directory is configured on the magnetic disk and WAL on SSD.

Non-default postgresql.conf parameters
shared_buffers=8GB
max_connections=200
bgwriter_delay=10ms
checkpoint_completion_target=0

Keeping above parameters as fixed, I have varied checkpoint_timeout for various tests.  Each of the below results is a median of 3, 15min pgbench TPC-B tests.  All the tests are performed at 64 and or 128 client-count (Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)).  All the tests are done for pgbench scale factor - 300 which means data fits in shared buffers.


checkpoint_timeout=30s
client_count/patch_ver64128
HEAD51766853
Patch49636556

checkpoint_timeout=60s
client_count/patch_ver
64128
HEAD49626894
Patch52286814

checkpoint_timeout=120s
client_count/patch_ver
64128
HEAD54437308
Patch54536937

checkpoint_timeout=150s
client_count/patch_ver
128
HEAD7316
Patch7188


In above results, you can see that in some cases (example, for checkpoint_time=30s, @128-client count) TPS with the patch is slightly lower(1~5%), but I find that as a run-to-run variation, because on repeating the tests, I could not see such regression.  The reason of keeping low values for checkpoint_timeout and bgwriter_delay is to test if there is any impact due to new locking introduced in checkpointer and bgwriter.  The conclusion from my tests is that this patch is okay as far as performance is concerned.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Michael Paquier
Дата:
On Fri, Nov 18, 2016 at 7:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Okay, I have done some performance tests with this patch and found that it doesn't have any noticeable impact which
isgood.  Details of performance tests is below:
 
> Machine configuration:
> 2 sockets, 28 cores (56 including Hyper-Threading)
> RAM = 64GB
> Data directory is configured on the magnetic disk and WAL on SSD.

Nice spec!

> The conclusion from my tests is that this patch is okay as far as performance is concerned.

Thank you a lot for doing those additional tests!
-- 
Michael



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
David Steele
Дата:
On 11/18/16 12:38 PM, David Steele wrote:

> On 11/14/16 4:29 AM, Michael Paquier wrote:
>> On Mon, Nov 14, 2016 at 6:10 PM, Kyotaro HORIGUCHI
>>> If I'm not missing something, at the worst we have a checkpoint
>>> after a checkpointer restart that should have been supressed. Is
>>> it worth picking it up for the complexity?
> 
> That's the way I read it as well.  It's not clear to me how the
> checkpointer would get restarted under normal circumstances.
> 
> I did a kill on the checkpointer and it was ignored.  After a kill -9
> the checkpointer process came back but also switched the xlog.  Is this
> the expected behavior?

Ah, never mind.  I can see this caused a restart and recovery so the
archive timeout was reset and a switch occurred after timeout.

-- 
-David
david@pgmasters.net



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Kyotaro HORIGUCHI
Дата:
Thank you very much for the testing on the nice machine.

At Fri, 18 Nov 2016 20:35:43 -0800, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqRa=igQMCx+FxbfwJ0TzhLU2tE+YOng7qAvZ+1NPm-FOw@mail.gmail.com>
> On Fri, Nov 18, 2016 at 7:00 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > Okay, I have done some performance tests with this patch and found that it doesn't have any noticeable impact which
isgood.  Details of performance tests is below:
 
> > Machine configuration:
> > 2 sockets, 28 cores (56 including Hyper-Threading)
> > RAM = 64GB
> > Data directory is configured on the magnetic disk and WAL on SSD.
> 
> Nice spec!

This spec seems enough to see the performance of this patch.

> > The conclusion from my tests is that this patch is okay as far as performance is concerned.
> 
> Thank you a lot for doing those additional tests!

So, all my original concern were cleared. The last one is
resetting by a checkpointer restart.. I'd like to remove that if
Andres agrees.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Michael Paquier
Дата:
On Tue, Nov 15, 2016 at 9:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Nov 15, 2016 at 9:27 AM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> The term "WAL activity' is used in the comment for
>> GetProgressRecPtr. Its meaning is not clear but not well
>> defined. Might need a bit detailed explanation about that or "WAL
>> activity tracking". What do you think about this?
>>
>
> I would have written it as below:
>
> GetProgressRecPtr -- Returns the WAL progress.  WAL progress is
> determined by scanning each WALinsertion lock by taking directly the
> light-weight lock associated to it.

Not sure if that's better.. What about something as fancy as that?
 /*
- * Get the time of the last xlog segment switch
+ * GetProgressRecPtr -- Returns the newest WAL progress position.  WAL
+ * progress is determined by scanning each WALinsertion lock by taking
+ * directly the light-weight lock associated to it.  The result of this
+ * routine can be compared with the last checkpoint LSN to check if
+ * a checkpoint can be skipped or not.
+ *
It may be worth mentioning that the result of this routine is
basically used for checkpoint skip logic.
--
Michael

Вложения

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Michael Paquier
Дата:
On Mon, Nov 21, 2016 at 1:31 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> So, all my original concern were cleared.

Cool. Perhaps this could be marked as ready for committer then?

> The last one is
> resetting by a checkpointer restart.. I'd like to remove that if
> Andres agrees.

Could you clarify this point? v18 makes sure that the last segment
switch stays in shared memory so as we could still skip the activity
of archive_timeout correctly.
-- 
Michael



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Kyotaro HORIGUCHI
Дата:
Hello,

At Mon, 21 Nov 2016 14:41:27 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqSetnFjhGAB+tE2M68Vc_3BwbsEPe+dCMB8xnH0UYw3aA@mail.gmail.com>
> On Mon, Nov 21, 2016 at 1:31 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > So, all my original concern were cleared.
> 
> Cool. Perhaps this could be marked as ready for committer then?

^^;

> > The last one is
> > resetting by a checkpointer restart.. I'd like to remove that if
> > Andres agrees.
> 
> Could you clarify this point? v18 makes sure that the last segment
> switch stays in shared memory so as we could still skip the activity
> of archive_timeout correctly.

I don't doubt that it works. (I don't comment on the comment:) My
concern is complexity. I don't think we wish to save almost no
harm behavior caused by a thing rarely happens.  But, if you and
others on this thread don't mind the complexity, It's not worth
asserting myself more.

So, after a day waiting, I'll mark this as ready for committer
again.

reagards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Kyotaro HORIGUCHI
Дата:
I almost forgot this.

At Mon, 21 Nov 2016 15:44:08 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20161121.154408.47398334.horiguchi.kyotaro@lab.ntt.co.jp>
> Hello,
> 
> At Mon, 21 Nov 2016 14:41:27 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqSetnFjhGAB+tE2M68Vc_3BwbsEPe+dCMB8xnH0UYw3aA@mail.gmail.com>
> > On Mon, Nov 21, 2016 at 1:31 PM, Kyotaro HORIGUCHI
> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > > So, all my original concern were cleared.
> > 
> > Cool. Perhaps this could be marked as ready for committer then?
> 
> ^^;
> 
> > > The last one is
> > > resetting by a checkpointer restart.. I'd like to remove that if
> > > Andres agrees.
> > 
> > Could you clarify this point? v18 makes sure that the last segment
> > switch stays in shared memory so as we could still skip the activity
> > of archive_timeout correctly.
> 
> I don't doubt that it works. (I don't comment on the comment:) My
> concern is complexity. I don't think we wish to save almost no
> harm behavior caused by a thing rarely happens.  But, if you and
> others on this thread don't mind the complexity, It's not worth
> asserting myself more.
> 
> So, after a day waiting, I'll mark this as ready for committer
> again.

I have marked this as ready for committer again.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Michael Paquier
Дата:
On Tue, Nov 22, 2016 at 6:27 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> I have marked this as ready for committer again.

And moved to next CF for now.
-- 
Michael



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Amit Kapila
Дата:
On Mon, Nov 21, 2016 at 11:08 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Nov 15, 2016 at 9:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Tue, Nov 15, 2016 at 9:27 AM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>> The term "WAL activity' is used in the comment for
>>> GetProgressRecPtr. Its meaning is not clear but not well
>>> defined. Might need a bit detailed explanation about that or "WAL
>>> activity tracking". What do you think about this?
>>>
>>
>> I would have written it as below:
>>
>> GetProgressRecPtr -- Returns the WAL progress.  WAL progress is
>> determined by scanning each WALinsertion lock by taking directly the
>> light-weight lock associated to it.
>
> Not sure if that's better.. What about something as fancy as that?
>  /*
> - * Get the time of the last xlog segment switch
> + * GetProgressRecPtr -- Returns the newest WAL progress position.  WAL
> + * progress is determined by scanning each WALinsertion lock by taking
> + * directly the light-weight lock associated to it.  The result of this
> + * routine can be compared with the last checkpoint LSN to check if
> + * a checkpoint can be skipped or not.
> + *
> It may be worth mentioning that the result of this routine is
> basically used for checkpoint skip logic.
>

That's okay, but I think you are using it to skip switch segment stuff
as well.  Today, again going through patch, I noticed small anomaly

> + * Switch segment only when WAL has done some progress since the
+ * > last time a segment has switched because of a timeout.

> + if (GetProgressRecPtr() > last_switch_lsn)

Either the above comment is wrong or the code after it has a problem.
last_switch_lsn aka XLogCtl->lastSegSwitchLSN is updated not only for
a timeout but also when there is a lot of WAL activity which makes WAL
Write to cross a segment boundary.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Michael Paquier
Дата:
On Wed, Nov 30, 2016 at 7:53 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Nov 21, 2016 at 11:08 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Tue, Nov 15, 2016 at 9:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> On Tue, Nov 15, 2016 at 9:27 AM, Kyotaro HORIGUCHI
>>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>>>> The term "WAL activity' is used in the comment for
>>>> GetProgressRecPtr. Its meaning is not clear but not well
>>>> defined. Might need a bit detailed explanation about that or "WAL
>>>> activity tracking". What do you think about this?
>>>>
>>>
>>> I would have written it as below:
>>>
>>> GetProgressRecPtr -- Returns the WAL progress.  WAL progress is
>>> determined by scanning each WALinsertion lock by taking directly the
>>> light-weight lock associated to it.
>>
>> Not sure if that's better.. What about something as fancy as that?
>>  /*
>> - * Get the time of the last xlog segment switch
>> + * GetProgressRecPtr -- Returns the newest WAL progress position.  WAL
>> + * progress is determined by scanning each WALinsertion lock by taking
>> + * directly the light-weight lock associated to it.  The result of this
>> + * routine can be compared with the last checkpoint LSN to check if
>> + * a checkpoint can be skipped or not.
>> + *
>> It may be worth mentioning that the result of this routine is
>> basically used for checkpoint skip logic.
>>
>
> That's okay, but I think you are using it to skip switch segment stuff
> as well.  Today, again going through patch, I noticed small anomaly
>
>> + * Switch segment only when WAL has done some progress since the
> + * > last time a segment has switched because of a timeout.
>
>> + if (GetProgressRecPtr() > last_switch_lsn)
>
> Either the above comment is wrong or the code after it has a problem.
> last_switch_lsn aka XLogCtl->lastSegSwitchLSN is updated not only for
> a timeout but also when there is a lot of WAL activity which makes WAL
> Write to cross a segment boundary.

Right, this should be reworded a bit better to mention both. Done as attached.
--
Michael

Вложения

Re: Fix checkpoint skip logic on idle systems by tracking LSN progress

От
Amit Kapila
Дата:
On Fri, Dec 2, 2016 at 9:50 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Nov 30, 2016 at 7:53 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> + * Switch segment only when WAL has done some progress since the
>> + * > last time a segment has switched because of a timeout.
>>
>>> + if (GetProgressRecPtr() > last_switch_lsn)
>>
>> Either the above comment is wrong or the code after it has a problem.
>> last_switch_lsn aka XLogCtl->lastSegSwitchLSN is updated not only for
>> a timeout but also when there is a lot of WAL activity which makes WAL
>> Write to cross a segment boundary.
>
> Right, this should be reworded a bit better to mention both. Done as attached.
>

+ * Switch segment only when WAL has done some progress since the
+ * last time a segment has switched because of a timeout or because
+ * of some WAL activity.

I think it could be better written as below, but it is up to you to
retain your version or use below one.

Switch segment only when WAL has done some progress since the last
time a segment has switched due to timeout or WAL activity.  Apart
from that patch looks good to me.

Note to Committer:  As discussed above [1], this patch skips logging
for LogAccessExclusiveLocks which can be called from multiple places,
so for clarity purpose either we should document it or skip it only
when absolutely necessary.


[1] - https://www.postgresql.org/message-id/CAA4eK1KJAXA3PdxH4T1QJKBNOvyUK8UKm_GCvTuT%2BFC5jpjmjg%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Fix checkpoint skip logic on idle systems by trackingLSN progress

От
Andres Freund
Дата:
Hi,

A mime-type of invalid/octet-stream? That's an, uh, odd choice.

Working on committing this (tomorrow morning, not tonight).  There's
some relatively minor things I want to change:

- I don't like the name XLogSetFlags() - it's completely unclear what that those flags refer to - it could just as well
bereplay related. XLogSetRecordFlags()?
 
- Similarly I don't like the name "progress LSN" much. What does "progress" really mean in that". Maybe "consistency
LSN"?
- It's currently required to avoid triggering archive timeouts and checkpoints triggering each other, but I'm nervous
markingall xlog switches as unimportant. I think it'd be better to only mark timeout triggered switches as such.
 

Otherwise this seems to look good.

Regards,

Andres



Re: [HACKERS] Fix checkpoint skip logic on idle systems by trackingLSN progress

От
Robert Haas
Дата:
On Wed, Dec 21, 2016 at 4:28 PM, Andres Freund <andres@anarazel.de> wrote:
> - Similarly I don't like the name "progress LSN" much. What does
>   "progress" really mean in that". Maybe "consistency LSN"?

Whoa.  -1 from me for "consistency LSN".  Consistency has to with
whether the cluster has recovered up to the minimum recovery point or
whatever -- that is -- questions like "am i going to run into torn
pages?" and "should I expect some heap tuples to maybe be missing
index tuples, or the other way around?".  What I think "progress LSN"
is getting at -- actually fairly well -- is whether we're getting
anything *important* done, not whether we are consistent.  I don't
mind changing the name, but not to consistency LSN.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Fix checkpoint skip logic on idle systems by trackingLSN progress

От
Andres Freund
Дата:
On 2016-12-21 16:35:28 -0500, Robert Haas wrote:
> On Wed, Dec 21, 2016 at 4:28 PM, Andres Freund <andres@anarazel.de> wrote:
> > - Similarly I don't like the name "progress LSN" much. What does
> >   "progress" really mean in that". Maybe "consistency LSN"?
> 
> Whoa.  -1 from me for "consistency LSN".  Consistency has to with
> whether the cluster has recovered up to the minimum recovery point or
> whatever -- that is -- questions like "am i going to run into torn
> pages?" and "should I expect some heap tuples to maybe be missing
> index tuples, or the other way around?".

That's imo pretty much what progress LSN currently describes. Have there
been any records which are important for durability/consistency and
hence need to be archived and such.


> What I think "progress LSN"
> is getting at -- actually fairly well -- is whether we're getting
> anything *important* done, not whether we are consistent.  I don't
> mind changing the name, but not to consistency LSN.

Well, progress could just as well be replay. Or the actual insertion
point. Or up to where we've written out. Or synced out. Or
replicated....

Open to other suggestions - I'm not really happy with consistency LSN,
but definitely unhappy with progress LSN.

Greetings,

Andres Freund



Re: [HACKERS] Fix checkpoint skip logic on idle systems by trackingLSN progress

От
David Steele
Дата:
Hi Andres,

On 12/21/16 4:28 PM, Andres Freund wrote:

> Working on committing this (tomorrow morning, not tonight).  There's
> some relatively minor things I want to change:
>
> - I don't like the name XLogSetFlags() - it's completely unclear what
>   that those flags refer to - it could just as well be replay
>   related. XLogSetRecordFlags()?

That sounds a bit more clear.

> - Similarly I don't like the name "progress LSN" much. What does
>   "progress" really mean in that". Maybe "consistency LSN"?

Yes, please.  I think that really cuts to the core of what the patch is 
about.  Progress made perfect sense to me, but consistency is always the 
goal, and what we are saying here is that this is the last xlog record 
that is required to achieve consistency.  Anything that happens to be 
after it is informational only.

> - It's currently required to avoid triggering archive timeouts and
>   checkpoints triggering each other, but I'm nervous marking all xlog
>   switches as unimportant. I think it'd be better to only mark timeout
>   triggered switches as such.

That seems fine to me.  If the system is truly idle that might trigger 
one more xlog switch that is needed, but it seems like a reasonable 
compromise.

-- 
-David
david@pgmasters.net



Re: [HACKERS] Fix checkpoint skip logic on idle systems by trackingLSN progress

От
David Steele
Дата:
On 12/21/16 4:40 PM, Andres Freund wrote:
> On 2016-12-21 16:35:28 -0500, Robert Haas wrote:

>> What I think "progress LSN"
>> is getting at -- actually fairly well -- is whether we're getting
>> anything *important* done, not whether we are consistent.  I don't
>> mind changing the name, but not to consistency LSN.
>
> Well, progress could just as well be replay. Or the actual insertion
> point. Or up to where we've written out. Or synced out. Or
> replicated....
>
> Open to other suggestions - I'm not really happy with consistency LSN,
> but definitely unhappy with progress LSN.

MinConsistencyLSN?

-- 
-David
david@pgmasters.net



Re: [HACKERS] Fix checkpoint skip logic on idle systems by trackingLSN progress

От
"David G. Johnston"
Дата:
On Wed, Dec 21, 2016 at 2:40 PM, Andres Freund <andres@anarazel.de> wrote:
That's imo pretty much what progress LSN currently describes. Have there
been any records which are important for durability/consistency and
hence need to be archived and such.

The above, to me, describes a "milestone LSN"...​

David J.

Re: [HACKERS] Fix checkpoint skip logic on idle systems by trackingLSN progress

От
Michael Paquier
Дата:
On Thu, Dec 22, 2016 at 6:28 AM, Andres Freund <andres@anarazel.de> wrote:
> A mime-type of invalid/octet-stream? That's an, uh, odd choice.

Indeed. I am not sure what kind of accident happened here.
-- 
Michael



Re: [HACKERS] Fix checkpoint skip logic on idle systems by trackingLSN progress

От
Michael Paquier
Дата:
On Thu, Dec 22, 2016 at 6:41 AM, David Steele <david@pgmasters.net> wrote:
> On 12/21/16 4:28 PM, Andres Freund wrote:
>
>> Working on committing this (tomorrow morning, not tonight).  There's
>> some relatively minor things I want to change:

Thanks for looking at this patch.

>> - I don't like the name XLogSetFlags() - it's completely unclear what
>>   that those flags refer to - it could just as well be replay
>>   related. XLogSetRecordFlags()?
>
> That sounds a bit more clear.

Fine for me.

>> - Similarly I don't like the name "progress LSN" much. What does
>>   "progress" really mean in that". Maybe "consistency LSN"?
>
> Yes, please.  I think that really cuts to the core of what the patch is
> about.  Progress made perfect sense to me, but consistency is always the
> goal, and what we are saying here is that this is the last xlog record that
> is required to achieve consistency.  Anything that happens to be after it is
> informational only.

Fine as well.

>> - It's currently required to avoid triggering archive timeouts and
>>   checkpoints triggering each other, but I'm nervous marking all xlog
>>   switches as unimportant. I think it'd be better to only mark timeout
>>   triggered switches as such.
>
> That seems fine to me.  If the system is truly idle that might trigger one
> more xlog switch that is needed, but it seems like a reasonable compromise.

On a long-running embedded system the difference won't matter much. So
I guess I'm fine with this bit as well.
-- 
Michael



Re: [HACKERS] Fix checkpoint skip logic on idle systems by trackingLSN progress

От
Amit Kapila
Дата:
On Thu, Dec 22, 2016 at 3:10 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-12-21 16:35:28 -0500, Robert Haas wrote:
>> On Wed, Dec 21, 2016 at 4:28 PM, Andres Freund <andres@anarazel.de> wrote:
>> > - Similarly I don't like the name "progress LSN" much. What does
>> >   "progress" really mean in that". Maybe "consistency LSN"?
>>
>> Whoa.  -1 from me for "consistency LSN".  Consistency has to with
>> whether the cluster has recovered up to the minimum recovery point or
>> whatever -- that is -- questions like "am i going to run into torn
>> pages?" and "should I expect some heap tuples to maybe be missing
>> index tuples, or the other way around?".
>
> That's imo pretty much what progress LSN currently describes. Have there
> been any records which are important for durability/consistency and
> hence need to be archived and such.
>
>
>> What I think "progress LSN"
>> is getting at -- actually fairly well -- is whether we're getting
>> anything *important* done, not whether we are consistent.  I don't
>> mind changing the name, but not to consistency LSN.
>
> Well, progress could just as well be replay. Or the actual insertion
> point. Or up to where we've written out. Or synced out. Or
> replicated....
>
> Open to other suggestions - I'm not really happy with consistency LSN,
> but definitely unhappy with progress LSN.
>

last_essential_LSN?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] Fix checkpoint skip logic on idle systems by trackingLSN progress

От
Andres Freund
Дата:
Hi,

On 2016-12-21 13:28:54 -0800, Andres Freund wrote:
> A mime-type of invalid/octet-stream? That's an, uh, odd choice.
>
> Working on committing this (tomorrow morning, not tonight).  There's
> some relatively minor things I want to change:
>
> - I don't like the name XLogSetFlags() - it's completely unclear what
>   that those flags refer to - it could just as well be replay
>   related. XLogSetRecordFlags()?
> - Similarly I don't like the name "progress LSN" much. What does
>   "progress" really mean in that". Maybe "consistency LSN"?
> - It's currently required to avoid triggering archive timeouts and
>   checkpoints triggering each other, but I'm nervous marking all xlog
>   switches as unimportant. I think it'd be better to only mark timeout
>   triggered switches as such.

Here's an updated version of this. Besides the above (with "consistency
LSN" now named "lastImportantAt" instead of the previous
lastProgressAt), I changed how the skipping works in the bgwriter: I
don't see any need to involve the checkpoint location there. This also
allows to drop GetLastCheckpointPtr().  Besides that I did a fair amount
of comment-smithing.

I plan to commit this later today.  Hope I got the reviewers roughly right.

Regards,

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] Fix checkpoint skip logic on idle systems by trackingLSN progress

От
Andres Freund
Дата:
Hi,

On 2016-12-22 08:32:56 -0800, Andres Freund wrote:
> I plan to commit this later today.  Hope I got the reviewers roughly right.

And pushed. Thanks for the work on this everyone.

Andres



Re: [HACKERS] Fix checkpoint skip logic on idle systems by trackingLSN progress

От
Robert Haas
Дата:
On Thu, Dec 22, 2016 at 2:34 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-12-22 08:32:56 -0800, Andres Freund wrote:
>> I plan to commit this later today.  Hope I got the reviewers roughly right.
>
> And pushed. Thanks for the work on this everyone.

Cool.  Also, +1 for the important/unimportant terminology.  I like that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Fix checkpoint skip logic on idle systems by trackingLSN progress

От
Michael Paquier
Дата:
On Fri, Dec 23, 2016 at 8:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Dec 22, 2016 at 2:34 PM, Andres Freund <andres@anarazel.de> wrote:
>> On 2016-12-22 08:32:56 -0800, Andres Freund wrote:
>>> I plan to commit this later today.  Hope I got the reviewers roughly right.
>>
>> And pushed. Thanks for the work on this everyone.
>
> Cool.  Also, +1 for the important/unimportant terminology.  I like that.

Thanks for the commit.
-- 
Michael



Re: [HACKERS] Fix checkpoint skip logic on idle systems bytracking LSN progress

От
Kyotaro HORIGUCHI
Дата:
At Fri, 23 Dec 2016 11:02:11 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqSsBzhOkCXyBh9_ZGUEnr0HCKRcpC9DMk6VVCGBez1pzA@mail.gmail.com>
> On Fri, Dec 23, 2016 at 8:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Thu, Dec 22, 2016 at 2:34 PM, Andres Freund <andres@anarazel.de> wrote:
> >> On 2016-12-22 08:32:56 -0800, Andres Freund wrote:
> >>> I plan to commit this later today.  Hope I got the reviewers roughly right.
> >>
> >> And pushed. Thanks for the work on this everyone.
> >
> > Cool.  Also, +1 for the important/unimportant terminology.  I like that.
> 
> Thanks for the commit.

Thanks for commiting.

By the way this issue seems beeing in the ToDo list.

https://wiki.postgresql.org/wiki/Todo#Point-In-Time_Recovery_.28PITR.29

>    Consider avoiding WAL switching via archive_timeout if there
>    has been no database activity
>     - archive_timeout behavior for no activity
>     - Re: archive_timeout behavior for no activity

So I marked it as "done".

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center