Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?

Поиск
Список
Период
Сортировка
От digoal zhou
Тема Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?
Дата
Msg-id CAKHd5Ce-bnD=gEEdtXiT2_AY7shquTKd0yHXXk5F4zVEKRPX-w@mail.gmail.com
обсуждение исходный текст
Ответы Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?  (Heikki Linnakangas <hlinnaka@iki.fi>)
Список pgsql-hackers
   PostgreSQL (<=9.4) trend to smooth buffer write smooth in a checkpoint_completion_target (checkpoint_timeout or checkpoint_segments), but when we use synchronous_commit=off, there is a little problem for the checkpoint_segments target, because xlog write fast(for full page write which the first page write after checkpoint), so checkpointer cann't sleep and write buffer not smooth. 
   There is an test:
# stap -DMAXSKIPPED=100000 -v 11111 -e '
global s_var, e_var, stat_var;

/* probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int); */
probe process("/opt/pgsql/bin/postgres").mark("smgr__md__read__start") {
  s_var[pid(),1] = gettimeofday_us()
}

/* probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int); */
probe process("/opt/pgsql/bin/postgres").mark("smgr__md__read__done") {
  e_var[pid(),1] = gettimeofday_us()
  if ( s_var[pid(),1] > 0 )
    stat_var[pid(),1] <<< e_var[pid(),1] - s_var[pid(),1]
}

/* probe smgr__md__write__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int); */
probe process("/opt/pgsql/bin/postgres").mark("smgr__md__write__start") {
  s_var[pid(),2] = gettimeofday_us()
}

/* probe smgr__md__write__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int); */
probe process("/opt/pgsql/bin/postgres").mark("smgr__md__write__done") {
  e_var[pid(),2] = gettimeofday_us()
  if ( s_var[pid(),2] > 0 )
    stat_var[pid(),2] <<< e_var[pid(),2] - s_var[pid(),2]
}

probe process("/opt/pgsql/bin/postgres").mark("buffer__sync__start") {
  printf("buffer__sync__start num_buffers: %d, dirty_buffers: %d\n", $NBuffers, $num_to_write)
}

probe process("/opt/pgsql/bin/postgres").mark("checkpoint__start") {
  printf("checkpoint start\n")
}

probe process("/opt/pgsql/bin/postgres").mark("checkpoint__done") {
  printf("checkpoint done\n")
}

probe timer.s(1) {
  foreach ([v1,v2] in stat_var +) {
    if ( @count(stat_var[v1,v2]) >0 ) {
      printf("r1_or_w2 %d, pid: %d, min: %d, max: %d, avg: %d, sum: %d, count: %d\n", v2, v1, @min(stat_var[v1,v2]), @max(stat_var[v1,v2]), @avg(stat_var[v1,v2]), @sum(stat_var[v1,v2]), @count(stat_var[v1,v2]))
    }
  }
  printf("----------------------------------end-----------------------------\n")
  delete s_var
  delete e_var
  delete stat_var
}'



Use the test table and data:
create table tbl(id primary key,info text,crt_time timestamp);
insert into tbl select generate_series(1,50000000),now(),now();


Use pgbench test it.
$ vi test.sql
\setrandom id 1 50000000
update tbl set crt_time=now() where id = :id ;


$ pgbench -M prepared -n -r -f ./test.sql -P 1 -c 28 -j 28 -T 100000000
When on schedule checkpoint occure , the tps:
progress: 255.0 s, 58152.2 tps, lat 0.462 ms stddev 0.504
progress: 256.0 s, 31382.8 tps, lat 0.844 ms stddev 2.331
progress: 257.0 s, 14615.5 tps, lat 1.863 ms stddev 4.554
progress: 258.0 s, 16258.4 tps, lat 1.652 ms stddev 4.139
progress: 259.0 s, 17814.7 tps, lat 1.526 ms stddev 4.035
progress: 260.0 s, 14573.8 tps, lat 1.825 ms stddev 5.592
progress: 261.0 s, 16736.6 tps, lat 1.600 ms stddev 5.018
progress: 262.0 s, 19060.5 tps, lat 1.448 ms stddev 4.818
progress: 263.0 s, 20553.2 tps, lat 1.290 ms stddev 4.146
progress: 264.0 s, 26223.0 tps, lat 1.042 ms stddev 3.711
progress: 265.0 s, 31953.0 tps, lat 0.836 ms stddev 2.837
progress: 266.0 s, 43396.1 tps, lat 0.627 ms stddev 1.615
progress: 267.0 s, 50487.8 tps, lat 0.533 ms stddev 0.647
progress: 268.0 s, 53537.7 tps, lat 0.502 ms stddev 0.598
progress: 269.0 s, 54259.3 tps, lat 0.496 ms stddev 0.624
progress: 270.0 s, 56139.8 tps, lat 0.479 ms stddev 0.524

The parameters for onschedule checkpoint:
checkpoint_segments = 512
checkpoint_timeout = 5min
checkpoint_completion_target = 0.9

stap's output : 
there is 156467 dirty blocks, we can see the buffer write per second, write buffer is not smooth between time target.
but between xlog target.
156467/(4.5*60*0.9) = 579.5 write per second.


checkpoint start
buffer__sync__start num_buffers: 262144, dirty_buffers: 156467
r1_or_w2 2, pid: 19848, min: 41, max: 1471, avg: 49, sum: 425291, count: 8596
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 153, avg: 49, sum: 450597, count: 9078
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 643, avg: 51, sum: 429193, count: 8397
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 1042, avg: 55, sum: 449091, count: 8097
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 254, avg: 52, sum: 296668, count: 5617
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 39, max: 171, avg: 54, sum: 321027, count: 5851
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 138, avg: 60, sum: 300056, count: 4953
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 42, max: 1217, avg: 65, sum: 312859, count: 4748
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 1371, avg: 56, sum: 353905, count: 6304
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 358, avg: 58, sum: 236254, count: 4038
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 34, max: 1239, avg: 63, sum: 296906, count: 4703
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 31, max: 17408, avg: 63, sum: 415234, count: 6534
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 31, max: 5486, avg: 57, sum: 190345, count: 3318
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 29, max: 510, avg: 53, sum: 136221, count: 2563
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 32, max: 733, avg: 52, sum: 108327, count: 2070
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 34, max: 382, avg: 53, sum: 96157, count: 1812
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 43, max: 327, avg: 53, sum: 83641, count: 1571
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 33, max: 102, avg: 54, sum: 79991, count: 1468
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 35, max: 88, avg: 53, sum: 74338, count: 1389
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 32, max: 86, avg: 52, sum: 65710, count: 1243
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 30, max: 347, avg: 52, sum: 66866, count: 1263
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 31, max: 93, avg: 54, sum: 75642, count: 1398
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 33, max: 100, avg: 51, sum: 62302, count: 1216
......



I think we can add an condition to the IsCheckpointOnSchedule, 
                if (synchronous_commit != SYNCHRONOUS_COMMIT_OFF)
                {
                    recptr = GetInsertRecPtr();
                    elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;

                    if (progress < elapsed_xlogs)
                    {
                        ckpt_cached_elapsed = elapsed_xlogs;
                        return false;
                    }
                 }


# vi src/backend/postmaster/checkpointer.c

#include "access/xact.h"

/*
 * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
 *               in time?
 *
 * Compares the current progress against the time/segments elapsed since last
 * checkpoint, and returns true if the progress we've made this far is greater
 * than the elapsed time/segments.
 */
static bool
IsCheckpointOnSchedule(double progress)
{
        XLogRecPtr      recptr;
        struct timeval now;
        double          elapsed_xlogs,
                                elapsed_time;

        Assert(ckpt_active);

        /* Scale progress according to checkpoint_completion_target. */
        progress *= CheckPointCompletionTarget;

        /*
         * Check against the cached value first. Only do the more expensive
         * calculations once we reach the target previously calculated. Since
         * neither time or WAL insert pointer moves backwards, a freshly
         * calculated value can only be greater than or equal to the cached value.
         */
        if (progress < ckpt_cached_elapsed)
                return false;

        /*
         * Check progress against WAL segments written and checkpoint_segments.
         *
         * We compare the current WAL insert location against the location
         * computed before calling CreateCheckPoint. The code in XLogInsert that
         * actually triggers a checkpoint when checkpoint_segments is exceeded
         * compares against RedoRecptr, so this is not completely accurate.
         * However, it's good enough for our purposes, we're only calculating an
         * estimate anyway.
         */
        if (!RecoveryInProgress())
        {
                if (synchronous_commit != SYNCHRONOUS_COMMIT_OFF)
                {
                    recptr = GetInsertRecPtr();
                    elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;

                    if (progress < elapsed_xlogs)
                    {
                        ckpt_cached_elapsed = elapsed_xlogs;
                        return false;
                    }
                 }
        }

        /*
         * Check progress against time elapsed and checkpoint_timeout.
         */
        gettimeofday(&now, NULL);
        elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) +
                                        now.tv_usec / 1000000.0) / CheckPointTimeout;

        if (progress < elapsed_time)
        {
                ckpt_cached_elapsed = elapsed_time;
                return false;
        }

        /* It looks like we're on schedule. */
        return true;
}

# gmake && gmake install

$ pg_ctl restart -m fast
Test again:
progress: 291.0 s, 63144.9 tps, lat 0.426 ms stddev 0.383
progress: 292.0 s, 55063.7 tps, lat 0.480 ms stddev 1.433
progress: 293.0 s, 12225.3 tps, lat 2.238 ms stddev 4.460
progress: 294.0 s, 16436.4 tps, lat 1.621 ms stddev 4.043
progress: 295.0 s, 18516.5 tps, lat 1.444 ms stddev 3.286
progress: 296.0 s, 21983.7 tps, lat 1.251 ms stddev 2.941
progress: 297.0 s, 25759.7 tps, lat 1.034 ms stddev 2.356
progress: 298.0 s, 33139.4 tps, lat 0.821 ms stddev 1.676
progress: 299.0 s, 41904.9 tps, lat 0.644 ms stddev 1.134
progress: 300.0 s, 52432.9 tps, lat 0.513 ms stddev 0.470
progress: 301.0 s, 57115.4 tps, lat 0.471 ms stddev 0.325
progress: 302.0 s, 59422.1 tps, lat 0.452 ms stddev 0.297
progress: 303.0 s, 59860.5 tps, lat 0.449 ms stddev 0.309

We can see checkpointer wiriter buffer smooth(spread time perid) this time.
checkpoint start
----------------------------------end-----------------------------
buffer__sync__start num_buffers: 262144, dirty_buffers: 156761
r1_or_w2 2, pid: 22334, min: 51, max: 137, avg: 60, sum: 52016, count: 860
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 108, avg: 58, sum: 35526, count: 604
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 145, avg: 71, sum: 39779, count: 559
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 52, max: 172, avg: 79, sum: 47279, count: 594
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 44, max: 160, avg: 63, sum: 36907, count: 581
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 113, avg: 61, sum: 33895, count: 552
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 116, avg: 61, sum: 38177, count: 617
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 113, avg: 62, sum: 34199, count: 550
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 53, max: 109, avg: 65, sum: 39842, count: 606
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 50, max: 118, avg: 64, sum: 35099, count: 545
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 50, max: 107, avg: 64, sum: 39027, count: 606
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 114, avg: 62, sum: 34054, count: 545
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 47, max: 106, avg: 63, sum: 38573, count: 605
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 48, max: 101, avg: 62, sum: 38051, count: 607
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 42, max: 103, avg: 61, sum: 33596, count: 545


But there is also a little problem, When PostgreSQL write xlog reach checkpoint_segments earlier then checkpoint_timeout, the next checkpoint will start soon, so we must tuning the checkpoint_segments larger when the checkpoint occure busy.

Regards,
Digoal

--
公益是一辈子的事,I'm Digoal,Just Do It.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: Multi-xacts and our process problem
Следующее
От: Craig Ringer
Дата:
Сообщение: RFC: Non-user-resettable SET SESSION AUTHORISATION