Обсуждение: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?
   PostgreSQL (<=9.4) trend to smooth buffer write smooth in a checkpoint_completion_target (checkpoint_timeout or checkpoint_segments), but when we use synchronous_commit=off, there is a little problem for the checkpoint_segments target, because xlog write fast(for full page write which the first page write after checkpoint), so checkpointer cann't sleep and write buffer not smooth. 
   There is an test:
# stap -DMAXSKIPPED=100000 -v 11111 -e '
global s_var, e_var, stat_var;
/* probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int); */
probe process("/opt/pgsql/bin/postgres").mark("smgr__md__read__start") {
  s_var[pid(),1] = gettimeofday_us()
}
/* probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int); */
probe process("/opt/pgsql/bin/postgres").mark("smgr__md__read__done") {
  e_var[pid(),1] = gettimeofday_us()
  if ( s_var[pid(),1] > 0 )
    stat_var[pid(),1] <<< e_var[pid(),1] - s_var[pid(),1]
}
/* probe smgr__md__write__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int); */
probe process("/opt/pgsql/bin/postgres").mark("smgr__md__write__start") {
  s_var[pid(),2] = gettimeofday_us()
}
/* probe smgr__md__write__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int); */
probe process("/opt/pgsql/bin/postgres").mark("smgr__md__write__done") {
  e_var[pid(),2] = gettimeofday_us()
  if ( s_var[pid(),2] > 0 )
    stat_var[pid(),2] <<< e_var[pid(),2] - s_var[pid(),2]
}
probe process("/opt/pgsql/bin/postgres").mark("buffer__sync__start") {
  printf("buffer__sync__start num_buffers: %d, dirty_buffers: %d\n", $NBuffers, $num_to_write)
}
probe process("/opt/pgsql/bin/postgres").mark("checkpoint__start") {
  printf("checkpoint start\n")
}
probe process("/opt/pgsql/bin/postgres").mark("checkpoint__done") {
  printf("checkpoint done\n")
}
probe timer.s(1) {
  foreach ([v1,v2] in stat_var +) {
    if ( @count(stat_var[v1,v2]) >0 ) {
      printf("r1_or_w2 %d, pid: %d, min: %d, max: %d, avg: %d, sum: %d, count: %d\n", v2, v1, @min(stat_var[v1,v2]), @max(stat_var[v1,v2]), @avg(stat_var[v1,v2]), @sum(stat_var[v1,v2]), @count(stat_var[v1,v2]))
    }
  }
  printf("----------------------------------end-----------------------------\n")
  delete s_var
  delete e_var
  delete stat_var
}'
Use the test table and data:
create table tbl(id primary key,info text,crt_time timestamp);
insert into tbl select generate_series(1,50000000),now(),now();
Use pgbench test it.
$ vi test.sql
\setrandom id 1 50000000
update tbl set crt_time=now() where id = :id ;
$ pgbench -M prepared -n -r -f ./test.sql -P 1 -c 28 -j 28 -T 100000000
When on schedule checkpoint occure , the tps:
progress: 255.0 s, 58152.2 tps, lat 0.462 ms stddev 0.504
progress: 256.0 s, 31382.8 tps, lat 0.844 ms stddev 2.331
progress: 257.0 s, 14615.5 tps, lat 1.863 ms stddev 4.554
progress: 258.0 s, 16258.4 tps, lat 1.652 ms stddev 4.139
progress: 259.0 s, 17814.7 tps, lat 1.526 ms stddev 4.035
progress: 260.0 s, 14573.8 tps, lat 1.825 ms stddev 5.592
progress: 261.0 s, 16736.6 tps, lat 1.600 ms stddev 5.018
progress: 262.0 s, 19060.5 tps, lat 1.448 ms stddev 4.818
progress: 263.0 s, 20553.2 tps, lat 1.290 ms stddev 4.146
progress: 264.0 s, 26223.0 tps, lat 1.042 ms stddev 3.711
progress: 265.0 s, 31953.0 tps, lat 0.836 ms stddev 2.837
progress: 266.0 s, 43396.1 tps, lat 0.627 ms stddev 1.615
progress: 267.0 s, 50487.8 tps, lat 0.533 ms stddev 0.647
progress: 268.0 s, 53537.7 tps, lat 0.502 ms stddev 0.598
progress: 269.0 s, 54259.3 tps, lat 0.496 ms stddev 0.624
progress: 270.0 s, 56139.8 tps, lat 0.479 ms stddev 0.524
The parameters for onschedule checkpoint:
checkpoint_segments = 512
checkpoint_timeout = 5min
checkpoint_completion_target = 0.9
stap's output : 
there is 156467 dirty blocks, we can see the buffer write per second, write buffer is not smooth between time target.
but between xlog target.
156467/(4.5*60*0.9) = 579.5 write per second.
checkpoint start
buffer__sync__start num_buffers: 262144, dirty_buffers: 156467
r1_or_w2 2, pid: 19848, min: 41, max: 1471, avg: 49, sum: 425291, count: 8596
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 153, avg: 49, sum: 450597, count: 9078
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 643, avg: 51, sum: 429193, count: 8397
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 1042, avg: 55, sum: 449091, count: 8097
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 254, avg: 52, sum: 296668, count: 5617
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 39, max: 171, avg: 54, sum: 321027, count: 5851
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 138, avg: 60, sum: 300056, count: 4953
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 42, max: 1217, avg: 65, sum: 312859, count: 4748
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 1371, avg: 56, sum: 353905, count: 6304
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 41, max: 358, avg: 58, sum: 236254, count: 4038
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 34, max: 1239, avg: 63, sum: 296906, count: 4703
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 31, max: 17408, avg: 63, sum: 415234, count: 6534
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 31, max: 5486, avg: 57, sum: 190345, count: 3318
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 29, max: 510, avg: 53, sum: 136221, count: 2563
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 32, max: 733, avg: 52, sum: 108327, count: 2070
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 34, max: 382, avg: 53, sum: 96157, count: 1812
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 43, max: 327, avg: 53, sum: 83641, count: 1571
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 33, max: 102, avg: 54, sum: 79991, count: 1468
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 35, max: 88, avg: 53, sum: 74338, count: 1389
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 32, max: 86, avg: 52, sum: 65710, count: 1243
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 30, max: 347, avg: 52, sum: 66866, count: 1263
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 31, max: 93, avg: 54, sum: 75642, count: 1398
----------------------------------end-----------------------------
r1_or_w2 2, pid: 19848, min: 33, max: 100, avg: 51, sum: 62302, count: 1216
......
I think we can add an condition to the IsCheckpointOnSchedule, 
                if (synchronous_commit != SYNCHRONOUS_COMMIT_OFF)
                {
                    recptr = GetInsertRecPtr();
                    elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;
                    if (progress < elapsed_xlogs)
                    {
                        ckpt_cached_elapsed = elapsed_xlogs;
                        return false;
                    }
                 }
# vi src/backend/postmaster/checkpointer.c
#include "access/xact.h"
/*
 * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
 *               in time?
 *
 * Compares the current progress against the time/segments elapsed since last
 * checkpoint, and returns true if the progress we've made this far is greater
 * than the elapsed time/segments.
 */
static bool
IsCheckpointOnSchedule(double progress)
{
        XLogRecPtr      recptr;
        struct timeval now;
        double          elapsed_xlogs,
                                elapsed_time;
        Assert(ckpt_active);
        /* Scale progress according to checkpoint_completion_target. */
        progress *= CheckPointCompletionTarget;
        /*
         * Check against the cached value first. Only do the more expensive
         * calculations once we reach the target previously calculated. Since
         * neither time or WAL insert pointer moves backwards, a freshly
         * calculated value can only be greater than or equal to the cached value.
         */
        if (progress < ckpt_cached_elapsed)
                return false;
        /*
         * Check progress against WAL segments written and checkpoint_segments.
         *
         * We compare the current WAL insert location against the location
         * computed before calling CreateCheckPoint. The code in XLogInsert that
         * actually triggers a checkpoint when checkpoint_segments is exceeded
         * compares against RedoRecptr, so this is not completely accurate.
         * However, it's good enough for our purposes, we're only calculating an
         * estimate anyway.
         */
        if (!RecoveryInProgress())
        {
                if (synchronous_commit != SYNCHRONOUS_COMMIT_OFF)
                {
                    recptr = GetInsertRecPtr();
                    elapsed_xlogs = (((double) (recptr - ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;
                    if (progress < elapsed_xlogs)
                    {
                        ckpt_cached_elapsed = elapsed_xlogs;
                        return false;
                    }
                 }
        }
        /*
         * Check progress against time elapsed and checkpoint_timeout.
         */
        gettimeofday(&now, NULL);
        elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) +
                                        now.tv_usec / 1000000.0) / CheckPointTimeout;
        if (progress < elapsed_time)
        {
                ckpt_cached_elapsed = elapsed_time;
                return false;
        }
        /* It looks like we're on schedule. */
        return true;
}
# gmake && gmake install
$ pg_ctl restart -m fast
Test again:
progress: 291.0 s, 63144.9 tps, lat 0.426 ms stddev 0.383
progress: 292.0 s, 55063.7 tps, lat 0.480 ms stddev 1.433
progress: 293.0 s, 12225.3 tps, lat 2.238 ms stddev 4.460
progress: 294.0 s, 16436.4 tps, lat 1.621 ms stddev 4.043
progress: 295.0 s, 18516.5 tps, lat 1.444 ms stddev 3.286
progress: 296.0 s, 21983.7 tps, lat 1.251 ms stddev 2.941
progress: 297.0 s, 25759.7 tps, lat 1.034 ms stddev 2.356
progress: 298.0 s, 33139.4 tps, lat 0.821 ms stddev 1.676
progress: 299.0 s, 41904.9 tps, lat 0.644 ms stddev 1.134
progress: 300.0 s, 52432.9 tps, lat 0.513 ms stddev 0.470
progress: 301.0 s, 57115.4 tps, lat 0.471 ms stddev 0.325
progress: 302.0 s, 59422.1 tps, lat 0.452 ms stddev 0.297
progress: 303.0 s, 59860.5 tps, lat 0.449 ms stddev 0.309
We can see checkpointer wiriter buffer smooth(spread time perid) this time.
checkpoint start
----------------------------------end-----------------------------
buffer__sync__start num_buffers: 262144, dirty_buffers: 156761
r1_or_w2 2, pid: 22334, min: 51, max: 137, avg: 60, sum: 52016, count: 860
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 108, avg: 58, sum: 35526, count: 604
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 145, avg: 71, sum: 39779, count: 559
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 52, max: 172, avg: 79, sum: 47279, count: 594
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 44, max: 160, avg: 63, sum: 36907, count: 581
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 113, avg: 61, sum: 33895, count: 552
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 116, avg: 61, sum: 38177, count: 617
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 113, avg: 62, sum: 34199, count: 550
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 53, max: 109, avg: 65, sum: 39842, count: 606
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 50, max: 118, avg: 64, sum: 35099, count: 545
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 50, max: 107, avg: 64, sum: 39027, count: 606
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 51, max: 114, avg: 62, sum: 34054, count: 545
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 47, max: 106, avg: 63, sum: 38573, count: 605
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 48, max: 101, avg: 62, sum: 38051, count: 607
----------------------------------end-----------------------------
r1_or_w2 2, pid: 22334, min: 42, max: 103, avg: 61, sum: 33596, count: 545
But there is also a little problem, When PostgreSQL write xlog reach checkpoint_segments earlier then checkpoint_timeout, the next checkpoint will start soon, so we must tuning the checkpoint_segments larger when the checkpoint occure busy.
Regards,
Digoal
--
公益是一辈子的事,I'm Digoal,Just Do It.
公益是一辈子的事,I'm Digoal,Just Do It.
Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?
От
 
		    	Heikki Linnakangas
		    Дата:
		        On 05/12/2015 03:27 AM, digoal zhou wrote:
>     PostgreSQL (<=9.4) trend to smooth buffer write smooth in a
> checkpoint_completion_target (checkpoint_timeout or checkpoint_segments),
> but when we use synchronous_commit=off, there is a little problem for
> the checkpoint_segments
> target, because xlog write fast(for full page write which the first page
> write after checkpoint), so checkpointer cann't sleep and write buffer not
> smooth.
> ...
> I think we can add an condition to the IsCheckpointOnSchedule,
>                  if (synchronous_commit != SYNCHRONOUS_COMMIT_OFF)
>                  {
>                      recptr = GetInsertRecPtr();
>                      elapsed_xlogs = (((double) (recptr -
> ckpt_start_recptr)) / XLogSegSize) / CheckPointSegments;
>
>                      if (progress < elapsed_xlogs)
>                      {
>                          ckpt_cached_elapsed = elapsed_xlogs;
>                          return false;
>                      }
>                   }
This has nothing to do with asynchronous_commit, except that setting
asynchronous_commit=off makes your test case run faster, and hit the
problem harder.
I think the real problem here is that IsCheckpointOnSchedule assumes
that the rate of WAL generated is constant throughout the checkpoint
cycle, but in reality you generate a lot more WAL immediately after the
checkpoint begins, thanks to full_page_writes. For example, in the
beginning of the cycle, you quickly use up, say, 20% of the WAL space in
the first 10 seconds, and the scheduling thinks it's in a lot of hurry
to finish the checkpoint because it extrapolates that the rest of the
WAL will be used up in the next 40 seconds. But in reality, the WAL
consumption levels off, and you have many minutes left until
CheckPointSegments.
Can you try the attached patch? It modifies the above calculation to
take the full-page-write effect into account. I used X^1.5 as the
corrective function, which roughly reflects the typical WAL consumption
pattern. You can adjust the exponent, 1.5, to make the correction more
or less aggressive.
- Heikki
			
		Вложения
Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?
От
 
		    	Heikki Linnakangas
		    Дата:
		        (please keep the mailing list CC'd, and please don't top-post) On 05/13/2015 05:00 AM, digoal zhou wrote: > I test it, but use exponent not very perfect in any environment. > why cann't use time only? As you mentioned yourself earlier, if you only use time but you reach checkpoint_segments before checkpoint_timeout, you will not complete the checkpoint until you'd already need to begin the next checkpoint. You can't completely ignore checkpoint_segments. Comparing the numbers you give below with compensate-fpw-effect-on-checkpoint-scheduling-1.patch, with the ones from your first post, it looks like the patch already made the situation much better. You still have a significant burst in the beginning of the checkpoint cycle, but it's a lot smaller than without the patch. Before the patch, the "count" topped at 9078, and below it topped at 2964. There is a strange "lull" after the burst, I'm not sure what's going on there, but overall it seems like a big improvement. Did the patch alleviate the bump in latency that pgbench reports? I put the "count" numbers from your original post and below into a spreadsheet, and created some fancy charts. See attached. It shows the same thing but with pretty pictures. Assuming we want the checkpoint to be spread as evenly as possible across the cycle, the ideal would be a straight line from 0 to about 150000 in 270 seconds in the cumulative chart. You didn't give the full data, but you can extrapolate the lines to get a rough picture of how close the different versions are from that ideal. In summary, the X^1.5 correction seems to work pretty well. It doesn't completely eliminate the problem, but it makes it a lot better. I don't want to over-compensate for the full-page-write effect either, because there are also applications where that effect isn't so big. For example, an application that performs a lot of updates, but all the updates are on a small number of pages, so the full-page-write storm immediately after checkpoint doesn't last long. A worst case for this patch would be such an application - lots of updates on only a few pages - with a long checkpoint_timeoout but relatively small checkpoint_segments, so that checkpoints are always driven by checkpoint_segments. I'd like to see some benchmarking of that worst case before committing anything like this. > ----------------------------------end----------------------------- > checkpoint start > buffer__sync__start num_buffers: 524288, dirty_buffers: 156931 > r1_or_w2 2, pid: 29132, min: 44, max: 151, avg: 52, sum: 49387, count: 932 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 44, max: 95, avg: 49, sum: 41532, count: 837 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 44, max: 747, avg: 54, sum: 100419, count: 1849 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 44, max: 372, avg: 52, sum: 110701, count: 2090 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 44, max: 115, avg: 57, sum: 147510, count: 2575 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 44, max: 470, avg: 58, sum: 145217, count: 2476 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 44, max: 120, avg: 54, sum: 161401, count: 2964 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 44, max: 208, avg: 59, sum: 170280, count: 2847 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 44, max: 10089, avg: 62, sum: 136106, count: > 2181 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 41, max: 487, avg: 56, sum: 88990, count: 1570 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 39, max: 102, avg: 55, sum: 59807, count: 1083 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 40, max: 557, avg: 56, sum: 117274, count: 2083 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 44, max: 537, avg: 58, sum: 169867, count: 2882 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 44, max: 147, avg: 60, sum: 92835, count: 1538 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 30, max: 93, avg: 55, sum: 14641, count: 264 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 48, max: 92, avg: 56, sum: 11834, count: 210 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 45, max: 91, avg: 56, sum: 9151, count: 162 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 92, avg: 57, sum: 8621, count: 151 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 36, max: 90, avg: 57, sum: 7962, count: 139 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 48, max: 93, avg: 58, sum: 7194, count: 123 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 33, max: 95, avg: 58, sum: 7143, count: 123 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 89, avg: 57, sum: 6801, count: 118 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 49, max: 100, avg: 58, sum: 6818, count: 117 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 45, max: 90, avg: 57, sum: 6982, count: 121 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 88, avg: 55, sum: 6459, count: 117 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 48, max: 88, avg: 58, sum: 7022, count: 121 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 47, max: 94, avg: 57, sum: 5952, count: 104 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 49, max: 95, avg: 57, sum: 6871, count: 119 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 49, max: 85, avg: 58, sum: 6829, count: 117 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 45, max: 89, avg: 57, sum: 6851, count: 119 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 49, max: 100, avg: 57, sum: 6779, count: 117 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 45, max: 93, avg: 55, sum: 6502, count: 117 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 47, max: 98, avg: 58, sum: 6805, count: 117 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 38, max: 90, avg: 57, sum: 6771, count: 118 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 96, avg: 56, sum: 6593, count: 116 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 35, max: 101, avg: 57, sum: 6809, count: 119 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 45, max: 100, avg: 57, sum: 6171, count: 107 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 48, max: 105, avg: 57, sum: 6801, count: 119 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 45, max: 95, avg: 57, sum: 6792, count: 119 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 93, avg: 56, sum: 6693, count: 118 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 93, avg: 57, sum: 6878, count: 120 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 93, avg: 56, sum: 6664, count: 117 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 94, avg: 57, sum: 7051, count: 123 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 92, avg: 57, sum: 6957, count: 120 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 48, max: 94, avg: 57, sum: 6842, count: 119 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 100, avg: 57, sum: 6865, count: 119 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 49, max: 102, avg: 58, sum: 6915, count: 119 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 48, max: 94, avg: 57, sum: 6187, count: 107 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 50, max: 86, avg: 58, sum: 6957, count: 119 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 97, avg: 55, sum: 33636, count: 609 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 36, max: 90, avg: 55, sum: 34180, count: 620 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 29, max: 92, avg: 53, sum: 36569, count: 680 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 40, max: 91, avg: 54, sum: 37374, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 32, max: 86, avg: 54, sum: 33347, count: 613 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 94, avg: 54, sum: 37603, count: 684 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 32, max: 93, avg: 55, sum: 33777, count: 613 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 104, avg: 55, sum: 37566, count: 682 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 35, max: 92, avg: 54, sum: 37037, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 42, max: 106, avg: 57, sum: 35181, count: 614 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 45, max: 99, avg: 54, sum: 36981, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 44, max: 88, avg: 53, sum: 33202, count: 622 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 35, max: 89, avg: 54, sum: 36825, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 30, max: 88, avg: 53, sum: 33917, count: 635 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 89, avg: 55, sum: 36234, count: 658 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 30, max: 99, avg: 55, sum: 37719, count: 682 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 93, avg: 54, sum: 33491, count: 613 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 96, avg: 54, sum: 37365, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 45, max: 101, avg: 54, sum: 33481, count: 613 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 37, max: 93, avg: 54, sum: 37102, count: 685 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 40, max: 87, avg: 54, sum: 36968, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 44, max: 84, avg: 54, sum: 33565, count: 613 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 92, avg: 54, sum: 37271, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 48, max: 96, avg: 55, sum: 34272, count: 613 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 44, max: 85, avg: 54, sum: 37378, count: 682 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 45, max: 107, avg: 53, sum: 36715, count: 680 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 44, max: 88, avg: 54, sum: 33620, count: 616 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 29, max: 94, avg: 54, sum: 37093, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 39, max: 110, avg: 53, sum: 33013, count: 612 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 40, max: 97, avg: 54, sum: 37215, count: 682 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 37, max: 90, avg: 54, sum: 37240, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 41, max: 95, avg: 54, sum: 33555, count: 613 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 38, max: 89, avg: 54, sum: 37503, count: 683 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 38, max: 95, avg: 55, sum: 33803, count: 614 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 44, max: 89, avg: 56, sum: 38403, count: 682 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 33, max: 92, avg: 54, sum: 37354, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 93, avg: 55, sum: 33881, count: 613 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 34, max: 91, avg: 54, sum: 37047, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 32, max: 85, avg: 53, sum: 33003, count: 613 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 30, max: 92, avg: 53, sum: 36854, count: 683 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 40, max: 92, avg: 54, sum: 36597, count: 673 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 96, avg: 54, sum: 33689, count: 620 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 32, max: 92, avg: 54, sum: 37194, count: 684 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 45, max: 90, avg: 53, sum: 32813, count: 612 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 32, max: 100, avg: 54, sum: 37485, count: 682 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 31, max: 97, avg: 54, sum: 33294, count: 613 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 33, max: 94, avg: 54, sum: 37320, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 28, max: 92, avg: 54, sum: 37067, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 45, max: 87, avg: 54, sum: 33766, count: 614 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 33, max: 110, avg: 53, sum: 36220, count: 680 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 35, max: 98, avg: 54, sum: 33442, count: 613 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 33, max: 97, avg: 55, sum: 37692, count: 682 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 33, max: 95, avg: 54, sum: 37073, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 35, max: 88, avg: 54, sum: 33676, count: 613 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 30, max: 103, avg: 53, sum: 36770, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 96, avg: 54, sum: 33447, count: 613 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 44, max: 91, avg: 55, sum: 37643, count: 682 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 38, max: 90, avg: 54, sum: 37377, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 103, avg: 56, sum: 34531, count: 614 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 30, max: 121, avg: 54, sum: 37412, count: 683 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 40, max: 89, avg: 54, sum: 33173, count: 613 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 31, max: 94, avg: 54, sum: 37385, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 28, max: 106, avg: 55, sum: 38132, count: 682 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 34, max: 96, avg: 55, sum: 33800, count: 613 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 37, max: 98, avg: 56, sum: 38305, count: 682 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 28, max: 104, avg: 55, sum: 33744, count: 613 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 45, max: 103, avg: 54, sum: 36923, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 36, max: 89, avg: 55, sum: 37797, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 44, max: 103, avg: 56, sum: 34902, count: 620 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 88, avg: 55, sum: 38025, count: 682 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 43, max: 102, avg: 56, sum: 34545, count: 614 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 34, max: 94, avg: 55, sum: 37756, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 28, max: 93, avg: 54, sum: 33530, count: 613 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 31, max: 97, avg: 55, sum: 37992, count: 682 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 35, max: 99, avg: 55, sum: 37923, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 39, max: 101, avg: 55, sum: 34027, count: 614 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 28, max: 93, avg: 53, sum: 36078, count: 680 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 41, max: 89, avg: 51, sum: 31563, count: 612 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 31, max: 92, avg: 52, sum: 35596, count: 680 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 35, max: 102, avg: 55, sum: 37816, count: 685 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 45, max: 102, avg: 55, sum: 33828, count: 613 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 30, max: 93, avg: 54, sum: 37285, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 30, max: 90, avg: 55, sum: 34037, count: 614 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 30, max: 86, avg: 54, sum: 37584, count: 684 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 32, max: 103, avg: 55, sum: 37946, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 44, max: 97, avg: 56, sum: 34556, count: 617 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 33, max: 99, avg: 56, sum: 38213, count: 681 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 44, max: 97, avg: 56, sum: 34613, count: 614 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 35, max: 101, avg: 55, sum: 37925, count: 682 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 35, max: 93, avg: 55, sum: 35504, count: 639 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 46, max: 90, avg: 55, sum: 36459, count: 655 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 30, max: 97, avg: 54, sum: 37369, count: 682 > ----------------------------------end----------------------------- > r1_or_w2 2, pid: 29132, min: 31, max: 93, avg: 54, sum: 33161, count: 612 > ----------------------------------end---------------------------- - Heikki
Вложения
Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?
От
 
		    	Heikki Linnakangas
		    Дата:
		        This needs more performance testing.
> This needs more performance testing. Definitely. I may do that some day. However I'm not sure that this is currently the main issue in the checkpointer. -- Fabien.
On 13 May 2015 at 09:35, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
 
 
-- 
			
		In summary, the X^1.5 correction seems to work pretty well. It doesn't completely eliminate the problem, but it makes it a lot better.
Agreed
I don't want to over-compensate for the full-page-write effect either, because there are also applications where that effect isn't so big. For example, an application that performs a lot of updates, but all the updates are on a small number of pages, so the full-page-write storm immediately after checkpoint doesn't last long. A worst case for this patch would be such an application - lots of updates on only a few pages - with a long checkpoint_timeoout but relatively small checkpoint_segments, so that checkpoints are always driven by checkpoint_segments. I'd like to see some benchmarking of that worst case before committing anything like this.
We could do better, but that is not a reason not to commit this, as is. Commit, please.
This has been in place for a while and still remains: TODO: reduce impact of full page writes
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Jul 2, 2015 at 4:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On 13 May 2015 at 09:35, Heikki Linnakangas <hlinnaka@iki.fi> wrote:In summary, the X^1.5 correction seems to work pretty well. It doesn't completely eliminate the problem, but it makes it a lot better.Agreed
Do we want to consider if wal_compression is enabled as that
can reduce the effect full_page_writes?
Also I am planning to run some tests for this patch, but not sure
if tps and or latency numbers by pgbench are sufficient or do you
people want to see actual read/write count via some form of
dynamic tracing (stap) as done by the reporter of this issue? 
Hello Simon, > We could do better, but that is not a reason not to commit this, as is. > Commit, please. My 0,02€: Please do not commit without further testing... I've submitted a patch to improve checkpoint write scheduling, including X00 hours of performance test on various cases. This patch changes significantly the load distribution over the whole checkpoint, and AFAICS has been tested on rather small cases. I'm not sure that the power 1.5 is the right one for all cases. For a big checkpoint over 30 minutes, it may have, or not, very large and possibly unwanted effects. Maybe the 1.5 factor should really be a guc. Well, what I really think is that it needs performance measures. In conclusion, and very egoistically, I would prefer if this patch could wait for the checkpoint scheduling patch to be considered, as it would basically invalidate the X00 hours of performance tests I ran:-) -- Fabien.
On 3 July 2015 at 06:38, Fabien COELHO <coelho@cri.ensmp.fr> wrote:
 
-- 
			
		
Hello Simon,We could do better, but that is not a reason not to commit this, as is.
Commit, please.
My 0,02€: Please do not commit without further testing...
I've submitted a patch to improve checkpoint write scheduling, including X00 hours of performance test on various cases. This patch changes significantly the load distribution over the whole checkpoint, and AFAICS has been tested on rather small cases.
I'm not sure that the power 1.5 is the right one for all cases. For a big checkpoint over 30 minutes, it may have, or not, very large and possibly unwanted effects. Maybe the 1.5 factor should really be a guc. Well, what I really think is that it needs performance measures.
power 1,5 is almost certainly not right for all cases, but it is simple and better. And easy to remove if something even better arrives.
I don't see the two patches being in conflict.
In conclusion, and very egoistically, I would prefer if this patch could wait for the checkpoint scheduling patch to be considered, as it would basically invalidate the X00 hours of performance tests I ran:-)
 I recommend making peace with yourself that probably 50% of development time is wasted. But we try to keep the best half.
Thank you for your time spent contributing.
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2015-07-03 07:38:15 +0200, Fabien COELHO wrote: > I've submitted a patch to improve checkpoint write scheduling, including X00 > hours of performance test on various cases. This patch changes significantly > the load distribution over the whole checkpoint, and AFAICS has been tested > on rather small cases. > > I'm not sure that the power 1.5 is the right one for all cases. For a big > checkpoint over 30 minutes, it may have, or not, very large and possibly > unwanted effects. Maybe the 1.5 factor should really be a guc. Well, what I > really think is that it needs performance measures. > > In conclusion, and very egoistically, I would prefer if this patch could > wait for the checkpoint scheduling patch to be considered, as it would > basically invalidate the X00 hours of performance tests I ran:-) These two patches target pretty independent mechanics. If you patch were significantly influenced by this something would be wrong. It might decrease the benefit of your patch a mite, but that's not really a problem.
> power 1,5 is almost certainly not right for all cases, but it is simple > and better. It is better "in some cases", as I've been told on my patch. If you have a separate disk for WAL writes the power formula may just degrade performance, or maybe not, or not too much, or it really should be a guc. Well, I just think that it needs more performance testing with various loads and sizes, really. I'm not against this patch at all. > And easy to remove if something even better arrives. > > I don't see the two patches being in conflict. They are not "in conflict" from a git point of view, or even so it would be trivial to solve. They are in conflict as the patch changes the checkpoint load significantly, which would mean that my X00 hours of performance testing on the checkpoint scheduler should more or less be run again. Ok, it is somehow egoistic, but I'm trying to avoid wasting people time. Another point is that I'm not sure I understand the decision process: for some patch in some area extensive performance tests are required, and for other patches in the same area they would not be. -- Fabien.
Hello Andres, >> In conclusion, and very egoistically, I would prefer if this patch could >> wait for the checkpoint scheduling patch to be considered, as it would >> basically invalidate the X00 hours of performance tests I ran:-) > > These two patches target pretty independent mechanics. If you patch were > significantly influenced by this something would be wrong. It might > decrease the benefit of your patch a mite, but that's not really a > problem. That is not the issue I see. On the principle of performance testing it really means that I should rerun the tests, even if I expect that the overall influence would be pretty small in this case. This is my egoistic argument. Well, probably I would just rerun a few cases to check that the impact is "mite", as you said, not all cases. Another point is that I'm not sure that this patch is ripe, in particular I'm skeptical about the hardcoded 1.5 without further testing. Maybe it is good, maybe 1.3 or 1.6 is better, maybe it depends and it should just be a guc with some advises about how to set it. So I really think that it needs more performance figures than "it has a positive effect on one load". Well, this is just my opinion, no need to care too much about it:-) -- Fabien.
On Thu, Jul 2, 2015 at 4:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jul 2, 2015 at 4:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>> On 13 May 2015 at 09:35, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>>>
>>> In summary, the X^1.5 correction seems to work pretty well. It doesn't completely eliminate the problem, but it makes it a lot better.
>>
>>
>> Agreed
>
>
> Do we want to consider if wal_compression is enabled as that
> can reduce the effect full_page_writes?
>
>
> Also I am planning to run some tests for this patch, but not sure
> if tps and or latency numbers by pgbench are sufficient or do you
> people want to see actual read/write count via some form of
> dynamic tracing (stap) as done by the reporter of this issue?
>
Use the test table and data:
create table tbl(id int primary key,info text,crt_time timestamp);
insert into tbl select generate_series(1,50000000),now(),now();
2.
Use pgbench to test it.
$ vi test.sql
\setrandom id 1 50000000
update tbl set crt_time=now() where id = :id ;
3.
$ pgbench -M prepared -n -r -f ./test.sql -P 1 -c 28 -j 28 -T 300
			
		>
> On Thu, Jul 2, 2015 at 4:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>> On 13 May 2015 at 09:35, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>>>
>>> In summary, the X^1.5 correction seems to work pretty well. It doesn't completely eliminate the problem, but it makes it a lot better.
>>
>>
>> Agreed
>
>
> Do we want to consider if wal_compression is enabled as that
> can reduce the effect full_page_writes?
>
>
> Also I am planning to run some tests for this patch, but not sure
> if tps and or latency numbers by pgbench are sufficient or do you
> people want to see actual read/write count via some form of
> dynamic tracing (stap) as done by the reporter of this issue?
>
I have ran some tests with this patch and the detailed results of the runs are
attached with this mail.  I thought the patch should show difference if I keep
max_wal_size to somewhat lower or moderate value so that checkpoint
should get triggered due to wal size, but I am not seeing any major difference
in the writes spreading.  Below is the configuration, I have used to run the
tests
synchronous_commit=off
checkpoint_timeout = 5min
max_wal_size=1GB
min_wal_size=80MB
checkpoint_completion_target=0.9
shared_buffers=8GB
Test (used almost similar testing mechanism as reported in mail)
------------------------------------------------------------------------------------------------
1.Use the test table and data:
create table tbl(id int primary key,info text,crt_time timestamp);
insert into tbl select generate_series(1,50000000),now(),now();
2.
Use pgbench to test it.
$ vi test.sql
\setrandom id 1 50000000
update tbl set crt_time=now() where id = :id ;
3.
$ pgbench -M prepared -n -r -f ./test.sql -P 1 -c 28 -j 28 -T 300
on another session, I ran system tap script
stap pgbenchrun.stp -DMAXSKIPPED=100000 -o pgbenchrun.out
pgbenchrun.stp (attached with mail).
During configure, you need to use (--enable-dtrace) and need to use
debug-build option if you want capture results using system tap script.
Also to run this SystemTap should be installed in your system.
I should have ran more tests to see in which cases this patch makes
difference, but I thought it is better to first discuss the current results
and see what other kind of tests you want to run for this patch.
Вложения
>>>> In summary, the X^1.5 correction seems to work pretty well. It doesn't >>>> completely eliminate the problem, but it makes it a lot better. I've looked at the maths. I think that the load is distributed as the derivative of this function, that is (1.5 * x ** 0.5): It starts at 0 but very quicky reaches 0.5, it pass the 1.0 (average load) around 40% progress, and ends up at 1.5, that is the finishing load is 1.5 the average load, just before fsyncing files. This looks like a recipee for a bad time: I would say this is too large an overload. I would suggest a much lower value, say around 1.1... The other issue with this function is that it should only degrade performance by disrupting the write distribution if someone has WAL on a different disk. As I understand it this thing does only make sense if the WAL & the data are on the samee disk. This really suggest a guc. > I have ran some tests with this patch and the detailed results of the > runs are attached with this mail. I do not understand really the aggregated figures in the files attached. I guess that maybe between "end" markers there is a summary of figures collected for 28 backends over 300-second runs (?), but I do not know what the min/max/avg/sum/count figures are about. > I thought the patch should show difference if I keep max_wal_size to > somewhat lower or moderate value so that checkpoint should get triggered > due to wal size, but I am not seeing any major difference in the writes > spreading. I'm not sure I understand your point. I would say that at full speed pgbench the disk is always busy writing as much as possible, either checkpoint writes or wal writes, so the write load as such should not be that different anyway? I understood that the point of the patch is to check whether there is a tps dip or not when the checkpoint begins, but I'm not sure how this can be infered from the many aggregated data you sent, and from my recent tests the tps is very variable anyway on HDD. -- Fabien.
On Jul 4, 2015, at 11:34 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote: >>>>> In summary, the X^1.5 correction seems to work pretty well. It doesn't >>>>> completely eliminate the problem, but it makes it a lot better. > > I've looked at the maths. > > I think that the load is distributed as the derivative of this function, that is (1.5 * x ** 0.5): It starts at 0 but veryquicky reaches 0.5, it pass the 1.0 (average load) around 40% progress, and ends up at 1.5, that is the finishing loadis 1.5 the average load, just before fsyncing files. This looks like a recipee for a bad time: I would say this is toolarge an overload. I would suggest a much lower value, say around 1.1... > > The other issue with this function is that it should only degrade performance by disrupting the write distribution if someonehas WAL on a different disk. As I understand it this thing does only make sense if the WAL & the data are on the sameedisk. This really suggest a guc. I am a bit skeptical about this. We need test scenarios that clearly show the benefit of having and of not having this behavior.It might be that doing this always is fine for everyone. ...Robert
Hello Robert, >> I've looked at the maths. >> >> I think that the load is distributed as the derivative of this >> function, that is (1.5 * x ** 0.5): It starts at 0 but very quicky >> reaches 0.5, it pass the 1.0 (average load) around 40% progress, and >> ends up at 1.5, that is the finishing load is 1.5 the average load, >> just before fsyncing files. This looks like a recipee for a bad time: I >> would say this is too large an overload. I would suggest a much lower >> value, say around 1.1... >> The other issue with this function is that it should only degrade >> performance by disrupting the write distribution if someone has WAL on >> a different disk. As I understand it this thing does only make sense if >> the WAL & the data are on the samee disk. This really suggest a guc. > > I am a bit skeptical about this. We need test scenarios that clearly > show the benefit of having and of not having this behavior. It might be > that doing this always is fine for everyone. Do you mean I have to proove that there is an actual problem induced from this patch? The logic fails me: I thought the patch submitter would have to show that his/her patch did not harm performance in various reasonable cases. At least this is what I'm told in another thread:-) Currently this patch changes heavily the checkpoint write load distribution in many cases with a proof which consist in showing that it may improve tps *briefly* on *one* example, as far as I understood the issue and the tests. If this is enough proof to apply the patch, then the minimum is that it should be possible to desactivate it, hence a guc. Having a guc would also help to test the feature with different values than 1.5, which really seems harmful from a math point of view. I'm not sure at all that a power formula is the right approach. The potential impact I see would be to aggravate significantly the write stall issues I'm working on, but the measures provided in these tests do not even look at that or measure that. -- Fabien.
Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?
От
 
		    	Heikki Linnakangas
		    Дата:
		        On 07/04/2015 07:34 PM, Fabien COELHO wrote: > >>>>> In summary, the X^1.5 correction seems to work pretty well. It doesn't >>>>> completely eliminate the problem, but it makes it a lot better. > > I've looked at the maths. > > I think that the load is distributed as the derivative of this function, > that is (1.5 * x ** 0.5): It starts at 0 but very quicky reaches 0.5, it > pass the 1.0 (average load) around 40% progress, and ends up at 1.5, that > is the finishing load is 1.5 the average load, just before fsyncing files. > This looks like a recipee for a bad time: I would say this is too large an > overload. I would suggest a much lower value, say around 1.1... Hmm. Load is distributed as a derivate of that, but probably not the way you think. Note that X means the amount of WAL consumed, not time. The goal is that I/O is constant over time, but the consumption of WAL over time is non-linear, with a lot more WAL consumed in the beginning of a checkpoint cycle. The function compensates for that. > The other issue with this function is that it should only degrade > performance by disrupting the write distribution if someone has WAL on a > different disk. As I understand it this thing does only make sense if the > WAL & the data are on the samee disk. This really suggest a guc. No, the I/O storm caused by full-page-writes is a problem even if WAL is on a different disk. Even though the burst of WAL I/O then happens on a different disk, the fact that we consume a lot of WAL in the beginning of a checkpoint makes the checkpointer think that it needs to hurry up, in order to meet the deadline. It will flush a lot of pages in a rush, so you get a burst of I/O on the data disk too. Yes, it's even worse when WAL and data are on the same disk, but even then, I think the random I/O caused by the checkpointer hurrying is more significant than the extra WAL I/O, which is sequential. To illustrate that, imagine that the checkpoint begins now. The checkpointer calculates that it has 10 minutes to complete the checkpoint (checkpoint_timeout), or until 1 GB of WAL has been generated (derived from max_wal_size), whichever happens first. Immediately after the Redo-point has been established, in the very beginning of the checkpoint, the WAL storm begins. Every backend that dirties a page also writes a full-page image. After just 10 seconds, those backends have already written 200 MB of WAL. That's 1/5 of the quota, and based on that, the checkpointer will quickly flush 1/5 of all buffers. In reality, the WAL consumption is not linear, and will slow down as time passes and less full-page writes happen. So in reality, the checkpointer would have a lot more time to complete the checkpoint - it is unnecessarily aggressive in the beginning of the checkpoint. The correction factor in the patch compensates for that. With the X^1.5 formula, when 20% of the WAL has already been consumed, the checkpointer have flushed only ~ 9% of the buffers, not 20% as without the patch. The ideal correction formula f(x), would be such that f(g(X)) = X, where: X is time, 0 = beginning of checkpoint, 1.0 = targeted end of checkpoint (checkpoint_segments), and g(X) is the amount of WAL generated. 0 = beginning of checkpoint, 1.0 = targeted end of checkpoint (derived from max_wal_size). Unfortunately, we don't know the shape of g(X), as that depends on the workload. It might be linear, if there is no effect at all from full_page_writes. Or it could be a step-function, where every write causes a full page write, until all pages have been touched, and after that none do (something like an UPDATE without a where-clause might cause that). In pgbench-like workloads, it's something like sqrt(x). I picked X^1.5 as a reasonable guess. It's close enough to linear that it shouldn't hurt too much if g(x) is linear. But it cuts the worst spike at the very beginning, if g(x) is more like sqrt(x). This is all assuming that the application load is constant. If it's not, g(x) can obviously have any shape, and there's no way we can predict that. But that's a different story, nothing to do with full_page_writes. >> I have ran some tests with this patch and the detailed results of the >> runs are attached with this mail. > > I do not understand really the aggregated figures in the files attached. Me neither. It looks like Amit measured the time spent in mdread and mdwrite, but I'm not sure what conclusions one can draw from that. >> I thought the patch should show difference if I keep max_wal_size to >> somewhat lower or moderate value so that checkpoint should get triggered >> due to wal size, but I am not seeing any major difference in the writes >> spreading. > > I'm not sure I understand your point. I would say that at full speed > pgbench the disk is always busy writing as much as possible, either > checkpoint writes or wal writes, so the write load as such should not be > that different anyway? > > I understood that the point of the patch is to check whether there is a > tps dip or not when the checkpoint begins, but I'm not sure how this can > be infered from the many aggregated data you sent, and from my recent > tests the tps is very variable anyway on HDD. Right, that's my understanding too. If the disk is not saturated, perhaps because you used pgbench's rate-limiting option, then measuring the disk I/O would be useful too: flatter is better. - Heikki
Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?
От
 
		    	Heikki Linnakangas
		    Дата:
		        On 07/05/2015 08:19 AM, Fabien COELHO wrote: >> I am a bit skeptical about this. We need test scenarios that clearly >> show the benefit of having and of not having this behavior. It might be >> that doing this always is fine for everyone. > > Do you mean I have to proove that there is an actual problem induced from > this patch? You don't have to do anything if you don't want to. I said myself that this needs performance testing of the worst-case scenario, one where we would expect this to perform worse than without the patch. Then we can look at how bad that effect is, and decide if that's acceptable. That said, if you could do that testing, that would be great! I'm not planning to spend much time on this myself, and it would take me a fair amount of time to set up the hardware and tools to test this. I was hoping Digoal would have the time to do that, since he started this thread, or someone else that has a system ready for this kind of testing. If no-one steps up to the plate to test this more, however, we'll have to just forget about this. > Having a guc would also help to test the feature with different values > than 1.5, which really seems harmful from a math point of view. I'm not > sure at all that a power formula is the right approach. Yeah, a GUC would be helpful in testing this. I'm hoping that we would come up with a reasonable formula that would work well enough for everyone that we wouldn't need to have a GUC in the final patch, though. - Heikki
> You don't have to do anything if you don't want to. Sure:-) What I mean is that I think that this patch is not ripe, and I understood that some people were suggesting that it could be applied as is right away. I'm really disagreeing with that. > I said myself that this needs performance testing of the worst-case > scenario, one where we would expect this to perform worse than without > the patch. Then we can look at how bad that effect is, and decide if > that's acceptable. Ok, I'm fine with that. It's quite different from "looks ok apply now". > That said, if you could do that testing, that would be great! Hmmm. I was not really planing to. On the other hand, I have some scripts and a small setup that I've been using to test checkpointer flushing, and it would be easy to start some tests. >> Having a guc would also help to test the feature with different values >> than 1.5, which really seems harmful from a math point of view. I'm not >> sure at all that a power formula is the right approach. > > Yeah, a GUC would be helpful in testing this. I'm hoping that we would come > up with a reasonable formula that would work well enough for everyone that we > wouldn't need to have a GUC in the final patch, though. Yep. If it is a guc testing is quite easy and I may run my scripts... -- Fabien.
Hello Heikki, >> I think that the load is distributed as the derivative of this function, >> that is (1.5 * x ** 0.5): It starts at 0 but very quicky reaches 0.5, it >> pass the 1.0 (average load) around 40% progress, and ends up at 1.5, that >> is the finishing load is 1.5 the average load, just before fsyncing files. >> This looks like a recipee for a bad time: I would say this is too large an >> overload. I would suggest a much lower value, say around 1.1... > > Hmm. Load is distributed as a derivate of that, but probably not the way you > think. Note that X means the amount of WAL consumed, not time. Interesting point. After a look at IsCheckpointOnSchedule, and if I understand the code correctly, it is actually *both*, so it really depends whether the checkpoint was xlog or time triggered, and especially which one (time/xlog) is proeminent at the beginning of the checkpoint. If it is time triggered and paced my reasonning is probably right and things will go bad/worse in the end, but if it is xlog-triggered and paced your line of argument is probably closer to what happens. This suggest that the corrective function should be applied with more care, maybe only for the xlog-based on schedule test, but not the time-based check. > The goal is that I/O is constant over time, but the consumption of WAL > over time is non-linear, with a lot more WAL consumed in the beginning > of a checkpoint cycle. The function compensates for that. *If* the checkpointer pacing comes from WAL size, which may or may not be the case. > [...] > > Unfortunately, we don't know the shape of g(X), as that depends on the > workload. It might be linear, if there is no effect at all from > full_page_writes. Or it could be a step-function, where every write causes a > full page write, until all pages have been touched, and after that none do > (something like an UPDATE without a where-clause might cause that). If postgresql is running in its cache (i.e. within shared buffers), the usual assumption would be an unknown exponential probability decreasing with time while the same pages are hit over and over. If postgresql is running on memory or disk (effective database size greater than shared buffers), pages are statiscally not reused by another update before being sent out, so the full page write would be always used during the whole checkpoint, there is no WAL storm (or it is always a storm, depending on the point of view) and the corrective factor would only create issues... So basically I would say that what to do heavily depends on the database size and checkpoint trigger (time vs xlog), which really suggest that a guc is indispensible, and maybe that the place the correction is applied is currently not the right one. > In pgbench-like workloads, it's something like sqrt(x). Probably for a small database size? > I picked X^1.5 as a reasonable guess. It's close enough to linear that > it shouldn't hurt too much if g(x) is linear. My understanding is still a 50% overload at the end of the checkpoint just before issuing fsync... I think that could hurt in some case. > But it cuts the worst spike at the very beginning, if g(x) is more like > sqrt(x). Hmmm. It's a balance between saving the 10 first seconds of the checkpoint at the price of risking a panic at the end of the checkpoint. Now the right approach might be for pg to know what is happening by collecting statistics while running, and to apply a correction when it is needed, for the amount needed. -- Fabien.
On Sun, Jul 5, 2015 at 1:19 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote: > Do you mean I have to proove that there is an actual problem induced from > this patch? No, I'm not saying anyone *has* to do anything. What I'm saying is that I'm not convinced by your analysis. I don't think we have enough evidence at this point to conclude that a GUC is necessary, and I hope it isn't, because I can't imagine what advice we would be able to give people about how to set it, other than "try all the value and see what works best", which isn't going to be satisfying. More broadly, I don't really know how to test this patch and show when it helps and when it hurts. And I think we need that, rather than just a theoretical analysis, to tune the behavior. Heikki, can you describe what you think a good test setup would be? Like, what workload should we run, and what measurements should we gather to see what the patch is doing that is good or bad? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-07-05 11:05:28 -0400, Robert Haas wrote: > More broadly, I don't really know how to test this patch and show when > it helps and when it hurts. And I think we need that, rather than > just a theoretical analysis, to tune the behavior. Heikki, can you > describe what you think a good test setup would be? Like, what > workload should we run, and what measurements should we gather to see > what the patch is doing that is good or bad? I think a good start would be to graph the writeout rate over several checkpoints. It'd be cool if there were a better way, but it's probably easiest to just graph the number of bytes written (using iostat) and the number of dirty bytes in the kernel. That'll unfortunately include WAL, but I can't immediately see how to avoid that.
> No, I'm not saying anyone *has* to do anything. What I'm saying is > that I'm not convinced by your analysis. Well, the gist of my analysis is really to say that there are potential performance issues with the proposed change, and that it must be tested thoroughly. The details may varry:-) > I don't think we have enough evidence at this point to conclude that a > GUC is necessary, and I hope it isn't, because I can't imagine what > advice we would be able to give people about how to set it, other than > "try all the value and see what works best", which isn't going to be > satisfying. At least for testing, ISTM that a GUC would be really useful. > More broadly, I don't really know how to test this patch and show when > it helps and when it hurts. And I think we need that, rather than just a > theoretical analysis, to tune the behavior. The point of an analysis is to think about how it works and what to test, but it is not a substitute for testing, obviously. -- Fabien.
On Sun, Jul 5, 2015 at 1:18 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
 
With Regards,
Amit Kapila.
On 07/04/2015 07:34 PM, Fabien COELHO wrote:I have ran some tests with this patch and the detailed results of the
runs are attached with this mail.
I do not understand really the aggregated figures in the files attached.
Me neither. It looks like Amit measured the time spent in mdread and mdwrite, but I'm not sure what conclusions one can draw from that.
As Heikki has pointed, it is stats data for mdread and mdwrite
between the checkpoints (in the data, you need to search for
"checkpoint start"/"checkpoint done").  In between checkpoint
start and checkpoint done, all the data shows the amount of read/
write done (I am just trying to reproduce what Digoal has reported, so
I am using his script and I also don't understand every thing, but I think
we can look at count between checkpoints to deduce whether the IO
is flattened after patch).  Digoal was seeing a spike at the beginning of
checkpoint (after checkpoint start) in his configuration without this patch
and the spike seems to be reduced after this patch where as in my tests
I don't see the spike immediately after checkpoint (although there are some
spikes in-between) even without patch which means that either I might not
be using the right configuration to measure the IO or there is some other
difference between the way Digoal ran the test and I ran the tests.  I have done
the setup (even though hardware will not be same, but at least I can run the
tests and collect the data in the format similar to Digoal), so if you guys have
suggestions about which kind of parameters we should tweek or some tests
to gather the results, I can do that present the results here for further discussion. 
I thought the patch should show difference if I keep max_wal_size to
somewhat lower or moderate value so that checkpoint should get triggered
due to wal size, but I am not seeing any major difference in the writes
spreading.
I'm not sure I understand your point. I would say that at full speed
pgbench the disk is always busy writing as much as possible, either
checkpoint writes or wal writes, so the write load as such should not be
that different anyway?
I understood that the point of the patch is to check whether there is a
tps dip or not when the checkpoint begins, but I'm not sure how this can
be infered from the many aggregated data you sent, and from my recent
tests the tps is very variable anyway on HDD.
Yes, we definitely want to see the effect on TPS at the beginning of checkpoint,
but even measuring the IO during checkpoint with the way Digoal was capturing
the data can show the effect of this patch.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Mon, Jul 6, 2015 at 12:30 PM, Amit Kapila wrote: > Yes, we definitely want to see the effect on TPS at the beginning of > checkpoint, > but even measuring the IO during checkpoint with the way Digoal was > capturing > the data can show the effect of this patch. I am marking this patch as returned with feedback. -- Michael
Hi,
I was planning to do some review/testing on this patch, but then I 
noticed it was rejected with feedback in 2015-07 and never resubmitted 
into another CF. So I won't waste time in testing this unless someone 
shouts that I should do that anyway. Instead I'll just post some ideas 
about how we might improve the patch, because I'd forget about them 
otherwise.
On 07/05/2015 09:48 AM, Heikki Linnakangas wrote:
>
> The ideal correction formula f(x), would be such that f(g(X)) = X, where:
>
>   X is time, 0 = beginning of checkpoint, 1.0 = targeted end of
> checkpoint (checkpoint_segments), and
>
>   g(X) is the amount of WAL generated. 0 = beginning of checkpoint, 1.0
> = targeted end of checkpoint (derived from max_wal_size).
>
> Unfortunately, we don't know the shape of g(X), as that depends on the
> workload. It might be linear, if there is no effect at all from
> full_page_writes. Or it could be a step-function, where every write
> causes a full page write, until all pages have been touched, and after
> that none do (something like an UPDATE without a where-clause might
> cause that). In pgbench-like workloads, it's something like sqrt(x). I
> picked X^1.5 as a reasonable guess. It's close enough to linear that it
> shouldn't hurt too much if g(x) is linear. But it cuts the worst spike
> at the very beginning, if g(x) is more like sqrt(x).
Exactly. I think the main "problem" here is that we do mix two types of 
WAL records, with quite different characteristics:
 (a) full_page_writes - very high volume right after checkpoint, then     usually drops to much lower volume
 (b) regular records - about the same volume over time (well, lower     volume right after the checkpoint, as that's
whereFPWs happen)
 
We completely ignore this when computing elapsed_xlogs, because we 
compute it (about) like this:
    elapsed_xlogs = wal_since_checkpoint / CheckPointSegments;
which of course gets confused when we write a lot of WAL right after a 
checkpoint, because of FPW. But what if we actually tracked the amount 
of WAL produced by FWP in a checkpoint (which we current don't AFAIK)?
Then we could compute the expected *remaining* amount of WAL to be 
produced within the checkpoint interval, and use that to compute a 
better progress like this:
  wal_bytes          - WAL (total)  wal_fpw_bytes      - WAL (due to FPW)  prev_wal_bytes     - WAL (total) in previous
checkpoint prev_wal_fpw_bytes - WAL (due to FPW) in previous checkpoint
 
So we know that we should expect about
  (prev_wal_bytes - wal_bytes) + (prev_wal_fpw_bytes - wal_fpw_bytes)
  (       regular WAL        ) + (              FPW WAL             )
to be produced until the end of the current checkpoint. I don't have a 
clear idea how to transform this into the 'progress' yet, but I'm pretty 
sure tracking the two types of WAL is a key to a better solution. The 
x^1.5 is probably a step in the right direction, but I don't feel 
particularly confident about the 1.5 (which is rather arbitrary).
regards
--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
			
		On Mon, Dec 14, 2015 at 6:08 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > So we know that we should expect about > > (prev_wal_bytes - wal_bytes) + (prev_wal_fpw_bytes - wal_fpw_bytes) > > ( regular WAL ) + ( FPW WAL ) > > to be produced until the end of the current checkpoint. I don't have a clear > idea how to transform this into the 'progress' yet, but I'm pretty sure > tracking the two types of WAL is a key to a better solution. The x^1.5 is > probably a step in the right direction, but I don't feel particularly > confident about the 1.5 (which is rather arbitrary). If it works well empirically, does it really matter that it's arbitrary? I mean, the entire planner is full of fairly arbitrary assumptions about which things to consider in the cost model and which to ignore. The proof that we have made good decisions there is in the query plans it generates. (The proof that we have made bad decisions in some cases in the query plans, too.) I think a bigger problem for this patch is that Heikki seems to have almost completely disappeared. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?
От
 
		    	Heikki Linnakangas
		    Дата:
		        On 17/12/15 19:07, Robert Haas wrote: > On Mon, Dec 14, 2015 at 6:08 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> So we know that we should expect about >> >> (prev_wal_bytes - wal_bytes) + (prev_wal_fpw_bytes - wal_fpw_bytes) >> >> ( regular WAL ) + ( FPW WAL ) >> >> to be produced until the end of the current checkpoint. I don't have a clear >> idea how to transform this into the 'progress' yet, but I'm pretty sure >> tracking the two types of WAL is a key to a better solution. The x^1.5 is >> probably a step in the right direction, but I don't feel particularly >> confident about the 1.5 (which is rather arbitrary). > > If it works well empirically, does it really matter that it's > arbitrary? I mean, the entire planner is full of fairly arbitrary > assumptions about which things to consider in the cost model and which > to ignore. The proof that we have made good decisions there is in the > query plans it generates. (The proof that we have made bad decisions > in some cases in the query plans, too.) Agreed. > I think a bigger problem for this patch is that Heikki seems to have > almost completely disappeared. Yeah, there's that problem too :-). The reason I didn't commit this back then was lack of performance testing. I'm fairly confident that this would be a significant improvement for some workloads, and shouldn't hurt much even in the worst case. But I did only a little testing on my laptop. I think Simon was in favor of just committing it immediately, and Fabien wanted to see more performance testing before committing. I was hoping that Digoal would re-ran his original test case, and report back on whether it helps. Fabien had a performance test setup, for testing another patch, but he didn't want to run it to test this patch. Amit did some testing, but didn't see a difference. We can take that as a positive sign - no regression - or as a negative sign, but I think that basically means that his test was just not sensitive to the FPW issue. So Tomas, if you're willing to do some testing on this, that would be brilliant! - Heikki
Hi, On 12/21/2015 12:03 PM, Heikki Linnakangas wrote: > On 17/12/15 19:07, Robert Haas wrote: >> On Mon, Dec 14, 2015 at 6:08 PM, Tomas Vondra >> <tomas.vondra@2ndquadrant.com> wrote: >>> So we know that we should expect about >>> >>> (prev_wal_bytes - wal_bytes) + (prev_wal_fpw_bytes - wal_fpw_bytes) >>> >>> ( regular WAL ) + ( FPW WAL ) >>> >>> to be produced until the end of the current checkpoint. I don't >>> have a clear idea how to transform this into the 'progress' yet, >>> but I'm pretty sure tracking the two types of WAL is a key to a >>> better solution. The x^1.5 is probably a step in the right >>> direction, but I don't feel particularly confident about the 1.5 >>> (which is rather arbitrary). >> >> If it works well empirically, does it really matter that it's >> arbitrary? I mean, the entire planner is full of fairly arbitrary >> assumptions about which things to consider in the cost model and >> which to ignore. The proof that we have made good decisions there >> is in the query plans it generates. (The proof that we have made >> bad decisions in some cases in the query plans, too.) > > Agreed. What if it only seems to work well because it was tested on cases it was designed for? What about the workloads that behave differently? Whenever we do changes to costing and query planning, we carefully consider counter-examples and cases where it might fail. I see nothing like that in this thread - all I see is a bunch of pgbench tests, which seems rather insufficient to me. > >> I think a bigger problem for this patch is that Heikki seems to have >> almost completely disappeared. > > Yeah, there's that problem too :-). > > The reason I didn't commit this back then was lack of performance > testing. I'm fairly confident that this would be a significant > improvement for some workloads, and shouldn't hurt much even in the > worst case. But I did only a little testing on my laptop. I think > Simon was in favor of just committing it immediately, and Fabien > wanted to see more performance testing before committing. > > I was hoping that Digoal would re-ran his original test case, and > report back on whether it helps. Fabien had a performance test setup, > for testing another patch, but he didn't want to run it to test this > patch. Amit did some testing, but didn't see a difference. We can > take that as a positive sign - no regression - or as a negative sign, > but I think that basically means that his test was just not sensitive > to the FPW issue. > > So Tomas, if you're willing to do some testing on this, that would > be brilliant! I'm ready to spend some time on this, assuming we can agree on what tests to run. Can we come up with realistic workloads where we expect the patch might actually work poorly? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Let PostgreSQL's On Schedule checkpoint write buffer smooth spread cycle by tuning IsCheckpointOnSchedule?
От
 
		    	Heikki Linnakangas
		    Дата:
		        On 21/12/15 13:53, Tomas Vondra wrote: > On 12/21/2015 12:03 PM, Heikki Linnakangas wrote: >> On 17/12/15 19:07, Robert Haas wrote: >>> If it works well empirically, does it really matter that it's >>> arbitrary? I mean, the entire planner is full of fairly arbitrary >>> assumptions about which things to consider in the cost model and >>> which to ignore. The proof that we have made good decisions there >>> is in the query plans it generates. (The proof that we have made >>> bad decisions in some cases in the query plans, too.) >> >> Agreed. > > What if it only seems to work well because it was tested on cases it was > designed for? What about the workloads that behave differently? > > Whenever we do changes to costing and query planning, we carefully > consider counter-examples and cases where it might fail. I see nothing > like that in this thread - all I see is a bunch of pgbench tests, which > seems rather insufficient to me. Agreed on that too. > I'm ready to spend some time on this, assuming we can agree on what > tests to run. Can we come up with realistic workloads where we expect > the patch might actually work poorly? I think the worst case scenario would be the case where there is no FPW-related WAL burst at all, and checkpoints are always triggered by max_wal_size rather than checkpoint_timeout. In that scenario, the compensation formula will cause the checkpoint to be too lazy in the beginning, and it will have to catch up more aggressively towards the end of the checkpoint cycle. One such scenario might be to do only COPYs into a table with no indexes. Or hack pgbench to do concentrate all the updates on only a few very rows. There will be a FPW on those few pages initially, but the spike will be much shorter. Or turn full_page_writes=off, and hack the patch to do compensation even when fullpage_writes=off, and then just run pgbench. - Heikki
Hello Heikki, > The reason I didn't commit this back then was lack of performance testing. > I'm fairly confident that this would be a significant improvement for some > workloads, and shouldn't hurt much even in the worst case. But I did only a > little testing on my laptop. I think Simon was in favor of just committing it > immediately, and > Fabien wanted to see more performance testing before committing. I confirm. To summarize my opinion: I think that the 1.5 value somewhere in the patch is much too high for the purpose because it shifts the checkpoint load quite a lot (50% more load at the end of the checkpoint) just for the purpose of avoiding a spike which lasts a few seconds (I think) at the beginning. A much smaller value should be used (1.0 <= factor < 1.1), as it would be much less disruptive and would probably avoid the issue just the same. I recommend not to commit with a 1.5 factor in any case. Another issue I raised is that the load change occurs both with xlog and time triggered checkpoints, and I'm sure it should be applied in both case. Another issue is that the patch makes sense when the WAL & relations are on the same disk, but might degrade performance otherwise. Another point that it interacts potentially with a patch I submitted which has a large impact on performance (order of magnitude better in some cases by sorting & flushing blocks on checkpoints), so it would make sense to check that. So more testing is definitely needed. A guc would be nice for this purpose, especially to look at different factors. > I was hoping that Digoal would re-ran his original test case, and report > back on whether it helps. Fabien had a performance test setup, for > testing another patch, but he didn't want to run it to test this patch. Indeed, I have, but I'm quite behind at the moment, I cannot promise anything. Moreover, I'm not sure I see this "spike" issue in my setting, AFAICR. -- Fabien.
On Mon, Dec 21, 2015 at 7:51 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote: > I think that the 1.5 value somewhere in the patch is much too high for the > purpose because it shifts the checkpoint load quite a lot (50% more load at > the end of the checkpoint) just for the purpose of avoiding a spike which > lasts a few seconds (I think) at the beginning. A much smaller value should > be used (1.0 <= factor < 1.1), as it would be much less disruptive and would > probably avoid the issue just the same. I recommend not to commit with a 1.5 > factor in any case. Wait, what? On what workload does the FPW spike last only a few seconds? That's certainly not the case in testing I've done. It would have to be the case that almost all the writes were concentrated on a very few pages. > Another issue I raised is that the load change occurs both with xlog and > time triggered checkpoints, and I'm sure it should be applied in both case. Is this sentence missing a "not"? > Another issue is that the patch makes sense when the WAL & relations are on > the same disk, but might degrade performance otherwise. Yes, that would be a good case to test. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello Robert, >> I think that the 1.5 value somewhere in the patch is much too high for the >> purpose because it shifts the checkpoint load quite a lot (50% more load at >> the end of the checkpoint) just for the purpose of avoiding a spike which >> lasts a few seconds (I think) at the beginning. A much smaller value should >> be used (1.0 <= factor < 1.1), as it would be much less disruptive and would >> probably avoid the issue just the same. I recommend not to commit with a 1.5 >> factor in any case. > > Wait, what? On what workload does the FPW spike last only a few > seconds? [...] Ok. AFAICR, a relatively small part at the beginning of the checkpoint, but possibly more that a few seconds. My actual point is that it should be tested with different and especially smaller values, because 1.5 changes the overall load distribution *a lot*. For testing purpose I suggested that a guc would help, but the patch author has never been back to intervene on the thread, discuss the arguments not provide another patch. >> Another issue I raised is that the load change occurs both with xlog and >> time triggered checkpoints, and I'm sure it should be applied in both case. > > Is this sentence missing a "not"? Indeed. I think that it make sense for xlog triggered checkpoints, but less so with time triggered checkpoints. I may be wrong, but I think that this deserve careful analysis. -- Fabien.
On Wed, Dec 23, 2015 at 9:22 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote: >> Wait, what? On what workload does the FPW spike last only a few >> seconds? [...] > > Ok. AFAICR, a relatively small part at the beginning of the checkpoint, but > possibly more that a few seconds. On a pgbench test, and probably many other workloads, the impact of FPWs declines exponentially (or maybe geometrically, but I think exponentially) as we get further into the checkpoint. The first write is dead certain to need an FPW; after that, if access is more or less random, the chance of needing an FPW for the next write increases in proportion to the number of FPWs already written. As the chances of NOT needing an FPW grow higher, the tps rate starts to increase, initially just a bit, but then faster and faster as the percentage of the working set that has already had an FPW grows. If the working set is large, we're still doing FPWs pretty frequently when the next checkpoint hits - if it's small, then it'll tail off sooner. > My actual point is that it should be tested with different and especially > smaller values, because 1.5 changes the overall load distribution *a lot*. > For testing purpose I suggested that a guc would help, but the patch author > has never been back to intervene on the thread, discuss the arguments not > provide another patch. Well, somebody else should be able to hack a GUC into the patch. I think one thing that this conversation exposes is that the size of the working set matters a lot. For example, if the workload is pgbench, you're going to see a relatively short FPW-related spike at scale factor 100, but at scale factor 3000 it's going to be longer and at some larger scale factor it will be longer still. Therefore you're probably right that 1.5 is unlikely to be optimal for everyone. Another point (which Jan Wieck made me think of) is that the optimal behavior here likely depends on whether xlog and data are on the same disk controller. If they aren't, the FPW spike and background writes may not interact as much. >>> Another issue I raised is that the load change occurs both with xlog and >>> time triggered checkpoints, and I'm sure it should be applied in both >>> case. >> >> Is this sentence missing a "not"? > Indeed. I think that it make sense for xlog triggered checkpoints, but less > so with time triggered checkpoints. I may be wrong, but I think that this > deserve careful analysis. Hmm, off-hand I don't see why that should make any difference. No matter what triggers the checkpoint, there is going to be a spike of FPI activity at the beginning. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello Robert, > On a pgbench test, and probably many other workloads, the impact of > FPWs declines exponentially (or maybe geometrically, but I think > exponentially) as we get further into the checkpoint. Indeed. If the probability of hitting a page is uniform, I think that the FPW probability is exp(-n/N) for the n-th page access. > The first write is dead certain to need an FPW; after that, if access is > more or less random, the chance of needing an FPW for the next write > increases in proportion to the number of FPWs already written. As the > chances of NOT needing an FPW grow higher, the tps rate starts to > increase, initially just a bit, but then faster and faster as the > percentage of the working set that has already had an FPW grows. If the > working set is large, we're still doing FPWs pretty frequently when the > next checkpoint hits - if it's small, then it'll tail off sooner. Yes. >> My actual point is that it should be tested with different and especially >> smaller values, because 1.5 changes the overall load distribution *a lot*. >> For testing purpose I suggested that a guc would help, but the patch author >> has never been back to intervene on the thread, discuss the arguments not >> provide another patch. > > Well, somebody else should be able to hack a GUC into the patch. Yep. But I'm so far behind everything that I was basically waiting for the author to do it:-) > I think one thing that this conversation exposes is that the size of > the working set matters a lot. For example, if the workload is > pgbench, you're going to see a relatively short FPW-related spike at > scale factor 100, but at scale factor 3000 it's going to be longer and > at some larger scale factor it will be longer still. Therefore you're > probably right that 1.5 is unlikely to be optimal for everyone. > > Another point (which Jan Wieck made me think of) is that the optimal > behavior here likely depends on whether xlog and data are on the same > disk controller. If they aren't, the FPW spike and background writes > may not interact as much. Yep, I pointed out that as well. In which case the patch just disrupts the checkpoint load for no benefit... Which would make a guc mandatory. >> [...]. I think that it make sense for xlog triggered checkpoints, but >> less so with time triggered checkpoints. I may be wrong, but I think >> that this deserve careful analysis. > > Hmm, off-hand I don't see why that should make any difference. No > matter what triggers the checkpoint, there is going to be a spike of > FPI activity at the beginning. Hmmm. Let us try with both hands: AFAICR with xlog-triggered checkpoints, the checkpointer progress is measured with respect to the size of the WAL file, which does not grow linearly in time for the reason you pointed above (a lot of FPW at the beginning, less in the end). As the WAL file is growing quickly, the checkpointer thinks that it is late and that it has some catchup to do, so it will start to try writing quickly as well. There is a double whammy as both are writing more, and are probably not succeeding. For time triggered checkpoints, the WAL file gets filled up *but* the checkpointer load is balanced against time. This is a "simple" whammy, where the checkpointer uses IO bandwith which is needed for the WAL, and it could wait a little bit because the WAL will need less later, but it is not trying to catch up by even writing more, so the load shifting needed in this case is not the same as the previous case. As you point out there is a WAL spike in both case, but in one case there is also a checkpointer spike and in the other the checkpointer load is flat. So I think that the correction should not be the same in both cases. Moreover no correction is needed if WAL & relations are on different disks. Also, as you pointed out, it also depends on the load (for a large base the FPW is spead more evenly, for smaller bases there is a spike), so the corrective formula should take that information into account, which means that some evaluation of the FPW distribution should be collected... All this is non trivial. I may do some math to try to solve this, but I'm pretty sure that a blank 1.5 correction in all cases is not the solution. -- Fabien.
Hi, On 12/23/2015 03:38 PM, Robert Haas wrote: > > I think one thing that this conversation exposes is that the size of > the working set matters a lot. For example, if the workload is > pgbench, you're going to see a relatively short FPW-related spike at > scale factor 100, but at scale factor 3000 it's going to be longer > and at some larger scale factor it will be longer still. Therefore > you're probably right that 1.5 is unlikely to be optimal for > everyone. Right. Also, when you say "pgbench" you probably mean the default uniform distribution. But we now have gaussian and exponential distributions which might be handy to simulate other types of workloads. > > Another point (which Jan Wieck made me think of) is that the optimal > behavior here likely depends on whether xlog and data are on the same > disk controller. If they aren't, the FPW spike and background writes > may not interact as much. I'm not sure what exactly you mean by "optimal behavior" here. Surely if you want to minimize interference between WAL and regular I/O, you'll do that. But I don't see what that has to do with the writes generated by the checkpoint? If we do much more writes at the beginning of the checkpoint (due to getting confused by FPW), and OS starts flushing that to disk because we exceed dirty_(background)_bytes, that surely interferes with reads (which is a major issue for queries). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Dec 23, 2015 at 10:37 AM, Fabien COELHO <coelho@cri.ensmp.fr> wrote: > Hmmm. Let us try with both hands: > > AFAICR with xlog-triggered checkpoints, the checkpointer progress is > measured with respect to the size of the WAL file, which does not grow > linearly in time for the reason you pointed above (a lot of FPW at the > beginning, less in the end). As the WAL file is growing quickly, the > checkpointer thinks that it is late and that it has some catchup to do, so > it will start to try writing quickly as well. There is a double whammy as > both are writing more, and are probably not succeeding. > > For time triggered checkpoints, the WAL file gets filled up *but* the > checkpointer load is balanced against time. This is a "simple" whammy, where > the checkpointer uses IO bandwith which is needed for the WAL, and it could > wait a little bit because the WAL will need less later, but it is not trying > to catch up by even writing more, so the load shifting needed in this case > is not the same as the previous case. I see your point, but this isn't a function of what triggered the checkpoint. It's a function of how we measure whether the already-triggered checkpoint is on schedule - we may be behind either because of time, or because of xlog, or both. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Dec 23, 2015 at 2:16 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> Another point (which Jan Wieck made me think of) is that the optimal >> behavior here likely depends on whether xlog and data are on the same >> disk controller. If they aren't, the FPW spike and background writes >> may not interact as much. > > I'm not sure what exactly you mean by "optimal behavior" here. Surely if you > want to minimize interference between WAL and regular I/O, you'll do that. > > But I don't see what that has to do with the writes generated by the > checkpoint? If we do much more writes at the beginning of the checkpoint > (due to getting confused by FPW), and OS starts flushing that to disk > because we exceed dirty_(background)_bytes, that surely interferes with > reads (which is a major issue for queries). Well, it's true that the checkpointer dirty page writes could interfere with reads, but if you've also got lots of FPW-bloated WAL records being written to the same disk at the same time, I would think that'd be worse. No? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/21/2015 01:11 PM, Heikki Linnakangas wrote: > On 21/12/15 13:53, Tomas Vondra wrote: >> On 12/21/2015 12:03 PM, Heikki Linnakangas wrote: >>> On 17/12/15 19:07, Robert Haas wrote: >>>> If it works well empirically, does it really matter that it's >>>> arbitrary? I mean, the entire planner is full of fairly arbitrary >>>> assumptions about which things to consider in the cost model and >>>> which to ignore. The proof that we have made good decisions there >>>> is in the query plans it generates. (The proof that we have made >>>> bad decisions in some cases in the query plans, too.) >>> >>> Agreed. >> >> What if it only seems to work well because it was tested on cases it was >> designed for? What about the workloads that behave differently? >> >> Whenever we do changes to costing and query planning, we carefully >> consider counter-examples and cases where it might fail. I see nothing >> like that in this thread - all I see is a bunch of pgbench tests, which >> seems rather insufficient to me. > > Agreed on that too. > >> I'm ready to spend some time on this, assuming we can agree on what >> tests to run. Can we come up with realistic workloads where we expect >> the patch might actually work poorly? > > I think the worst case scenario would be the case where there is no > FPW-related WAL burst at all, and checkpoints are always triggered by > max_wal_size rather than checkpoint_timeout. In that scenario, the > compensation formula will cause the checkpoint to be too lazy in the > beginning, and it will have to catch up more aggressively towards the > end of the checkpoint cycle. > > One such scenario might be to do only COPYs into a table with no > indexes. Or hack pgbench to do concentrate all the updates on only a few > very rows. There will be a FPW on those few pages initially, but the > spike will be much shorter. Or turn full_page_writes=off, and hack the > patch to do compensation even when fullpage_writes=off, and then just > run pgbench. OK, the COPY scenario works interesting and also realistic because it probably applies to systems doing batch loads. So that's one test to do, can we come up with some other? We probably do want to do a bunch of pgbench tests, with various scales and also distributions - the gaussian/exponential distributions seem useful for simulating OLTP systems that usually have just s small active set (instead of touching all the data). This surely affects how much FPW we do and at what point - my expectetion is that the non-uniform distributions will have a long tail of FPW. So I was thinking about these combinations: * modes: uniform, gaussian, exponential * scales: 1000 (15GB), 10000 (150GB) * clients: 1, 2, 4, 8, 16 (to see impact on scalability, if any) Each combination needs to run for at least an hour or two, possibly with multiple runs. I'll also try running this both on SSD-based sytem and a system with 10k drives, because those will probably behave differently. Also, are we tracking the amount of FPW during the checkpoint, somewhere? That'd be useful, at least for this patch. Or do we need to just track the amount of WAL produced? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 12/23/2015 08:22 PM, Robert Haas wrote: > On Wed, Dec 23, 2015 at 2:16 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >>> Another point (which Jan Wieck made me think of) is that the optimal >>> behavior here likely depends on whether xlog and data are on the same >>> disk controller. If they aren't, the FPW spike and background writes >>> may not interact as much. >> >> I'm not sure what exactly you mean by "optimal behavior" here. Surely if you >> want to minimize interference between WAL and regular I/O, you'll do that. >> >> But I don't see what that has to do with the writes generated by the >> checkpoint? If we do much more writes at the beginning of the checkpoint >> (due to getting confused by FPW), and OS starts flushing that to disk >> because we exceed dirty_(background)_bytes, that surely interferes with >> reads (which is a major issue for queries). > > Well, it's true that the checkpointer dirty page writes could > interfere with reads, but if you've also got lots of FPW-bloated WAL > records being written to the same disk at the same time, I would think > that'd be worse. No? Yes, sure. My point was that in both cases the "optimal behavior" is not to get confused by the initially higher amount of WAL (due to FPW), and track the "real" un-skewed checkpoint progress. Placing both data and WAL on the same device/controller makes the interference worse, especially when we have a lot of FPW at the beginning of the checkpoint. I.e. there's only one "optimal" behavior for both cases. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>> AFAICR with xlog-triggered checkpoints, the checkpointer progress is >> measured with respect to the size of the WAL file, which does not grow >> linearly in time for the reason you pointed above (a lot of FPW at the >> beginning, less in the end). As the WAL file is growing quickly, the >> checkpointer thinks that it is late and that it has some catchup to do, so >> it will start to try writing quickly as well. There is a double whammy as >> both are writing more, and are probably not succeeding. >> >> For time triggered checkpoints, the WAL file gets filled up *but* the >> checkpointer load is balanced against time. This is a "simple" whammy, where >> the checkpointer uses IO bandwith which is needed for the WAL, and it could >> wait a little bit because the WAL will need less later, but it is not trying >> to catch up by even writing more, so the load shifting needed in this case >> is not the same as the previous case. > > I see your point, but this isn't a function of what triggered the > checkpoint. It's a function of how we measure whether the > already-triggered checkpoint is on schedule - we may be behind either > because of time, or because of xlog, or both. Yes. Indeed the current implementation does some kind of both time & xlog. My reasonning was that for time triggered checkpoints (probably average to low load) the time is likely to be used for the checkpoint schedule, while for xlog-triggered checkpoints (probably higher load) it would be more likely to be the xlog, which is skewed. Anyway careful thinking is needed to balance WAL and checkpointer IOs, only when needed, not a rough formula applied blindly. -- Fabien.
On Wed, Dec 23, 2015 at 04:37:00PM +0100, Fabien COELHO wrote: > Hmmm. Let us try with both hands: > > AFAICR with xlog-triggered checkpoints, the checkpointer progress is > measured with respect to the size of the WAL file, which does not > grow linearly in time for the reason you pointed above (a lot of FPW > at the beginning, less in the end). As the WAL file is growing > quickly, the checkpointer thinks that it is late and that it has > some catchup to do, so it will start to try writing quickly as well. > There is a double whammy as both are writing more, and are probably > not succeeding. > > For time triggered checkpoints, the WAL file gets filled up *but* > the checkpointer load is balanced against time. This is a "simple" > whammy, where the checkpointer uses IO bandwith which is needed for > the WAL, and it could wait a little bit because the WAL will need > less later, but it is not trying to catch up by even writing more, > so the load shifting needed in this case is not the same as the > previous case. > > As you point out there is a WAL spike in both case, but in one case > there is also a checkpointer spike and in the other the checkpointer > load is flat. Coming in late here, but I always thought the fact that the FPW happen mostly at the start of the checkpoint, and the checkpoint writes/fsyncs happen mostly in the first half of the checkpoint period was always suboptimal, i.e. it would be nice of one of these was more active in the second half of the checkpoint period. I assume that is what is being discussed here. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Roman grave inscription +
> Coming in late here, but I always thought the fact that the FPW happen > mostly at the start of the checkpoint, and the checkpoint writes/fsyncs > happen mostly in the first half of the checkpoint period was always > suboptimal, i.e. it would be nice of one of these was more active in the > second half of the checkpoint period. I assume that is what is being > discussed here. Yes, this is the subject of the thread. On the one end hand, whether is the first half or first quarter of first tenth really depends on the actual load, so how much to rebalance depends on that dynamic information. At the beginning there should be a short spike for index pages which are quickly reused, and a longer spike about data pages depending on the pattern of access and size of table. On the other hand the rebalancing also depends on the measure chosen to know about the overall progress, either WAL writing or time, and their behavior is not the same, so this should be taken into account. My conclusion is that there is no simple static fix to this issue, as proposed in the submitted patch. The problem needs thinking and maths. -- Fabien.