Spread checkpoint sync

Поиск
Список
Период
Сортировка
От	Greg Smith
Тема	Spread checkpoint sync
Дата	14 ноября 2010 г. 19:48:36
Msg-id	4CE07548.4030709@2ndquadrant.com обсуждение исходный текст
Ответы	Re: Spread checkpoint sync Re: Spread checkpoint sync
Список	pgsql-hackers
Дерево обсуждения
Final patch in this series for today spreads out the individual
checkpoint fsync calls over time, and was written by myself and Simon
Riggs.  Patch is based against a system that's already had the two
patches I sent over earlier today applied, rather than HEAD, as both are
useful for measuring how well this one works.  You can grab a tree with
all three from my Github repo, via the "checkpoint" branch:
https://github.com/greg2ndQuadrant/postgres/tree/checkpoint

This is a work in progress.  While I've seen this reduce checkpoint
spike latency significantly on a large system, I don't have any
referencable performance numbers I can share yet.  There are also a
couple of problems I know about, and I'm sure others I haven't thought
of yet  The first known issues is that it delays manual or other
"forced" checkpoints, which is not necessarily wrong if you really are
serious about spreading syncs out, but it is certainly surprising when
you run into it.  I notice this most when running createdb on a busy
system.  No real reason for this to happen, the code passes that it's a
forced checkpoint down but just doesn't act on it yet.

The second issue is that the delay between sync calls is currently
hard-coded, at 3 seconds.  I believe the right path here is to consider
the current checkpoint_completion_target to still be valid, then work
back from there.  That raises the question of what percentage of the
time writes should now be compressed into relative to that, to leave
some time to spread the sync calls.  If we're willing to say "writes
finish in first 1/2 of target, syncs execute in second 1/2", that I
could implement that here.  Maybe that ratio needs to be another
tunable.  Still thinking about that part, and it's certainly open to
community debate.  The thing to realize that complicates the design is
that the actual sync execution may take a considerable period of time.
It's much more likely for that to happen than in the case of an
individual write, as the current spread checkpoint does, because those
are usually cached.  In the spread sync case, it's easy for one slow
sync to make the rest turn into ones that fire in quick succession, to
make up for lost time.

There's some history behind this design that impacts review.  Circa 8.3
development in 2007, I had experimented with putting some delay between
each of the fsync calls that the background writer executes during a
checkpoint.  It didn't help smooth things out at all at the time.  It
turns out that's mainly because all my tests were on Linux using ext3.
On that filesystem, fsync is not very granular.  It's quite likely it
will push out data you haven't asked to sync yet, which means one giant
sync is almost impossible to avoid no matter how you space the fsync
calls.  If you try and review this on ext3, I expect you'll find a big
spike early in each checkpoint (where it flushes just about everything
out) and then quick response for the later files involved.

The system this patch originated to help fix was running XFS.  There,
I've confirmed that problem doesn't exist, that individual syncs only
seem to push out the data related to one file.  The same should be true
on ext4, but I haven't tested that myself.  Not sure how granular the
fsync calls are on Solaris, FreeBSD, Darwin, etc. yet.  Note that it's
still possible to get hung on one sync call for a while, even on XFS.
The worst case seems to be if you've created a new 1GB database table
chunk and fully populated it since the last checkpoint, on a system
that's just cached the whole thing so far.

One change that turned out be necessary rather than optional--to get
good performance from the system under tuning--was to make regular
background writer activity, including fsync absorb checks, happen during
these sync pauses.  The existing code ran the checkpoint sync work in a
pretty tight loop, which as I alluded to in an earlier patch today can
lead to the backends competing with the background writer to get their
sync calls executed.  This squashes that problem if the background
writer is setup properly.

What does properly mean?  Well, it can't do that cleanup if the
background writer is sleeping.  This whole area was refactored.  The
current sync absorb code uses the constant WRITES_PER_ABSORB to make
decisions.  This new version replaces that hard-coded value with
something that scales to the system size.  It now ignores doing work
until the number of pending absorb requests has reached 10% of the
number possible to store (BgWriterShmem->max_requests, which is set to
the size of shared_buffers in 8K pages, AKA NBuffers).  This may
actually postpone this work for too long on systems with large
shared_buffers settings; that's one area I'm still investigating.

As far as concerns about this 10% setting not doing enough work, which
is something I do see, you can always increase how often absorbing
happens by decreasing bgwriter_delay now--giving other benefits too.
For example, if you run the fsync-stress-v2.sh script I included with
the last patch I sent, you'll discover the spread sync version of the
server leaves just as many unabsorbed writes behind as the old code
did.  Those are happening because of periods the background writer is
sleeping.  They drop as you decrease the delay; here's a table showing
some values I tested here, with all three patches installed:

bgwriter_delay    buffers_backend_sync
200 ms    90
50 ms    28
25 ms    3

There's a bunch of performance related review work that needs to be done
here, in addition to the usual code review for the patch.  My hope is
that I can get enough of that done to validate this does what it's
supposed to on public hardware that a later version of this patch is
considered for the next CommitFest.  It's a little more raw than I'd
like still, but the idea has been tested enough here that I believe it's
fundamentally sound and valuable.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us


diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 43a149e..0ce8e2b 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -143,8 +143,8 @@ typedef struct

 static BgWriterShmemStruct *BgWriterShmem;

-/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */
-#define WRITES_PER_ABSORB        1000
+/* Fraction of fsync absorb queue that needs to be filled before acting */
+#define ABSORB_ACTION_DIVISOR    10

 /*
  * GUC parameters
@@ -382,7 +382,7 @@ BackgroundWriterMain(void)
         /*
          * Process any requests or signals received recently.
          */
-        AbsorbFsyncRequests();
+        AbsorbFsyncRequests(false);

         if (got_SIGHUP)
         {
@@ -636,7 +636,7 @@ BgWriterNap(void)
         (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
             break;
         pg_usleep(1000000L);
-        AbsorbFsyncRequests();
+        AbsorbFsyncRequests(true);
         udelay -= 1000000L;
     }

@@ -684,8 +684,6 @@ ImmediateCheckpointRequested(void)
 void
 CheckpointWriteDelay(int flags, double progress)
 {
-    static int    absorb_counter = WRITES_PER_ABSORB;
-
     /* Do nothing if checkpoint is being executed by non-bgwriter process */
     if (!am_bg_writer)
         return;
@@ -705,22 +703,65 @@ CheckpointWriteDelay(int flags, double progress)
             ProcessConfigFile(PGC_SIGHUP);
         }

-        AbsorbFsyncRequests();
-        absorb_counter = WRITES_PER_ABSORB;
+        AbsorbFsyncRequests(false);

         BgBufferSync();
         CheckArchiveTimeout();
         BgWriterNap();
     }
-    else if (--absorb_counter <= 0)
+    else
     {
         /*
-         * Absorb pending fsync requests after each WRITES_PER_ABSORB write
-         * operations even when we don't sleep, to prevent overflow of the
-         * fsync request queue.
+         * Check for overflow of the fsync request queue.
          */
-        AbsorbFsyncRequests();
-        absorb_counter = WRITES_PER_ABSORB;
+        AbsorbFsyncRequests(false);
+    }
+}
+
+/*
+ * CheckpointSyncDelay -- yield control to bgwriter during a checkpoint
+ *
+ * This function is called after each file sync performed by mdsync().
+ * It is responsible for keeping the bgwriter's normal activities in
+ * progress during a long checkpoint.
+ */
+void
+CheckpointSyncDelay(void)
+{
+    pg_time_t    now;
+     pg_time_t    sync_start_time;
+     int            sync_delay_secs;
+
+     /*
+      * Delay after each sync, in seconds.  This could be a parameter.  But
+      * since ideally this will be auto-tuning in the near future, not
+     * assigning it a GUC setting yet.
+      */
+#define EXTRA_SYNC_DELAY    3
+
+    /* Do nothing if checkpoint is being executed by non-bgwriter process */
+    if (!am_bg_writer)
+        return;
+
+     sync_start_time = (pg_time_t) time(NULL);
+
+    /*
+     * Perform the usual bgwriter duties.
+     */
+     for (;;)
+     {
+        AbsorbFsyncRequests(false);
+         BgBufferSync();
+         CheckArchiveTimeout();
+         BgWriterNap();
+
+         /*
+          * Are we there yet?
+          */
+         now = (pg_time_t) time(NULL);
+         sync_delay_secs = now - sync_start_time;
+         if (sync_delay_secs >= EXTRA_SYNC_DELAY)
+            break;
     }
 }

@@ -1116,16 +1157,41 @@ ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
  * non-bgwriter processes, do nothing if not bgwriter.
  */
 void
-AbsorbFsyncRequests(void)
+AbsorbFsyncRequests(bool force)
 {
     BgWriterRequest *requests = NULL;
     BgWriterRequest *request;
     int            n;

+    /*
+     * Divide the size of the request queue by this to determine when
+     * absorption action needs to be taken.  Default here aims to empty the
+     * queue whenever 1 / 10 = 10% of it is full.  If this isn't good enough,
+     * you probably need to lower bgwriter_delay, rather than presume
+     * this needs to be a tunable you can decrease.
+     */
+    int            absorb_action_divisor = 10;
+
     if (!am_bg_writer)
         return;

     /*
+     * If the queue isn't very large, don't worry about absorbing yet.
+     * Access integer counter without lock, to avoid queuing.
+     */
+    if (!force && BgWriterShmem->num_requests <
+            (BgWriterShmem->max_requests / ABSORB_ACTION_DIVISOR))
+    {
+        if (BgWriterShmem->num_requests > 0)
+            elog(DEBUG1,"Absorb queue: %d fsync requests, not processing",
+                BgWriterShmem->num_requests);
+        return;
+    }
+
+    elog(DEBUG1,"Absorb queue: %d fsync requests, processing",
+        BgWriterShmem->num_requests);
+
+    /*
      * We have to PANIC if we fail to absorb all the pending requests (eg,
      * because our hashtable runs out of memory).  This is because the system
      * cannot run safely if we are unable to fsync what we have been told to
@@ -1167,4 +1233,9 @@ AbsorbFsyncRequests(void)
         pfree(requests);

     END_CRIT_SECTION();
+
+    /*
+     * Send off activity statistics to the stats collector
+     */
+    pgstat_send_bgwriter();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 7140b94..57066c4 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -36,9 +36,6 @@
  */
 #define DEBUG_FSYNC    1

-/* interval for calling AbsorbFsyncRequests in mdsync */
-#define FSYNCS_PER_ABSORB        10
-
 /* special values for the segno arg to RememberFsyncRequest */
 #define FORGET_RELATION_FSYNC    (InvalidBlockNumber)
 #define FORGET_DATABASE_FSYNC    (InvalidBlockNumber-1)
@@ -931,7 +928,6 @@ mdsync(void)

     HASH_SEQ_STATUS hstat;
     PendingOperationEntry *entry;
-    int            absorb_counter;

 #ifdef DEBUG_FSYNC
     /* Statistics on sync times */
@@ -958,7 +954,7 @@ mdsync(void)
      * queued an fsync request before clearing the buffer's dirtybit, so we
      * are safe as long as we do an Absorb after completing BufferSync().
      */
-    AbsorbFsyncRequests();
+    AbsorbFsyncRequests(true);

     /*
      * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
@@ -1001,7 +997,6 @@ mdsync(void)
     mdsync_in_progress = true;

     /* Now scan the hashtable for fsync requests to process */
-    absorb_counter = FSYNCS_PER_ABSORB;
     hash_seq_init(&hstat, pendingOpsTable);
     while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
     {
@@ -1026,17 +1021,9 @@ mdsync(void)
             int            failures;

             /*
-             * If in bgwriter, we want to absorb pending requests every so
-             * often to prevent overflow of the fsync request queue.  It is
-             * unspecified whether newly-added entries will be visited by
-             * hash_seq_search, but we don't care since we don't need to
-             * process them anyway.
+             * If in bgwriter, perform normal duties.
              */
-            if (--absorb_counter <= 0)
-            {
-                AbsorbFsyncRequests();
-                absorb_counter = FSYNCS_PER_ABSORB;
-            }
+            CheckpointSyncDelay();

             /*
              * The fsync table could contain requests to fsync segments that
@@ -1131,10 +1118,9 @@ mdsync(void)
                 pfree(path);

                 /*
-                 * Absorb incoming requests and check to see if canceled.
+                 * If in bgwriter, perform normal duties.
                  */
-                AbsorbFsyncRequests();
-                absorb_counter = FSYNCS_PER_ABSORB;        /* might as well... */
+                CheckpointSyncDelay();

                 if (entry->canceled)
                     break;
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index e251da6..4939604 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -26,10 +26,11 @@ extern void BackgroundWriterMain(void);

 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointSyncDelay(void);

 extern bool ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
                     BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
+extern void AbsorbFsyncRequests(bool force);

 extern Size BgWriterShmemSize(void);
 extern void BgWriterShmemInit(void);
В списке pgsql-hackers по дате отправления:
Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Spread checkpoint sync