Re: Spread checkpoint sync

Поиск
Список
Период
Сортировка
От Greg Smith
Тема Re: Spread checkpoint sync
Дата
Msg-id 4CF55EC5.5000108@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: Spread checkpoint sync  (Ron Mayer <rm_pg@cheapcomplexdevices.com>)
Ответы Re: Spread checkpoint sync  (Josh Berkus <josh@agliodbs.com>)
Re: Spread checkpoint sync  (Robert Haas <robertmhaas@gmail.com>)
Re: Spread checkpoint sync  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Ron Mayer wrote:
> Might smoother checkpoints be better solved by talking
> to the OS vendors & virtual-memory-tunning-knob-authors
> to work with them on exposing the ideal knobs; rather than
> saying that our only tool is a hammer(fsync) so the problem
> must be handled as a nail.
>

Maybe, but it's hard to argue that the current implementation--just
doing all of the sync calls as fast as possible, one after the other--is
going to produce worst-case behavior in a lot of situations.  Given that
it's not a huge amount of code to do better, I'd rather do some work in
that direction, instead of presuming the kernel authors will ever make
this go away.  Spreading the writes out as part of the checkpoint rework
in 8.3 worked better than any kernel changes I've tested since then, and
I'm not real optimisic about this getting resolved at the system level.
So long as the database changes aren't antagonistic toward kernel
improvements, I'd prefer to have some options here that become effective
as soon as the database code is done.

I've attached an updated version of the initial sync spreading patch
here, one that applies cleanly on top of HEAD and over top of the sync
instrumentation patch too.  The conflict that made that hard before is
gone now.

Having the pg_stat_bgwriter.buffers_backend_fsync patch available all
the time now has made me reconsider how important one potential bit of
refactoring here would be.  I managed to catch one of the situations
where really popular relations were being heavily updated in a way that
was competing with the checkpoint on my test system (which I can happily
share the logs of), with the instrumentation patch applied but not the
spread sync one:

LOG:  checkpoint starting: xlog
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 7747 of relation base/16424/16442
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 42688 of relation base/16424/16437
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 9723 of relation base/16424/16442
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 58117 of relation base/16424/16437
DEBUG:  could not forward fsync request because request queue is full
CONTEXT:  writing block 165128 of relation base/16424/16437
[330 of these total, all referring to the same two relations]

DEBUG:  checkpoint sync: number=1 file=base/16424/16448_fsm
time=10132.830000 msec
DEBUG:  checkpoint sync: number=2 file=base/16424/11645 time=0.001000 msec
DEBUG:  checkpoint sync: number=3 file=base/16424/16437 time=7.796000 msec
DEBUG:  checkpoint sync: number=4 file=base/16424/16448 time=4.679000 msec
DEBUG:  checkpoint sync: number=5 file=base/16424/11607 time=0.001000 msec
DEBUG:  checkpoint sync: number=6 file=base/16424/16437.1 time=3.101000 msec
DEBUG:  checkpoint sync: number=7 file=base/16424/16442 time=4.172000 msec
DEBUG:  checkpoint sync: number=8 file=base/16424/16428_vm time=0.001000
msec
DEBUG:  checkpoint sync: number=9 file=base/16424/16437_fsm
time=0.001000 msec
DEBUG:  checkpoint sync: number=10 file=base/16424/16428 time=0.001000 msec
DEBUG:  checkpoint sync: number=11 file=base/16424/16425 time=0.000000 msec
DEBUG:  checkpoint sync: number=12 file=base/16424/16437_vm
time=0.001000 msec
DEBUG:  checkpoint sync: number=13 file=base/16424/16425_vm
time=0.001000 msec
LOG:  checkpoint complete: wrote 3032 buffers (74.0%); 0 transaction log
file(s) added, 0 removed, 0 recycled; write=1.742 s, sync=10.153 s,
total=37.654 s; sync files=13, longest=10.132 s, average=0.779 s

Note here how the checkpoint was hung on trying to get 16448_fsm written
out, but the backends were issuing constant competing fsync calls to
these other relations.  This is very similar to the production case this
patch was written to address, which I hadn't been able to share a good
example of yet.  That's essentially what it looks like, except with the
contention going on for minutes instead of seconds.

One of the ideas Simon and I had been considering at one point was
adding some better de-duplication logic to the fsync absorb code, which
I'm reminded by the pattern here might be helpful independently of other
improvements.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 620b197..501cab8 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -143,8 +143,8 @@ typedef struct

 static BgWriterShmemStruct *BgWriterShmem;

-/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */
-#define WRITES_PER_ABSORB        1000
+/* Fraction of fsync absorb queue that needs to be filled before acting */
+#define ABSORB_ACTION_DIVISOR    10

 /*
  * GUC parameters
@@ -382,7 +382,7 @@ BackgroundWriterMain(void)
         /*
          * Process any requests or signals received recently.
          */
-        AbsorbFsyncRequests();
+        AbsorbFsyncRequests(false);

         if (got_SIGHUP)
         {
@@ -636,7 +636,7 @@ BgWriterNap(void)
         (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
             break;
         pg_usleep(1000000L);
-        AbsorbFsyncRequests();
+        AbsorbFsyncRequests(true);
         udelay -= 1000000L;
     }

@@ -684,8 +684,6 @@ ImmediateCheckpointRequested(void)
 void
 CheckpointWriteDelay(int flags, double progress)
 {
-    static int    absorb_counter = WRITES_PER_ABSORB;
-
     /* Do nothing if checkpoint is being executed by non-bgwriter process */
     if (!am_bg_writer)
         return;
@@ -705,22 +703,65 @@ CheckpointWriteDelay(int flags, double progress)
             ProcessConfigFile(PGC_SIGHUP);
         }

-        AbsorbFsyncRequests();
-        absorb_counter = WRITES_PER_ABSORB;
+        AbsorbFsyncRequests(false);

         BgBufferSync();
         CheckArchiveTimeout();
         BgWriterNap();
     }
-    else if (--absorb_counter <= 0)
+    else
     {
         /*
-         * Absorb pending fsync requests after each WRITES_PER_ABSORB write
-         * operations even when we don't sleep, to prevent overflow of the
-         * fsync request queue.
+         * Check for overflow of the fsync request queue.
          */
-        AbsorbFsyncRequests();
-        absorb_counter = WRITES_PER_ABSORB;
+        AbsorbFsyncRequests(false);
+    }
+}
+
+/*
+ * CheckpointSyncDelay -- yield control to bgwriter during a checkpoint
+ *
+ * This function is called after each file sync performed by mdsync().
+ * It is responsible for keeping the bgwriter's normal activities in
+ * progress during a long checkpoint.
+ */
+void
+CheckpointSyncDelay(void)
+{
+    pg_time_t    now;
+     pg_time_t    sync_start_time;
+     int            sync_delay_secs;
+
+     /*
+      * Delay after each sync, in seconds.  This could be a parameter.  But
+      * since ideally this will be auto-tuning in the near future, not
+     * assigning it a GUC setting yet.
+      */
+#define EXTRA_SYNC_DELAY    3
+
+    /* Do nothing if checkpoint is being executed by non-bgwriter process */
+    if (!am_bg_writer)
+        return;
+
+     sync_start_time = (pg_time_t) time(NULL);
+
+    /*
+     * Perform the usual bgwriter duties.
+     */
+     for (;;)
+     {
+        AbsorbFsyncRequests(false);
+         BgBufferSync();
+         CheckArchiveTimeout();
+         BgWriterNap();
+
+         /*
+          * Are we there yet?
+          */
+         now = (pg_time_t) time(NULL);
+         sync_delay_secs = now - sync_start_time;
+         if (sync_delay_secs >= EXTRA_SYNC_DELAY)
+            break;
     }
 }

@@ -1116,16 +1157,41 @@ ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
  * non-bgwriter processes, do nothing if not bgwriter.
  */
 void
-AbsorbFsyncRequests(void)
+AbsorbFsyncRequests(bool force)
 {
     BgWriterRequest *requests = NULL;
     BgWriterRequest *request;
     int            n;

+    /*
+     * Divide the size of the request queue by this to determine when
+     * absorption action needs to be taken.  Default here aims to empty the
+     * queue whenever 1 / 10 = 10% of it is full.  If this isn't good enough,
+     * you probably need to lower bgwriter_delay, rather than presume
+     * this needs to be a tunable you can decrease.
+     */
+    int            absorb_action_divisor = 10;
+
     if (!am_bg_writer)
         return;

     /*
+     * If the queue isn't very large, don't worry about absorbing yet.
+     * Access integer counter without lock, to avoid queuing.
+     */
+    if (!force && BgWriterShmem->num_requests <
+            (BgWriterShmem->max_requests / ABSORB_ACTION_DIVISOR))
+    {
+        if (BgWriterShmem->num_requests > 0)
+            elog(DEBUG1,"Absorb queue: %d fsync requests, not processing",
+                BgWriterShmem->num_requests);
+        return;
+    }
+
+    elog(DEBUG1,"Absorb queue: %d fsync requests, processing",
+        BgWriterShmem->num_requests);
+
+    /*
      * We have to PANIC if we fail to absorb all the pending requests (eg,
      * because our hashtable runs out of memory).  This is because the system
      * cannot run safely if we are unable to fsync what we have been told to
@@ -1164,4 +1230,9 @@ AbsorbFsyncRequests(void)
         pfree(requests);

     END_CRIT_SECTION();
+
+    /*
+     * Send off activity statistics to the stats collector
+     */
+    pgstat_send_bgwriter();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index cadd938..c89486e 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -31,9 +31,6 @@
 #include "pg_trace.h"


-/* interval for calling AbsorbFsyncRequests in mdsync */
-#define FSYNCS_PER_ABSORB        10
-
 /* special values for the segno arg to RememberFsyncRequest */
 #define FORGET_RELATION_FSYNC    (InvalidBlockNumber)
 #define FORGET_DATABASE_FSYNC    (InvalidBlockNumber-1)
@@ -926,7 +923,6 @@ mdsync(void)

     HASH_SEQ_STATUS hstat;
     PendingOperationEntry *entry;
-    int            absorb_counter;

     /* Statistics on sync times */
     int processed = 0;
@@ -951,7 +947,7 @@ mdsync(void)
      * queued an fsync request before clearing the buffer's dirtybit, so we
      * are safe as long as we do an Absorb after completing BufferSync().
      */
-    AbsorbFsyncRequests();
+    AbsorbFsyncRequests(true);

     /*
      * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
@@ -994,7 +990,6 @@ mdsync(void)
     mdsync_in_progress = true;

     /* Now scan the hashtable for fsync requests to process */
-    absorb_counter = FSYNCS_PER_ABSORB;
     hash_seq_init(&hstat, pendingOpsTable);
     while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
     {
@@ -1019,17 +1014,9 @@ mdsync(void)
             int            failures;

             /*
-             * If in bgwriter, we want to absorb pending requests every so
-             * often to prevent overflow of the fsync request queue.  It is
-             * unspecified whether newly-added entries will be visited by
-             * hash_seq_search, but we don't care since we don't need to
-             * process them anyway.
+             * If in bgwriter, perform normal duties.
              */
-            if (--absorb_counter <= 0)
-            {
-                AbsorbFsyncRequests();
-                absorb_counter = FSYNCS_PER_ABSORB;
-            }
+            CheckpointSyncDelay();

             /*
              * The fsync table could contain requests to fsync segments that
@@ -1121,10 +1108,9 @@ mdsync(void)
                 pfree(path);

                 /*
-                 * Absorb incoming requests and check to see if canceled.
+                 * If in bgwriter, perform normal duties.
                  */
-                AbsorbFsyncRequests();
-                absorb_counter = FSYNCS_PER_ABSORB;        /* might as well... */
+                CheckpointSyncDelay();

                 if (entry->canceled)
                     break;
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index e251da6..4939604 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -26,10 +26,11 @@ extern void BackgroundWriterMain(void);

 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
+extern void CheckpointSyncDelay(void);

 extern bool ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
                     BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
+extern void AbsorbFsyncRequests(bool force);

 extern Size BgWriterShmemSize(void);
 extern void BgWriterShmemInit(void);

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: DELETE with LIMIT (or my first hack)
Следующее
От: Alastair Turner
Дата:
Сообщение: Re: DELETE with LIMIT (or my first hack)