Обсуждение: Background writer process

Поиск
Список
Период
Сортировка

Background writer process

От
Jan Wieck
Дата:
The attached diff is another attempt for distributing the write IO.

It is a separate background process much like the checkpointer. It's
purpose is to keep the number of dirty blocks in the buffer cache at a
reasonable level and try that the buffers returned by the strategy for
replacement are allways clean. This current shot does it this way:

     - get a list of all dirty blocks in strategy replacement order
     - flush n percent of that list or a maximum of m buffers
       (whatever is smaller)
     - issue a sync()
     - sleep for x milliseconds

If there is nothing to do, it will sleep for 10 seconds before checking
again at all. It acquires a checkpoint lock during the flush, so it will
yield for a real checkpoint.

For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.


Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #
Index: src/backend/bootstrap/bootstrap.c
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/bootstrap/bootstrap.c,v
retrieving revision 1.166
diff -c -r1.166 bootstrap.c
*** src/backend/bootstrap/bootstrap.c    2003/09/02 19:04:12    1.166
--- src/backend/bootstrap/bootstrap.c    2003/11/13 18:39:51
***************
*** 428,435 ****

      BaseInit();

      if (IsUnderPostmaster)
!         InitDummyProcess();        /* needed to get LWLocks */

      /*
       * XLOG operations
--- 428,447 ----

      BaseInit();

+     /* needed to get LWLocks */
      if (IsUnderPostmaster)
!     {
!         switch (xlogop)
!         {
!             case BS_XLOG_BGWRITER:
!                 InitDummyProcess(DUMMY_PROC_BGWRITER);
!                 break;
!
!             default:
!                 InitDummyProcess(DUMMY_PROC_DEFAULT);
!                 break;
!         }
!     }

      /*
       * XLOG operations
***************
*** 451,456 ****
--- 463,473 ----
              CreateCheckPoint(false, false);
              SetSavedRedoRecPtr();        /* pass redo ptr back to
                                           * postmaster */
+             proc_exit(0);        /* done */
+
+         case BS_XLOG_BGWRITER:
+             CreateDummyCaches();
+             BufferBackgroundWriter();
              proc_exit(0);        /* done */

          case BS_XLOG_STARTUP:
Index: src/backend/catalog/index.c
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/catalog/index.c,v
retrieving revision 1.221
diff -c -r1.221 index.c
*** src/backend/catalog/index.c    2003/11/12 21:15:48    1.221
--- src/backend/catalog/index.c    2003/11/13 16:19:07
***************
*** 1043,1049 ****
          /* Send out shared cache inval if necessary */
          if (!IsBootstrapProcessingMode())
              CacheInvalidateHeapTuple(pg_class, tuple);
!         BufferSync();
      }
      else if (dirty)
      {
--- 1043,1049 ----
          /* Send out shared cache inval if necessary */
          if (!IsBootstrapProcessingMode())
              CacheInvalidateHeapTuple(pg_class, tuple);
!         BufferSync(-1, -1);
      }
      else if (dirty)
      {
Index: src/backend/commands/dbcommands.c
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/commands/dbcommands.c,v
retrieving revision 1.126
diff -c -r1.126 dbcommands.c
*** src/backend/commands/dbcommands.c    2003/11/12 21:15:50    1.126
--- src/backend/commands/dbcommands.c    2003/11/13 16:19:07
***************
*** 317,323 ****
       * up-to-date for the copy.  (We really only need to flush buffers for
       * the source database...)
       */
!     BufferSync();

      /*
       * Close virtual file descriptors so the kernel has more available for
--- 317,323 ----
       * up-to-date for the copy.  (We really only need to flush buffers for
       * the source database...)
       */
!     BufferSync(-1, -1);

      /*
       * Close virtual file descriptors so the kernel has more available for
***************
*** 454,460 ****
       * will see the new database in pg_database right away.  (They'll see
       * an uncommitted tuple, but they don't care; see GetRawDatabaseInfo.)
       */
!     BufferSync();
  }


--- 454,460 ----
       * will see the new database in pg_database right away.  (They'll see
       * an uncommitted tuple, but they don't care; see GetRawDatabaseInfo.)
       */
!     BufferSync(-1, -1);
  }


***************
*** 591,597 ****
       * (They'll see an uncommitted deletion, but they don't care; see
       * GetRawDatabaseInfo.)
       */
!     BufferSync();
  }


--- 591,597 ----
       * (They'll see an uncommitted deletion, but they don't care; see
       * GetRawDatabaseInfo.)
       */
!     BufferSync(-1, -1);
  }


***************
*** 688,694 ****
       * see an uncommitted tuple, but they don't care; see
       * GetRawDatabaseInfo.)
       */
!     BufferSync();
  }


--- 688,694 ----
       * see an uncommitted tuple, but they don't care; see
       * GetRawDatabaseInfo.)
       */
!     BufferSync(-1, -1);
  }


Index: src/backend/postmaster/postmaster.c
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/postmaster/postmaster.c,v
retrieving revision 1.348
diff -c -r1.348 postmaster.c
*** src/backend/postmaster/postmaster.c    2003/11/11 01:09:42    1.348
--- src/backend/postmaster/postmaster.c    2003/11/13 19:39:58
***************
*** 205,210 ****
--- 205,213 ----
  int            CheckPointTimeout = 300;
  int            CheckPointWarning = 30;
  time_t        LastSignalledCheckpoint = 0;
+ int            BgWriterDelay = 500;
+ int            BgWriterPercent = 0;
+ int            BgWriterMaxpages = 100;

  bool        log_hostname;        /* for ps display */
  bool        LogSourcePort;
***************
*** 224,230 ****
  /* Startup/shutdown state */
  static pid_t StartupPID = 0,
              ShutdownPID = 0,
!             CheckPointPID = 0;
  static time_t checkpointed = 0;

  #define            NoShutdown        0
--- 227,234 ----
  /* Startup/shutdown state */
  static pid_t StartupPID = 0,
              ShutdownPID = 0,
!             CheckPointPID = 0,
!             BgWriterPID = 0;
  static time_t checkpointed = 0;

  #define            NoShutdown        0
***************
*** 298,303 ****
--- 302,308 ----

  #define StartupDataBase()        SSDataBase(BS_XLOG_STARTUP)
  #define CheckPointDataBase()    SSDataBase(BS_XLOG_CHECKPOINT)
+ #define StartBackgroundWriter()    SSDataBase(BS_XLOG_BGWRITER)
  #define ShutdownDataBase()        SSDataBase(BS_XLOG_SHUTDOWN)


***************
*** 1056,1061 ****
--- 1061,1077 ----
          }

          /*
+          * If no background writer process is running and we should
+          * do background writing, start one. It doesn't matter if
+          * this fails, we'll just try again later.
+          */
+         if (BgWriterPID == 0 && BgWriterPercent > 0 &&
+                 Shutdown == NoShutdown && !FatalError && random_seed != 0)
+         {
+             BgWriterPID = StartBackgroundWriter();
+         }
+
+         /*
           * Wait for something to happen.
           */
          memcpy((char *) &rmask, (char *) &readmask, sizeof(fd_set));
***************
*** 1478,1483 ****
--- 1494,1506 ----
                                   backendPID)));
          return;
      }
+     else if (backendPID == BgWriterPID)
+     {
+         ereport(DEBUG2,
+                 (errmsg_internal("ignoring cancel request for bgwriter process %d",
+                                  backendPID)));
+         return;
+     }
      else if (ExecBackend)
          AttachSharedMemoryAndSemaphores();

***************
*** 1660,1665 ****
--- 1683,1695 ----
          SignalChildren(SIGHUP);
          load_hba();
          load_ident();
+
+         /*
+          * Tell the background writer to terminate so that we
+          * will start a new one with a possibly changed config
+          */
+         if (BgWriterPID != 0)
+             kill(BgWriterPID, SIGTERM);
      }

      PG_SETMASK(&UnBlockSig);
***************
*** 1692,1697 ****
--- 1722,1729 ----
               *
               * Wait for children to end their work and ShutdownDataBase.
               */
+             if (BgWriterPID != 0)
+                 kill(BgWriterPID, SIGTERM);
              if (Shutdown >= SmartShutdown)
                  break;
              Shutdown = SmartShutdown;
***************
*** 1724,1729 ****
--- 1756,1763 ----
               * abort all children with SIGTERM (rollback active transactions
               * and exit) and ShutdownDataBase when they are gone.
               */
+             if (BgWriterPID != 0)
+                 kill(BgWriterPID, SIGTERM);
              if (Shutdown >= FastShutdown)
                  break;
              ereport(LOG,
***************
*** 1770,1775 ****
--- 1804,1811 ----
               * abort all children with SIGQUIT and exit without attempt to
               * properly shutdown data base system.
               */
+             if (BgWriterPID != 0)
+                 kill(BgWriterPID, SIGQUIT);
              ereport(LOG,
                      (errmsg("received immediate shutdown request")));
              if (ShutdownPID > 0)
***************
*** 1877,1882 ****
--- 1913,1924 ----
              CheckPointPID = 0;
              checkpointed = time(NULL);

+             if (BgWriterPID == 0 && BgWriterPercent > 0 &&
+                 Shutdown == NoShutdown && !FatalError && random_seed != 0)
+             {
+                 BgWriterPID = StartBackgroundWriter();
+             }
+
              /*
               * Go to shutdown mode if a shutdown request was pending.
               */
***************
*** 1983,1988 ****
--- 2025,2032 ----
                  GetSavedRedoRecPtr();
              }
          }
+         else if (pid == BgWriterPID)
+             BgWriterPID = 0;
          else
              pgstat_beterm(pid);

***************
*** 1996,2001 ****
--- 2040,2046 ----
      {
          LogChildExit(LOG,
                   (pid == CheckPointPID) ? gettext("checkpoint process") :
+                  (pid == BgWriterPID) ? gettext("bgwriter process") :
                       gettext("server process"),
                       pid, exitstatus);
          ereport(LOG,
***************
*** 2044,2049 ****
--- 2089,2098 ----
          CheckPointPID = 0;
          checkpointed = 0;
      }
+     else if (pid == BgWriterPID)
+     {
+         BgWriterPID = 0;
+     }
      else
      {
          /*
***************
*** 2754,2759 ****
--- 2803,2810 ----
      }
      if (CheckPointPID != 0)
          cnt--;
+     if (BgWriterPID != 0)
+         cnt--;
      return cnt;
  }

***************
*** 2827,2832 ****
--- 2878,2886 ----
              case BS_XLOG_CHECKPOINT:
                  statmsg = "checkpoint subprocess";
                  break;
+             case BS_XLOG_BGWRITER:
+                 statmsg = "bgwriter subprocess";
+                 break;
              case BS_XLOG_SHUTDOWN:
                  statmsg = "shutdown subprocess";
                  break;
***************
*** 2883,2888 ****
--- 2937,2946 ----
                  ereport(LOG,
                        (errmsg("could not fork checkpoint process: %m")));
                  break;
+             case BS_XLOG_BGWRITER:
+                 ereport(LOG,
+                       (errmsg("could not fork bgwriter process: %m")));
+                 break;
              case BS_XLOG_SHUTDOWN:
                  ereport(LOG,
                          (errmsg("could not fork shutdown process: %m")));
***************
*** 2895,2913 ****

          /*
           * fork failure is fatal during startup/shutdown, but there's no
!          * need to choke if a routine checkpoint fails.
           */
          if (xlop == BS_XLOG_CHECKPOINT)
              return 0;
          ExitPostmaster(1);
      }

      /*
       * The startup and shutdown processes are not considered normal
!      * backends, but the checkpoint process is.  Checkpoint must be added
!      * to the list of backends.
       */
!     if (xlop == BS_XLOG_CHECKPOINT)
      {
          if (!(bn = (Backend *) malloc(sizeof(Backend))))
          {
--- 2953,2974 ----

          /*
           * fork failure is fatal during startup/shutdown, but there's no
!          * need to choke if a routine checkpoint or starting a background
!          * writer fails.
           */
          if (xlop == BS_XLOG_CHECKPOINT)
              return 0;
+         if (xlop == BS_XLOG_BGWRITER)
+             return 0;
          ExitPostmaster(1);
      }

      /*
       * The startup and shutdown processes are not considered normal
!      * backends, but the checkpoint and bgwriter processes are.
!      * They must be added to the list of backends.
       */
!     if (xlop == BS_XLOG_CHECKPOINT || xlop == BS_XLOG_BGWRITER)
      {
          if (!(bn = (Backend *) malloc(sizeof(Backend))))
          {
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.144
diff -c -r1.144 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c    2003/11/13 14:57:15    1.144
--- src/backend/storage/buffer/bufmgr.c    2003/11/13 18:41:48
***************
*** 44,49 ****
--- 44,50 ----
  #include <sys/file.h>
  #include <math.h>
  #include <signal.h>
+ #include <unistd.h>

  #include "lib/stringinfo.h"
  #include "miscadmin.h"
***************
*** 679,688 ****
  /*
   * BufferSync -- Write all dirty buffers in the pool.
   *
!  * This is called at checkpoint time and writes out all dirty shared buffers.
   */
! void
! BufferSync(void)
  {
      int            i;
      BufferDesc *bufHdr;
--- 680,690 ----
  /*
   * BufferSync -- Write all dirty buffers in the pool.
   *
!  * This is called at checkpoint time and writes out all dirty shared buffers,
!  * and by the background writer process to write out some of the dirty blocks.
   */
! int
! BufferSync(int percent, int maxpages)
  {
      int            i;
      BufferDesc *bufHdr;
***************
*** 703,714 ****
       * have to wait until the next checkpoint.
       */
      buffer_dirty = (int *)palloc(NBuffers * sizeof(int));
!     num_buffer_dirty = 0;
!
      LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
      num_buffer_dirty = StrategyDirtyBufferList(buffer_dirty, NBuffers);
      LWLockRelease(BufMgrLock);

      for (i = 0; i < num_buffer_dirty; i++)
      {
          Buffer        buffer;
--- 705,728 ----
       * have to wait until the next checkpoint.
       */
      buffer_dirty = (int *)palloc(NBuffers * sizeof(int));
!
      LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
      num_buffer_dirty = StrategyDirtyBufferList(buffer_dirty, NBuffers);
      LWLockRelease(BufMgrLock);

+     /*
+      * If called by the background writer, we are usually asked to
+      * only write out some percentage of dirty buffers now, to prevent
+      * the IO storm at checkpoint time.
+      */
+     if (percent > 0 && num_buffer_dirty > 10)
+     {
+         Assert(percent <= 100);
+         num_buffer_dirty = (num_buffer_dirty * percent) / 100;
+         if (maxpages > 0 && num_buffer_dirty > maxpages)
+             num_buffer_dirty = maxpages;
+     }
+
      for (i = 0; i < num_buffer_dirty; i++)
      {
          Buffer        buffer;
***************
*** 854,859 ****
--- 868,875 ----

      /* Pop the error context stack */
      error_context_stack = errcontext.previous;
+
+     return num_buffer_dirty;
  }

  /*
***************
*** 984,991 ****
  void
  FlushBufferPool(void)
  {
!     BufferSync();
      smgrsync();
  }

  /*
--- 1000,1064 ----
  void
  FlushBufferPool(void)
  {
!     BufferSync(-1, -1);
      smgrsync();
+ }
+
+ void
+ BufferBackgroundWriter(void)
+ {
+     if (BgWriterPercent == 0)
+         return;
+
+     for (;;)
+     {
+         int n;
+
+         /*
+          * Acquire a CheckpointLock to suspend background writing
+          * while a real checkpoint is going on.
+          */
+         while (!LWLockConditionalAcquire(CheckpointLock, LW_EXCLUSIVE))
+         {
+             if (InterruptPending)
+                 return;
+             sleep(1);
+         }
+
+         /*
+          * Call BufferSync() with instructions to keep just the
+          * LRU heads clean.
+          */
+         n = BufferSync(BgWriterPercent, BgWriterMaxpages);
+
+         /*
+          * Release the CheckpointLock
+          */
+         LWLockRelease(CheckpointLock);
+
+         /*
+          * Whatever signal is sent to us, let's just die galantly. If
+          * it wasn't meant that way, the postmaster will reincarnate us.
+          */
+         if (InterruptPending)
+             return;
+
+         /*
+          * If there was nothing to flush, sleep for 10 seconds. If there
+          * was, pg_fsync() recently written files and nap.
+          */
+         if (n > 0)
+         {
+             /*
+              * TODO: This sync must be replaced with calls to
+              *       pg_fdatasync() for recently written files.
+              */
+             sync();
+             PG_DELAY(BgWriterDelay);
+         }
+         else
+             sleep(10);
+     }
  }

  /*
Index: src/backend/storage/buffer/freelist.c
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/storage/buffer/freelist.c,v
retrieving revision 1.34
diff -c -r1.34 freelist.c
*** src/backend/storage/buffer/freelist.c    2003/11/13 14:57:15    1.34
--- src/backend/storage/buffer/freelist.c    2003/11/13 19:45:32
***************
*** 190,197 ****
--- 190,217 ----
          if (StrategyControl->stat_report + BufferStrategyStatInterval < now)
          {
              long    all_hit, b1_hit, t1_hit, t2_hit, b2_hit;
+             int        id, t1_clean, t2_clean;
              ErrorContextCallback    *errcxtold;

+             id = StrategyControl->listHead[STRAT_LIST_T1];
+             t1_clean = 0;
+             while (id >= 0)
+             {
+                 if (BufferDescriptors[StrategyCDB[id].buf_id].flags & BM_DIRTY)
+                     break;
+                 t1_clean++;
+                 id = StrategyCDB[id].next;
+             }
+             id = StrategyControl->listHead[STRAT_LIST_T2];
+             t2_clean = 0;
+             while (id >= 0)
+             {
+                 if (BufferDescriptors[StrategyCDB[id].buf_id].flags & BM_DIRTY)
+                     break;
+                 t2_clean++;
+                 id = StrategyCDB[id].next;
+             }
+
              if (StrategyControl->num_lookup == 0)
              {
                  all_hit = b1_hit = t1_hit = t2_hit = b2_hit = 0;
***************
*** 215,220 ****
--- 235,242 ----
                      T1_TARGET, B1_LENGTH, T1_LENGTH, T2_LENGTH, B2_LENGTH);
              elog(DEBUG1, "ARC total   =%4ld%% B1hit=%4ld%% T1hit=%4ld%% T2hit=%4ld%% B2hit=%4ld%%",
                      all_hit, b1_hit, t1_hit, t2_hit, b2_hit);
+             elog(DEBUG1, "ARC clean buffers at LRU       T1=   %5d T2=   %5d",
+                     t1_clean, t2_clean);
              error_context_stack = errcxtold;

              StrategyControl->num_lookup = 0;
Index: src/backend/storage/lmgr/proc.c
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/storage/lmgr/proc.c,v
retrieving revision 1.136
diff -c -r1.136 proc.c
*** src/backend/storage/lmgr/proc.c    2003/10/16 20:59:35    1.136
--- src/backend/storage/lmgr/proc.c    2003/11/13 18:03:24
***************
*** 71,76 ****
--- 71,77 ----
  static PROC_HDR *ProcGlobal = NULL;

  static PGPROC *DummyProc = NULL;
+ static int    dummy_proc_type = -1;

  static bool waitingForLock = false;
  static bool waitingForSignal = false;
***************
*** 163,176 ****
           * processes, too.    This does not get linked into the freeProcs
           * list.
           */
!         DummyProc = (PGPROC *) ShmemAlloc(sizeof(PGPROC));
          if (!DummyProc)
              ereport(FATAL,
                      (errcode(ERRCODE_OUT_OF_MEMORY),
                       errmsg("out of shared memory")));
!         MemSet(DummyProc, 0, sizeof(PGPROC));
!         DummyProc->pid = 0;        /* marks DummyProc as not in use */
!         PGSemaphoreCreate(&DummyProc->sem);

          /* Create ProcStructLock spinlock, too */
          ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
--- 164,180 ----
           * processes, too.    This does not get linked into the freeProcs
           * list.
           */
!         DummyProc = (PGPROC *) ShmemAlloc(sizeof(PGPROC) * NUM_DUMMY_PROCS);
          if (!DummyProc)
              ereport(FATAL,
                      (errcode(ERRCODE_OUT_OF_MEMORY),
                       errmsg("out of shared memory")));
!         MemSet(DummyProc, 0, sizeof(PGPROC) * NUM_DUMMY_PROCS);
!         for (i = 0; i < NUM_DUMMY_PROCS; i++)
!         {
!             DummyProc[i].pid = 0;        /* marks DummyProc as not in use */
!             PGSemaphoreCreate(&(DummyProc[i].sem));
!         }

          /* Create ProcStructLock spinlock, too */
          ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
***************
*** 270,277 ****
   * sema that are assigned are the extra ones created during InitProcGlobal.
   */
  void
! InitDummyProcess(void)
  {
      /*
       * ProcGlobal should be set by a previous call to InitProcGlobal (we
       * inherit this by fork() from the postmaster).
--- 274,283 ----
   * sema that are assigned are the extra ones created during InitProcGlobal.
   */
  void
! InitDummyProcess(int proctype)
  {
+     PGPROC    *dummyproc;
+
      /*
       * ProcGlobal should be set by a previous call to InitProcGlobal (we
       * inherit this by fork() from the postmaster).
***************
*** 282,293 ****
      if (MyProc != NULL)
          elog(ERROR, "you already exist");

      /*
!      * DummyProc should not presently be in use by anyone else
       */
!     if (DummyProc->pid != 0)
!         elog(FATAL, "DummyProc is in use by PID %d", DummyProc->pid);
!     MyProc = DummyProc;

      /*
       * Initialize all fields of MyProc, except MyProc->sem which was set
--- 288,304 ----
      if (MyProc != NULL)
          elog(ERROR, "you already exist");

+     Assert(dummy_proc_type < 0);
+     dummy_proc_type = proctype;
+     dummyproc = &DummyProc[proctype];
+
      /*
!      * dummyproc should not presently be in use by anyone else
       */
!     if (dummyproc->pid != 0)
!         elog(FATAL, "DummyProc[%d] is in use by PID %d",
!                 proctype, dummyproc->pid);
!     MyProc = dummyproc;

      /*
       * Initialize all fields of MyProc, except MyProc->sem which was set
***************
*** 310,316 ****
      /*
       * Arrange to clean up at process exit.
       */
!     on_shmem_exit(DummyProcKill, 0);

      /*
       * We might be reusing a semaphore that belonged to a failed process.
--- 321,327 ----
      /*
       * Arrange to clean up at process exit.
       */
!     on_shmem_exit(DummyProcKill, proctype);

      /*
       * We might be reusing a semaphore that belonged to a failed process.
***************
*** 446,453 ****
  static void
  DummyProcKill(void)
  {
!     Assert(MyProc != NULL && MyProc == DummyProc);

      /* Release any LW locks I am holding */
      LWLockReleaseAll();

--- 457,470 ----
  static void
  DummyProcKill(void)
  {
!     PGPROC    *dummyproc;

+     Assert(dummy_proc_type >= 0 && dummy_proc_type < NUM_DUMMY_PROCS);
+
+     dummyproc = &DummyProc[dummy_proc_type];
+
+     Assert(MyProc != NULL && MyProc == dummyproc);
+
      /* Release any LW locks I am holding */
      LWLockReleaseAll();

***************
*** 463,468 ****
--- 480,487 ----

      /* PGPROC struct isn't mine anymore */
      MyProc = NULL;
+
+     dummy_proc_type = -1;
  }


Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/utils/misc/guc.c,v
retrieving revision 1.169
diff -c -r1.169 guc.c
*** src/backend/utils/misc/guc.c    2003/11/13 14:57:15    1.169
--- src/backend/utils/misc/guc.c    2003/11/13 19:40:10
***************
*** 74,79 ****
--- 74,82 ----
  extern int    CommitSiblings;
  extern char *preload_libraries_string;
  extern int    BufferStrategyStatInterval;
+ extern int    BgWriterDelay;
+ extern int    BgWriterPercent;
+ extern int    BgWriterMaxpages;

  #ifdef HAVE_SYSLOG
  extern char *Syslog_facility;
***************
*** 1198,1203 ****
--- 1201,1233 ----
          },
          &BufferStrategyStatInterval,
          0, 0, 600, NULL, NULL
+     },
+
+     {
+         {"bgwriter_delay", PGC_SIGHUP, RESOURCES,
+             gettext_noop("Background writer sleep time between rounds in milliseconds"),
+             NULL
+         },
+         &BgWriterDelay,
+         500, 10, 5000, NULL, NULL
+     },
+
+     {
+         {"bgwriter_percent", PGC_SIGHUP, RESOURCES,
+             gettext_noop("Background writer percentage of dirty buffers to flush per round"),
+             NULL
+         },
+         &BgWriterPercent,
+         0, 0, 100, NULL, NULL
+     },
+
+     {
+         {"bgwriter_maxpages", PGC_SIGHUP, RESOURCES,
+             gettext_noop("Background writer maximum number of pages to flush per round"),
+             NULL
+         },
+         &BgWriterMaxpages,
+         100, 1, 1000, NULL, NULL
      },

      /* End-of-list marker */
Index: src/backend/utils/misc/postgresql.conf.sample
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/utils/misc/postgresql.conf.sample,v
retrieving revision 1.95
diff -c -r1.95 postgresql.conf.sample
*** src/backend/utils/misc/postgresql.conf.sample    2003/11/13 14:57:15    1.95
--- src/backend/utils/misc/postgresql.conf.sample    2003/11/13 21:20:03
***************
*** 60,65 ****
--- 60,70 ----
  #vacuum_mem = 8192        # min 1024, size in KB
  #buffer_strategy_status_interval = 0    # 0-600 seconds

+ # - Background writer -
+ #bgwriter_delay = 500        # 10-5000 milliseconds
+ #bgwriter_percent = 0        # 0-100% of dirty buffers
+ #bgwriter_maxpages = 100    # 1-1000 buffers max at once
+
  # - Free Space Map -

  #max_fsm_pages = 20000        # min max_fsm_relations*16, 6 bytes each
Index: src/include/bootstrap/bootstrap.h
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/include/bootstrap/bootstrap.h,v
retrieving revision 1.31
diff -c -r1.31 bootstrap.h
*** src/include/bootstrap/bootstrap.h    2003/08/04 02:40:10    1.31
--- src/include/bootstrap/bootstrap.h    2003/11/13 16:19:07
***************
*** 59,64 ****
  #define BS_XLOG_BOOTSTRAP    1
  #define BS_XLOG_STARTUP        2
  #define BS_XLOG_CHECKPOINT    3
! #define BS_XLOG_SHUTDOWN    4

  #endif   /* BOOTSTRAP_H */
--- 59,65 ----
  #define BS_XLOG_BOOTSTRAP    1
  #define BS_XLOG_STARTUP        2
  #define BS_XLOG_CHECKPOINT    3
! #define BS_XLOG_BGWRITER    4
! #define BS_XLOG_SHUTDOWN    5

  #endif   /* BOOTSTRAP_H */
Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/include/storage/bufmgr.h,v
retrieving revision 1.70
diff -c -r1.70 bufmgr.h
*** src/include/storage/bufmgr.h    2003/08/10 19:48:08    1.70
--- src/include/storage/bufmgr.h    2003/11/13 17:12:15
***************
*** 37,42 ****
--- 37,47 ----
  extern DLLIMPORT Block *LocalBufferBlockPointers;
  extern long *LocalRefCount;

+ /* in postmaster.c ... they don't belong here */
+ extern int    BgWriterDelay;
+ extern int    BgWriterPercent;
+ extern int    BgWriterMaxpages;
+
  /* special pageno for bget */
  #define P_NEW    InvalidBlockNumber        /* grow the file to get a new page */

***************
*** 186,192 ****
  extern void AbortBufferIO(void);

  extern void BufmgrCommit(void);
! extern void BufferSync(void);

  extern void InitLocalBuffer(void);

--- 191,198 ----
  extern void AbortBufferIO(void);

  extern void BufmgrCommit(void);
! extern int    BufferSync(int percent, int maxpages);
! extern void BufferBackgroundWriter(void);

  extern void InitLocalBuffer(void);

Index: src/include/storage/proc.h
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/include/storage/proc.h,v
retrieving revision 1.64
diff -c -r1.64 proc.h
*** src/include/storage/proc.h    2003/08/04 02:40:15    1.64
--- src/include/storage/proc.h    2003/11/13 17:55:02
***************
*** 86,91 ****
--- 86,96 ----
  } PROC_HDR;


+ #define    DUMMY_PROC_DEFAULT    0
+ #define    DUMMY_PROC_BGWRITER    1
+ #define    NUM_DUMMY_PROCS        2
+
+
  /* configurable options */
  extern int    DeadlockTimeout;
  extern int    StatementTimeout;
***************
*** 97,103 ****
  extern int    ProcGlobalSemas(int maxBackends);
  extern void InitProcGlobal(int maxBackends);
  extern void InitProcess(void);
! extern void InitDummyProcess(void);
  extern void ProcReleaseLocks(bool isCommit);

  extern void ProcQueueInit(PROC_QUEUE *queue);
--- 102,108 ----
  extern int    ProcGlobalSemas(int maxBackends);
  extern void InitProcGlobal(int maxBackends);
  extern void InitProcess(void);
! extern void InitDummyProcess(int proctype);
  extern void ProcReleaseLocks(bool isCommit);

  extern void ProcQueueInit(PROC_QUEUE *queue);

Re: Background writer process

От
Kurt Roeckx
Дата:
On Thu, Nov 13, 2003 at 04:35:31PM -0500, Jan Wieck wrote:
> For sure the sync() needs to be replaced by the discussed fsync() of 
> recently written files. And I think the algorithm how much and how often 
> to flush can be significantly improved. But after all, this does not 
> change the real checkpointing at all, and the general framework having a 
> separate process is what we probably want.

Why is the sync() needed at all?  My understanding was that it
was only needed in case of a checkpoint.


Kurt



Re: Background writer process

От
Bruce Momjian
Дата:
Kurt Roeckx wrote:
> On Thu, Nov 13, 2003 at 04:35:31PM -0500, Jan Wieck wrote:
> > For sure the sync() needs to be replaced by the discussed fsync() of 
> > recently written files. And I think the algorithm how much and how often 
> > to flush can be significantly improved. But after all, this does not 
> > change the real checkpointing at all, and the general framework having a 
> > separate process is what we probably want.
> 
> Why is the sync() needed at all?  My understanding was that it
> was only needed in case of a checkpoint.

He found that write() itself didn't encourage the kernel to write the
buffers to disk fast enough.  I think the final solution will be to use
fsync or O_SYNC.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Background writer process

От
Jan Wieck
Дата:
Bruce Momjian wrote:

> Kurt Roeckx wrote:
>> On Thu, Nov 13, 2003 at 04:35:31PM -0500, Jan Wieck wrote:
>> > For sure the sync() needs to be replaced by the discussed fsync() of 
>> > recently written files. And I think the algorithm how much and how often 
>> > to flush can be significantly improved. But after all, this does not 
>> > change the real checkpointing at all, and the general framework having a 
>> > separate process is what we probably want.
>> 
>> Why is the sync() needed at all?  My understanding was that it
>> was only needed in case of a checkpoint.
> 
> He found that write() itself didn't encourage the kernel to write the
> buffers to disk fast enough.  I think the final solution will be to use
> fsync or O_SYNC.
> 

write() alone doesn't encourage the kernel to do any physical IO at all. 
As long as you have enough OS buffers, it does happy write caching until 
you checkpoint and sync(), and then the system freezes.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: Background writer process

От
Bruce Momjian
Дата:
Jan Wieck wrote:
> Bruce Momjian wrote:
> 
> > Kurt Roeckx wrote:
> >> On Thu, Nov 13, 2003 at 04:35:31PM -0500, Jan Wieck wrote:
> >> > For sure the sync() needs to be replaced by the discussed fsync() of 
> >> > recently written files. And I think the algorithm how much and how often 
> >> > to flush can be significantly improved. But after all, this does not 
> >> > change the real checkpointing at all, and the general framework having a 
> >> > separate process is what we probably want.
> >> 
> >> Why is the sync() needed at all?  My understanding was that it
> >> was only needed in case of a checkpoint.
> > 
> > He found that write() itself didn't encourage the kernel to write the
> > buffers to disk fast enough.  I think the final solution will be to use
> > fsync or O_SYNC.
> > 
> 
> write() alone doesn't encourage the kernel to do any physical IO at all. 
> As long as you have enough OS buffers, it does happy write caching until 
> you checkpoint and sync(), and then the system freezes.

That's not completely true.  Some kernels with trickle sync, meaning
they sync a little bit regularly rather than all at once so write() does
help get those shared buffers into the kernel for possible writing. 
Also, it is possible the kernel will issue a sync() on its own.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Background writer process

От
Kurt Roeckx
Дата:
On Thu, Nov 13, 2003 at 05:39:32PM -0500, Bruce Momjian wrote:
> Jan Wieck wrote:
> > Bruce Momjian wrote:
> > > He found that write() itself didn't encourage the kernel to write the
> > > buffers to disk fast enough.  I think the final solution will be to use
> > > fsync or O_SYNC.
> > 
> > write() alone doesn't encourage the kernel to do any physical IO at all. 
> > As long as you have enough OS buffers, it does happy write caching until 
> > you checkpoint and sync(), and then the system freezes.
> 
> That's not completely true.  Some kernels with trickle sync, meaning
> they sync a little bit regularly rather than all at once so write() does
> help get those shared buffers into the kernel for possible writing. 
> Also, it is possible the kernel will issue a sync() on its own.

So basicly on some kernels you want them to flush their dirty
buffers faster.

I have a feeling we should more make it depend on the system how
we ask them not to keep it in memory too long and that maybe the
sync(), fsync() or O_SYNC could be a fallback in case it's needed
and there are no better ways of doing it.

Maybe something as posix_fadvise() might be useful too on systems
that have it?


Kurt



Re: Background writer process

От
Jan Wieck
Дата:
Kurt Roeckx wrote:
> On Thu, Nov 13, 2003 at 05:39:32PM -0500, Bruce Momjian wrote:
>> Jan Wieck wrote:
>> > Bruce Momjian wrote:
>> > > He found that write() itself didn't encourage the kernel to write the
>> > > buffers to disk fast enough.  I think the final solution will be to use
>> > > fsync or O_SYNC.
>> > 
>> > write() alone doesn't encourage the kernel to do any physical IO at all. 
>> > As long as you have enough OS buffers, it does happy write caching until 
>> > you checkpoint and sync(), and then the system freezes.
>> 
>> That's not completely true.  Some kernels with trickle sync, meaning
>> they sync a little bit regularly rather than all at once so write() does
>> help get those shared buffers into the kernel for possible writing. 
>> Also, it is possible the kernel will issue a sync() on its own.
> 
> So basicly on some kernels you want them to flush their dirty
> buffers faster.
> 
> I have a feeling we should more make it depend on the system how
> we ask them not to keep it in memory too long and that maybe the
> sync(), fsync() or O_SYNC could be a fallback in case it's needed
> and there are no better ways of doing it.
> 
> Maybe something as posix_fadvise() might be useful too on systems
> that have it?

That is all right and as said, how often, how much and how forced we do 
the IO can all be configurable and as flexible as people see fit. But 
whether you use sync(), fsync(), fdatasync(), O_SYNC, O_DSYNC or 
posix_fadvise(), somewhere you have to do the write(). And that write 
has to be coordinated with the buffer cache replacement strategy so that 
you write those buffers that are likely to be replaced soon, and don't 
write those that the strategy thinks keeping for longer anyway. Except 
at a checkpoint, then you have to write whatever is dirty.

The patch I posted does this write() in coordination with the strategy 
in a separate background process, so that the regular backends don't 
have to write under normal circumstances (there are some places in DDL 
statements that call BufferSync(), that's exceptions IMHO). Can we agree 
on this general outline? Or do we have any better proposals?


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: Background writer process

От
Bruce Momjian
Дата:
Kurt Roeckx wrote:
> On Thu, Nov 13, 2003 at 05:39:32PM -0500, Bruce Momjian wrote:
> > Jan Wieck wrote:
> > > Bruce Momjian wrote:
> > > > He found that write() itself didn't encourage the kernel to write the
> > > > buffers to disk fast enough.  I think the final solution will be to use
> > > > fsync or O_SYNC.
> > > 
> > > write() alone doesn't encourage the kernel to do any physical IO at all. 
> > > As long as you have enough OS buffers, it does happy write caching until 
> > > you checkpoint and sync(), and then the system freezes.
> > 
> > That's not completely true.  Some kernels with trickle sync, meaning
> > they sync a little bit regularly rather than all at once so write() does
> > help get those shared buffers into the kernel for possible writing. 
> > Also, it is possible the kernel will issue a sync() on its own.
> 
> So basicly on some kernels you want them to flush their dirty
> buffers faster.
> 
> I have a feeling we should more make it depend on the system how
> we ask them not to keep it in memory too long and that maybe the
> sync(), fsync() or O_SYNC could be a fallback in case it's needed
> and there are no better ways of doing it.

I think the final plan is to have a GUC variable that controls how the
kernel is _encouraged_ to write dirty buffers to disk.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Background writer process

От
Bruce Momjian
Дата:
Jan Wieck wrote:
> That is all right and as said, how often, how much and how forced we do 
> the IO can all be configurable and as flexible as people see fit. But 
> whether you use sync(), fsync(), fdatasync(), O_SYNC, O_DSYNC or 
> posix_fadvise(), somewhere you have to do the write(). And that write 
> has to be coordinated with the buffer cache replacement strategy so that 
> you write those buffers that are likely to be replaced soon, and don't 
> write those that the strategy thinks keeping for longer anyway. Except 
> at a checkpoint, then you have to write whatever is dirty.
> 
> The patch I posted does this write() in coordination with the strategy 
> in a separate background process, so that the regular backends don't 
> have to write under normal circumstances (there are some places in DDL 
> statements that call BufferSync(), that's exceptions IMHO). Can we agree 
> on this general outline? Or do we have any better proposals?

Agreed.  Background write() is a win on all all OS's.  It is just the
kernel to disk part we will have to have configurable, I think.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Background writer process

От
Shridhar Daithankar
Дата:
On Friday 14 November 2003 03:05, Jan Wieck wrote:
> For sure the sync() needs to be replaced by the discussed fsync() of
> recently written files. And I think the algorithm how much and how often
> to flush can be significantly improved. But after all, this does not
> change the real checkpointing at all, and the general framework having a
> separate process is what we probably want.

Having fsync for regular data files and sync for WAL segment a comfortable 
compramise?  Or this is going to use fsync for all of them.

IMO, with fsync, we tell kernel that you can write this buffer. It may or may 
not write it immediately, unless it is hard sync. 

Since postgresql can afford lazy writes for data files, I think this could 
work.

Just a thought..
Shridhar



Re: Background writer process

От
Jan Wieck
Дата:
Shridhar Daithankar wrote:

> On Friday 14 November 2003 03:05, Jan Wieck wrote:
>> For sure the sync() needs to be replaced by the discussed fsync() of
>> recently written files. And I think the algorithm how much and how often
>> to flush can be significantly improved. But after all, this does not
>> change the real checkpointing at all, and the general framework having a
>> separate process is what we probably want.
> 
> Having fsync for regular data files and sync for WAL segment a comfortable 
> compramise?  Or this is going to use fsync for all of them.
> 
> IMO, with fsync, we tell kernel that you can write this buffer. It may or may 
> not write it immediately, unless it is hard sync. 

I think it's more the other way around. On some systems sync() might 
return before all buffers are flushed to disk, while fsync() does not.

> 
> Since postgresql can afford lazy writes for data files, I think this could 
> work.

The whole point of a checkpoint is to know for certain that a specific 
change is in the datafile, so that it is safe to throw away older WAL 
segments.


Jan

-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: Background writer process

От
Bruce Momjian
Дата:
Shridhar Daithankar wrote:
> On Friday 14 November 2003 03:05, Jan Wieck wrote:
> > For sure the sync() needs to be replaced by the discussed fsync() of
> > recently written files. And I think the algorithm how much and how often
> > to flush can be significantly improved. But after all, this does not
> > change the real checkpointing at all, and the general framework having a
> > separate process is what we probably want.
> 
> Having fsync for regular data files and sync for WAL segment a comfortable 
> compramise?  Or this is going to use fsync for all of them.

I think we still need sync() for WAL because sometimes backends are
going to have to write their own buffers, and we don't want them using
fsync or it will be very slow.

> IMO, with fsync, we tell kernel that you can write this buffer. It may or may 
> not write it immediately, unless it is hard sync. 
> 
> Since postgresql can afford lazy writes for data files, I think this could 
> work.

fsync() doesn't return until the data is on the disk.  It doesn't
schedule the write then return, as far as I know.  sync() does schedule
the writes, I think, which can be bad, but we delay a little to wait for
it to complete.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Background writer process

От
Tom Lane
Дата:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Shridhar Daithankar wrote:
>> Having fsync for regular data files and sync for WAL segment a comfortable 
>> compramise?  Or this is going to use fsync for all of them.

> I think we still need sync() for WAL because sometimes backends are
> going to have to write their own buffers, and we don't want them using
> fsync or it will be very slow.

sync() for WAL is a complete nonstarter, because it gives you no
guarantees at all about whether the write has occurred.  I don't really
care what you say about speed; this is a correctness point.
        regards, tom lane


Re: Background writer process

От
Bruce Momjian
Дата:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Shridhar Daithankar wrote:
> >> Having fsync for regular data files and sync for WAL segment a comfortable 
> >> compramise?  Or this is going to use fsync for all of them.
> 
> > I think we still need sync() for WAL because sometimes backends are
> > going to have to write their own buffers, and we don't want them using
> > fsync or it will be very slow.
> 
> sync() for WAL is a complete nonstarter, because it gives you no
> guarantees at all about whether the write has occurred.  I don't really
> care what you say about speed; this is a correctness point.

Sorry, I meant sync() is needed for recycling WAL (checkpoint), not for
WAL writes.  I assume that's what Shridhar meant, but now I am not sure.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Background writer process

От
Shridhar Daithankar
Дата:
On Friday 14 November 2003 22:10, Jan Wieck wrote:
> Shridhar Daithankar wrote:
> > On Friday 14 November 2003 03:05, Jan Wieck wrote:
> >> For sure the sync() needs to be replaced by the discussed fsync() of
> >> recently written files. And I think the algorithm how much and how often
> >> to flush can be significantly improved. But after all, this does not
> >> change the real checkpointing at all, and the general framework having a
> >> separate process is what we probably want.
> >
> > Having fsync for regular data files and sync for WAL segment a
> > comfortable compramise?  Or this is going to use fsync for all of them.
> >
> > IMO, with fsync, we tell kernel that you can write this buffer. It may or
> > may not write it immediately, unless it is hard sync.
>
> I think it's more the other way around. On some systems sync() might
> return before all buffers are flushed to disk, while fsync() does not.

Oops.. that's bad.

> > Since postgresql can afford lazy writes for data files, I think this
> > could work.
>
> The whole point of a checkpoint is to know for certain that a specific
> change is in the datafile, so that it is safe to throw away older WAL
> segments.

I just made another posing on patches for a thread crossing win32-devel.

Essentially I said

1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if current code does 
it. The hackery in xlog.c is not exactly trivial.)
2. Open data files normally and fsync them only in background writer process.

Now BGWriter process will flush everything at the time of checkpointing. It 
does not need to flush WAL because of O_SYNC(ideally but an additional fsync 
won't hurt). So it just flushes all the file decriptors touched since last 
checkpoint, which should not be much of a load because it is flushing those 
files intermittently anyways.

It could also work nicely if only background writer fsync the data files. 
Backends can either wait or proceed to other business by the time disk is 
flushed. Backends needs to wait for certain while committing and it should be 
rather small delay of syncing to disk in current process as opposed to in  
background process. 

In case of commit, BGWriter could get away with files touched in transaction
+WAL as opposed to all files touched since last checkpoint+WAL in case of 
chekpoint. I don't know how difficult that would be.

What is different in currrent BGwriter implementation? Use of sync()?
Shridhar



Re: Background writer process

От
"Zeugswetter Andreas SB SD"
Дата:
> 1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if

Without grouping WAL writes that does not fly. Iff however such grouping
is implemented that should deliver optimal performance. I don't think flushing
WAL to the OS early (before a tx commits) is necessary, since writing 8k or 256k
to disk with one call takes nearly the same time. The WAL write would need to be
done as soon as eighter 256k fill or a txn commits.

Andreas


Re: Background writer process

От
Bruce Momjian
Дата:
Shridhar Daithankar wrote:
> On Friday 14 November 2003 22:10, Jan Wieck wrote:
> > Shridhar Daithankar wrote:
> > > On Friday 14 November 2003 03:05, Jan Wieck wrote:
> > >> For sure the sync() needs to be replaced by the discussed fsync() of
> > >> recently written files. And I think the algorithm how much and how often
> > >> to flush can be significantly improved. But after all, this does not
> > >> change the real checkpointing at all, and the general framework having a
> > >> separate process is what we probably want.
> > >
> > > Having fsync for regular data files and sync for WAL segment a
> > > comfortable compromise?  Or this is going to use fsync for all of them.
> > >
> > > IMO, with fsync, we tell kernel that you can write this buffer. It may or
> > > may not write it immediately, unless it is hard sync.
> >
> > I think it's more the other way around. On some systems sync() might
> > return before all buffers are flushed to disk, while fsync() does not.
> 
> Oops.. that's bad.

Yes, one I idea I had was to do an fsync on a new file _after_ issuing
sync, hoping that this will complete after all the sync buffers are
done.

> > > Since postgresql can afford lazy writes for data files, I think this
> > > could work.
> >
> > The whole point of a checkpoint is to know for certain that a specific
> > change is in the datafile, so that it is safe to throw away older WAL
> > segments.
> 
> I just made another posing on patches for a thread crossing win32-devel.
> 
> Essentially I said
> 
> 1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if current code does 
> it. The hackery in xlog.c is not exactly trivial.)

We write WAL, then fsync, so if we write multiple blocks, we can write
them and fsync once, rather than O_SYNC every write.

> 2. Open data files normally and fsync them only in background writer process.
> 
> Now BGWriter process will flush everything at the time of checkpointing. It 
> does not need to flush WAL because of O_SYNC(ideally but an additional fsync 
> won't hurt). So it just flushes all the file descriptors touched since last 
> checkpoint, which should not be much of a load because it is flushing those 
> files intermittently anyways.
> 
> It could also work nicely if only background writer fsync the data files. 
> Backends can either wait or proceed to other business by the time disk is 
> flushed. Backends needs to wait for certain while committing and it should be 
> rather small delay of syncing to disk in current process as opposed to in  
> background process. 
> 
> In case of commit, BGWriter could get away with files touched in transaction
> +WAL as opposed to all files touched since last checkpoint+WAL in case of 
> checkpoint. I don't know how difficult that would be.
> 
> What is different in current BGwriter implementation? Use of sync()?

Well, basically we are still discussing how to do this.  Right now the
backend writer patch uses sync(), but the final version will use fsync
or O_SYNC, or maybe nothing.

The open items are whether a background process can keep the dirty
buffers cleaned fast enough to keep up with the maximum number of
backends.  We might need to use multiple processes or threads to do
this.   We certainly will have a background writer in 7.5 --- the big
question is whether _all_ write will go through it.   It certainly would
be nice if it could, and Tom thinks it can, so we are still exploring
this.

If the background writer uses fsync, it can write and allow the buffer
to be reused and fsync later, while if we use O_SYNC, we have to wait
for the O_SYNC write to happen before reusing the buffer;  that will be
slower.

Another open issue is _if_ the backend writer can't keep up with the
normal backends, do we allow normal backends to write dirty buffers, and
do they use fsync(), or can we record the file in a shared area and have
the background writer do the fsync.  This is the issue of whether one
process can fsync all dirty buffers for the file or just the buffers it
wrote.

I think this is these are the basics of the current discussion.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Background writer process

От
Shridhar Daithankar
Дата:
Bruce Momjian wrote:

> Shridhar Daithankar wrote:
> 
>>On Friday 14 November 2003 22:10, Jan Wieck wrote:
>>
>>>Shridhar Daithankar wrote:
>>>
>>>>On Friday 14 November 2003 03:05, Jan Wieck wrote:
>>>>
>>>>>For sure the sync() needs to be replaced by the discussed fsync() of
>>>>>recently written files. And I think the algorithm how much and how often
>>>>>to flush can be significantly improved. But after all, this does not
>>>>>change the real checkpointing at all, and the general framework having a
>>>>>separate process is what we probably want.
>>>>
>>>>Having fsync for regular data files and sync for WAL segment a
>>>>comfortable compromise?  Or this is going to use fsync for all of them.
>>>>
>>>>IMO, with fsync, we tell kernel that you can write this buffer. It may or
>>>>may not write it immediately, unless it is hard sync.
>>>
>>>I think it's more the other way around. On some systems sync() might
>>>return before all buffers are flushed to disk, while fsync() does not.
>>
>>Oops.. that's bad.
> 
> 
> Yes, one I idea I had was to do an fsync on a new file _after_ issuing
> sync, hoping that this will complete after all the sync buffers are
> done.
> 
> 
>>>>Since postgresql can afford lazy writes for data files, I think this
>>>>could work.
>>>
>>>The whole point of a checkpoint is to know for certain that a specific
>>>change is in the datafile, so that it is safe to throw away older WAL
>>>segments.
>>
>>I just made another posing on patches for a thread crossing win32-devel.
>>
>>Essentially I said
>>
>>1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if current code does 
>>it. The hackery in xlog.c is not exactly trivial.)
> 
> 
> We write WAL, then fsync, so if we write multiple blocks, we can write
> them and fsync once, rather than O_SYNC every write.
> 
> 
>>2. Open data files normally and fsync them only in background writer process.
>>
>>Now BGWriter process will flush everything at the time of checkpointing. It 
>>does not need to flush WAL because of O_SYNC(ideally but an additional fsync 
>>won't hurt). So it just flushes all the file descriptors touched since last 
>>checkpoint, which should not be much of a load because it is flushing those 
>>files intermittently anyways.
>>
>>It could also work nicely if only background writer fsync the data files. 
>>Backends can either wait or proceed to other business by the time disk is 
>>flushed. Backends needs to wait for certain while committing and it should be 
>>rather small delay of syncing to disk in current process as opposed to in  
>>background process. 
>>
>>In case of commit, BGWriter could get away with files touched in transaction
>>+WAL as opposed to all files touched since last checkpoint+WAL in case of 
>>checkpoint. I don't know how difficult that would be.
>>
>>What is different in current BGwriter implementation? Use of sync()?
> 
> 
> Well, basically we are still discussing how to do this.  Right now the
> backend writer patch uses sync(), but the final version will use fsync
> or O_SYNC, or maybe nothing.
> 
> The open items are whether a background process can keep the dirty
> buffers cleaned fast enough to keep up with the maximum number of
> backends.  We might need to use multiple processes or threads to do
> this.   We certainly will have a background writer in 7.5 --- the big
> question is whether _all_ write will go through it.   It certainly would
> be nice if it could, and Tom thinks it can, so we are still exploring
> this.

Given that fsync is blocking, the background writer has to scale up in terms of 
processes/threads and load w.r.t. disk flushing.

I would vote for threads for a simple reason that, in BGWriter, threads are 
needed only to flush the file. Get the fd, fsync it and get next one. No need to 
make entire process thread safe.

Furthermore BGWriter has to detect the disk limit. If adding threads does not 
improve fsyncing speed, it should stop adding them and wait. There is nothing to 
do when disk is saturated.

> If the background writer uses fsync, it can write and allow the buffer
> to be reused and fsync later, while if we use O_SYNC, we have to wait
> for the O_SYNC write to happen before reusing the buffer;  that will be
> slower.

Certainly. However an O_SYNC open file would not require fsync separately. I 
suggested it only for WAL. But for WAL block grouping as suggested in another 
post, all files with fsync might be a good idea.

Just a thought.
 Shridhar



Re: Background writer process

От
"Zeugswetter Andreas SB SD"
Дата:
> If the background writer uses fsync, it can write and allow the buffer
> to be reused and fsync later, while if we use O_SYNC, we have to wait
> for the O_SYNC write to happen before reusing the buffer;
> that will be slower.

You can forget O_SYNC for datafiles for now. There would simply be too much to
do currently to allow decent performance, like scatter/gather IO, ...
Imho the reasonable target should be to write from all backends but sync (fsync)
from the background writer only. (Tune the OS if it actually waits until the
pg invoked sync (== 5 minutes per default))

Andreas


Re: Background writer process

От
Shridhar Daithankar
Дата:
Zeugswetter Andreas SB SD wrote:
>>1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if 
> Without grouping WAL writes that does not fly. Iff however such grouping
> is implemented that should deliver optimal performance. I don't think flushing 
> WAL to the OS early (before a tx commits) is necessary, since writing 8k or 256k 
> to disk with one call takes nearly the same time. The WAL write would need to be 
> done as soon as eighter 256k fill or a txn commits.

That means no special treatment to WAL files? If it works, great. There would be 
single class of files to take care w.r.t sync. issue. Even more simpler.
 Shridhar



Re: Background writer process

От
"Zeugswetter Andreas SB SD"
Дата:
> >>1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if
> > Without grouping WAL writes that does not fly. Iff however such grouping
> > is implemented that should deliver optimal performance. I don't think flushing
> > WAL to the OS early (before a tx commits) is necessary, since writing 8k or 256k
> > to disk with one call takes nearly the same time. The WAL write would need to be
> > done as soon as eighter 256k fill or a txn commits.
>
> That means no special treatment to WAL files? If it works, great. There would be
> single class of files to take care w.r.t sync. issue. Even more simpler.

No, WAL needs special handling. Eighter leave it as is with write + f[data]sync,
or implement O_SYNC|O_DIRECT with grouping of writes (the current O_SYNC implementation
is only good for small (<8kb) transactions).

Andreas