[HACKERS] Reducing pg_ctl's reaction time

Поиск
Список
Период
Сортировка
От Tom Lane
Тема [HACKERS] Reducing pg_ctl's reaction time
Дата
Msg-id 18444.1498428798@sss.pgh.pa.us
обсуждение исходный текст
Ответы Re: [HACKERS] Reducing pg_ctl's reaction time  (Michael Paquier <michael.paquier@gmail.com>)
Список pgsql-hackers
I still have a bee in my bonnet about how slow the recovery TAP tests
are, and especially about how low the CPU usage is while they run,
suggesting that a lot of the wall clock time is being expended on
useless sleeps.  Some analysis I did today found some low-hanging fruit
there: a significant part of the time is being spent in pg_ctl waiting
for postmaster start/stop operations.  It waits in quanta of 1 second,
but the postmaster usually starts or stops in much less than that.
(In these tests, most of the shutdown checkpoints have little to do,
so that in many cases the postmaster stops in just a couple of msec.
Startup isn't very many msec either.)

The attached proposed patch adjusts pg_ctl to check every 100msec,
instead of every second, for the postmaster to be done starting or
stopping.  This cuts the runtime of the recovery TAP tests from around
4m30s to around 3m10s on my machine, a good 30% savings.  I experimented
with different check frequencies but there doesn't seem to be anything
to be gained by cutting the delay further (and presumably, it becomes
counterproductive at some point due to pg_ctl chewing too many cycles).

This change probably doesn't offer much real-world benefit, since few
people start/stop their postmasters often, and shutdown checkpoints are
seldom so cheap on production installations.  Still, it doesn't seem
like it could hurt.  The original choice to use once-per-second checks
was made for hardware a lot slower than what most of us use now; I do
not think it's been revisited since the first implementation of pg_ctl
in 1999 (cf 5b912b089).

Barring objections I'd like to push this soon.

            regards, tom lane

diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 82ac62d..1e6cb69 100644
*** a/src/bin/pg_ctl/pg_ctl.c
--- b/src/bin/pg_ctl/pg_ctl.c
*************** typedef enum
*** 68,73 ****
--- 68,75 ----

  #define DEFAULT_WAIT    60

+ #define WAITS_PER_SEC    10        /* should divide 1000000 evenly */
+
  static bool do_wait = true;
  static int    wait_seconds = DEFAULT_WAIT;
  static bool wait_seconds_arg = false;
*************** test_postmaster_connection(pgpid_t pm_pi
*** 531,537 ****

      connstr[0] = '\0';

!     for (i = 0; i < wait_seconds; i++)
      {
          /* Do we need a connection string? */
          if (connstr[0] == '\0')
--- 533,539 ----

      connstr[0] = '\0';

!     for (i = 0; i < wait_seconds * WAITS_PER_SEC; i++)
      {
          /* Do we need a connection string? */
          if (connstr[0] == '\0')
*************** test_postmaster_connection(pgpid_t pm_pi
*** 701,724 ****
  #endif

          /* No response, or startup still in process; wait */
! #ifdef WIN32
!         if (do_checkpoint)
          {
!             /*
!              * Increment the wait hint by 6 secs (connection timeout + sleep)
!              * We must do this to indicate to the SCM that our startup time is
!              * changing, otherwise it'll usually send a stop signal after 20
!              * seconds, despite incrementing the checkpoint counter.
!              */
!             status.dwWaitHint += 6000;
!             status.dwCheckPoint++;
!             SetServiceStatus(hStatus, (LPSERVICE_STATUS) &status);
!         }
!         else
  #endif
!             print_msg(".");

!         pg_usleep(1000000);        /* 1 sec */
      }

      /* return result of last call to PQping */
--- 703,730 ----
  #endif

          /* No response, or startup still in process; wait */
!         if (i % WAITS_PER_SEC == 0)
          {
! #ifdef WIN32
!             if (do_checkpoint)
!             {
!                 /*
!                  * Increment the wait hint by 6 secs (connection timeout +
!                  * sleep). We must do this to indicate to the SCM that our
!                  * startup time is changing, otherwise it'll usually send a
!                  * stop signal after 20 seconds, despite incrementing the
!                  * checkpoint counter.
!                  */
!                 status.dwWaitHint += 6000;
!                 status.dwCheckPoint++;
!                 SetServiceStatus(hStatus, (LPSERVICE_STATUS) &status);
!             }
!             else
  #endif
!                 print_msg(".");
!         }

!         pg_usleep(1000000 / WAITS_PER_SEC); /* 1/WAITS_PER_SEC sec */
      }

      /* return result of last call to PQping */
*************** do_stop(void)
*** 998,1009 ****

          print_msg(_("waiting for server to shut down..."));

!         for (cnt = 0; cnt < wait_seconds; cnt++)
          {
              if ((pid = get_pgpid(false)) != 0)
              {
!                 print_msg(".");
!                 pg_usleep(1000000); /* 1 sec */
              }
              else
                  break;
--- 1004,1016 ----

          print_msg(_("waiting for server to shut down..."));

!         for (cnt = 0; cnt < wait_seconds * WAITS_PER_SEC; cnt++)
          {
              if ((pid = get_pgpid(false)) != 0)
              {
!                 if (cnt % WAITS_PER_SEC == 0)
!                     print_msg(".");
!                 pg_usleep(1000000 / WAITS_PER_SEC); /* 1/WAITS_PER_SEC sec */
              }
              else
                  break;
*************** do_restart(void)
*** 1088,1099 ****

          /* always wait for restart */

!         for (cnt = 0; cnt < wait_seconds; cnt++)
          {
              if ((pid = get_pgpid(false)) != 0)
              {
!                 print_msg(".");
!                 pg_usleep(1000000); /* 1 sec */
              }
              else
                  break;
--- 1095,1107 ----

          /* always wait for restart */

!         for (cnt = 0; cnt < wait_seconds * WAITS_PER_SEC; cnt++)
          {
              if ((pid = get_pgpid(false)) != 0)
              {
!                 if (cnt % WAITS_PER_SEC == 0)
!                     print_msg(".");
!                 pg_usleep(1000000 / WAITS_PER_SEC); /* 1/WAITS_PER_SEC sec */
              }
              else
                  break;
*************** do_promote(void)
*** 1225,1241 ****
      if (do_wait)
      {
          DBState        state = DB_STARTUP;

          print_msg(_("waiting for server to promote..."));
!         while (wait_seconds > 0)
          {
              state = get_control_dbstate();
              if (state == DB_IN_PRODUCTION)
                  break;

!             print_msg(".");
!             pg_usleep(1000000); /* 1 sec */
!             wait_seconds--;
          }
          if (state == DB_IN_PRODUCTION)
          {
--- 1233,1250 ----
      if (do_wait)
      {
          DBState        state = DB_STARTUP;
+         int            cnt;

          print_msg(_("waiting for server to promote..."));
!         for (cnt = 0; cnt < wait_seconds * WAITS_PER_SEC; cnt++)
          {
              state = get_control_dbstate();
              if (state == DB_IN_PRODUCTION)
                  break;

!             if (cnt % WAITS_PER_SEC == 0)
!                 print_msg(".");
!             pg_usleep(1000000 / WAITS_PER_SEC); /* 1/WAITS_PER_SEC sec */
          }
          if (state == DB_IN_PRODUCTION)
          {

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: [HACKERS] Timing-sensitive case in src/test/recovery TAP tests
Следующее
От: Mark Rofail
Дата:
Сообщение: Re: [HACKERS] GSoC 2017: Foreign Key Arrays