Обсуждение: Rethinking placement of latch self-pipe initialization

Поиск
Список
Период
Сортировка

Rethinking placement of latch self-pipe initialization

От
Tom Lane
Дата:
Sean Chittenden recently reported that 9.2 can crash after logging
"FATAL: pipe() failed" if the kernel is short of file descriptors:
http://archives.postgresql.org/pgsql-general/2012-10/msg00202.php

The only match to that error text is in initSelfPipe().  What I believe
is happening is that InitProcess is calling OwnLatch which calls
initSelfPipe, and the latter fails, and then the postmaster thinks that
was a backend crash because we have armed the dead-man switch but not
set up on_shmem_exit(ProcKill) which would disarm it.

It's possible we could fix this by changing the order of operations
in InitProcess and OwnLatch, but it'd be tricky, not least because
ProcKill calls DisownLatch which asserts that OwnLatch was done.

What I think would be a better idea is to fix things so that OwnLatch
cannot fail except as a result of internal logic errors, by splitting
out the acquisition of the self-pipe into a separate function named say
InitializeLatchSupport.  The question then becomes where to put the
InitializeLatchSupport calls.  My first thought is to put them near the
signal-setup stanzas for the various processes (ie, the pqsignal calls)
similarly to what we did recently for initialization of timeout support.
However there might be a better idea.

Comments?
        regards, tom lane



Re: Rethinking placement of latch self-pipe initialization

От
Amit Kapila
Дата:
 On Sunday, October 07, 2012 10:58 PM Tom Lane wrote:
> Sean Chittenden recently reported that 9.2 can crash after logging
> "FATAL: pipe() failed" if the kernel is short of file descriptors:
> http://archives.postgresql.org/pgsql-general/2012-10/msg00202.php
> 
> The only match to that error text is in initSelfPipe().  What I believe
> is happening is that InitProcess is calling OwnLatch which calls
> initSelfPipe, and the latter fails, and then the postmaster thinks that
> was a backend crash because we have armed the dead-man switch but not
> set up on_shmem_exit(ProcKill) which would disarm it.
> 
> It's possible we could fix this by changing the order of operations
> in InitProcess and OwnLatch, but it'd be tricky, not least because
> ProcKill calls DisownLatch which asserts that OwnLatch was done.
 Why can't before Disowning, it can be checked that if it's already owned
or not. Something similar is done for lwlockreleaseall() where it checks if it
holds any locks  then only it frees it. Also similar is done in SyncRepCleanupAtProcExit().

> What I think would be a better idea is to fix things so that OwnLatch
> cannot fail except as a result of internal logic errors, by splitting
> out the acquisition of the self-pipe into a separate function named say
> InitializeLatchSupport.  The question then becomes where to put the
> InitializeLatchSupport calls.  My first thought is to put them near the
> signal-setup stanzas for the various processes (ie, the pqsignal calls)
> similarly to what we did recently for initialization of timeout support.
> However there might be a better idea.
> 
> Comments?

With Regards,
Amit Kapila.




Re: Rethinking placement of latch self-pipe initialization

От
Simon Riggs
Дата:
On 7 October 2012 18:27, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Sean Chittenden recently reported that 9.2 can crash after logging
> "FATAL: pipe() failed" if the kernel is short of file descriptors:
> http://archives.postgresql.org/pgsql-general/2012-10/msg00202.php
>
> The only match to that error text is in initSelfPipe().  What I believe
> is happening is that InitProcess is calling OwnLatch which calls
> initSelfPipe, and the latter fails, and then the postmaster thinks that
> was a backend crash because we have armed the dead-man switch but not
> set up on_shmem_exit(ProcKill) which would disarm it.
>
> It's possible we could fix this by changing the order of operations
> in InitProcess and OwnLatch, but it'd be tricky, not least because
> ProcKill calls DisownLatch which asserts that OwnLatch was done.
>
> What I think would be a better idea is to fix things so that OwnLatch
> cannot fail except as a result of internal logic errors, by splitting
> out the acquisition of the self-pipe into a separate function named say
> InitializeLatchSupport.  The question then becomes where to put the
> InitializeLatchSupport calls.  My first thought is to put them near the
> signal-setup stanzas for the various processes (ie, the pqsignal calls)
> similarly to what we did recently for initialization of timeout support.
> However there might be a better idea.
>
> Comments?

We still have to consider how Postgres would operate without the
latches. I don't see that it can, so a shutdown seems appropriate. Is
the purpose of this just to allow a cleaner and more informative
shutdown? Or do you think we can avoid?

If we did move the init calls, would that alter things for code that
creates new used defined latches?

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Rethinking placement of latch self-pipe initialization

От
Tom Lane
Дата:
Simon Riggs <simon@2ndQuadrant.com> writes:
> We still have to consider how Postgres would operate without the
> latches. I don't see that it can, so a shutdown seems appropriate. Is
> the purpose of this just to allow a cleaner and more informative
> shutdown? Or do you think we can avoid?

The point is that right now, if a new backend fails its initialization
at this specific step, that gets translated into a database-wide crash
and restart cycle, for no good reason.  It should just result in that
particular session failing.

> If we did move the init calls, would that alter things for code that
> creates new used defined latches?

Only to the extent that it'd have to make sure it called the
initialization function at some appropriate point.  It's not like
required initialization functions are a foreign concept in our code.
        regards, tom lane



Re: Rethinking placement of latch self-pipe initialization

От
Tom Lane
Дата:
I wrote:
> Sean Chittenden recently reported that 9.2 can crash after logging
> "FATAL: pipe() failed" if the kernel is short of file descriptors:
> http://archives.postgresql.org/pgsql-general/2012-10/msg00202.php
> ...
> What I think would be a better idea is to fix things so that OwnLatch
> cannot fail except as a result of internal logic errors, by splitting
> out the acquisition of the self-pipe into a separate function named say
> InitializeLatchSupport.  The question then becomes where to put the
> InitializeLatchSupport calls.  My first thought is to put them near the
> signal-setup stanzas for the various processes (ie, the pqsignal calls)
> similarly to what we did recently for initialization of timeout support.

I experimented with that approach, and found out it didn't work at all,
because there are assorted code paths in which InitProcess and
InitAuxiliaryProcess are called well before we enable signals.
I find that a bit disturbing; it means there is a significant amount
of process-startup code that has a latch available but can't actually
wait on the latch.  (Well, it could, but it would never awaken because
it would never react to the signal.)  At some point in the future we
may have to do some refactoring in this area.  On the other hand, moving
signal-enabling earlier carries its own hazards: the signal handlers
might expect yet other infrastructure to be alive already, and any bug
of that ilk would probably be a lot harder to reproduce and diagnose.

Anyway, the simplest working solution proved to be to put the
InitializeLatchSupport calls in InitProcess and InitAuxiliaryProcess,
plus add them in a few background process types that use InitLatch but
don't call either of those functions.  Patch attached.  Any objections?

            regards, tom lane

diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
index 335e9f66afb52163470ee9b69119b7fa16d2bc1c..bbd1810ea530547c75255f02c5be44c559cc03de 100644
*** a/src/backend/port/unix_latch.c
--- b/src/backend/port/unix_latch.c
*************** static volatile sig_atomic_t waiting = f
*** 60,70 ****
  static int    selfpipe_readfd = -1;
  static int    selfpipe_writefd = -1;

! /* private function prototypes */
! static void initSelfPipe(void);
! static void drainSelfPipe(void);
  static void sendSelfPipeByte(void);


  /*
   * Initialize a backend-local latch.
--- 60,100 ----
  static int    selfpipe_readfd = -1;
  static int    selfpipe_writefd = -1;

! /* Private function prototypes */
  static void sendSelfPipeByte(void);
+ static void drainSelfPipe(void);
+
+
+ /*
+  * Initialize the process-local latch infrastructure.
+  *
+  * This must be called once during startup of any process that can wait on
+  * latches, before it issues any InitLatch() or OwnLatch() calls.
+  */
+ void
+ InitializeLatchSupport(void)
+ {
+     int            pipefd[2];
+
+     Assert(selfpipe_readfd == -1);
+
+     /*
+      * Set up the self-pipe that allows a signal handler to wake up the
+      * select() in WaitLatch. Make the write-end non-blocking, so that
+      * SetLatch won't block if the event has already been set many times
+      * filling the kernel buffer. Make the read-end non-blocking too, so that
+      * we can easily clear the pipe by reading until EAGAIN or EWOULDBLOCK.
+      */
+     if (pipe(pipefd) < 0)
+         elog(FATAL, "pipe() failed: %m");
+     if (fcntl(pipefd[0], F_SETFL, O_NONBLOCK) < 0)
+         elog(FATAL, "fcntl() failed on read-end of self-pipe: %m");
+     if (fcntl(pipefd[1], F_SETFL, O_NONBLOCK) < 0)
+         elog(FATAL, "fcntl() failed on write-end of self-pipe: %m");

+     selfpipe_readfd = pipefd[0];
+     selfpipe_writefd = pipefd[1];
+ }

  /*
   * Initialize a backend-local latch.
*************** static void sendSelfPipeByte(void);
*** 72,80 ****
  void
  InitLatch(volatile Latch *latch)
  {
!     /* Initialize the self-pipe if this is our first latch in the process */
!     if (selfpipe_readfd == -1)
!         initSelfPipe();

      latch->is_set = false;
      latch->owner_pid = MyProcPid;
--- 102,109 ----
  void
  InitLatch(volatile Latch *latch)
  {
!     /* Assert InitializeLatchSupport has been called in this process */
!     Assert(selfpipe_readfd >= 0);

      latch->is_set = false;
      latch->owner_pid = MyProcPid;
*************** InitSharedLatch(volatile Latch *latch)
*** 116,126 ****
  void
  OwnLatch(volatile Latch *latch)
  {
!     Assert(latch->is_shared);

!     /* Initialize the self-pipe if this is our first latch in this process */
!     if (selfpipe_readfd == -1)
!         initSelfPipe();

      /* sanity check */
      if (latch->owner_pid != 0)
--- 145,154 ----
  void
  OwnLatch(volatile Latch *latch)
  {
!     /* Assert InitializeLatchSupport has been called in this process */
!     Assert(selfpipe_readfd >= 0);

!     Assert(latch->is_shared);

      /* sanity check */
      if (latch->owner_pid != 0)
*************** latch_sigusr1_handler(void)
*** 514,543 ****
          sendSelfPipeByte();
  }

- /* initialize the self-pipe */
- static void
- initSelfPipe(void)
- {
-     int            pipefd[2];
-
-     /*
-      * Set up the self-pipe that allows a signal handler to wake up the
-      * select() in WaitLatch. Make the write-end non-blocking, so that
-      * SetLatch won't block if the event has already been set many times
-      * filling the kernel buffer. Make the read-end non-blocking too, so that
-      * we can easily clear the pipe by reading until EAGAIN or EWOULDBLOCK.
-      */
-     if (pipe(pipefd) < 0)
-         elog(FATAL, "pipe() failed: %m");
-     if (fcntl(pipefd[0], F_SETFL, O_NONBLOCK) < 0)
-         elog(FATAL, "fcntl() failed on read-end of self-pipe: %m");
-     if (fcntl(pipefd[1], F_SETFL, O_NONBLOCK) < 0)
-         elog(FATAL, "fcntl() failed on write-end of self-pipe: %m");
-
-     selfpipe_readfd = pipefd[0];
-     selfpipe_writefd = pipefd[1];
- }
-
  /* Send one byte to the self-pipe, to wake up WaitLatch */
  static void
  sendSelfPipeByte(void)
--- 542,547 ----
diff --git a/src/backend/port/win32_latch.c b/src/backend/port/win32_latch.c
index 1f1ed33dc2d358ce27d7d769f4e6849d7d2d810d..0c089fc7ecc80b4c5fff806866ea6b8b46700ded 100644
*** a/src/backend/port/win32_latch.c
--- b/src/backend/port/win32_latch.c
***************
*** 31,36 ****
--- 31,42 ----


  void
+ InitializeLatchSupport(void)
+ {
+     /* currently, nothing to do here for Windows */
+ }
+
+ void
  InitLatch(volatile Latch *latch)
  {
      latch->is_set = false;
diff --git a/src/backend/postmaster/pgarch.c b/src/backend/postmaster/pgarch.c
index d5d8be0587d663631e27886da66cdae71f8448db..a6c0aea3d6a957ee4f4a1652b82987f1d1aeae9f 100644
*** a/src/backend/postmaster/pgarch.c
--- b/src/backend/postmaster/pgarch.c
*************** PgArchiverMain(int argc, char *argv[])
*** 234,241 ****

      MyProcPid = getpid();        /* reset MyProcPid */

-     InitLatch(&mainloop_latch); /* initialize latch used in main loop */
-
      MyStartTime = time(NULL);    /* record Start Time for logging */

      /*
--- 234,239 ----
*************** PgArchiverMain(int argc, char *argv[])
*** 247,252 ****
--- 245,254 ----
          elog(FATAL, "setsid() failed: %m");
  #endif

+     InitializeLatchSupport();        /* needed for latch waits */
+
+     InitLatch(&mainloop_latch); /* initialize latch used in main loop */
+
      /*
       * Ignore all signals usually bound to some action in the postmaster,
       * except for SIGHUP, SIGTERM, SIGUSR1, SIGUSR2, and SIGQUIT.
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8389d5c4ae932a24d22a8d3fe20920a2bdede7c8..be3adf16d922039a636b59842f7b1b3f1440707d 100644
*** a/src/backend/postmaster/pgstat.c
--- b/src/backend/postmaster/pgstat.c
*************** PgstatCollectorMain(int argc, char *argv
*** 3022,3027 ****
--- 3022,3029 ----
          elog(FATAL, "setsid() failed: %m");
  #endif

+     InitializeLatchSupport();        /* needed for latch waits */
+
      /* Initialize private latch for use by signal handlers */
      InitLatch(&pgStatLatch);

diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index 0febf64d87f2b337b70e4c9e49d4399b6aa03c0e..cccec743b0d28922964acd5b597ed74548588856 100644
*** a/src/backend/postmaster/syslogger.c
--- b/src/backend/postmaster/syslogger.c
*************** SysLoggerMain(int argc, char *argv[])
*** 251,256 ****
--- 251,258 ----
          elog(FATAL, "setsid() failed: %m");
  #endif

+     InitializeLatchSupport();        /* needed for latch waits */
+
      /* Initialize private latch for use by signal handlers */
      InitLatch(&sysLoggerLatch);

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 677042962a0e9e5723dc1fcf81d0cbb2f44446c2..5ae1506038cc59ebea9d6a7cee3ccebd3adc9819 100644
*** a/src/backend/storage/lmgr/proc.c
--- b/src/backend/storage/lmgr/proc.c
*************** InitProcess(void)
*** 280,285 ****
--- 280,292 ----
          elog(ERROR, "you already exist");

      /*
+      * Initialize process-local latch support.  This could fail if the kernel
+      * is low on resources, and if so we want to exit cleanly before acquiring
+      * any shared-memory resources.
+      */
+     InitializeLatchSupport();
+
+     /*
       * Try to get a proc struct from the free list.  If this fails, we must be
       * out of PGPROC structures (not to mention semaphores).
       *
*************** InitAuxiliaryProcess(void)
*** 452,457 ****
--- 459,471 ----
          elog(ERROR, "you already exist");

      /*
+      * Initialize process-local latch support.  This could fail if the kernel
+      * is low on resources, and if so we want to exit cleanly before acquiring
+      * any shared-memory resources.
+      */
+     InitializeLatchSupport();
+
+     /*
       * We use the ProcStructLock to protect assignment and releasing of
       * AuxiliaryProcs entries.
       *
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 71fb4868a000ccdd1afc6dd38bbbc8102c07f87f..f5c7fe1ca5ee9d0ee9706035f0b13f7a02bc1968 100644
*** a/src/include/storage/latch.h
--- b/src/include/storage/latch.h
*************** typedef struct
*** 111,116 ****
--- 111,117 ----
  /*
   * prototypes for functions in latch.c
   */
+ extern void InitializeLatchSupport(void);
  extern void InitLatch(volatile Latch *latch);
  extern void InitSharedLatch(volatile Latch *latch);
  extern void OwnLatch(volatile Latch *latch);

Re: Rethinking placement of latch self-pipe initialization

От
Peter Geoghegan
Дата:
On 14 October 2012 19:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Anyway, the simplest working solution proved to be to put the
> InitializeLatchSupport calls in InitProcess and InitAuxiliaryProcess,
> plus add them in a few background process types that use InitLatch but
> don't call either of those functions.  Patch attached.  Any objections?

Looks good to me.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services