Overhauling our interrupt handling (was Escaping from blocked send() reprised.)

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Overhauling our interrupt handling (was Escaping from blocked send() reprised.)
Дата
Msg-id 20150115020335.GZ5245@awork2.anarazel.de
обсуждение исходный текст
Ответы Re: Overhauling our interrupt handling  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Re: Overhauling our interrupt handling (was Escaping from blocked send() reprised.)  (Robert Haas <robertmhaas@gmail.com>)
Re: Overhauling our interrupt handling (was Escaping from blocked send() reprised.)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Re: Overhauling our interrupt handling (was Escaping from blocked send() reprised.)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Re: Overhauling our interrupt handling (was Escaping from blocked send() reprised.)  (Noah Misch <noah@leadboat.com>)
Список pgsql-hackers
Hi,

This mail is a answer to
http://archives.postgresql.org/message-id/20150110022542.GF12509%40alap3.anarazel.de
but I decided that it's a good go move it into a new thread since the
patchseries has outgrown its original target. It's invasive and touches
many arcane areas of the code, so that I think more eyes than a long
deep thread is likely to have, are a good thing.

As a short description of the goal of the patchseries:
This started out as steps on the way to be able to safely handle
terminate_backend() et al when we're reading/writing from the
client. E.g. because the client is dead and we haven't noticed yet and
are stuck writing, or because we want to implement a idle_in_transaction
timeout.

Making these changes allowed me to see that not doing significant work
in signal handlers can make code much simpler and more robust. The
number of bugs postgres had in the past that involved doing too much in
signal handlers is significant.  Thus I've since extended the
patchseries to get rid of nearly all nontrivial work in signal
handlers.

All the patches in the series up to 0008 hav ecommit messages providing
more detail. A short description of each patch follows:

0001: Replace walsender's latch with the general shared latch.

      New patch that removes ImmediateInteruptOK behaviour from walsender. I
      think that's a rather good idea, because walsender currently seems to
      assume WaitLatchOrSocket is reentrant - which I don't think is really
      guaranteed.
      Hasn't been reviewed yet, but I think it's not far from being
      committable.

0002: Use a nonblocking socket for FE/BE communication and block using
      latches.

      Has previously been reviewed by Heikki. I think Noah also had a
      look, although I'm not sure how close that was.

      I think this can be committed soon.

0003: Introduce and use infrastructure for interrupt processing during client reads.

      From here on ImmediateInterruptOK isn't set during client
      communication. Normal interrupts and sinval/async interrupts are
      processed outside of signal handlers. Especially the sinval/async
      greatly simplify the respective code.

      Has been reviewed by Heikki in an earlier version - I think I
      addressed all the review comments.

      I'd like somebody else to look at the sinval/async.c changes
      before committing. I really like the simplification this change
      allowed.

      I think this patch might not be safe without 0004 because I can't
      convince myself that it's safe to interrupt latch waits with work
      that actually might also use the same latch (sinval
      interrupts). But it's easier to review this way as 0004 is quite
      big.

0004: Process 'die' interrupts while reading/writing from the client socket.

      This is the reason Horiguchi-san started this thread.

      I think the important debate here is whether we think it's
      acceptable that there are situations (a full socket buffer, but a
      alive connection) where the client previously got an error, but
      not anymore afterwards. I think that's much better than having
      unkillable connections, but I can see others being of a different
      opinion.

0005: Move deadlock and other interrupt handling in proc.c out of signal handlers.

      I'm surprised how comparatively simple this turned out to be. For
      robustness and understanding I think this is a great improvement.

      New and not reviewed at all. Needs significant review. But works
      in my simple testcases.

0006: Don't allow immediate interupts during authentication anymore.

      So far we've set ImmediateInterruptOK to true during large parts
      of the client authentication - that's not all that pretty,
      interrupts might arrive while we're in some system routines.

      Due to patches 0003/0004 we now are able to safely serve
      interrupts during client communication which is the major are
      where we want to adhere to authentication_timeout.

      I additionally placed some CHECK_FOR_INTERRUPTS()s in some
      somewhat randomly chosen places in auth.c. Those don't completely
      get back the previous 'appruptness' (?) of reacting to
      interrupts, but that's partially for the better, because we don't
      interrupt foreign code anymore.

0007: Remove the option to service interrupts during PGSemaphoreLock().

      Due to patch 0005 there's no user of PGSemaphoreLock(lock, interruptOK = true)

      anymore, so remove the code to handle that. I find it rather odd
      that the code did a CHECK_FOR_INTERRUPTS before the semwait even
      when interruptOK was set to to false - that only happened to work
      because lwlock.c did a HOLD_INTERRUPTS() around the semaphore
      code. It's now removed entirely.

      This is a API break because the signature of PGSemaphoreLock()
      changed. But I have a hard time seing that as a problem. For one I
      don't think there's many users of PGSemaphoreLock, for another
      they should change their interrupt handling.

      New and not reviewed.

0008: Remove remnants of ImmediateInterruptOK handling.

      Now that ImmediateInterruptOK is never set to true anymore, we can
      remove related code and comments.

      New and not reviewed.

0009: WIP: lwlock: use latches

      Optional patch that removes the use of semaphores from the lwlock
      code. There's no benefit for lwlock.c itself due to this. But it
      does get rid of the last user of semaphores in a normal postgres
      build. I primarily wrote this to test the performance of latches.

      I guess we want to do this anyway.

0010: WIP: unix_latch: WIP efficiency hackery

      Nothing ready, although I think using eventfds on linux is worth
      the cost.


Questions I have about the series right now:

1) Do others agree that getting rid of ImmediateInterruptOK is
   worthwile. I personally think that we really should pursue that
   aggressively as a goal.

2) Does anybody see a significant problem with the reduction of cases
   where we immediately react to authentication_timeout? Since we still
   react immediately when we're waiting for the client (except maybe in
   sspi/kerberos?) I think that's ok. The waiting cases are slow
   ident/radius/kerberos/ldap/... servers. But we really shouldn't jump
   out of the relevant code anyway.

3) If we do everything (in corrected), upto 0009, there's no remaining
   user of semaphores in postgres, except the spinlock emulation.

   Does anybody see a need for PGPROC->sem to remain? Removing it would
   have the significant benefit that semaphores are the last thing users
   frequently need to tune to get postgres to start up when using a
   higher max_connection or multiple clusters.

   If we remove PGPROC->sem does anybody see a way to get rid of the
   semaphore code alltogether? I personally still think we should just
   gank --disable-spinlocks. Now that it's only a couple lines (in an
   preexisting patch) to rely on compiler intrinsics for spinlocks that
   doesn't sound like a big deal.  Even if not, we could decide to get
   rid of win32_sem.c...

4) There remains one user of something like ImmediateInterruptOK -
   walreceiver. It has WalRcvImmediateInterruptOK. We definitely should
   get rid of that as well, but that's a independent patch that can be
   written in the future.

5) In a future patch I think we could also handle catchup interrupts
   during other client reads, not just ReadCommand(). Does anybody see
   that as a worthwile goal? I can't remember many problems with not
   enough catchup happening, but I think someone mentioned that as a
   problem in the past.

6) Review ;)

Comments?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jim Nasby
Дата:
Сообщение: Re: Parallel Seq Scan
Следующее
От: Amit Langote
Дата:
Сообщение: Partitioning: issues/ideas (Was: Re: On partitioning)