Overhauling our interrupt handling (was Escaping from blocked send() reprised.)
От | Andres Freund |
---|---|
Тема | Overhauling our interrupt handling (was Escaping from blocked send() reprised.) |
Дата | |
Msg-id | 20150115020335.GZ5245@awork2.anarazel.de обсуждение исходный текст |
Ответы |
Re: Overhauling our interrupt handling
(Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Re: Overhauling our interrupt handling (was Escaping from blocked send() reprised.) (Robert Haas <robertmhaas@gmail.com>) Re: Overhauling our interrupt handling (was Escaping from blocked send() reprised.) (Heikki Linnakangas <hlinnakangas@vmware.com>) Re: Overhauling our interrupt handling (was Escaping from blocked send() reprised.) (Heikki Linnakangas <hlinnakangas@vmware.com>) Re: Overhauling our interrupt handling (was Escaping from blocked send() reprised.) (Noah Misch <noah@leadboat.com>) |
Список | pgsql-hackers |
Hi, This mail is a answer to http://archives.postgresql.org/message-id/20150110022542.GF12509%40alap3.anarazel.de but I decided that it's a good go move it into a new thread since the patchseries has outgrown its original target. It's invasive and touches many arcane areas of the code, so that I think more eyes than a long deep thread is likely to have, are a good thing. As a short description of the goal of the patchseries: This started out as steps on the way to be able to safely handle terminate_backend() et al when we're reading/writing from the client. E.g. because the client is dead and we haven't noticed yet and are stuck writing, or because we want to implement a idle_in_transaction timeout. Making these changes allowed me to see that not doing significant work in signal handlers can make code much simpler and more robust. The number of bugs postgres had in the past that involved doing too much in signal handlers is significant. Thus I've since extended the patchseries to get rid of nearly all nontrivial work in signal handlers. All the patches in the series up to 0008 hav ecommit messages providing more detail. A short description of each patch follows: 0001: Replace walsender's latch with the general shared latch. New patch that removes ImmediateInteruptOK behaviour from walsender. I think that's a rather good idea, because walsender currently seems to assume WaitLatchOrSocket is reentrant - which I don't think is really guaranteed. Hasn't been reviewed yet, but I think it's not far from being committable. 0002: Use a nonblocking socket for FE/BE communication and block using latches. Has previously been reviewed by Heikki. I think Noah also had a look, although I'm not sure how close that was. I think this can be committed soon. 0003: Introduce and use infrastructure for interrupt processing during client reads. From here on ImmediateInterruptOK isn't set during client communication. Normal interrupts and sinval/async interrupts are processed outside of signal handlers. Especially the sinval/async greatly simplify the respective code. Has been reviewed by Heikki in an earlier version - I think I addressed all the review comments. I'd like somebody else to look at the sinval/async.c changes before committing. I really like the simplification this change allowed. I think this patch might not be safe without 0004 because I can't convince myself that it's safe to interrupt latch waits with work that actually might also use the same latch (sinval interrupts). But it's easier to review this way as 0004 is quite big. 0004: Process 'die' interrupts while reading/writing from the client socket. This is the reason Horiguchi-san started this thread. I think the important debate here is whether we think it's acceptable that there are situations (a full socket buffer, but a alive connection) where the client previously got an error, but not anymore afterwards. I think that's much better than having unkillable connections, but I can see others being of a different opinion. 0005: Move deadlock and other interrupt handling in proc.c out of signal handlers. I'm surprised how comparatively simple this turned out to be. For robustness and understanding I think this is a great improvement. New and not reviewed at all. Needs significant review. But works in my simple testcases. 0006: Don't allow immediate interupts during authentication anymore. So far we've set ImmediateInterruptOK to true during large parts of the client authentication - that's not all that pretty, interrupts might arrive while we're in some system routines. Due to patches 0003/0004 we now are able to safely serve interrupts during client communication which is the major are where we want to adhere to authentication_timeout. I additionally placed some CHECK_FOR_INTERRUPTS()s in some somewhat randomly chosen places in auth.c. Those don't completely get back the previous 'appruptness' (?) of reacting to interrupts, but that's partially for the better, because we don't interrupt foreign code anymore. 0007: Remove the option to service interrupts during PGSemaphoreLock(). Due to patch 0005 there's no user of PGSemaphoreLock(lock, interruptOK = true) anymore, so remove the code to handle that. I find it rather odd that the code did a CHECK_FOR_INTERRUPTS before the semwait even when interruptOK was set to to false - that only happened to work because lwlock.c did a HOLD_INTERRUPTS() around the semaphore code. It's now removed entirely. This is a API break because the signature of PGSemaphoreLock() changed. But I have a hard time seing that as a problem. For one I don't think there's many users of PGSemaphoreLock, for another they should change their interrupt handling. New and not reviewed. 0008: Remove remnants of ImmediateInterruptOK handling. Now that ImmediateInterruptOK is never set to true anymore, we can remove related code and comments. New and not reviewed. 0009: WIP: lwlock: use latches Optional patch that removes the use of semaphores from the lwlock code. There's no benefit for lwlock.c itself due to this. But it does get rid of the last user of semaphores in a normal postgres build. I primarily wrote this to test the performance of latches. I guess we want to do this anyway. 0010: WIP: unix_latch: WIP efficiency hackery Nothing ready, although I think using eventfds on linux is worth the cost. Questions I have about the series right now: 1) Do others agree that getting rid of ImmediateInterruptOK is worthwile. I personally think that we really should pursue that aggressively as a goal. 2) Does anybody see a significant problem with the reduction of cases where we immediately react to authentication_timeout? Since we still react immediately when we're waiting for the client (except maybe in sspi/kerberos?) I think that's ok. The waiting cases are slow ident/radius/kerberos/ldap/... servers. But we really shouldn't jump out of the relevant code anyway. 3) If we do everything (in corrected), upto 0009, there's no remaining user of semaphores in postgres, except the spinlock emulation. Does anybody see a need for PGPROC->sem to remain? Removing it would have the significant benefit that semaphores are the last thing users frequently need to tune to get postgres to start up when using a higher max_connection or multiple clusters. If we remove PGPROC->sem does anybody see a way to get rid of the semaphore code alltogether? I personally still think we should just gank --disable-spinlocks. Now that it's only a couple lines (in an preexisting patch) to rely on compiler intrinsics for spinlocks that doesn't sound like a big deal. Even if not, we could decide to get rid of win32_sem.c... 4) There remains one user of something like ImmediateInterruptOK - walreceiver. It has WalRcvImmediateInterruptOK. We definitely should get rid of that as well, but that's a independent patch that can be written in the future. 5) In a future patch I think we could also handle catchup interrupts during other client reads, not just ReadCommand(). Does anybody see that as a worthwile goal? I can't remember many problems with not enough catchup happening, but I think someone mentioned that as a problem in the past. 6) Review ;) Comments? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
- 0001-Replace-walsender-s-latch-with-the-general-shared-la.patch
- 0002-Use-a-nonblocking-socket-for-FE-BE-communication-and.patch
- 0003-Introduce-and-use-infrastructure-for-interrupt-proce.patch
- 0004-Process-die-interrupts-while-reading-writing-from-th.patch
- 0005-Move-deadlock-and-other-interrupt-handling-in-proc.c.patch
- 0006-Don-t-allow-immediate-interupts-during-authenticatio.patch
- 0007-Remove-the-option-to-service-interrupts-during-PGSem.patch
- 0008-Remove-remnants-of-ImmediateInterruptOK-handling.patch
- 0009-WIP-lwlock.c-use-latches.patch
- 0010-WIP-unix_latch.c-efficiency-hackery.patch
В списке pgsql-hackers по дате отправления: