Обсуждение: Redesigning postmaster death handling
Hi, Here's an experimental patch to fix our shutdown strategy on postmaster death, as discussed in a nearby report[1]. Maybe it's possible to switch to _exit() without also switching to preemptive handling, but it seems fragile and painful for no gain. Following that line of thinking, we might as well just ask the kernel to hit our existing SIGQUIT handler at parent exit, on Linux/FreeBSD. Job done. For systems lacking that facility, the idea I'm trying out here is that backends that detect the condition in WaitEventSetWait() should themselves blast all backends with SIGQUIT, in a sense taking over the role of the departed postmaster. I didn't really want any consensus/negotiation over who's going to do that, so... they all do. Most of the patch is just removing hundreds of lines of errors and conditions and comments that were now unreachable. Better ideas, glaring holes in the plan, etc, welcome. [1] https://www.postgresql.org/message-id/flat/B3C69B86-7F82-4111-B97F-0005497BB745%40yandex-team.ru
Вложения
Thomas Munro <thomas.munro@gmail.com> writes:
> Here's an experimental patch to fix our shutdown strategy on
> postmaster death, as discussed in a nearby report[1].
Thanks for tackling this topic.
> For systems lacking that facility, the idea I'm trying out here is
> that backends that detect the condition in WaitEventSetWait() should
> themselves blast all backends with SIGQUIT, in a sense taking over the
> role of the departed postmaster.
Hmm. Up to now, we have not had an assumption that postmaster
children are aware of every other postmaster child. In particular,
not all postmaster children have PGPROC entries. How much does
this matter? What happens if the shared PGPROC array is corrupt?
> I didn't really want any
> consensus/negotiation over who's going to do that, so... they all do.
Agreed on that point.
> Most of the patch is just removing hundreds of lines of errors and
> conditions and comments that were now unreachable.
The patch would likely be a lot more readable if you split out the
"delete unreachable code" part into a separate step.
regards, tom lane
Thomas Munro <thomas.munro@gmail.com> writes:
> Following that line of thinking, we might as well just ask the kernel
> to hit our existing SIGQUIT handler at parent exit, on Linux/FreeBSD.
> Job done.
One other thought here: do we *really* want such a critical-and-hard-
to-test aspect of our behavior to be handled completely differently
on different platforms? I'd lean to ignoring the Linux/FreeBSD
facilities, because otherwise we're basically doubling our testing
problems in exchange for not much.
regards, tom lane
On Thu, Aug 21, 2025 at 5:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Hmm. Up to now, we have not had an assumption that postmaster > children are aware of every other postmaster child. In particular, > not all postmaster children have PGPROC entries. How much does > this matter? What happens if the shared PGPROC array is corrupt? It's also how we set latches, but yeah it's certainly an issue. Other ideas: 1. My other patch that used O_ASYNC (= ask the kernel to send SIGIO when the pipe becomes readable) worked, but required a pipe or socket pair per backend and is not actually in any standard. I think it is available almost everywhere anyway. I could rejuvenate that just to try out again. 2. I wonder if we could make better use of session IDs. I understand that we use them to signal eg archiver + its children, but I wonder if we could use a different granularity. postmaster's sid for most stuff, and per-backend sids when really needed, and then you just have to signal a small number of sessions, perhaps more than one but not much more. We pretend that setsid is optional but it's old POSIX and everywhere. I also know that Windows has a similar thing, I just haven't looked into it. > > Most of the patch is just removing hundreds of lines of errors and > > conditions and comments that were now unreachable. > > The patch would likely be a lot more readable if you split out the > "delete unreachable code" part into a separate step. Will do.
On Thu, Aug 21, 2025 at 5:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > One other thought here: do we *really* want such a critical-and-hard- > to-test aspect of our behavior to be handled completely differently > on different platforms? I'd lean to ignoring the Linux/FreeBSD > facilities, because otherwise we're basically doubling our testing > problems in exchange for not much. Yeah. That attraction is that it's extremely simple and reliable: set-and-forget, adding one line that sends you into well tested immediate shutdown code. Combined with the fact that most of our user base has it, that seemed attractive. The reliability aspects I was thinking of are: (1) the kernel's knowledge of the process tree is infallible by definition, (2) it's handled asynchronously on postmaster exit, not after a POLLHUP, EVFILT_PROCESS, or process HANDLE event that must be consumed synchronously by at least one child. For (2), in practice I think it's close to 100% certain that one backend will currently or very soon be in WaitEventSetWait() and thus drive the cleanup operation, and I think it's probably good enough. For example, even if your backends are all busy, there's basically always a bunch of "launchers" and other auxiliary processes ready and waiting to deal with it. But it's possible to dream up extreme theoretical scenarios where that bet fails: imagine if every single backend except for one is current waiting for a lock in sem_wait() (let's say it's the same lock for simplicity). I previously said in some throwaway comment that they can't all be blocked in sem_wait() or you already have a deadlock (a programming bug that isn't this system's fault), but if the postmaster AND the backend that holds the lock are killed by the OOM killer, you lose. Those backends would need to be cleaned up manually by an administrator in all released versions of PostgreSQL, and it's be not better with the v1 patch on Windows and macOS. They'd all eat SIGQUIT on a Linux or FreeBSD system with the v1 patch, so paper at least it's more hole-proof. I agree that it would be nice to have just one system though, and of course to make it completely reliable everywhere without complicated theories. One argument I thought of against PROC_PDEATHSIG_CTL is that its simplicity also takes away some possibilities. Yesterday I wrote "taking over the role of the departed Postmaster", and realised it's not the whole enchilada: do we also want the "issuing SIGKILL to recalcitrant children" bit? I don't want this system to be complicated, rather the opposite, but I wonder if there is a nice way to make it run *literally* the same code as the postmaster. We'd need bulletproof data structure sharing, or preferably, no sharing of modifiable data at all. Some ideas I'm looking into: better use of process groups, or maybe doing the book keeping in memory that is not even mapped into children until they need it. Or something. Researching...