Re: VM corruption on standby
От | Andres Freund |
---|---|
Тема | Re: VM corruption on standby |
Дата | |
Msg-id | o22gkxevs5c3ilid7czbo3idnmtv6aljczod37s2pi7gcnrbe4@bggjoejg6gy2 обсуждение исходный текст |
Ответ на | Re: VM corruption on standby (Thomas Munro <thomas.munro@gmail.com>) |
Ответы |
Re: VM corruption on standby
|
Список | pgsql-hackers |
Hi, On 2025-08-20 03:19:38 +1200, Thomas Munro wrote: > On Wed, Aug 20, 2025 at 2:57 AM Andres Freund <andres@anarazel.de> wrote: > > On 2025-08-20 02:54:09 +1200, Thomas Munro wrote: > > > > On linux - the primary OS with OOM killer troubles - I'm pretty sure'll lwlock > > > > waiters would get killed due to the postmaster death signal we've configured > > > > (c.f. PostmasterDeathSignalInit()). > > > > > > No, that has a handler that just sets a global variable. That was > > > done because recovery used to try to read() from the postmaster pipe > > > after replaying every record. Also we currently have some places that > > > don't want to be summarily killed (off the top of my head, syncrep > > > wants to send a special error message, and the logger wants to survive > > > longer than everyone else to catch as much output as possible, things > > > I've been thinking about in the context of threads). > > > > That makes no sense. We should just _exit(). If postmaster has been killed, > > trying to stay up longer just makes everything more fragile. Waiting for the > > logger is *exactly* what we should *not* do - what if the logger also crashed? > > There's no postmaster around to start it. > > Nobody is waiting for the logger. Error messages that we might be printing will wait for logger if the pipe is full, no? > The logger waits for everyone else to exit first to collect forensics: > > * Unlike all other postmaster child processes, we'll ignore postmaster > * death because we want to collect final log output from all backends and > * then exit last. We'll do that by running until we see EOF on the > * syslog pipe, which implies that all other backends have exited > * (including the postmaster). > The syncrep case is a bit weirder: it wants to tell the user that > syncrep is broken, so its own WaitEventSetWait() has > WL_POSTMASTER_DEATH, but that's basically bogus because the backend > can reach WaitEventSetWait(WL_EXIT_ON_PM_DEATH) in many other code > paths. I've proposed nuking that before. Yea, that's just bogus. I think this is one more instance of "let's try hard to continue limping along" making things way more fragile than the simpler "let's just do crash-restart in the most normal way possible". Greetings, Andres Freund
В списке pgsql-hackers по дате отправления: