Обсуждение: PANIC serves too many masters
Hi, Right now we use PANIC for very different kinds of errors. Sometimes for errors that are persistent, where crash-restarting and trying again won't help: ereport(PANIC, (errmsg("could not locate a valid checkpoint record"))); or ereport(PANIC, (errmsg("online backup was canceled, recovery cannot continue"))); Sometimes for errors that could be transient, e.g. when running out of space while trying to write out WAL: ereport(ERROR, (errcode_for_file_access(), errmsg("could not write to file \"%s\": %m", tmppath))); (the ERROR is often promoted to a PANIC due to critical sections). or ereport(PANIC, (errcode_for_file_access(), errmsg("could not write to log file \"%s\" at offset %u, length %zu: %m", xlogfname, startoffset, nleft))); Sometimes for "should never happen" checks that are important enough to check in production builds: elog(PANIC, "stuck spinlock detected at %s, %s:%d", or elog(PANIC, "failed to re-find shared proclock object"); I have two main issues with this: 1) If core dumps are allowed, we trigger core dumps for all of these. That's good for "should never happen" type of errors like a stuck spinlock. But it makes no sense for things like the on-disk state being wrong at startup, or running out of space while writing WAL - if anything, it might make that worse! It's very useful to be able to collect dumps for crashes in production, but it's not useful to generate thousands of identical cores because crash recovery fails with out-of-space and we retry over and over. 2) For errors where crash-restarting won't fix anything, using PANIC doesn't allow postmaster to distinguish between an error that should lead postmaster to exit itself (after killing other processes, obviously) and the normal crash restart cycle. I've been trying to do some fleet wide analyses of the causes of crashes, but having core dumps for lots of stuff that aren't crashes, often repeated many times, makes that much harder. Filtering out abort()s and just looking at segfaults filters out far too much. I don't quite know what we should do. But the current situation decidedly doesn't seem great. Maybe we could have: - PANIC_BUG - triggers abort() followed by a crash restart cycle to be used for things like a stuck spinlock - PANIC_RETRY - causes a crash restart cycle, no core dump to be used for things like ENOSPC during WAL writes - PANIC_EXIT - causes postmaster to exit(1) to be used for things where retrying won't help, like "requested recovery stop point is before consistent recovery point" One could argue that some of the PANICs that want to just shut down the server should instead be FATALs, with knowledge in postmaster about which/when such errors should trigger exiting. We do have something like this for the startup process, but only when errors happen "early enough", and without being able to distinguish between "retryable" and "should exit" type errors. But ISTM that that requires adding more and more knowledge to postmaster.c, instead of leaving it with the code that raises the error. Greetings, Andres Freund
Hi, On Sat, 2023-11-18 at 14:29 -0800, Andres Freund wrote: > I don't quite know what we should do. But the current situation > decidedly > doesn't seem great. Agreed. Better classification is nice, but it also requires more discipline and it might not always be obvious which category something fits in. What about an error loop resulting in: ereport(PANIC, (errmsg_internal("ERRORDATA_STACK_SIZE exceeded"))); We'd want a core file, but I don't think we want to restart in that case, right? Also, can we do a change like this incrementally by updating a few PANIC sites at a time? Is it fine to leave plain PANICs in place for the foreseeable future, or do you want all of them to eventually move? Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Sat, 2023-11-18 at 14:29 -0800, Andres Freund wrote: >> I don't quite know what we should do. But the current situation >> decidedly >> doesn't seem great. > Agreed. +1 > Better classification is nice, but it also requires more > discipline and it might not always be obvious which category something > fits in. What about an error loop resulting in: > ereport(PANIC, (errmsg_internal("ERRORDATA_STACK_SIZE exceeded"))); > We'd want a core file, but I don't think we want to restart in that > case, right? Why not restart? There's no strong reason to assume this will repeat. It might be worth having some independent logic in the postmaster that causes it to give up after too many crashes in a row. But with many/most of these call sites, by definition we're not too sure what is wrong. > Also, can we do a change like this incrementally by updating a few > PANIC sites at a time? Is it fine to leave plain PANICs in place for > the foreseeable future, or do you want all of them to eventually move? I'd be inclined to keep PANIC with its current meaning, and incrementally change call sites where we decide that's not the best behavior. I think those will be a minority, maybe a small minority. (PANIC_EXIT had darn well better be a small minority.) regards, tom lane
On Mon, 2023-11-20 at 17:12 -0500, Tom Lane wrote: > I'd be inclined to keep PANIC with its current meaning, and > incrementally change call sites where we decide that's not the > best behavior. I think those will be a minority, maybe a small > minority. (PANIC_EXIT had darn well better be a small minority.) Is the error level the right way to express what we want to happen? It seems like what we really want is to decide on the behavior, i.e. restart or not, and generate core or not. That could be done a different way, like: ereport(PANIC, (errmsg("could not locate a valid checkpoint record"), errabort(false),errrestart(false))); Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > Is the error level the right way to express what we want to happen? It > seems like what we really want is to decide on the behavior, i.e. > restart or not, and generate core or not. That could be done a > different way, like: > ereport(PANIC, > (errmsg("could not locate a valid checkpoint record"), > errabort(false),errrestart(false))); Yeah, I was wondering about that too. It feels to me that PANIC_EXIT is an error level (even more severe than PANIC). But maybe "no core dump please" should be conveyed separately, since it's just a minor adjustment that doesn't fundamentally change what happens. It's plausible that you'd want a core, or not want one, for different cases that all seem to require PANIC_EXIT. (Need a better name than PANIC_EXIT. OMIGOD?) regards, tom lane
Hi, On 2023-11-20 17:55:32 -0500, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > Is the error level the right way to express what we want to happen? It > > seems like what we really want is to decide on the behavior, i.e. > > restart or not, and generate core or not. That could be done a > > different way, like: > > > ereport(PANIC, > > (errmsg("could not locate a valid checkpoint record"), > > errabort(false),errrestart(false))); > > Yeah, I was wondering about that too. It feels to me that > PANIC_EXIT is an error level (even more severe than PANIC). > But maybe "no core dump please" should be conveyed separately, > since it's just a minor adjustment that doesn't fundamentally > change what happens. I guess I was thinking of an error level because that'd be easier to search for in logs. It seems reasonable to want to specificially search for errors that cause core dumps, since IMO they should all be "should never happen" kind of paths. > It's plausible that you'd want a core, > or not want one, for different cases that all seem to require > PANIC_EXIT. I can't immediately think of a case where you'd want PANIC_EXIT but also want a core dump? In my mental model to use PANIC_EXIT we'd need to have a decent understanding that the situation isn't going to change after crash-restart - in which case a core dump presumably isn't interesting? > (Need a better name than PANIC_EXIT. OMIGOD?) CRITICAL? I agree with the point made upthread that we'd want leave PANIC around, it's not realistic to annotate everything, and then there's obviously also extensions (although I hope there aren't many PANICs in extensions). If that weren't the case, something like this could make sense: PANIC: crash-restart CRITICAL: crash-shutdown BUG: crash-restart, abort() Greetings, Andres Freund