Обсуждение: Severity of elog(FATAL) should vary by process

Поиск
Список
Период
Сортировка

Severity of elog(FATAL) should vary by process

От
Tom Lane
Дата:
Awhile back I noted that we had a problem with the postmaster failing
to recognize error exit from the startup process:
http://archives.postgresql.org/pgsql-hackers/2006-07/msg01485.php
The discussion with Stephen Harris about signal response brought this
back to mind --- as things stand, the only way that the xlog.c code
could report an unrecoverable error is to elog(PANIC).  The problem
noted in the above message only applied in early startup of a
subprocess, but really we've got an issue with elog(FATAL) exits at
any point in a subprocess.  (Note: in the startup process, any
elog(ERROR) is auto-promoted to elog(FATAL) by elog.c, because of the
lack of a setjmp handler to return to.)  So the solution I proposed
before isn't enough.

The backend code is quite littered with elog(FATAL) calls that are meant
to indicate "this backend seems hopelessly confused, but there's no
reason to suppose there's a system-wide problem".  So we don't want the
postmaster to engage in a panic restart if a normal backend goes down
with elog(FATAL).  I claim, however, that that *would* be a good idea
for the startup process, and probably for the bgwriter too.

Rather than try to change a lot of elog call sites, what I'm thinking
would be a good plan is to make the FATAL-exit case in elog.c always
exit with exit(1) (right now it tests a couple of different conditions
to decide what to return).  Then, in the postmaster, consider an exit
code of 1 to be either OK or not OK depending on which child it came
from.  I think there are a small number of exit(1) calls that might
need to be changed to exit(2) because they are trying to force the
postmaster to do a panic restart, but it should be a minimal patch.

Comments?
        regards, tom lane


Re: Severity of elog(FATAL) should vary by process

От
Alvaro Herrera
Дата:
Tom Lane wrote:

> Rather than try to change a lot of elog call sites, what I'm thinking
> would be a good plan is to make the FATAL-exit case in elog.c always
> exit with exit(1) (right now it tests a couple of different conditions
> to decide what to return).  Then, in the postmaster, consider an exit
> code of 1 to be either OK or not OK depending on which child it came
> from.  I think there are a small number of exit(1) calls that might
> need to be changed to exit(2) because they are trying to force the
> postmaster to do a panic restart, but it should be a minimal patch.

I was going to suggest using symbolic names to exit codes instead of
hardcoding 1 or 2.  We do that in Mammoth replicator, and use the exit
codes to determine whether the postmaster needs to take special action
for different replication scenarios, e.g. when one needs to promote a
master server to slave or vice versa.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.