Unintended restart after recovery error

Поиск
Список
Период
Сортировка
От Antonin Houska
Тема Unintended restart after recovery error
Дата
Msg-id 15802.1415785969@localhost
обсуждение исходный текст
Ответы Re: Unintended restart after recovery error  (Antonin Houska <ah@cybertec.at>)
Re: Unintended restart after recovery error  (Fujii Masao <masao.fujii@gmail.com>)
Список pgsql-hackers
While looking at postmaster.c:reaper(), one problematic case occurred to me.


1. Startup process signals PMSIGNAL_RECOVERY_STARTED.

2. Checkpointer process is forked and immediately dies.

3. reaper() catches this failure, calls HandleChildCrash() and thus sets
FatalError to true.

4. Startup process exits with non-zero status code too - either due to SIGQUIT
received from HandleChildCrash or due to some other failure of the startup
process itself. However, FatalError is already set, because of the previous
crash of the checkpointer. Thus reaper() does not set RecoveryError.

5. As RecoverError failed to be set to true, postmaster will try to restart
the cluster, although it apparently should not.


I could simulate the problem using the following changes (and by installing
recovery.conf into DATA directory):


diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 99f702c..0cbd1c1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6178,6 +6178,12 @@ StartupXLOG(void)            SetForwardFsyncRequests();
SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);           bgwriterLaunched = true; 
+
+            /*
+             * Accidental delay, ensuring that checkpointer's crash is caught
+             * by PM first.
+             */
+            pg_usleep(5000000L);        }        /*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 6c814ba..4585119 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -194,6 +194,8 @@ CheckpointerMain(void)    sigjmp_buf    local_sigjmp_buf;    MemoryContext checkpointer_context;
+    ereport(FATAL, (errmsg("early failure")));
+    CheckpointerShmem->checkpointer_pid = MyProcPid;    /*


This works for me as a fix:

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 6220a8e..0fb13bb 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2573,15 +2573,11 @@ reaper(SIGNAL_ARGS)             * After PM_STARTUP, any unexpected exit (including FATAL exit)
of            * the startup process is catastrophic, so kill other children,             * and set RecoveryError so we
don'ttry to reinitialize after 
-             * they're gone.  Exception: if FatalError is already set, that
-             * implies we previously sent the startup process a SIGQUIT, so
-             * that's probably the reason it died, and we do want to try to
-             * restart in that case.
+             * they're gone.             */            if (!EXIT_STATUS_0(exitstatus))            {
-                if (!FatalError)
-                    RecoveryError = true;
+                RecoveryError = true;                HandleChildCrash(pid, exitstatus,
_("startupprocess"));                continue; 

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Atri Sharma
Дата:
Сообщение: Re: using custom scan nodes to prototype parallel sequential scan
Следующее
От: Antonin Houska
Дата:
Сообщение: Re: Unintended restart after recovery error