Обсуждение: assertion at postmaster start

Поиск
Список
Период
Сортировка

assertion at postmaster start

От
Alvaro Herrera
Дата:
Once in a blue moon I get this assertion failure on server start:

2019-06-15 12:00:29.650 -04 [30080] LOG:  iniciando PostgreSQL 12beta1 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
7.4.0-1ubuntu1~18.04)7.4.0, 64-bit
 
2019-06-15 12:00:29.650 -04 [30080] LOG:  escuchando en la dirección IPv4 «127.0.0.1», port 55432
2019-06-15 12:00:29.650 -04 [30080] LOG:  escuchando en el socket Unix «/tmp/.s.PGSQL.55432»
2019-06-15 12:00:29.658 -04 [30956] LOG:  el sistema de bases de datos fue apagado en 2019-06-15 12:00:24 -04
2019-06-15 12:00:29.659 -04 [30080] LOG:  proceso de servidor (PID 30107) terminó con código de salida 15
2019-06-15 12:00:29.659 -04 [30080] LOG:  terminando todos los otros procesos de servidor activos
TRAP: FailedAssertion(«!(AbortStartTime == 0)», Archivo: «/pgsql/source/master/src/backend/postmaster/postmaster.c»,
Línea:2957)
 
Aborted (core dumped)

Apologies for the Spanish -- I cannot readily reproduce this.  In
essence, this shows a normal startup, until suddenly process 30107
terminates with exit code 15, and then while shutting everything down,
postmaster hits the aforementioned assertion and terminates.

One problem with debugging this is that I don't know what process 30107
is, since the logs don't mention it.

No idea what is going on.  But I'm going to set my script to start the
server with log_min_messages=debug1, in case I hit it again ...

Has anybody else seen this?

-- 
Álvaro Herrera



Re: assertion at postmaster start

От
Andres Freund
Дата:
Hi,

On 2019-06-15 12:09:50 -0400, Alvaro Herrera wrote:
> Once in a blue moon I get this assertion failure on server start:
> 
> 2019-06-15 12:00:29.650 -04 [30080] LOG:  iniciando PostgreSQL 12beta1 on x86_64-pc-linux-gnu, compiled by gcc
(Ubuntu7.4.0-1ubuntu1~18.04) 7.4.0, 64-bit
 
> 2019-06-15 12:00:29.650 -04 [30080] LOG:  escuchando en la dirección IPv4 «127.0.0.1», port 55432
> 2019-06-15 12:00:29.650 -04 [30080] LOG:  escuchando en el socket Unix «/tmp/.s.PGSQL.55432»
> 2019-06-15 12:00:29.658 -04 [30956] LOG:  el sistema de bases de datos fue apagado en 2019-06-15 12:00:24 -04
> 2019-06-15 12:00:29.659 -04 [30080] LOG:  proceso de servidor (PID 30107) terminó con código de salida 15
> 2019-06-15 12:00:29.659 -04 [30080] LOG:  terminando todos los otros procesos de servidor activos
> TRAP: FailedAssertion(«!(AbortStartTime == 0)», Archivo: «/pgsql/source/master/src/backend/postmaster/postmaster.c»,
Línea:2957)
 
> Aborted (core dumped)
> 
> Apologies for the Spanish -- I cannot readily reproduce this.  In
> essence, this shows a normal startup, until suddenly process 30107
> terminates with exit code 15, and then while shutting everything down,
> postmaster hits the aforementioned assertion and terminates.

I assume this is on master as of a few days ago? This doesn't even look
to be *after* a crash-restart? And I assume core files weren't enabled?


> One problem with debugging this is that I don't know what process 30107
> is, since the logs don't mention it.

Hm - it probably can't be that many processes, it looks like 30107 has
to have started pretty soon after the startup process (which IIRC is the
one emitting "el sistema de bases de datos fue apagado en"), and as soon
as that's done 30107 is noticed as having crashed.

Unfortunately, as this appears to be a start in a clean database, we
don't really know which phase of startup this is. There's IIRC no
messages to be expected before "database system is ready to accept
connections" in a clean start.

What is a bit odd is that:

> 2019-06-15 12:00:29.659 -04 [30080] LOG:  proceso de servidor (PID 30107) terminó con código de salida 15

comes from:
#. translator: %s is a noun phrase describing a child process, such as
#. "server process"
#: postmaster/postmaster.c:3656
#, c-format
msgid "%s (PID %d) exited with exit code %d"
msgstr "%s (PID %d) terminó con código de salida %d"

#: postmaster/postmaster.c:3301 postmaster/postmaster.c:3321
#: postmaster/postmaster.c:3328 postmaster/postmaster.c:3346
msgid "server process"
msgstr "proceso de servidor"

And "server process" is afaict only used for actual backends, not other
types of processes. But we've not yet seen "database system is ready to
accept accept connections", so IIRC it could only be a "dead_end" type
backend? But we didn't yet see an error from that...


> No idea what is going on.

Seems to indicate a logic error in postmaster's state machine. Perhaps
something related to dead_end processes?

Greetings,

Andres Freund



Re: assertion at postmaster start

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2019-06-15 12:09:50 -0400, Alvaro Herrera wrote:
>> Once in a blue moon I get this assertion failure on server start:
>> TRAP: FailedAssertion(«!(AbortStartTime == 0)», Archivo: «/pgsql/source/master/src/backend/postmaster/postmaster.c»,
Línea:2957) 

> And "server process" is afaict only used for actual backends, not other
> types of processes. But we've not yet seen "database system is ready to
> accept accept connections", so IIRC it could only be a "dead_end" type
> backend? But we didn't yet see an error from that...
> Seems to indicate a logic error in postmaster's state machine. Perhaps
> something related to dead_end processes?

So if Andres is guessing right, this must be from something trying to
connect before the postmaster is ready?  Seems like that could be
tested for ...

            regards, tom lane



Re: assertion at postmaster start

От
Tom Lane
Дата:
I wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On 2019-06-15 12:09:50 -0400, Alvaro Herrera wrote:
>>> Once in a blue moon I get this assertion failure on server start:
>>> TRAP: FailedAssertion("!(AbortStartTime == 0)", Archivo:
"/pgsql/source/master/src/backend/postmaster/postmaster.c",Linea: 2957) 

>> And "server process" is afaict only used for actual backends, not other
>> types of processes. But we've not yet seen "database system is ready to
>> accept accept connections", so IIRC it could only be a "dead_end" type
>> backend? But we didn't yet see an error from that...
>> Seems to indicate a logic error in postmaster's state machine. Perhaps
>> something related to dead_end processes?

> So if Andres is guessing right, this must be from something trying to
> connect before the postmaster is ready?  Seems like that could be
> tested for ...

I got around to trying to test for this, and I find that I can reproduce
the symptom exactly by applying the attached hack and then trying to
connect a couple of seconds after starting the postmaster.

Basically what seems to be happening in Alvaro's report is that

(1) before the startup process is done, something tries to connect,
causing a dead_end child to be launched;

(2) for reasons mysterious, that child exits with exit code 15 rather
than the expected 0 or 1;

(3) the postmaster therefore initiates a system-wide restart cycle;

(4) the startup process completes normally anyway, indicating that the
SIGQUIT arrived too late to affect it;

(5) then we hit the Assert, since we reach the transition-to-normal-run
code even though HandleChildCrash set AbortStartTime in step (3).

The timing window for (4) to happen is extremely tight normally.  The
attached patch makes it wider by the expedient of just not sending the
SIGQUIT to the startup process ;-).  Then you just need enough of a delay
in startup to perform a manual connection, plus some hack to make the
dead_end child exit with an unexpected exit code.

I think what this demonstrates is that that Assert is just wrong:
we *can* reach PM_RUN with the flag still set, so we should do

            StartupStatus = STARTUP_NOT_RUNNING;
            FatalError = false;
-            Assert(AbortStartTime == 0);
+            AbortStartTime = 0;
            ReachedNormalRunning = true;
            pmState = PM_RUN;

Probably likewise for the similar Assert in sigusr1_handler.

A larger question is whether we should modify the postmaster logic
so that crashes of dead_end children aren't treated as reasons to
perform a system restart.  I'm dubious about this, because frankly,
such crashes shouldn't be happening.  There is very little code
that a dead_end child will traverse before exiting ... so how the
devil did it reach an exit(15)?  Alvaro, are you running any
nonstandard code in the postmaster (shared_preload_libraries, maybe)?

            regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 3339804..ef53522 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2260,6 +2260,7 @@ retry1:
     switch (port->canAcceptConnections)
     {
         case CAC_STARTUP:
+            exit(15);
             ereport(FATAL,
                     (errcode(ERRCODE_CANNOT_CONNECT_NOW),
                      errmsg("the database system is starting up")));
@@ -3512,7 +3513,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
                 (errmsg_internal("sending %s to process %d",
                                  (SendStop ? "SIGSTOP" : "SIGQUIT"),
                                  (int) StartupPID)));
-        signal_child(StartupPID, (SendStop ? SIGSTOP : SIGQUIT));
+//        signal_child(StartupPID, (SendStop ? SIGSTOP : SIGQUIT));
         StartupStatus = STARTUP_SIGNALED;
     }

@@ -5383,6 +5384,9 @@ StartChildProcess(AuxProcType type)
         MemoryContextDelete(PostmasterContext);
         PostmasterContext = NULL;

+        if (type == StartupProcess)
+            pg_usleep(5000000);
+
         AuxiliaryProcessMain(ac, av);
         ExitPostmaster(0);
     }

Re: assertion at postmaster start

От
Tom Lane
Дата:
I wrote:
> I think what this demonstrates is that that Assert is just wrong:
> we *can* reach PM_RUN with the flag still set, so we should do
>             StartupStatus = STARTUP_NOT_RUNNING;
>             FatalError = false;
> -            Assert(AbortStartTime == 0);
> +            AbortStartTime = 0;
>             ReachedNormalRunning = true;
>             pmState = PM_RUN;
> Probably likewise for the similar Assert in sigusr1_handler.

Poking further at this, I noticed that the code just above here completely
fails to do what the comments say it intends to do, which is restart the
startup process after we've SIGQUIT'd it.  That's because the careful
manipulation of StartupStatus in reaper (lines 2943ff in HEAD) is stomped
on by HandleChildCrash, which will just unconditionally reset it to
STARTUP_CRASHED (line 3507).  So we end up shutting down the database
after all, which is not what the intention seems to be.  Hence,
commit 45811be94 was still a few bricks shy of a load :-(.

I propose the attached.  I'm inclined to think that the risk/benefit
of back-patching this is not very good, so I just want to stick it in
HEAD, unless somebody can explain why dead_end children are likely to
crash in the field.

            regards, tom lane

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 3339804..ecb7892 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2920,7 +2920,9 @@ reaper(SIGNAL_ARGS)
              * during PM_STARTUP is treated as catastrophic. There are no
              * other processes running yet, so we can just exit.
              */
-            if (pmState == PM_STARTUP && !EXIT_STATUS_0(exitstatus))
+            if (pmState == PM_STARTUP &&
+                StartupStatus != STARTUP_SIGNALED &&
+                !EXIT_STATUS_0(exitstatus))
             {
                 LogChildExit(LOG, _("startup process"),
                              pid, exitstatus);
@@ -2937,11 +2939,24 @@ reaper(SIGNAL_ARGS)
              * then we previously sent the startup process a SIGQUIT; so
              * that's probably the reason it died, and we do want to try to
              * restart in that case.
+             *
+             * This stanza also handles the case where we sent a SIGQUIT
+             * during PM_STARTUP due to some dead_end child crashing: in that
+             * situation, if the startup process dies on the SIGQUIT, we need
+             * to transition to PM_WAIT_BACKENDS state which will allow
+             * PostmasterStateMachine to restart the startup process.  (On the
+             * other hand, the startup process might complete normally, if we
+             * were too late with the SIGQUIT.  In that case we'll fall
+             * through and commence normal operations.)
              */
             if (!EXIT_STATUS_0(exitstatus))
             {
                 if (StartupStatus == STARTUP_SIGNALED)
+                {
                     StartupStatus = STARTUP_NOT_RUNNING;
+                    if (pmState == PM_STARTUP && Shutdown == NoShutdown)
+                        pmState = PM_WAIT_BACKENDS;
+                }
                 else
                     StartupStatus = STARTUP_CRASHED;
                 HandleChildCrash(pid, exitstatus,
@@ -2954,7 +2969,7 @@ reaper(SIGNAL_ARGS)
              */
             StartupStatus = STARTUP_NOT_RUNNING;
             FatalError = false;
-            Assert(AbortStartTime == 0);
+            AbortStartTime = 0;
             ReachedNormalRunning = true;
             pmState = PM_RUN;

@@ -3504,7 +3519,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
     if (pid == StartupPID)
     {
         StartupPID = 0;
-        StartupStatus = STARTUP_CRASHED;
+        /* Caller adjusts StartupStatus, so don't touch it here */
     }
     else if (StartupPID != 0 && take_action)
     {
@@ -5100,7 +5115,7 @@ sigusr1_handler(SIGNAL_ARGS)
     {
         /* WAL redo has started. We're out of reinitialization. */
         FatalError = false;
-        Assert(AbortStartTime == 0);
+        AbortStartTime = 0;

         /*
          * Crank up the background tasks.  It doesn't matter if this fails,

Re: assertion at postmaster start

От
Tom Lane
Дата:
I wrote:
> I propose the attached.  I'm inclined to think that the risk/benefit
> of back-patching this is not very good, so I just want to stick it in
> HEAD, unless somebody can explain why dead_end children are likely to
> crash in the field.

Pushed at ee3278239.

I'm still curious as to the explanation for a dead_end child exiting
with code 15, but I have no way to pursue the point.

            regards, tom lane



Re: assertion at postmaster start

От
Alvaro Herrera
Дата:
On 2019-Aug-26, Tom Lane wrote:

> I wrote:
> > I propose the attached.  I'm inclined to think that the risk/benefit
> > of back-patching this is not very good, so I just want to stick it in
> > HEAD, unless somebody can explain why dead_end children are likely to
> > crash in the field.
> 
> Pushed at ee3278239.
> 
> I'm still curious as to the explanation for a dead_end child exiting
> with code 15, but I have no way to pursue the point.

Many thanks for all the investigation and fix!

Sadly, I have *no* idea what could have happened that would have caused
a connection at that point (my start scripts don't do it).  It is
possible that I had a terminal running some shell loop on psql ("watch
psql -c something" perhaps).  But I'm sure I didn't notice that when I
reported this, or I would have mentioned it.  However, I have no idea
why would it have died with code 15.  From my notes of what I was doing
that day, I can't find any evidence that I would have had anything in
shared_preload_libraries.  (I don't have Frost's complete timestamped
shell history, however.)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services