Обсуждение: What to do when dynamic shared memory control segment is corrupt

Поиск
Список
Период
Сортировка

What to do when dynamic shared memory control segment is corrupt

От
Sherrylyn Branchaw
Дата:
Greetings,

We are using Postgres 9.6.8 (planning to upgrade to 9.6.9 soon) on RHEL 6.9.

We recently experienced two similar outages on two different prod databases. The error messages from the logs were as follows:

LOG:  server process (PID 138529) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

We are still investigating what may have triggered these errors, since there were no recent changes to these databases. Unfortunately, core dumps were not configured correctly, so we may have to wait for the next outage before we can do a good root cause analysis.

My question, meanwhile, is around remedial actions to take when this happens.

In one case, the logs recorded

LOG:  all server processes terminated; reinitializing
LOG:  incomplete data in "postmaster.pid": found only 1 newlines while trying to add line 7
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 365/FDFA738
LOG:  invalid record length at 365/12420978: wanted 24, got 0
LOG:  redo done at 365/12420950
LOG:  last completed transaction was at log time 2018-06-05 10:59:27.049443-05
LOG:  checkpoint starting: end-of-recovery immediate
LOG:  checkpoint complete: wrote 5343 buffers (0.5%); 0 transaction log file(s) added, 1 removed, 0 recycled; write=0.131 s, sync=0.009 s, total=0.164 s; sync files=142, longest=0.005 s, average=0.000 s; distance=39064 kB, estimate=39064 kB
LOG:  MultiXact member wraparound protections are now enabled
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections

In that case, the database restarted immediately, with only 30 seconds of downtime.

In the other case, the logs recorded

LOG:  all server processes terminated; reinitializing
LOG:  dynamic shared memory control segment is corrupt
LOG:  incomplete data in "postmaster.pid": found only 1 newlines while trying to add line 7

In that case, the database did not restart on its own. It was 5 am on Sunday, so the on-call SRE just manually started the database up, and it appears to have been running fine since.

My question is whether the corrupt shared memory control segment, and the failure of Postgres to automatically restart, mean the database should not be automatically started up, and if there's something we should be doing before restarting. 

Do we potentially have corrupt data or indices as a result of our last outage? If so, what should we do to investigate?

Thanks,
Sherrylyn

Re: What to do when dynamic shared memory control segment is corrupt

От
Tom Lane
Дата:
Sherrylyn Branchaw <sbranchaw@gmail.com> writes:
> We are using Postgres 9.6.8 (planning to upgrade to 9.6.9 soon) on RHEL 6.9.
> We recently experienced two similar outages on two different prod
> databases. The error messages from the logs were as follows:
> LOG:  server process (PID 138529) was terminated by signal 6: Aborted

Hm ... were these installations built with --enable-cassert?  If not,
an abort trap seems pretty odd.

> In one case, the logs recorded
> LOG:  all server processes terminated; reinitializing
> LOG:  incomplete data in "postmaster.pid": found only 1 newlines while
> trying to add line 7
> ...

> In the other case, the logs recorded
> LOG:  all server processes terminated; reinitializing
> LOG:  dynamic shared memory control segment is corrupt
> LOG:  incomplete data in "postmaster.pid": found only 1 newlines while
> trying to add line 7
> ...

Those "incomplete data" messages are quite unexpected and disturbing.
I don't know of any mechanism within Postgres proper that would result
in corruption of the postmaster.pid file that way.  (I wondered briefly
if trying to start a conflicting postmaster would result in such a
situation, but experimentation here says not.)  I'm suspicious that
this may indicate a bug or unwarranted assumption in whatever scripts
you use to start/stop the postmaster.  Whether that is at all related
to your crash issue is hard to say, but it bears looking into.

> My question is whether the corrupt shared memory control segment, and the
> failure of Postgres to automatically restart, mean the database should not
> be automatically started up, and if there's something we should be doing
> before restarting.

No, that looks like fairly typical crash recovery to me: corrupt shared
memory contents are expected and recovered from after a crash.  However,
we don't expect postmaster.pid to get mucked with.

            regards, tom lane


Re: What to do when dynamic shared memory control segment is corrupt

От
Andres Freund
Дата:
On 2018-06-18 12:30:13 -0400, Tom Lane wrote:
> Sherrylyn Branchaw <sbranchaw@gmail.com> writes:
> > We are using Postgres 9.6.8 (planning to upgrade to 9.6.9 soon) on RHEL 6.9.
> > We recently experienced two similar outages on two different prod
> > databases. The error messages from the logs were as follows:
> > LOG:  server process (PID 138529) was terminated by signal 6: Aborted
> 
> Hm ... were these installations built with --enable-cassert?  If not,
> an abort trap seems pretty odd.

PANIC does an abort, so that's not too surprising...

    if (elevel >= PANIC)
    {
        /*
         * Serious crash time. Postmaster will observe SIGABRT process exit
         * status and kill the other backends too.
         *
         * XXX: what if we are *in* the postmaster?  abort() won't kill our
         * children...
         */
        fflush(stdout);
        fflush(stderr);
        abort();
    }

Greetings,

Andres Freund


Re: What to do when dynamic shared memory control segment is corrupt

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2018-06-18 12:30:13 -0400, Tom Lane wrote:
>> Sherrylyn Branchaw <sbranchaw@gmail.com> writes:
>>> LOG:  server process (PID 138529) was terminated by signal 6: Aborted

>> Hm ... were these installations built with --enable-cassert?  If not,
>> an abort trap seems pretty odd.

> PANIC does an abort, so that's not too surprising...

Hm, I supposed that Sherrylyn would've noticed any PANIC entries in
the log.  The TRAP message from an assertion failure could've escaped
notice though, even assuming that her logging setup captured it.

            regards, tom lane


Re: What to do when dynamic shared memory control segment is corrupt

От
Peter Geoghegan
Дата:
On Mon, Jun 18, 2018 at 1:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hm, I supposed that Sherrylyn would've noticed any PANIC entries in
> the log.  The TRAP message from an assertion failure could've escaped
> notice though, even assuming that her logging setup captured it.

Unhandled C++ exceptions end up calling std::abort(). I've seen bugs
in modules like PL/V8 that were caused by this. The symptom was a
mysterious message in the logs about SIGABRT. Perhaps that's what
happened here?

What extensions are installed, if any?

-- 
Peter Geoghegan


Re: What to do when dynamic shared memory control segment is corrupt

От
Sherrylyn Branchaw
Дата:
> Hm ... were these installations built with --enable-cassert?  If not,
> an abort trap seems pretty odd.

The packages are installed directly from the yum repos for RHEL. I'm not aware that --enable-cassert is being used, and we're certainly not installing from source.

> Those "incomplete data" messages are quite unexpected and disturbing.
> I don't know of any mechanism within Postgres proper that would result
> in corruption of the postmaster.pid file that way.  (I wondered briefly
> if trying to start a conflicting postmaster would result in such a
> situation, but experimentation here says not.)  I'm suspicious that
> this may indicate a bug or unwarranted assumption in whatever scripts
> you use to start/stop the postmaster.  Whether that is at all related
> to your crash issue is hard to say, but it bears looking into.

We're using the stock initd script from the yum repo, but I dug into this issue, and it looks like we're passing the path to the postmaster.pid as the $pidfile variable in our sysconfig file, meaning the initd script is managing the postmaster.pid file, and specifically is overwriting it with a single line containing just the pid. I'm not sure why it's set up like this, and I'm thinking we should change it, but it seems harmless and unrelated to the crash. In particular, manual initd actions such as stop, start, restart, and status all work fine.

> No, that looks like fairly typical crash recovery to me: corrupt shared
> memory contents are expected and recovered from after a crash. 

That's reassuring. But if it's safe for us to immediately start the server back up, why did Postgres not automatically start the server up like it did the first time? I was assuming it was due to the presence of the corrupt memory segment, as that was the only difference in the logs, although I could be wrong. Automatic restart would have saved us a great deal of downtime; since in the first case we had total recovery within 30 seconds, and in the second case, many minutes of downtime while someone got paged, troubleshot the issue, and eventually decided to try starting the database back up.

At any rate, if it's safe, we can write a script to detect this failure mode and automatically restart, although it would be less error-prone if Postgres restarted automatically.

> Hm, I supposed that Sherrylyn would've noticed any PANIC entries in
> the log.

No PANICs. The log lines I pasted were the only ones that looked relevant in the Postgres logs. I can try to dig through the application logs, but I was planning to wait until the next time this happens, since we should have core dumps fixed and that might make things easier.

> What extensions are installed, if any?

In the first database, the one without the corrupt memory segment and that restarted automatically: plpgsql and postgres_fdw.

In the second database, the one where the memory segment got corrupted and that didn't restart automatically: dblink, hstore, pg_trgm, pgstattuple, plpgsql, and tablefunc.

I forgot to mention that the queries that got killed were innocuous-looking SELECTs that completed successfully for me in less than a second when I ran them manually. In other words, the problem was not reproducible.

Sherrylyn

Re: What to do when dynamic shared memory control segment is corrupt

От
Tom Lane
Дата:
Sherrylyn Branchaw <sbranchaw@gmail.com> writes:
>> Hm ... were these installations built with --enable-cassert?  If not,
>> an abort trap seems pretty odd.

> The packages are installed directly from the yum repos for RHEL. I'm not
> aware that --enable-cassert is being used, and we're certainly not
> installing from source.

OK, I'm pretty sure nobody builds production RPMs with --enable-cassert.
But your extensions (as listed below) don't include any C++ code, so
that still leaves us wondering where the abort trap came from.  A stack
trace would almost certainly help clear that up.

>> Those "incomplete data" messages are quite unexpected and disturbing.

> We're using the stock initd script from the yum repo, but I dug into this
> issue, and it looks like we're passing the path to the postmaster.pid as
> the $pidfile variable in our sysconfig file, meaning the initd script is
> managing the postmaster.pid file, and specifically is overwriting it with a
> single line containing just the pid. I'm not sure why it's set up like
> this, and I'm thinking we should change it, but it seems harmless and
> unrelated to the crash. In particular, manual initd actions such as stop,
> start, restart, and status all work fine.

This is bad; a normal postmaster.pid file contains half a dozen lines
besides the PID proper.  You might get away with this for now, but it'll
break pg_ctl as of v10 or so, and might confuse other external tools
sooner than that.  Still, it doesn't seem related to your crash problem.

>> No, that looks like fairly typical crash recovery to me: corrupt shared
>> memory contents are expected and recovered from after a crash.

> That's reassuring. But if it's safe for us to immediately start the server
> back up, why did Postgres not automatically start the server up like it did
> the first time?

Yeah, I'd like to know that too.  The complaint about corrupt shared
memory may be just an unrelated red herring, or it might be a separate
effect of whatever the primary failure was ... but I think it was likely
not the direct cause of the failure-to-restart.  But we've got no real
evidence as to what that direct cause was.

> At any rate, if it's safe, we can write a script to detect this failure
> mode and automatically restart, although it would be less error-prone if
> Postgres restarted automatically.

I realize that you're most focused on less-downtime, but from my
perspective it'd be good to worry about collecting evidence as to
what happened exactly.  Capturing core files is a good start --- and
don't forget the possibility that there's more than one.  A plausible
guess as to why the system didn't restart is that the postmaster crashed
too, so we'd need to see its core to figure out why.

Anyway, I would not be afraid to try restarting the postmaster manually
if it died.  Maybe don't do that repeatedly without human intervention;
but PG is pretty robust against crashes.  We developers crash it all the
time, and we don't lose data.

            regards, tom lane


Re: What to do when dynamic shared memory control segment is corrupt

От
Alban Hertroys
Дата:
> On 18 Jun 2018, at 17:34, Sherrylyn Branchaw <sbranchaw@gmail.com> wrote:
>
> In the other case, the logs recorded
>
> LOG:  all server processes terminated; reinitializing
> LOG:  dynamic shared memory control segment is corrupt
> LOG:  incomplete data in "postmaster.pid": found only 1 newlines while trying to add line 7
>
> In that case, the database did not restart on its own. It was 5 am on Sunday, so the on-call SRE just manually
startedthe database up, and it appears to have been running fine since. 

That rings a bell. Some versions of systemd apparently clean up shared memory belonging to a user when it detects the
userlogs out. ISTR that we had to disable that on our CentOS 7 server to stop crashes from happening. 

More details here: https://wiki.postgresql.org/wiki/Systemd

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.



Re: What to do when dynamic shared memory control segment is corrupt

От
Sherrylyn Branchaw
Дата:
Yeah, I'd like to know that too.  The complaint about corrupt shared
memory may be just an unrelated red herring, or it might be a separate
effect of whatever the primary failure was ... but I think it was likely
not the direct cause of the failure-to-restart.

Anyway, I would not be afraid to try restarting the postmaster manually
if it died.  Maybe don't do that repeatedly without human intervention;
but PG is pretty robust against crashes.  We developers crash it all the
time, and we don't lose data.

Understood, and thanks. I was basing my concern on a message in the mailing lists that suggested that postgres might fail to start up in the event of a corrupted memory segment. I would link to the message directly, but I keep getting backend server error messages when I try to search for it today. At any rate, it looked there was a chance that it was a deliberate design choice, and I didn't want to ignore it if so. It's good to know that this is not the case.

I realize that you're most focused on less-downtime, but from my
perspective it'd be good to worry about collecting evidence as to
what happened exactly.  

Absolutely. I would love to know why this is happening too. However, our priorities have been set in part by a very tight deadline handed down from the C-levels to migrate to Aurora, so we have to focus our energies accordingly. I will be back with core files if this happens again before we're completely migrated over. Meanwhile, thank you for assuring me we have no current data corruption and that it's safe to restart next time without taking additional action to avoid or detect corruption.

Best,
Sherrylyn