Обсуждение: PostgreSQL 9.1 "database system was interrupted; last known up at"

Поиск
Список
Период
Сортировка

PostgreSQL 9.1 "database system was interrupted; last known up at"

От
"Boylan, Ross"
Дата:
I had to power cycle my system because it became unresponsive.  Now PosgtreSQL will not start.  I would like advice
abouthow to proceed; I think pg_resetxlog is my next step.  I have made a copy of the current database files. 

<log file="postgresql-9.1-main.log">
2015-05-25 10:44:21 PDT LOG:  database system was interrupted; last known up at 2015-05-22 09:22:25 PDT
2015-05-25 10:44:21 PDT LOG:  incomplete startup packet
2015-05-25 10:44:21 PDT FATAL:  could not open file "/etc/ssl/certs/ssl-cert-snakeoil.pem": Permission denied
2015-05-25 10:44:21 PDT LOG:  startup process (PID 5180) exited with exit code 1
2015-05-25 10:44:21 PDT LOG:  aborting startup due to startup process failure
</log>

I am running PostgreSQL 9.1 on Debian wheezy aka 7 aka oldstable.
Installed via the Debian package.  I think I accepted the defaults, and have not changed the configuration since.
Linux  3.2.0, stock Debian kernel, amd64.
Connect via emacs sql-postgresql or psql.  An init script controls startup.

When the system became unresponsive I was able to ssh in; the X process had gone crazy and could not be killed.  Most
keyfile systems had been remounted read-only, and many commands (includiing shutdown and telinit) produced errors,
oftenI/O errors, when run.  The last kern.log entries showed a process being killed.  There was quite a lot of inode
deletionand log replaying on restart.  The log message above was from just after the restart. 

I have no backups*, but could recreate the database in the worst case.  I haven't done anything with the database in at
leasta week, I think, and so if I could get back the state as of 5/22 that would be fine. 

Filesystem is ext3 on dm-crypt on LVM.

The permission error on the snakeoil cert is weird, since it is readable by all.  I'm guessing it's a side effect of
theearlier problems. 

Thanks for any guidance.

Ross Boylan

(*) It would be a bit of mess even if I did, since I use bacula with postgres as the database.  I think I was also
dumpingthe database when I had backup going. 

Re: PostgreSQL 9.1 "database system was interrupted; last known up at"

От
Tom Lane
Дата:
"Boylan, Ross" <Ross.Boylan@ucsf.edu> writes:
> I had to power cycle my system because it became unresponsive.  Now PosgtreSQL will not start.  I would like advice
abouthow to proceed; I think pg_resetxlog is my next step.  I have made a copy of the current database files. 
> <log file="postgresql-9.1-main.log">
> 2015-05-25 10:44:21 PDT LOG:  database system was interrupted; last known up at 2015-05-22 09:22:25 PDT
> 2015-05-25 10:44:21 PDT LOG:  incomplete startup packet
> 2015-05-25 10:44:21 PDT FATAL:  could not open file "/etc/ssl/certs/ssl-cert-snakeoil.pem": Permission denied

You did not say, but I'm betting you recently updated to last week's
release, 9.1.16?

If so, this is an unforeseen side-effect of a patch to forcibly fsync
everything in the PG data directory after a crash, see
http://www.postgresql.org/message-id/flat/20150523172627.GA24277@msg.df7cb.de

We're debating how to fix it for real, but in the meantime the easiest
workaround is to change the permissions on that file so it's writable by
Postgres.  (Yeah, I know, ick.  Make a note to undo that after we fix
the fsync'ing logic.)

            regards, tom lane


Re: PostgreSQL 9.1 "database system was interrupted; last known up at"

От
"Boylan, Ross"
Дата:
Thanks for your reply.
1. Yes, I updated to 9.1.16.
2. Changing the permission of the snakeoil file makes produced a similar error about the private key.
Changing the permission of the private key produced an error that the key no longer had valid permissions.
3. So is the problem reading the certificate my only real problem?

Details:
1. Since security backports do not necessarily include all upstream changes, here's the changelog.Debian entry for the
lastupdate: 

postgresql-9.1 (9.1.16-0+deb7u1) wheezy-security; urgency=medium

  * New upstream version.

    + Avoid possible crash when client disconnects just before the
      authentication timeout expires (Benkocs Norbert Attila)

      If the timeout interrupt fired partway through the session shutdown
      sequence, SSL-related state would be freed twice, typically causing a
      crash and hence denial of service to other sessions.  Experimentation
      shows that an unauthenticated remote attacker could trigger the bug
      somewhat consistently, hence treat as security issue. (CVE-2015-3165)

    + Improve detection of system-call failures (Noah Misch)

      Our replacement implementation of snprintf() failed to check for errors
      reported by the underlying system library calls; the main case that
      might be missed is out-of-memory situations. In the worst case this
      might lead to information exposure, due to our code assuming that a
      buffer had been overwritten when it hadn't been. Also, there were a few
      places in which security-relevant calls of other system library
      functions did not check for failure.

      It remains possible that some calls of the *printf() family of functions
      are vulnerable to information disclosure if an out-of-memory error
      occurs at just the wrong time.  We judge the risk to not be large, but
      will continue analysis in this area. (CVE-2015-3166)

    + In contrib/pgcrypto, uniformly report decryption failures as Wrong key
      or corrupt data (Noah Misch)

      Previously, some cases of decryption with an incorrect key could report
      other error message texts.  It has been shown that such variance in
      error reports can aid attackers in recovering keys from other systems.
      While it's unknown whether pgcrypto's specific behaviors are likewise
      exploitable, it seems better to avoid the risk by using a
      one-size-fits-all message. (CVE-2015-3167)

  * Repository moved to git, update Vcs headers.

 -- Christoph Berg <christoph.berg@credativ.de>  Thu, 21 May 2015 16:35:28 +0200

2. Note Debian has a wrapper to support running multiple clusters of postgres at multiple versions of the database.
So the command below has the version (9.1) and cluster name (main) as arguments.
root@tempserver:/etc# date; pg_ctlcluster 9.1 main start
Mon May 25 21:25:29 PDT 2015
The PostgreSQL server failed to start. Please check the log output:
2015-05-25 21:25:29 PDT LOG:  database system was interrupted; last known up at 2015-05-22 09:22:25 PDT
2015-05-25 21:25:30 PDT LOG:  incomplete startup packet
2015-05-25 21:25:30 PDT FATAL:  could not open file "/etc/ssl/private/ssl-cert-snakeoil.key": Permission denied
2015-05-25 21:25:30 PDT LOG:  startup process (PID 28524) exited with exit code 1
2015-05-25 21:25:30 PDT LOG:  aborting startup due to startup process failure
root@tempserver:/etc# ls -l /etc/ssl/certs/ssl-cert-snakeoil.pem  # previously changed group and group write
-rw-rw-r-- 1 root postgres 1005 Apr 22  2013 /etc/ssl/certs/ssl-cert-snakeoil.pem
root@tempserver:/etc# ls -ld /etc/ssl/certs
drwxr-xr-x 3 root root 22528 Oct 18  2014 /etc/ssl/certs
# Realize that the "could not open file" is different from original error.  Try to adjust.
root@tempserver:/etc# ls -l /etc/ssl/private/ssl-cert-snakeoil.key
-rw-r----- 1 root ssl-cert 1704 Apr 22  2013 /etc/ssl/private/ssl-cert-snakeoil.key
root@tempserver:/etc# grep ssl-cert /etc/group
ssl-cert:x:118:postgres
root@tempserver:/etc# date; chmod g+w /etc/ssl/private/ssl-cert-snakeoil.key
Mon May 25 21:37:45 PDT 2015
root@tempserver:/etc# ls -l /etc/ssl/private/ssl-cert-snakeoil.key
-rw-rw---- 1 root ssl-cert 1704 Apr 22  2013 /etc/ssl/private/ssl-cert-snakeoil.key
root@tempserver:/etc# date; pg_ctlcluster 9.1 main start
Mon May 25 21:37:58 PDT 2015
The PostgreSQL server failed to start. Please check the log output:
2015-05-25 21:37:58 PDT FATAL:  private key file "server.key" has group or world access
2015-05-25 21:37:58 PDT DETAIL:  File must be owned by the database user or root, must have no write permission for
"group",and must have no permissions for "oth\ 
er".


Ross


________________________________________
From: Tom Lane [tgl@sss.pgh.pa.us]
Sent: Monday, May 25, 2015 7:15 PM
To: Boylan, Ross
Cc: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] PostgreSQL 9.1 "database system was interrupted; last known up at"

"Boylan, Ross" <Ross.Boylan@ucsf.edu> writes:
> I had to power cycle my system because it became unresponsive.  Now PosgtreSQL will not start.  I would like advice
abouthow to proceed; I think pg_resetxlog is my next step.  I have made a copy of the current database files. 
> <log file="postgresql-9.1-main.log">
> 2015-05-25 10:44:21 PDT LOG:  database system was interrupted; last known up at 2015-05-22 09:22:25 PDT
> 2015-05-25 10:44:21 PDT LOG:  incomplete startup packet
> 2015-05-25 10:44:21 PDT FATAL:  could not open file "/etc/ssl/certs/ssl-cert-snakeoil.pem": Permission denied

You did not say, but I'm betting you recently updated to last week's
release, 9.1.16?

If so, this is an unforeseen side-effect of a patch to forcibly fsync
everything in the PG data directory after a crash, see
http://www.postgresql.org/message-id/flat/20150523172627.GA24277@msg.df7cb.de

We're debating how to fix it for real, but in the meantime the easiest
workaround is to change the permissions on that file so it's writable by
Postgres.  (Yeah, I know, ick.  Make a note to undo that after we fix
the fsync'ing logic.)

                        regards, tom lane


Re: PostgreSQL 9.1 "database system was interrupted; last known up at"

От
Albe Laurenz
Дата:
Ross Boylan wrote:
> I had to power cycle my system because it became unresponsive.  Now PosgtreSQL will not start.  I
> would like advice about how to proceed; I think pg_resetxlog is my next step.  I have made a copy of
> the current database files.
> 
> <log file="postgresql-9.1-main.log">
> 2015-05-25 10:44:21 PDT LOG:  database system was interrupted; last known up at 2015-05-22 09:22:25 PDT
> 2015-05-25 10:44:21 PDT LOG:  incomplete startup packet
> 2015-05-25 10:44:21 PDT FATAL:  could not open file "/etc/ssl/certs/ssl-cert-snakeoil.pem": Permission denied
> 2015-05-25 10:44:21 PDT LOG:  startup process (PID 5180) exited with exit code 1
> 2015-05-25 10:44:21 PDT LOG:  aborting startup due to startup process failure
> </log>
> 
> I am running PostgreSQL 9.1 on Debian wheezy aka 7 aka oldstable.
> Installed via the Debian package.  I think I accepted the defaults, and have not changed the
> configuration since.
> Linux  3.2.0, stock Debian kernel, amd64.
> Connect via emacs sql-postgresql or psql.  An init script controls startup.
> 
> When the system became unresponsive I was able to ssh in; the X process had gone crazy and could not
> be killed.  Most key file systems had been remounted read-only, and many commands (includiing shutdown
> and telinit) produced errors, often I/O errors, when run.  The last kern.log entries showed a process
> being killed.  There was quite a lot of inode deletion and log replaying on restart.  The log message
> above was from just after the restart.
> 
> I have no backups*, but could recreate the database in the worst case.  I haven't done anything with
> the database in at least a week, I think, and so if I could get back the state as of 5/22 that would
> be fine.
> 
> Filesystem is ext3 on dm-crypt on LVM.
> 
> The permission error on the snakeoil cert is weird, since it is readable by all.  I'm guessing it's a
> side effect of the earlier problems.

You didn't say which exact version of PostgreSQL you are running, but I bet you it is 9.1.16
and you are hitting the "Fsync Permissions Bug" introduced with that release:
https://wiki.postgresql.org/wiki/May_2015_Fsync_Permissions_Bug

There are plans to release a fix for that shortly.
A better workaround than the one specified in the current version of the Wiki page
might be to replace the symbolic link with a copy and change ownership of the files to "postgres".

A restore wouldn't help you in this case, but since you probably felt the pain,
please implement a backup strategy.  It may come handy some day if recreating the database
is not a nice option.

Yours,
Laurenz Albe

Re: PostgreSQL 9.1 "database system was interrupted; last known up at"

От
"Boylan, Ross"
Дата:
Mostly following the instructions at
https://wiki.postgresql.org/wiki/May_2015_Fsync_Permissions_Bug#I.27ve_hit_this_bug_and_I_can.27t_restart_Postgres._What_do_I_do.3F
I was able to restart the postgres.
<log>
2015-05-30 13:45:38 PDT LOG:  database system was interrupted; last known up at 2015-05-22 09:22:25 PDT
2015-05-30 13:45:38 PDT LOG:  incomplete startup packet
2015-05-30 13:45:39 PDT LOG:  database system was not properly shut down; automatic recovery in progress
2015-05-30 13:45:39 PDT LOG:  record with zero length at 0/41856D18
2015-05-30 13:45:39 PDT LOG:  redo is not required
2015-05-30 13:45:39 PDT LOG:  database system is ready to accept connections
2015-05-30 13:45:39 PDT LOG:  autovacuum launcher started
</log>

I actually tried first renaming server.{crt,key} and copying files as suggested on the wiki.  I did a chown just on the
newserver.{crt,key}.  Postgres gave the same permission error on my renamed links, so I deleted them.  Then
/etc/init.d/postgresqlstart 
succeeded.

Thanks to Albe, Tom, and those contributing to the wiki for your help.

Ross Boylan
________________________________________
From: Albe Laurenz [laurenz.albe@wien.gv.at]
Sent: Tuesday, May 26, 2015 2:12 AM
To: Boylan, Ross; pgsql-admin@postgresql.org
Subject: RE: PostgreSQL 9.1 "database system was interrupted; last known up at"

Ross Boylan wrote:
> I had to power cycle my system because it became unresponsive.  Now PosgtreSQL will not start.  I
> would like advice about how to proceed; I think pg_resetxlog is my next step.  I have made a copy of
> the current database files.
>
> <log file="postgresql-9.1-main.log">
> 2015-05-25 10:44:21 PDT LOG:  database system was interrupted; last known up at 2015-05-22 09:22:25 PDT
> 2015-05-25 10:44:21 PDT LOG:  incomplete startup packet
> 2015-05-25 10:44:21 PDT FATAL:  could not open file "/etc/ssl/certs/ssl-cert-snakeoil.pem": Permission denied
> 2015-05-25 10:44:21 PDT LOG:  startup process (PID 5180) exited with exit code 1
> 2015-05-25 10:44:21 PDT LOG:  aborting startup due to startup process failure
> </log>
>
> I am running PostgreSQL 9.1 on Debian wheezy aka 7 aka oldstable.
> Installed via the Debian package.  I think I accepted the defaults, and have not changed the
> configuration since.
> Linux  3.2.0, stock Debian kernel, amd64.
> Connect via emacs sql-postgresql or psql.  An init script controls startup.
>
> When the system became unresponsive I was able to ssh in; the X process had gone crazy and could not
> be killed.  Most key file systems had been remounted read-only, and many commands (includiing shutdown
> and telinit) produced errors, often I/O errors, when run.  The last kern.log entries showed a process
> being killed.  There was quite a lot of inode deletion and log replaying on restart.  The log message
> above was from just after the restart.
>
> I have no backups*, but could recreate the database in the worst case.  I haven't done anything with
> the database in at least a week, I think, and so if I could get back the state as of 5/22 that would
> be fine.
>
> Filesystem is ext3 on dm-crypt on LVM.
>
> The permission error on the snakeoil cert is weird, since it is readable by all.  I'm guessing it's a
> side effect of the earlier problems.

You didn't say which exact version of PostgreSQL you are running, but I bet you it is 9.1.16
and you are hitting the "Fsync Permissions Bug" introduced with that release:
https://wiki.postgresql.org/wiki/May_2015_Fsync_Permissions_Bug

There are plans to release a fix for that shortly.
A better workaround than the one specified in the current version of the Wiki page
might be to replace the symbolic link with a copy and change ownership of the files to "postgres".

A restore wouldn't help you in this case, but since you probably felt the pain,
please implement a backup strategy.  It may come handy some day if recreating the database
is not a nice option.

Yours,
Laurenz Albe