Обсуждение: System in Recovery Mode

Поиск
Список
Период
Сортировка

System in Recovery Mode

От
"Aaron Bono"
Дата:
Our database was going along just fine this morning when, out of the blue, we started getting the following errors:

FATAL:  the database system is in recovery mode

We are using PostgreSQL 8.1.3 on CentOS.  I was the only person on the machine and had recently started a small batch that had just finished querying and updating the database.  This batch runs constantly and has not been changed in months.

What are the things that could be causing this?

Thanks!
Aaron

--
==================================================================
  Aaron Bono
  Aranya Software Technologies, Inc.
  http://www.aranya.com
  http://codeelixir.com
==================================================================

Re: System in Recovery Mode

От
"Aaron Bono"
Дата:
On Mon, Feb 25, 2008 at 9:42 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, 2008-02-25 at 08:11 -0600, Aaron Bono wrote:
> Our database was going along just fine this morning when, out of the
> blue, we started getting the following errors:
>
> FATAL:  the database system is in recovery mode
>
> We are using PostgreSQL 8.1.3 on CentOS.  I was the only person on the
> machine and had recently started a small batch that had just finished
> querying and updating the database.  This batch runs constantly and
> has not been changed in months.

Your database has crashed and is recovering. You should look at the
server log to find out when and why this happened, then either correct
the problem or report the bug.


Here is what I am seeing:

<2008-02-25 07:15:24 CST username 127.0.0.1>ERROR:  unrecognized node type: 4260493
<2008-02-25 07:15:25 CST [unknown] >LOG:  connection received: host=127.0.0.1 port=51295
<2008-02-25 07:15:25 CST username 127.0.0.1>LOG:  connection authorized: user=username database=my_db
<2008-02-25 07:15:25 CST username 127.0.0.1>LOG:  disconnection: session time: 0:00:00.04 user=username database=my_db host=127.0.0.1 port=51295
<2008-02-25 07:15:25 CST [unknown] >LOG:  connection received: host=127.0.0.1 port=51296
<2008-02-25 07:15:25 CST username 127.0.0.1>LOG:  connection authorized: user=username database=clockwork_db_prod
<2008-02-25 07:15:38 CST username 127.0.0.1>ERROR:  unrecognized node type: 4260493
<2008-02-25 07:15:49 CST  >LOG:  server process (PID 10876) was terminated by signal 11
<2008-02-25 07:15:49 CST  >LOG:  terminating any other active server processes
<2008-02-25 07:15:49 CST username 127.0.0.1>WARNING:  terminating connection because of crash of another server process
<2008-02-25 07:15:49 CST username 127.0.0.1>DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
<2008-02-25 07:15:49 CST username 127.0.0.1>HINT:  In a moment you should be able to reconnect to the database and repeat your command.
<2008-02-25 07:15:49 CST username 127.0.0.1>WARNING:  terminating connection because of crash of another server process

and it continues like this as it terminated all connections followed by:

<2008-02-25 07:15:49 CST username 127.0.0.1>CONTEXT:  PL/pgSQL function "get_branch_for_zip" line 61 at fetch
<2008-02-25 07:15:50 CST  >LOG:  all server processes terminated; reinitializing
<2008-02-25 07:15:50 CST  >LOG:  database system was interrupted at 2008-02-25 07:11:10 CST
<2008-02-25 07:15:50 CST  >LOG:  checkpoint record is at 14/9F015F54
<2008-02-25 07:15:50 CST  >LOG:  redo record is at 14/9F015F54; undo record is at 0/0; shutdown FALSE
<2008-02-25 07:15:50 CST  >LOG:  next transaction ID: 911288200; next OID: 90133045
<2008-02-25 07:15:50 CST  >LOG:  next MultiXactId: 27095; next MultiXactOffset: 66130
<2008-02-25 07:15:50 CST  >LOG:  database system was not properly shut down; automatic recovery in progress
<2008-02-25 07:15:50 CST  >LOG:  redo starts at 14/9F015F98
<2008-02-25 07:15:50 CST  >LOG:  record with zero length at 14/9F08C23C
<2008-02-25 07:15:50 CST  >LOG:  redo done at 14/9F08C214
<2008-02-25 07:15:50 CST  >LOG:  could not truncate directory "pg_multixact/members": apparent wraparound
<2008-02-25 07:15:50 CST  >LOG:  database system is ready
<2008-02-25 07:15:50 CST  >LOG:  transaction ID wrap limit is 1576224515, limited by database "another_db"
<2008-02-25 07:15:56 CST [unknown] >LOG:  connection received: host=127.0.0.1 port=51987
<2008-02-25 07:15:56 CST username 127.0.0.1>LOG:  connection authorized: user=onehour database=my_db
<2008-02-25 07:15:56 CST  >LOG:  server process (PID 14433) was terminated by signal 11
<2008-02-25 07:15:56 CST  >LOG:  terminating any other active server processes
<2008-02-25 07:15:56 CST  >LOG:  all server processes terminated; reinitializing
<2008-02-25 07:15:56 CST  >LOG:  database system was interrupted at 2008-02-25 07:15:50 CST
<2008-02-25 07:15:56 CST  >LOG:  checkpoint record is at 14/9F08C23C
<2008-02-25 07:15:56 CST  >LOG:  redo record is at 14/9F08C23C; undo record is at 0/0; shutdown TRUE
<2008-02-25 07:15:56 CST  >LOG:  next transaction ID: 911293548; next OID: 90133045
<2008-02-25 07:15:56 CST  >LOG:  next MultiXactId: 27095; next MultiXactOffset: 66130
<2008-02-25 07:15:56 CST  >LOG:  database system was not properly shut down; automatic recovery in progress
<2008-02-25 07:15:56 CST  >LOG:  record with zero length at 14/9F08C280
<2008-02-25 07:15:56 CST  >LOG:  redo is not required
<2008-02-25 07:15:56 CST  >LOG:  could not truncate directory "pg_multixact/members": apparent wraparound
<2008-02-25 07:15:56 CST  >LOG:  database system is ready
<2008-02-25 07:15:56 CST  >LOG:  transaction ID wrap limit is 1576224515, limited by database "another_db"

and it then continues with this.


--
==================================================================
  Aaron Bono
  Aranya Software Technologies, Inc.
  http://www.aranya.com
  http://codeelixir.com
==================================================================

Re: System in Recovery Mode

От
Simon Riggs
Дата:
On Mon, 2008-02-25 at 08:11 -0600, Aaron Bono wrote:
> Our database was going along just fine this morning when, out of the
> blue, we started getting the following errors:
>
> FATAL:  the database system is in recovery mode
>
> We are using PostgreSQL 8.1.3 on CentOS.  I was the only person on the
> machine and had recently started a small batch that had just finished
> querying and updating the database.  This batch runs constantly and
> has not been changed in months.

Your database has crashed and is recovering. You should look at the
server log to find out when and why this happened, then either correct
the problem or report the bug.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: System in Recovery Mode

От
"Scott Marlowe"
Дата:
On Mon, Feb 25, 2008 at 8:11 AM, Aaron Bono <postgresql@aranya.com> wrote:
> Our database was going along just fine this morning when, out of the blue,
> we started getting the following errors:
>
> FATAL:  the database system is in recovery mode
>
> We are using PostgreSQL 8.1.3 on CentOS.  I was the only person on the
> machine and had recently started a small batch that had just finished
> querying and updating the database.  This batch runs constantly and has not
> been changed in months.
>
> What are the things that could be causing this?

I'm not saying it definitely is the problem,  but you are missing 2
years of updates.  8.1.11 came out just last month, 8.1.3 came out 2
years ago.

Re: System in Recovery Mode

От
"Aaron Bono"
Дата:
On Mon, Feb 25, 2008 at 10:48 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, 2008-02-25 at 09:53 -0600, Aaron Bono wrote:

<2008-02-25 07:15:24 CST username 127.0.0.1>ERROR:  unrecognized node
type: 4260493
>

...

> <2008-02-25 07:15:49 CST username 127.0.0.1>CONTEXT:  PL/pgSQL
> function "get_branch_for_zip" line 61 at fetch

...

> <2008-02-25 07:15:56 CST  >LOG:  could not truncate directory
> "pg_multixact/members": apparent wraparound

Some interesting failures there.

Is this a modified Postgres server? Are you running externally supplied
C language functions? Have you had hardware problems?

This is NOT a modified PostgreSQL server.  We are not using any externally supplied C language functions or modules.  However, we did have to upgrade from what CPanel put on the server (it used version 7.x) and I don't know what CPanel has done.

We did have a hard drive problem (errors on the drive) in the past and they had to swap out a drive.  It is a RAID configuration so no backup/restore was necessary.  We do have intermittent problems with the server and believe it is related to Postgres.  Our SA assures us there are currently no system errors or other indications of hard drive problems.

We want to do an upgrade but are apprehensive as this is a production server.  Is there a place we can go to to get a good idea of how to perform an upgrade with as little interruption to service as possible?

Thanks,
Aaron


--
==================================================================
  Aaron Bono
  Aranya Software Technologies, Inc.
  http://www.aranya.com
  http://codeelixir.com
==================================================================

Re: System in Recovery Mode

От
Simon Riggs
Дата:
On Mon, 2008-02-25 at 09:53 -0600, Aaron Bono wrote:

<2008-02-25 07:15:24 CST username 127.0.0.1>ERROR:  unrecognized node
type: 4260493
>

...

> <2008-02-25 07:15:49 CST username 127.0.0.1>CONTEXT:  PL/pgSQL
> function "get_branch_for_zip" line 61 at fetch

...

> <2008-02-25 07:15:56 CST  >LOG:  could not truncate directory
> "pg_multixact/members": apparent wraparound

Some interesting failures there.

Is this a modified Postgres server? Are you running externally supplied
C language functions? Have you had hardware problems?

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com


Re: System in Recovery Mode

От
"Aaron Bono"
Дата:
On Mon, Feb 25, 2008 at 9:53 AM, Aaron Bono <postgresql@aranya.com> wrote:
On Mon, Feb 25, 2008 at 9:42 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, 2008-02-25 at 08:11 -0600, Aaron Bono wrote:

Now I am getting:

<2008-02-25 10:51:08 CST username 127.0.0.1>LOG:  unexpected EOF on client connection
<2008-02-25 10:51:08 CST username 127.0.0.1>LOG:  disconnection: session time: 3:17:39.14 user=username database=my_db host=127.0.0.1 port=12345

...

<2008-02-25 10:51:27 CST [unknown] >LOG:  connection received: host=127.0.0.1 port=43222
<2008-02-25 10:51:27 CST username 127.0.0.1>LOG:  connection authorized: user=username database=my_db
<2008-02-25 10:52:27 CST  >LOG:  autovacuum: processing database "postgres"
<2008-02-25 10:53:18 CST [unknown] >LOG:  connection received: host=127.0.0.1 port=43284
<2008-02-25 10:53:18 CST username 127.0.0.1>LOG:  connection authorized: user=username database=my_db
<2008-02-25 10:53:22 CST  >LOG:  server process (PID 17043) was terminated by signal 11
<2008-02-25 10:53:22 CST  >LOG:  terminating any other active server processes
<2008-02-25 10:53:22 CST username 127.0.0.1>WARNING:  terminating connection because of crash of another server process
<2008-02-25 10:53:22 CST username 127.0.0.1>DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly
corrupted shared memory.
<2008-02-25 10:53:22 CST username 127.0.0.1>HINT:  In a moment you should be able to reconnect to the database and repeat your command.
<2008-02-25 10:53:22 CST username 127.0.0.1>CONTEXT:  PL/pgSQL function "get_branch_for_zip" line 61 at fetch
<2008-02-25 10:53:22 CST username 127.0.0.1>WARNING:  terminating connection because of crash of another server process
<2008-02-25 10:53:22 CST username 127.0.0.1>DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly
 corrupted shared memory.
<2008-02-25 10:53:22 CST username 127.0.0.1>HINT:  In a moment you should be able to reconnect to the database and repeat your command.
<2008-02-25 10:53:22 CST username 127.0.0.1>WARNING:  terminating connection because of crash of another server process
<2008-02-25 10:53:22 CST username 127.0.0.1>DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly
corrupted shared memory.
<2008-02-25 10:53:22 CST username 127.0.0.1>HINT:  In a moment you should be able to reconnect to the database and repeat your command.
<2008-02-25 10:53:22 CST username 127.0.0.1>WARNING:  terminating connection because of crash of another server process
<2008-02-25 10:53:22 CST username 127.0.0.1>DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly
corrupted shared memory.
<2008-02-25 10:53:22 CST username 127.0.0.1>HINT:  In a moment you should be able to reconnect to the database and repeat your command.
<2008-02-25 10:53:22 CST username 127.0.0.1>WARNING:  terminating connection because of crash of another server process
<2008-02-25 10:53:22 CST username 127.0.0.1>DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly
 corrupted shared memory.
<2008-02-25 10:53:22 CST username 127.0.0.1>HINT:  In a moment you should be able to reconnect to the database and repeat your command.
<2008-02-25 10:53:22 CST  >LOG:  all server processes terminated; reinitializing
<2008-02-25 10:53:22 CST  >LOG:  database system was interrupted at 2008-02-25 10:50:59 CST
<2008-02-25 10:53:22 CST  >LOG:  checkpoint record is at 14/9F8126D8
<2008-02-25 10:53:22 CST  >LOG:  redo record is at 14/9F8126D8; undo record is at 0/0; shutdown FALSE
<2008-02-25 10:53:22 CST  >LOG:  next transaction ID: 911531009; next OID: 90141237
<2008-02-25 10:53:22 CST  >LOG:  next MultiXactId: 27095; next MultiXactOffset: 66130
<2008-02-25 10:53:22 CST  >LOG:  database system was not properly shut down; automatic recovery in progress
<2008-02-25 10:53:22 CST  >LOG:  redo starts at 14/9F81271C
<2008-02-25 10:53:22 CST  >LOG:  record with zero length at 14/9F826D68
<2008-02-25 10:53:22 CST  >LOG:  redo done at 14/9F826D40
<2008-02-25 10:53:22 CST  >LOG:  could not truncate directory "pg_multixact/members": apparent wraparound
<2008-02-25 10:53:22 CST  >LOG:  database system is ready
<2008-02-25 10:53:22 CST  >LOG:  transaction ID wrap limit is 1576224515, limited by database "another_db"
<2008-02-25 10:53:23 CST [unknown] >LOG:  connection received: host=127.0.0.1 port=43289
<2008-02-25 10:53:23 CST username 127.0.0.1>LOG:  connection authorized: user=username database=my_db


 

> Our database was going along just fine this morning when, out of the
> blue, we started getting the following errors:
>
> FATAL:  the database system is in recovery mode
>
> We are using PostgreSQL 8.1.3 on CentOS.  I was the only person on the
> machine and had recently started a small batch that had just finished
> querying and updating the database.  This batch runs constantly and
> has not been changed in months.

Your database has crashed and is recovering. You should look at the
server log to find out when and why this happened, then either correct
the problem or report the bug.


Here is what I am seeing:

<2008-02-25 07:15:24 CST username 127.0.0.1>ERROR:  unrecognized node type: 4260493
<2008-02-25 07:15:25 CST [unknown] >LOG:  connection received: host=127.0.0.1 port=51295
<2008-02-25 07:15:25 CST username 127.0.0.1>LOG:  connection authorized: user=username database=my_db
<2008-02-25 07:15:25 CST username 127.0.0.1>LOG:  disconnection: session time: 0:00:00.04 user=username database=my_db host=127.0.0.1 port=51295
<2008-02-25 07:15:25 CST [unknown] >LOG:  connection received: host=127.0.0.1 port=51296
<2008-02-25 07:15:25 CST username 127.0.0.1>LOG:  connection authorized: user=username database=clockwork_db_prod
<2008-02-25 07:15:38 CST username 127.0.0.1>ERROR:  unrecognized node type: 4260493
<2008-02-25 07:15:49 CST  >LOG:  server process (PID 10876) was terminated by signal 11
<2008-02-25 07:15:49 CST  >LOG:  terminating any other active server processes
<2008-02-25 07:15:49 CST username 127.0.0.1>WARNING:  terminating connection because of crash of another server process
<2008-02-25 07:15:49 CST username 127.0.0.1>DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
<2008-02-25 07:15:49 CST username 127.0.0.1>HINT:  In a moment you should be able to reconnect to the database and repeat your command.
<2008-02-25 07:15:49 CST username 127.0.0.1>WARNING:  terminating connection because of crash of another server process

and it continues like this as it terminated all connections followed by:

<2008-02-25 07:15:49 CST username 127.0.0.1>CONTEXT:  PL/pgSQL function "get_branch_for_zip" line 61 at fetch
<2008-02-25 07:15:50 CST  >LOG:  all server processes terminated; reinitializing
<2008-02-25 07:15:50 CST  >LOG:  database system was interrupted at 2008-02-25 07:11:10 CST
<2008-02-25 07:15:50 CST  >LOG:  checkpoint record is at 14/9F015F54
<2008-02-25 07:15:50 CST  >LOG:  redo record is at 14/9F015F54; undo record is at 0/0; shutdown FALSE
<2008-02-25 07:15:50 CST  >LOG:  next transaction ID: 911288200; next OID: 90133045
<2008-02-25 07:15:50 CST  >LOG:  next MultiXactId: 27095; next MultiXactOffset: 66130
<2008-02-25 07:15:50 CST  >LOG:  database system was not properly shut down; automatic recovery in progress
<2008-02-25 07:15:50 CST  >LOG:  redo starts at 14/9F015F98
<2008-02-25 07:15:50 CST  >LOG:  record with zero length at 14/9F08C23C
<2008-02-25 07:15:50 CST  >LOG:  redo done at 14/9F08C214
<2008-02-25 07:15:50 CST  >LOG:  could not truncate directory "pg_multixact/members": apparent wraparound
<2008-02-25 07:15:50 CST  >LOG:  database system is ready
<2008-02-25 07:15:50 CST  >LOG:  transaction ID wrap limit is 1576224515, limited by database "another_db"
<2008-02-25 07:15:56 CST [unknown] >LOG:  connection received: host=127.0.0.1 port=51987
<2008-02-25 07:15:56 CST username 127.0.0.1>LOG:  connection authorized: user=onehour database=my_db
<2008-02-25 07:15:56 CST  >LOG:  server process (PID 14433) was terminated by signal 11
<2008-02-25 07:15:56 CST  >LOG:  terminating any other active server processes
<2008-02-25 07:15:56 CST  >LOG:  all server processes terminated; reinitializing
<2008-02-25 07:15:56 CST  >LOG:  database system was interrupted at 2008-02-25 07:15:50 CST
<2008-02-25 07:15:56 CST  >LOG:  checkpoint record is at 14/9F08C23C
<2008-02-25 07:15:56 CST  >LOG:  redo record is at 14/9F08C23C; undo record is at 0/0; shutdown TRUE
<2008-02-25 07:15:56 CST  >LOG:  next transaction ID: 911293548; next OID: 90133045
<2008-02-25 07:15:56 CST  >LOG:  next MultiXactId: 27095; next MultiXactOffset: 66130
<2008-02-25 07:15:56 CST  >LOG:  database system was not properly shut down; automatic recovery in progress
<2008-02-25 07:15:56 CST  >LOG:  record with zero length at 14/9F08C280
<2008-02-25 07:15:56 CST  >LOG:  redo is not required
<2008-02-25 07:15:56 CST  >LOG:  could not truncate directory "pg_multixact/members": apparent wraparound
<2008-02-25 07:15:56 CST  >LOG:  database system is ready
<2008-02-25 07:15:56 CST  >LOG:  transaction ID wrap limit is 1576224515, limited by database "another_db"

and it then continues with this.





--
==================================================================
  Aaron Bono
  Aranya Software Technologies, Inc.
  http://www.aranya.com
  http://codeelixir.com
==================================================================

Re: System in Recovery Mode

От
Michael Monnerie
Дата:
On Montag, 25. Februar 2008 Aaron Bono wrote:
> We did have a hard drive problem (errors on the drive) in the past
> and they had to swap out a drive.  It is a RAID configuration so no
> backup/restore was necessary.

Even with a RAID system, a broken hard drive that doesn't show that it's
broken can cause severe data destroying: We've had this several times
already, with different RAID and drive combinations, SATA and SCSI,
doesn't matter. The symptom was that you write data to the drive, it
says OK, but upon reading you receive other data than you wrote. Even
RAID 1/5/6 doesn't help you here.

Doesn't mean you have the same, but from what you wrote I just wanted to
inform you that it's a fact that a single broken drive can destroy even
a RAIDs data very nasty.

mfg zmi
--
// Michael Monnerie, Ing.BSc    -----      http://it-management.at
// Tel: 0676/846 914 666                      .network.your.ideas.
// PGP Key:         "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38  500E CE14 91F7 1C12 09B4
// Keyserver: www.keyserver.net                   Key-ID: 1C1209B4

Вложения

Re: System in Recovery Mode

От
"Aaron Bono"
Дата:
On Mon, Feb 25, 2008 at 2:48 PM, Michael Monnerie <michael.monnerie@it-management.at> wrote:
On Montag, 25. Februar 2008 Aaron Bono wrote:
> We did have a hard drive problem (errors on the drive) in the past
> and they had to swap out a drive.  It is a RAID configuration so no
> backup/restore was necessary.

Even with a RAID system, a broken hard drive that doesn't show that it's
broken can cause severe data destroying: We've had this several times
already, with different RAID and drive combinations, SATA and SCSI,
doesn't matter. The symptom was that you write data to the drive, it
says OK, but upon reading you receive other data than you wrote. Even
RAID 1/5/6 doesn't help you here.

Doesn't mean you have the same, but from what you wrote I just wanted to
inform you that it's a fact that a single broken drive can destroy even
a RAIDs data very nasty.

mfg zmi


If I do a pg_dump from this potentially "bad" database and then use the output to do a restore to a new DB on another server, aside from having some data corrupted, would we be able to recover from this problem?  If not, what could I do to "fix" the data so the database stops having issues?

Thanks!
Aaron

--
==================================================================
  Aaron Bono
  Aranya Software Technologies, Inc.
  http://www.aranya.com
  http://codeelixir.com
==================================================================