Обсуждение: Possible database corruption - urgent

Поиск
Список
Период
Сортировка

Possible database corruption - urgent

От
"Benjamin Krajmalnik"
Дата:

I have a situation where pg_xlog started growing until it filled up the disk drive.

I got alerted to the error and started investigating.

Checked the logs and I am seeing the following entry repeatedly:

 

2013-01-07 01:49:12 GMT ERROR:  could not open file "base/16748/181979366_fsm": No such file or directory

2013-01-07 01:49:12 GMT CONTEXT:  writing block 1 of relation base/16748/181979366_fsm

2013-01-07 01:49:12 GMT WARNING:  could not write block 1 of base/16748/181979366_fsm

 

I checked the actual file system, and that file is indeed missing.  181979366 exists.

Is there a way to get the system back up and running?

I stopped the postmaster and am moving the pg_xlog directory to a partition which has room left in it, but I need to resolve this missing file problem

 

 

Re: Possible database corruption - urgent

От
"Benjamin Krajmalnik"
Дата:

I forgot to mention – PostgreSQL 9.0 – my apologies.

Can I just recreate the file using touch so it exists and then restart potgresql?

The system coredumped and was attempting to go intorecovery mode

 

 

2013-01-07 01:49:12 GMT ERROR:  could not open file "base/16748/181979366_fsm": No such file or directory

2013-01-07 01:49:12 GMT CONTEXT:  writing block 1 of relation base/16748/181979366_fsm

2013-01-07 01:49:12 GMT WARNING:  could not write block 1 of base/16748/181979366_fsm

.

.

.

 

2013-01-07 01:49:12 GMT ERROR:  could not open file "base/16748/181979366_fsm": No such file or directory

2013-01-07 01:49:12 GMT CONTEXT:  writing block 1 of relation base/16748/181979366_fsm

2013-01-07 01:49:12 GMT WARNING:  could not write block 1 of base/16748/181979366_fsm

 

 

From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org] On Behalf Of Benjamin Krajmalnik
Sent: Monday, January 07, 2013 2:22 PM
To: pgsql-admin@postgresql.org
Subject: [ADMIN] Possible database corruption - urgent

 

I have a situation where pg_xlog started growing until it filled up the disk drive.

I got alerted to the error and started investigating.

Checked the logs and I am seeing the following entry repeatedly:

 

2013-01-07 01:49:12 GMT ERROR:  could not open file "base/16748/181979366_fsm": No such file or directory

2013-01-07 01:49:12 GMT CONTEXT:  writing block 1 of relation base/16748/181979366_fsm

2013-01-07 01:49:12 GMT WARNING:  could not write block 1 of base/16748/181979366_fsm

 

I checked the actual file system, and that file is indeed missing.  181979366 exists.

Is there a way to get the system back up and running?

I stopped the postmaster and am moving the pg_xlog directory to a partition which has room left in it, but I need to resolve this missing file problem

 

 

Re: Possible database corruption - urgent

От
"Benjamin Krajmalnik"
Дата:

Sorry for the cut and paste error.

This is the log entry when the pg_xlog partition ran out of space:

 

2013-01-07 20:50:22 GMT [local]PANIC:  could not write to file "pg_xlog/xlogtemp.49680": No space left on device

2013-01-07 20:50:22 GMT [local]STATEMENT:  INSERT INTO tbltmptests (testhash, testtime, statusid, replytxt, replyval, groupid) V

2013-01-07 20:50:23 GMT LOG:  server process (PID 49680) was terminated by signal 6: Abort trap

2013-01-07 20:50:23 GMT LOG:  terminating any other active server processes

2013-01-07 20:50:23 GMT [local]WARNING:  terminating connection because of crash of another server process

2013-01-07 20:50:23 GMT [local]DETAIL:  The postmaster has commanded this server process to roll back the current transaction an

2013-01-07 20:50:23 GMT [local]HINT:  In a moment you should be able to reconnect to the database and repeat your command.

.

.

.

2013-01-07 20:50:23 GMT [local]FATAL:  the database system is in recovery mode

2013-01-07 20:50:23 GMT LOG:  all server processes terminated; reinitializing

2013-01-07 20:50:24 GMT LOG:  database system was interrupted; last known up at 2013-01-07 00:31:02 GMT

2013-01-07 20:50:24 GMT LOG:  database system was not properly shut down; automatic recovery in progress

2013-01-07 20:50:24 GMT LOG:  consistent recovery state reached at 52F/8CE57490

2013-01-07 20:50:24 GMT LOG:  redo starts at 52F/7BABC118

2013-01-07 20:50:38 GMT [local]FATAL:  the database system is in recovery mode

2013-01-07 20:50:53 GMT [local]FATAL:  the database system is in recovery mode

2013-01-07 20:51:08 GMT [local]FATAL:  the database system is in recovery mode

2013-01-07 20:51:24 GMT [local]FATAL:  the database system is in recovery mode

2013-01-07 20:51:39 GMT [local]FATAL:  the database system is in recovery mode

2013-01-07 20:51:54 GMT [local]FATAL:  the database system is in recovery mode

 

From: Benjamin Krajmalnik
Sent: Monday, January 07, 2013 2:31 PM
To: Benjamin Krajmalnik; pgsql-admin@postgresql.org
Subject: RE: [ADMIN] Possible database corruption - urgent

 

I forgot to mention – PostgreSQL 9.0 – my apologies.

Can I just recreate the file using touch so it exists and then restart potgresql?

The system coredumped and was attempting to go intorecovery mode

 

 

2013-01-07 01:49:12 GMT ERROR:  could not open file "base/16748/181979366_fsm": No such file or directory

2013-01-07 01:49:12 GMT CONTEXT:  writing block 1 of relation base/16748/181979366_fsm

2013-01-07 01:49:12 GMT WARNING:  could not write block 1 of base/16748/181979366_fsm

.

.

.

 

2013-01-07 01:49:12 GMT ERROR:  could not open file "base/16748/181979366_fsm": No such file or directory

2013-01-07 01:49:12 GMT CONTEXT:  writing block 1 of relation base/16748/181979366_fsm

2013-01-07 01:49:12 GMT WARNING:  could not write block 1 of base/16748/181979366_fsm

 

 

From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org] On Behalf Of Benjamin Krajmalnik
Sent: Monday, January 07, 2013 2:22 PM
To: pgsql-admin@postgresql.org
Subject: [ADMIN] Possible database corruption - urgent

 

I have a situation where pg_xlog started growing until it filled up the disk drive.

I got alerted to the error and started investigating.

Checked the logs and I am seeing the following entry repeatedly:

 

2013-01-07 01:49:12 GMT ERROR:  could not open file "base/16748/181979366_fsm": No such file or directory

2013-01-07 01:49:12 GMT CONTEXT:  writing block 1 of relation base/16748/181979366_fsm

2013-01-07 01:49:12 GMT WARNING:  could not write block 1 of base/16748/181979366_fsm

 

I checked the actual file system, and that file is indeed missing.  181979366 exists.

Is there a way to get the system back up and running?

I stopped the postmaster and am moving the pg_xlog directory to a partition which has room left in it, but I need to resolve this missing file problem

 

 

Re: Possible database corruption - urgent

От
Walter Hurry
Дата:
On Mon, 07 Jan 2013 14:35:48 -0700, Benjamin Krajmalnik wrote:

<snip>

What do you think you will gain by adding "urgent" to your subject line?
What do you think you will gain by posting in HTML?

Re: Possible database corruption - urgent

От
Walter Hurry
Дата:
On Mon, 07 Jan 2013 14:35:48 -0700, Benjamin Krajmalnik wrote:

<snip>

What do you think you will gain by adding "urgent" to your subject line?
What do you think you will gain by posting in HTML?

Re: Possible database corruption

От
Craig Ringer
Дата:
On 01/08/2013 05:22 AM, Benjamin Krajmalnik wrote:

I have a situation where pg_xlog started growing until it filled up the disk drive.

This should not ever cause corruption. If it has, there's a bug at work.

A crash is reasonable (albeit undesirable; it'd be better to just report errors on connections) - but database corruption is not.

Before doing ANYTHING else, read http://wiki.postgresql.org/wiki/Corruption and act on it.

How big is the DB?

What file system is it on?

PostgreSQL 9.0.[what?] ?

Host OS?

Disk subsystem?


-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Possible database corruption

От
"Benjamin Krajmalnik"
Дата:

Thanks for the reply - I posted an update that I had resolved the issue.

When the partition with the WAL files filled up due to the missing fsm file (I wonder what caused that), the db panicked.

After moving all 43GB of WAL files to a different partition, database came into recovery mode, and after about half an hour of processing the WAL files the server  came back online.

The only thing that is still pending is for the system to clean out all of the now unused wal files.

Once this is done, I will move back the WAL files to their own spindle.

 

Since the database would not restart until the WAL files were moved I feared data corruption - which thankfully did not occur.

 

DB was Postgres 9.0.4 running on FreeBSD 8.1/amd64.  Subsystem is dual RAID-1 SAS, OS/WAL on one set of spindles, data on the other.

 

 

 

From: Craig Ringer [mailto:craig@2ndQuadrant.com]
Sent: Monday, January 07, 2013 7:24 PM
To: Benjamin Krajmalnik
Cc: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] Possible database corruption

 

On 01/08/2013 05:22 AM, Benjamin Krajmalnik wrote:

I have a situation where pg_xlog started growing until it filled up the disk drive.

This should not ever cause corruption. If it has, there's a bug at work.

A crash is reasonable (albeit undesirable; it'd be better to just report errors on connections) - but database corruption is not.

Before doing ANYTHING else, read http://wiki.postgresql.org/wiki/Corruption and act on it.

How big is the DB?

What file system is it on?

PostgreSQL 9.0.[what?] ?

Host OS?

Disk subsystem?



-- 
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services