Обсуждение: Possible database corruption - urgent
I have a situation where pg_xlog started growing until it filled up the disk drive.
I got alerted to the error and started investigating.
Checked the logs and I am seeing the following entry repeatedly:
2013-01-07 01:49:12 GMT ERROR: could not open file "base/16748/181979366_fsm": No such file or directory
2013-01-07 01:49:12 GMT CONTEXT: writing block 1 of relation base/16748/181979366_fsm
2013-01-07 01:49:12 GMT WARNING: could not write block 1 of base/16748/181979366_fsm
I checked the actual file system, and that file is indeed missing. 181979366 exists.
Is there a way to get the system back up and running?
I stopped the postmaster and am moving the pg_xlog directory to a partition which has room left in it, but I need to resolve this missing file problem
I forgot to mention – PostgreSQL 9.0 – my apologies.
Can I just recreate the file using touch so it exists and then restart potgresql?
The system coredumped and was attempting to go intorecovery mode
2013-01-07 01:49:12 GMT ERROR: could not open file "base/16748/181979366_fsm": No such file or directory
2013-01-07 01:49:12 GMT CONTEXT: writing block 1 of relation base/16748/181979366_fsm
2013-01-07 01:49:12 GMT WARNING: could not write block 1 of base/16748/181979366_fsm
.
.
.
2013-01-07 01:49:12 GMT ERROR: could not open file "base/16748/181979366_fsm": No such file or directory
2013-01-07 01:49:12 GMT CONTEXT: writing block 1 of relation base/16748/181979366_fsm
2013-01-07 01:49:12 GMT WARNING: could not write block 1 of base/16748/181979366_fsm
From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org] On Behalf Of Benjamin Krajmalnik
Sent: Monday, January 07, 2013 2:22 PM
To: pgsql-admin@postgresql.org
Subject: [ADMIN] Possible database corruption - urgent
I have a situation where pg_xlog started growing until it filled up the disk drive.
I got alerted to the error and started investigating.
Checked the logs and I am seeing the following entry repeatedly:
2013-01-07 01:49:12 GMT ERROR: could not open file "base/16748/181979366_fsm": No such file or directory
2013-01-07 01:49:12 GMT CONTEXT: writing block 1 of relation base/16748/181979366_fsm
2013-01-07 01:49:12 GMT WARNING: could not write block 1 of base/16748/181979366_fsm
I checked the actual file system, and that file is indeed missing. 181979366 exists.
Is there a way to get the system back up and running?
I stopped the postmaster and am moving the pg_xlog directory to a partition which has room left in it, but I need to resolve this missing file problem
Sorry for the cut and paste error.
This is the log entry when the pg_xlog partition ran out of space:
2013-01-07 20:50:22 GMT [local]PANIC: could not write to file "pg_xlog/xlogtemp.49680": No space left on device
2013-01-07 20:50:22 GMT [local]STATEMENT: INSERT INTO tbltmptests (testhash, testtime, statusid, replytxt, replyval, groupid) V
2013-01-07 20:50:23 GMT LOG: server process (PID 49680) was terminated by signal 6: Abort trap
2013-01-07 20:50:23 GMT LOG: terminating any other active server processes
2013-01-07 20:50:23 GMT [local]WARNING: terminating connection because of crash of another server process
2013-01-07 20:50:23 GMT [local]DETAIL: The postmaster has commanded this server process to roll back the current transaction an
2013-01-07 20:50:23 GMT [local]HINT: In a moment you should be able to reconnect to the database and repeat your command.
.
.
.
2013-01-07 20:50:23 GMT [local]FATAL: the database system is in recovery mode
2013-01-07 20:50:23 GMT LOG: all server processes terminated; reinitializing
2013-01-07 20:50:24 GMT LOG: database system was interrupted; last known up at 2013-01-07 00:31:02 GMT
2013-01-07 20:50:24 GMT LOG: database system was not properly shut down; automatic recovery in progress
2013-01-07 20:50:24 GMT LOG: consistent recovery state reached at 52F/8CE57490
2013-01-07 20:50:24 GMT LOG: redo starts at 52F/7BABC118
2013-01-07 20:50:38 GMT [local]FATAL: the database system is in recovery mode
2013-01-07 20:50:53 GMT [local]FATAL: the database system is in recovery mode
2013-01-07 20:51:08 GMT [local]FATAL: the database system is in recovery mode
2013-01-07 20:51:24 GMT [local]FATAL: the database system is in recovery mode
2013-01-07 20:51:39 GMT [local]FATAL: the database system is in recovery mode
2013-01-07 20:51:54 GMT [local]FATAL: the database system is in recovery mode
From: Benjamin Krajmalnik
Sent: Monday, January 07, 2013 2:31 PM
To: Benjamin Krajmalnik; pgsql-admin@postgresql.org
Subject: RE: [ADMIN] Possible database corruption - urgent
I forgot to mention – PostgreSQL 9.0 – my apologies.
Can I just recreate the file using touch so it exists and then restart potgresql?
The system coredumped and was attempting to go intorecovery mode
2013-01-07 01:49:12 GMT ERROR: could not open file "base/16748/181979366_fsm": No such file or directory
2013-01-07 01:49:12 GMT CONTEXT: writing block 1 of relation base/16748/181979366_fsm
2013-01-07 01:49:12 GMT WARNING: could not write block 1 of base/16748/181979366_fsm
.
.
.
2013-01-07 01:49:12 GMT ERROR: could not open file "base/16748/181979366_fsm": No such file or directory
2013-01-07 01:49:12 GMT CONTEXT: writing block 1 of relation base/16748/181979366_fsm
2013-01-07 01:49:12 GMT WARNING: could not write block 1 of base/16748/181979366_fsm
From: pgsql-admin-owner@postgresql.org [mailto:pgsql-admin-owner@postgresql.org] On Behalf Of Benjamin Krajmalnik
Sent: Monday, January 07, 2013 2:22 PM
To: pgsql-admin@postgresql.org
Subject: [ADMIN] Possible database corruption - urgent
I have a situation where pg_xlog started growing until it filled up the disk drive.
I got alerted to the error and started investigating.
Checked the logs and I am seeing the following entry repeatedly:
2013-01-07 01:49:12 GMT ERROR: could not open file "base/16748/181979366_fsm": No such file or directory
2013-01-07 01:49:12 GMT CONTEXT: writing block 1 of relation base/16748/181979366_fsm
2013-01-07 01:49:12 GMT WARNING: could not write block 1 of base/16748/181979366_fsm
I checked the actual file system, and that file is indeed missing. 181979366 exists.
Is there a way to get the system back up and running?
I stopped the postmaster and am moving the pg_xlog directory to a partition which has room left in it, but I need to resolve this missing file problem
On Mon, 07 Jan 2013 14:35:48 -0700, Benjamin Krajmalnik wrote: <snip> What do you think you will gain by adding "urgent" to your subject line? What do you think you will gain by posting in HTML?
On Mon, 07 Jan 2013 14:35:48 -0700, Benjamin Krajmalnik wrote: <snip> What do you think you will gain by adding "urgent" to your subject line? What do you think you will gain by posting in HTML?
This should not ever cause corruption. If it has, there's a bug at work.I have a situation where pg_xlog started growing until it filled up the disk drive.
A crash is reasonable (albeit undesirable; it'd be better to just report errors on connections) - but database corruption is not.
Before doing ANYTHING else, read http://wiki.postgresql.org/wiki/Corruption and act on it.
How big is the DB?
What file system is it on?
PostgreSQL 9.0.[what?] ?
Host OS?
Disk subsystem?
-- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Thanks for the reply - I posted an update that I had resolved the issue.
When the partition with the WAL files filled up due to the missing fsm file (I wonder what caused that), the db panicked.
After moving all 43GB of WAL files to a different partition, database came into recovery mode, and after about half an hour of processing the WAL files the server came back online.
The only thing that is still pending is for the system to clean out all of the now unused wal files.
Once this is done, I will move back the WAL files to their own spindle.
Since the database would not restart until the WAL files were moved I feared data corruption - which thankfully did not occur.
DB was Postgres 9.0.4 running on FreeBSD 8.1/amd64. Subsystem is dual RAID-1 SAS, OS/WAL on one set of spindles, data on the other.
From: Craig Ringer [mailto:craig@2ndQuadrant.com]
Sent: Monday, January 07, 2013 7:24 PM
To: Benjamin Krajmalnik
Cc: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] Possible database corruption
On 01/08/2013 05:22 AM, Benjamin Krajmalnik wrote:
I have a situation where pg_xlog started growing until it filled up the disk drive.
This should not ever cause corruption. If it has, there's a bug at work.
A crash is reasonable (albeit undesirable; it'd be better to just report errors on connections) - but database corruption is not.
Before doing ANYTHING else, read http://wiki.postgresql.org/wiki/Corruption and act on it.
How big is the DB?
What file system is it on?
PostgreSQL 9.0.[what?] ?
Host OS?
Disk subsystem?
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services