Обсуждение: Corrupted disk

Поиск
Список
Период
Сортировка

Corrupted disk

От
Tony Nelson
Дата:
Good morning all,

This morning  my server (ubuntu 10.04 LTS, pg 8.4 from apt) experienced a disk corruption which caused errors like
this:

Feb 18 08:50:07 ihdb1 postgres[13317]: [2-1] user=postgres,db=instihire ERROR:  invalid page header in block 10 of
relationbase/16384/8082305 

I am ok, because I have a dump from last night, and wal since then.

As a test, I tried doing a dump like this from the broken server:

pg_dump -Fc mydb > busted.dump

I restored this dump on a test server, and all of the data looks ok, and my application is able to run the operation
thatwas failing in production just fine. 

My question is  simple, is my dump good?  Is it possible that pg_dump was able to correctly save the data that "select
.." and "update.. " couldn't. 

Can I drop/create/restore from this dump?  Or should I restore from last nights full and apply the WAL?

Thanks for any help.

Tony Nelson
Starpoint  Solutions

Re: Corrupted disk

От
"Kevin Grittner"
Дата:
Tony Nelson  wrote:

> This morning my server (ubuntu 10.04 LTS, pg 8.4 from apt)
> experienced a disk corruption which caused errors like this:
>
> Feb 18 08:50:07 ihdb1 postgres[13317]: [2-1]
> user=postgres,db=instihire ERROR: invalid page header in block 10
> of relation base/16384/8082305

Do you know why?  If not, that machine is not to be trusted.

> I am ok, because I have a dump from last night, and wal since then.

That dump was taken according to PITR instructions, not from pg_dump
or pg_dumpall, I hope.

> As a test, I tried doing a dump like this from the broken server:
>
> pg_dump -Fc mydb > busted.dump
>
> I restored this dump on a test server, and all of the data looks
> ok, and my application is able to run the operation that was
> failing in production just fine.
>
> My question is simple, is my dump good?

Maybe; maybe not.  I would try to restore from that dump you
mentioned, run pg_dump on that, and compare the two files which came
out of pg_dump.  Or just use the dump if you can restore it and
you're sure it's up to date.

> Is it possible that pg_dump was able to correctly save the data
> that "select .. " and "update.. " couldn't.

pg_dump essentially just does a SELECT from each table.  If the
corrupted relation was an index, you might have dodged the problem
mentioned above, but if a page in that index got corrupted, why
assume damage was localized to that one place?

You really have two problems to address here: how best to recover
your data, and how to prevent a recurrence.

What version of PostgreSQL is this exactly?  (It's best to SELECT
version(); and copy/paste the results.)

-Kevin

Re: Corrupted disk

От
Greg Smith
Дата:
Tony Nelson wrote:
> I am ok, because I have a dump from last night, and wal since then.
>

Dumps made with pg_dump are easy to restore from.  But they're fixed in
time:  there is no applying WALs to them in order to update them.
Anything that's happened to the server since then you can't add.

If you made a filesystem copy of the server using
pg_start_backup/pg_stop_backup, and you save all of the WAL files after
the backup began, that's also useful.  You can store from that backup
and apply all the WAL that's happened since then.

> Can I drop/create/restore from this dump?  Or should I restore from last nights full and apply the WAL?
>

If last night's backup with a filesystem one done with
pg_start_backup/pg_stop_backup, and you have WAL since then, I would
favor that set as likely to work fine.  But it sounds like what you have
might instead be a pg_dump backup and some WAL files; you can't apply
the WAL to such a dump.

Whatever you do, you want to make a full filesystem copy of the server's
data directory--with the server shutdown--before you do anything else.
It's possible to recover from page errors and extract the available data
using the right data recovery techniques, especially if there's a
pg_dump available too; we offer some services in this area.  But if any
serious changes are made to the database before we get to it, odds of
successful recovery can drop fast.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books