Обсуждение: db recovery after raid5 failure

Поиск
Список
Период
Сортировка

db recovery after raid5 failure

От
qcor@vp.pl
Дата:
Hello

I have serious problems recovering our db after recent raid5 failure.
Long story short - no recent dumps, some missing files (like pg_control).

long version of the story:
our raid failed.. badly. I was able to recover most of the files but some (like pg_control) are missing (possibly more, I dont know that).
After creating an img of damged raid I copied all recovered files to new system with same version of PG installed (8.2).
I copied whole DATA folder and tried to run pg service (win xp).
First error was sth about missing pg_control file. I googled  solution involving unsing "pg_resetxlog -f ..\data". That went good (I guess). File was created.
Second error was sth about 'access denied' to pg_control. It occured that copying files messed up files ownership so I granted 'rwx' permission to all users. That went good (I guess).

PG service started at last... but...
When I log in using pgadmin I can see 0 (zero) databases and 0 registered roles. But when I hit 'refresh' after few second I can see SOME of my old databases back on list. (registered roles as well).  Then after few more seconds of hitting 'refresh'  more and more databases are back on the list.
BUT...
There are no tables inside :( All databases contain only 4 pg_xxxx tables.

Out of pure curiosity I  tried to recover one of the databases using some very old dump using "psql dbname <dbfile" and got tons of errors saying "create table blabla... fields etc" - > 'this relation already exist".
So... is it still there? why cant I see it? any way to recover it?

one more thing I noticed in pgadmin: owner of database "unknown (oid 17004)" but in "registered roles" I can see "name: tom, oid:17004... etc"
So it looks like there is registered owner with the same oid but for some reason pg cant find a link.

oh, I almost forgot, pg_log is full of
" xlog flush request (number) is not satisfied --- flushed only to (another_number)
writing block 1 of (another_number)
multiple failures --- write error may be permanent"


Anyone can help? 2 days of using google didnt help much :( YOU are my last hope...

qc

Re: db recovery after raid5 failure

От
"Kevin Grittner"
Дата:
<qcor@vp.pl> wrote:

> I have serious problems recovering our db after recent raid5
> failure.  Long story short - no recent dumps, some missing files
> (like pg_control).

Been there -- at least on the end of helping with recovery for
people in that position with a different database product.  It can
be very painful.  :-(  (I'm assuming from your post that a second
drive failed before recovery from failure of the first was
complete.)

First a word of advice -- don't discard anything.  Keep any backups,
keep the bad drives, keep any logs, exports, reports, or anything
else which might contain fragments of the data.  Make a backup of
what you have now, if you haven't already.  Keep these things for a
long time.

Second, a word of encouragement -- given all these scattered
fragments and enough time and money, you'd might be surprised at how
much data can be recovered.  But time and money is required --
someone has to make the hard call of how much money it is worth to
recover how much of the data.

Based on your description, it sounds like you will probably need the
assistance of an outside company to recover very much.  Possibly one
company specializing in recovery of data off of damaged disks, and
another for PostgreSQL internals expertise.

I don't suppose there's another source for the data to avoid
attempting this recovery?

-Kevin

Re: db recovery after raid5 failure

От
Balkrishna Sharma
Дата:
If the database is not extremely huge, makes you wonder what does a RAID actually give us. A robust near-realtime replication setup (say PITR + cloud) may be good enough against once in a few years of disk failure.
atleast you don't add another point of failure that you (your database/OS) can't do anything about.


> Date: Mon, 21 Jun 2010 15:30:45 -0500
> From: Kevin.Grittner@wicourts.gov
> To: pgsql-admin@postgresql.org; qcor@vp.pl
> Subject: Re: [ADMIN] db recovery after raid5 failure
>
> <qcor@vp.pl> wrote:
>
> > I have serious problems recovering our db after recent raid5
> > failure. Long story short - no recent dumps, some missing files
> > (like pg_control).
>
> Been there -- at least on the end of helping with recovery for
> people in that position with a different database product. It can
> be very painful. :-( (I'm assuming from your post that a second
> drive failed before recovery from failure of the first was
> complete.)
>
> First a word of advice -- don't discard anything. Keep any backups,
> keep the bad drives, keep any logs, exports, reports, or anything
> else which might contain fragments of the data. Make a backup of
> what you have now, if you haven't already. Keep these things for a
> long time.
>
> Second, a word of encouragement -- given all these scattered
> fragments and enough time and money, you'd might be surprised at how
> much data can be recovered. But time and money is required --
> someone has to make the hard call of how much money it is worth to
> recover how much of the data.
>
> Based on your description, it sounds like you will probably need the
> assistance of an outside company to recover very much. Possibly one
> company specializing in recovery of data off of damaged disks, and
> another for PostgreSQL internals expertise.
>
> I don't suppose there's another source for the data to avoid
> attempting this recovery?
>
> -Kevin
>
> --
> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-admin


Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox. Learn more.

Re: db recovery after raid5 failure

От
"Kevin Grittner"
Дата:
Balkrishna Sharma <b_ki@hotmail.com> wrote:

> If the database is not extremely huge, makes you wonder what does
> a RAID actually give us.

Well, RAID5 gives you a situations where you must have a second
drive fail before recovery for the first failure is complete, versus
being instantly dead on a single-drive failure.  RAID6 requires
three drives to fail in close succession (assuming a hot spare which
initiates recovery on failure).  RAID10 requires that two paired
drives fail.  We have about 100 database servers, and probably
average about two drive failures a month; having any down time from
them is rare because of RAID (and that's with us primarily using
RAID5).

> A robust near-realtime replication setup (say PITR + cloud)
> may be good enough against once in a few years of disk
> failure.atleast you don't add another point of failure that you
> (your database/OS) can't do anything about.

You've totally lost me there.  "The cloud" still uses similar
techniques, just out of your sight and control.  If you assume that
whoever is running it can do it better than you can, that's one
thing; just don't assume it's magic.  The machines in my shop are
what I *can* do something about.  Management here insists on near-
real-time backup using at least two completely independent
techniques to multiple machines in multiple buildings, with
continuous testing that all backups actually restore.  If we were to
float data off into a cloud somewhere, I can guarantee we wouldn't
count on it without an alternative.  As a place to put "one more
copy" it might make sense, as long as it had strong encryption.
(Again, you've lost all control over who has what access once you
send it into the cloud.)

-Kevin

Re: db recovery after raid5 failure

От
Balkrishna Sharma
Дата:
> average about two drive failures a month
You must be having a real huge postgres setup with several hundreds of drives to have such high frequency of failure.


> As a place to put "one more
> copy" it might make sense, as long as it had strong encryption.
I didn't expand but that's what I meant. The copy in cloud to be your final resort incase the LAN and the WAN copy both fail. You get an extra copy in a different geographic location for some catastrophic event.



> Date: Tue, 22 Jun 2010 09:46:46 -0500
> From: Kevin.Grittner@wicourts.gov
> To: b_ki@hotmail.com; pgsql-admin@postgresql.org; qcor@vp.pl
> Subject: Re: [ADMIN] db recovery after raid5 failure
>
> Balkrishna Sharma <b_ki@hotmail.com> wrote:
>
> > If the database is not extremely huge, makes you wonder what does
> > a RAID actually give us.
>
> Well, RAID5 gives you a situations where you must have a second
> drive fail before recovery for the first failure is complete, versus
> being instantly dead on a single-drive failure. RAID6 requires
> three drives to fail in close succession (assuming a hot spare which
> initiates recovery on failure). RAID10 requires that two paired
> drives fail. We have about 100 database servers, and probably
> average about two drive failures a month; having any down time from
> them is rare because of RAID (and that's with us primarily using
> RAID5).
>
> > A robust near-realtime replication setup (say PITR + cloud)
> > may be good enough against once in a few years of disk
> > failure.atleast you don't add another point of failure that you
> > (your database/OS) can't do anything about.
>
> You've totally lost me there. "The cloud" still uses similar
> techniques, just out of your sight and control. If you assume that
> whoever is running it can do it better than you can, that's one
> thing; just don't assume it's magic. The machines in my shop are
> what I *can* do something about. Management here insists on near-
> real-time backup using at least two completely independent
> techniques to multiple machines in multiple buildings, with
> continuous testing that all backups actually restore. If we were to
> float data off into a cloud somewhere, I can guarantee we wouldn't
> count on it without an alternative. As a place to put "one more
> copy" it might make sense, as long as it had strong encryption.
> (Again, you've lost all control over who has what access once you
> send it into the cloud.)
>
> -Kevin
>
> --
> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-admin


The New Busy think 9 to 5 is a cute idea. Combine multiple calendars with Hotmail. Get busy.

Re: db recovery after raid5 failure

От
"Kevin Grittner"
Дата:
Balkrishna Sharma  wrote:

>> average about two drive failures a month

> You must be having a real huge postgres setup with several hundreds
> of drives to have such high frequency of failure.

About 100 database servers with over 1000 drives spinning 24/7.  Also
probably significant, management-set policy is to replace machines
after four years, and we don't always hit that.  I haven't tried to
run numbers on it, but the pattern sure seems to match published
reports that we have some initial failures within the first few
months when a set of machines go in, it settles down for about three
years, then the failure rate starts to edge inexorably upward.

>> As a place to put "one more copy" it might make sense, as long as
>> it had strong encryption.

> I didn't expand but that's what I meant. The copy in cloud to be
> your final resort incase the LAN and the WAN copy both fail. You
> get an extra copy in a different geographic location for some
> catastrophic event.

OK, that makes sense.

-Kevin