Обсуждение: hardware failure - data recovery

Поиск
Список
Период
Сортировка

hardware failure - data recovery

От
Rick Gigger
Дата:
To make a long story short lets just say that I had a bit of a
hardware failure recently.

If I got an error like this when trying to dump a db from the mangled
data directory is it safe to say it's totally hosed or is there some
chance of recovery?

pg_dump: ERROR:  could not open relation 1663/18392/18400: No such
file or directory
pg_dump: SQL command to dump the contents of table "file" failed:
PQendcopy() failed.
pg_dump: Error message from server: ERROR:  could not open relation
1663/18392/18400: No such file or directory
pg_dump: The command was: COPY public.file (vfs_id, vfs_type,
vfs_path, vfs_name, vfs_modified, vfs_owner, vfs_data) TO stdout;

Thanks,

Rick

Re: hardware failure - data recovery

От
Ron Johnson
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/18/06 19:57, Rick Gigger wrote:
> To make a long story short lets just say that I had a bit of a hardware
> failure recently.
>
> If I got an error like this when trying to dump a db from the mangled
> data directory is it safe to say it's totally hosed or is there some
> chance of recovery?
>
> pg_dump: ERROR:  could not open relation 1663/18392/18400: No such file
> or directory
> pg_dump: SQL command to dump the contents of table "file" failed:
> PQendcopy() failed.
> pg_dump: Error message from server: ERROR:  could not open relation
> 1663/18392/18400: No such file or directory
> pg_dump: The command was: COPY public.file (vfs_id, vfs_type, vfs_path,
> vfs_name, vfs_modified, vfs_owner, vfs_data) TO stdout;

What happens when you fsck the relevant partitions?

- --
Ron Johnson, Jr.
Jefferson LA  USA

Is "common sense" really valid?
For example, it is "common sense" to white-power racists that
whites are superior to blacks, and that those with brown skins
are mud people.
However, that "common sense" is obviously wrong.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQFFNtOdS9HxQb37XmcRAqswAJ0X6up02yGXkky6YBrVHIK3DXuO1ACgqbI/
//TGGkVOIe9o4sKzuUHyuXI=
=fZuZ
-----END PGP SIGNATURE-----

Re: hardware failure - data recovery

От
Rick Gigger
Дата:
Ron Johnson wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 10/18/06 19:57, Rick Gigger wrote:
>> To make a long story short lets just say that I had a bit of a hardware
>> failure recently.
>>
>> If I got an error like this when trying to dump a db from the mangled
>> data directory is it safe to say it's totally hosed or is there some
>> chance of recovery?
>>
>> pg_dump: ERROR:  could not open relation 1663/18392/18400: No such file
>> or directory
>> pg_dump: SQL command to dump the contents of table "file" failed:
>> PQendcopy() failed.
>> pg_dump: Error message from server: ERROR:  could not open relation
>> 1663/18392/18400: No such file or directory
>> pg_dump: The command was: COPY public.file (vfs_id, vfs_type, vfs_path,
>> vfs_name, vfs_modified, vfs_owner, vfs_data) TO stdout;
>
> What happens when you fsck the relevant partitions?

Errors about a bunch of duplicate inodes, missing inodes, etc.  Should I
do it again and get some of the exact text for you?

Re: hardware failure - data recovery

От
Rick Gigger
Дата:
Rick Gigger wrote:
> Ron Johnson wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> On 10/18/06 19:57, Rick Gigger wrote:
>>> To make a long story short lets just say that I had a bit of a hardware
>>> failure recently.
>>>
>>> If I got an error like this when trying to dump a db from the mangled
>>> data directory is it safe to say it's totally hosed or is there some
>>> chance of recovery?
>>>
>>> pg_dump: ERROR:  could not open relation 1663/18392/18400: No such file
>>> or directory
>>> pg_dump: SQL command to dump the contents of table "file" failed:
>>> PQendcopy() failed.
>>> pg_dump: Error message from server: ERROR:  could not open relation
>>> 1663/18392/18400: No such file or directory
>>> pg_dump: The command was: COPY public.file (vfs_id, vfs_type, vfs_path,
>>> vfs_name, vfs_modified, vfs_owner, vfs_data) TO stdout;
>>
>> What happens when you fsck the relevant partitions?
>
> Errors about a bunch of duplicate inodes, missing inodes, etc.  Should I
> do it again and get some of the exact text for you?

Also this is an example of the type of errors that were being logged
before it died:

LOG:  checkpoint record is at 26/41570488
LOG:  redo record is at 26/41570488; undo record is at 0/0; shutdown TRUE



Re: hardware failure - data recovery

От
Ron Johnson
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/18/06 23:52, Rick Gigger wrote:
> Rick Gigger wrote:
>> Ron Johnson wrote:
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> On 10/18/06 19:57, Rick Gigger wrote:
>>>> To make a long story short lets just say that I had a bit of a hardware
>>>> failure recently.
>>>>
>>>> If I got an error like this when trying to dump a db from the mangled
>>>> data directory is it safe to say it's totally hosed or is there some
>>>> chance of recovery?
>>>>
>>>> pg_dump: ERROR:  could not open relation 1663/18392/18400: No such file
>>>> or directory
>>>> pg_dump: SQL command to dump the contents of table "file" failed:
>>>> PQendcopy() failed.
>>>> pg_dump: Error message from server: ERROR:  could not open relation
>>>> 1663/18392/18400: No such file or directory
>>>> pg_dump: The command was: COPY public.file (vfs_id, vfs_type, vfs_path,
>>>> vfs_name, vfs_modified, vfs_owner, vfs_data) TO stdout;
>>>
>>> What happens when you fsck the relevant partitions?
>>
>> Errors about a bunch of duplicate inodes, missing inodes, etc.  Should
>> I do it again and get some of the exact text for you?
>
> Also this is an example of the type of errors that were being logged
> before it died:
>
> LOG:  checkpoint record is at 26/41570488
> LOG:  redo record is at 26/41570488; undo record is at 0/0; shutdown TRUE

What does Google say about these error messages and your fs?

- --
Ron Johnson, Jr.
Jefferson LA  USA

Is "common sense" really valid?
For example, it is "common sense" to white-power racists that
whites are superior to blacks, and that those with brown skins
are mud people.
However, that "common sense" is obviously wrong.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQFFNwhNS9HxQb37XmcRAksiAJ4zWuLJKcOaAR5PI5QSgZt3NhAQEgCgplJn
p4Xfy5+I/K6uYD+MOg+e6Nc=
=nhTB
-----END PGP SIGNATURE-----

Re: hardware failure - data recovery

От
Rick Gigger
Дата:
Ron Johnson wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 10/18/06 23:52, Rick Gigger wrote:
>> Rick Gigger wrote:
>>> Ron Johnson wrote:
>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>> Hash: SHA1
>>>>
>>>> On 10/18/06 19:57, Rick Gigger wrote:
>>>>> To make a long story short lets just say that I had a bit of a hardware
>>>>> failure recently.
>>>>>
>>>>> If I got an error like this when trying to dump a db from the mangled
>>>>> data directory is it safe to say it's totally hosed or is there some
>>>>> chance of recovery?
>>>>>
>>>>> pg_dump: ERROR:  could not open relation 1663/18392/18400: No such file
>>>>> or directory
>>>>> pg_dump: SQL command to dump the contents of table "file" failed:
>>>>> PQendcopy() failed.
>>>>> pg_dump: Error message from server: ERROR:  could not open relation
>>>>> 1663/18392/18400: No such file or directory
>>>>> pg_dump: The command was: COPY public.file (vfs_id, vfs_type, vfs_path,
>>>>> vfs_name, vfs_modified, vfs_owner, vfs_data) TO stdout;
>>>> What happens when you fsck the relevant partitions?
>>> Errors about a bunch of duplicate inodes, missing inodes, etc.  Should
>>> I do it again and get some of the exact text for you?
>> Also this is an example of the type of errors that were being logged
>> before it died:
>>
>> LOG:  checkpoint record is at 26/41570488
>> LOG:  redo record is at 26/41570488; undo record is at 0/0; shutdown TRUE
>
> What does Google say about these error messages and your fs?

Not much that is useful.  I think this is a little beyond that scope.  A
hardware failure basically left the fs and the db in an inconsistent
state.  There is one table in one database that has a bunch of data in
it that I need to get out.  I'm guessing I'm going to need to find
someone who understands the the internal structure of the files to go in
and pull out whatever data is still in tact.

I have been poking around and as far as I can tell, although one of the
toast indexes is gone the actual table files appear to be in tact.  That
is they are still in the file system.  I don't know if they are ok
internally.

I also get this error when trying to access the non-toasted data:

ERROR:  could not access status of transaction 307904873
DETAIL:  could not open file "pg_clog/0125": No such file or directory

I'm guessing that this means that I may have get someone to pull out all
versions of a given tuple because I have lost some of the visibility
info.  This shouldn't matter as most likely very few tuples would have
had more than one version when the system went down.

I just hope that the relations are need are in tact and that there is
someone out there who can help me get it out.

Re: hardware failure - data recovery

От
Ron Johnson
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/19/06 00:46, Rick Gigger wrote:
> Ron Johnson wrote:
>>
>> On 10/18/06 23:52, Rick Gigger wrote:
>>> Rick Gigger wrote:
>>>> Ron Johnson wrote:
>>>>>
>>>>> On 10/18/06 19:57, Rick Gigger wrote:
[snip]
> Not much that is useful.  I think this is a little beyond that scope.  A
> hardware failure basically left the fs and the db in an inconsistent
> state.  There is one table in one database that has a bunch of data in
> it that I need to get out.  I'm guessing I'm going to need to find
> someone who understands the the internal structure of the files to go in
> and pull out whatever data is still in tact.

So, no backup tapes?

> I have been poking around and as far as I can tell, although one of the
> toast indexes is gone the actual table files appear to be in tact.  That
> is they are still in the file system.  I don't know if they are ok
> internally.
>
> I also get this error when trying to access the non-toasted data:
>
> ERROR:  could not access status of transaction 307904873
> DETAIL:  could not open file "pg_clog/0125": No such file or directory
>
> I'm guessing that this means that I may have get someone to pull out all
> versions of a given tuple because I have lost some of the visibility
> info.  This shouldn't matter as most likely very few tuples would have
> had more than one version when the system went down.
>
> I just hope that the relations are need are in tact and that there is
> someone out there who can help me get it out.

What kernel, version & fs?


- --
Ron Johnson, Jr.
Jefferson LA  USA

Is "common sense" really valid?
For example, it is "common sense" to white-power racists that
whites are superior to blacks, and that those with brown skins
are mud people.
However, that "common sense" is obviously wrong.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQFFNxObS9HxQb37XmcRAkaJAKCKqIN76cHjvCqMJN3JqO7LBJZWfQCff2AT
WZWfEy3dWyByN2Q06IAt1kg=
=xmv+
-----END PGP SIGNATURE-----

Re: hardware failure - data recovery

От
Rick Gigger
Дата:
Ron Johnson wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 10/19/06 00:46, Rick Gigger wrote:
>> Ron Johnson wrote:
>>> On 10/18/06 23:52, Rick Gigger wrote:
>>>> Rick Gigger wrote:
>>>>> Ron Johnson wrote:
>>>>>> On 10/18/06 19:57, Rick Gigger wrote:
> [snip]
>> Not much that is useful.  I think this is a little beyond that scope.  A
>> hardware failure basically left the fs and the db in an inconsistent
>> state.  There is one table in one database that has a bunch of data in
>> it that I need to get out.  I'm guessing I'm going to need to find
>> someone who understands the the internal structure of the files to go in
>> and pull out whatever data is still in tact.
>
> So, no backup tapes?
>
>> I have been poking around and as far as I can tell, although one of the
>> toast indexes is gone the actual table files appear to be in tact.  That
>> is they are still in the file system.  I don't know if they are ok
>> internally.
>>
>> I also get this error when trying to access the non-toasted data:
>>
>> ERROR:  could not access status of transaction 307904873
>> DETAIL:  could not open file "pg_clog/0125": No such file or directory
>>
>> I'm guessing that this means that I may have get someone to pull out all
>> versions of a given tuple because I have lost some of the visibility
>> info.  This shouldn't matter as most likely very few tuples would have
>> had more than one version when the system went down.
>>
>> I just hope that the relations are need are in tact and that there is
>> someone out there who can help me get it out.
>
> What kernel, version & fs?

Linux. 2.4.22. and ext3.

Re: hardware failure - data recovery

От
Rick Gigger
Дата:
Rick Gigger wrote:
> Ron Johnson wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> On 10/19/06 00:46, Rick Gigger wrote:
>>> Ron Johnson wrote:
>>>> On 10/18/06 23:52, Rick Gigger wrote:
>>>>> Rick Gigger wrote:
>>>>>> Ron Johnson wrote:
>>>>>>> On 10/18/06 19:57, Rick Gigger wrote:
>> [snip]
>>> Not much that is useful.  I think this is a little beyond that scope.  A
>>> hardware failure basically left the fs and the db in an inconsistent
>>> state.  There is one table in one database that has a bunch of data in
>>> it that I need to get out.  I'm guessing I'm going to need to find
>>> someone who understands the the internal structure of the files to go in
>>> and pull out whatever data is still in tact.
>>
>> So, no backup tapes?

Oh.  Yeah.  Everything was backed up but this one thing. Hence the need
to get the data from the messed up db.

I think we've got it figure out though.  We were able to patch up the db
enough to extract the data with some help from google and old postings
from Tom.

Thanks,

Rick

Re: hardware failure - data recovery

От
Ray Stell
Дата:
On Thu, 19 Oct 2006 06:14:46 -0600, Rick Gigger wrote:
> I think we've got it figure out though. We were able to patch up the
> db enough to extract the data with some help from google and old postings
> from Tom.


It would be really great if you put down the specifics of what you
googled/old postings/Tom together.   You might need it next time.
I know I'd like to be ready.

Re: hardware failure - data recovery

От
Jerry Sievers
Дата:
Rick Gigger <rick@alpinenetworking.com> writes:

> To make a long story short lets just say that I had a bit of a
> hardware failure recently.
>
> If I got an error like this when trying to dump a db from the mangled
> data directory is it safe to say it's totally hosed or is there some
> chance of recovery?

Why don't you try dumping just the critical tables using pg_dump -t?

Also perhaps use psql and try looking up in pg_class for the damaged
tables by OID.  You may be able to drop just a few tables and then
dump the DB normally.  This assumes the damaged table(s) are
non-critical...

I suggest you stop Pg first, take an FS backup of the entire cluster
before this so you cahn try various approaches if needed.

> pg_dump: ERROR:  could not open relation 1663/18392/18400: No such
> file or directory
> pg_dump: SQL command to dump the contents of table "file" failed:
> PQendcopy() failed.
> pg_dump: Error message from server: ERROR:  could not open relation
> 1663/18392/18400: No such file or directory
> pg_dump: The command was: COPY public.file (vfs_id, vfs_type,
> vfs_path, vfs_name, vfs_modified, vfs_owner, vfs_data) TO stdout;
>
> Thanks,
>
> Rick
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq
>

--
-------------------------------------------------------------------------------
Jerry Sievers   305 854-3001 (home)     WWW ECommerce Consultant
                305 321-1144 (mobile    http://www.JerrySievers.com/

Re: hardware failure - data recovery

От
Rick Gigger
Дата:
I could have my developer do this if it would be useful to someone
else.  But in general I think my time would be much better served
fixing my backup situation and monitoring them so that this CAN'T
happen again.  It shouldn't have happened this time.

On Oct 19, 2006, at 8:35 AM, Ray Stell wrote:

> On Thu, 19 Oct 2006 06:14:46 -0600, Rick Gigger wrote:
>> I think we've got it figure out though. We were able to patch up the
>> db enough to extract the data with some help from google and old
>> postings
>> from Tom.
>
>
> It would be really great if you put down the specifics of what you
> googled/old postings/Tom together.   You might need it next time.
> I know I'd like to be ready.
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
>


Re: hardware failure - data recovery

От
Rick Gigger
Дата:
> Rick Gigger <rick@alpinenetworking.com> writes:
>
>> To make a long story short lets just say that I had a bit of a
>> hardware failure recently.
>>
>> If I got an error like this when trying to dump a db from the mangled
>> data directory is it safe to say it's totally hosed or is there some
>> chance of recovery?
>
> Why don't you try dumping just the critical tables using pg_dump -t?

There was only one table in the db.

> Also perhaps use psql and try looking up in pg_class for the damaged
> tables by OID.  You may be able to drop just a few tables and then
> dump the DB normally.  This assumes the damaged table(s) are
> non-critical...

The table I needed was damaged.  I dropped the indexes from it that
were also damaged but then the table had to be repaired.

> I suggest you stop Pg first, take an FS backup of the entire cluster
> before this so you cahn try various approaches if needed.

That was the first thing I did.

Thanks everyone for the help.  Luckily one of my developers  was able
to patch up the table and get 99% of the data out.  All of the truly
critical data was in another database and was backed up, so the 99%
was enough to get through the crisis.

Thanks,

Rick