Обсуждение: Fix corrupt pg_toast table?

Поиск
Список
Период
Сортировка

Fix corrupt pg_toast table?

От
Michael Clark
Дата:
Hello everyone.

Sorry if this has been covered already, I haven't had any luck searching and finding a solution.

I have a database which appears to have some corruption in a toast value.

When I select a certain table, I get the following error:
ERROR: invalid page header in block 984 of relation "pg_toast_17341_index"

I am new enough to PG that I am having some troubles navigating around the system tables and being able to do anything useful.

This error seems to be preventing a vacuum, reindex, and PGCOPY from going through on the database.

I wrote a little script to fetch this table row by row until I hit an error, and I think this has narrowed it down to a specific row, although trying to fetch a row beyond this also gave this error (I am hoping that is a side effect of the problem and that not the rest of the table is lost).

What is the best course of action to try and fix this and recover the data?  I am not concerned with recovering the particular row that appears to be having problems, but the rest of the data in the table is important.

Thanks a lot for any help on this,
Michael.


Re: Fix corrupt pg_toast table?

От
Tom Lane
Дата:
Michael Clark <codingninja@gmail.com> writes:
> I have a database which appears to have some corruption in a toast value.

> When I select a certain table, I get the following error:
> ERROR: invalid page header in block 984 of relation "pg_toast_17341_index"

If it's only that index that has gotten damaged, REINDEX will fix it, eg
    reindex index pg_toast.pg_toast_17341_index;

The bigger concern is whether there is other damage.  Have you had any
system crashes, indications of flaky hardware, etc on that machine?

            regards, tom lane

Re: Fix corrupt pg_toast table?

От
Michael Clark
Дата:
Hello again.

I have an update, hoping someone can steer me in the right direction here.

After receiving the "could not access transaction" error mentioned in my previous email I did some more digging and found that people have had success resolving this issue using pg_resetxlog.

I gave this a try, first without specifying any options, which did not resolve anything (it did not give an error on the command line).  I then read up on the command and passed a value for every swtich to set new values, and again this had no effect.  It did not give an error on the command line, but trying to reset the index gives the same error.  The response I get at the command line after running the pg_resetxlog command is: Transaction log reset

Here is the error I get when trying to reset this particular index:
2009-07-31 12:58:07.503 EDT [Test] - ERROR: could not access status of transaction 3839923882
2009-07-31 12:58:07.503 EDT [Test] - DETAIL: Could not open file "pg_clog/0E4E": No such file or directory.
2009-07-31 12:58:07.503 EDT [Test] - STATEMENT: reindex index pg_toast.pg_toast_17431_index;

Here is the final pg_resetxlog command I tried to resolve the issue:
pg_resetxlog -f -x 0x100000 -m 0x10000 -o 0x10000 -l 0x1,0x1,0xCD ./

Here are the filenames for the files in the various directories that led me to these switch values, as per the docs on pg_resetxlog:
-x switch - pg_clog folder had 1 file named: 0000
-m switch - pg_multixact/offsets had 1 file named: 0000
-o switch - pg_multixact/members had 1 file named: 0000
-l swtich - pg_xlog had 7 files, the one with the biggest named is: 0000000100000000000000CC


After running the pg_resetxlog command above I get the exact same error about pg_clog/0E4E missing.

Does anyone have any other ideas about how to solve this?

Thanks, 
Michael.


On Thu, Jul 30, 2009 at 6:08 PM, Michael Clark <codingninja@gmail.com> wrote:
Thanks for the reply!

On Thu, Jul 30, 2009 at 5:59 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Michael Clark <codingninja@gmail.com> writes:
> I have a database which appears to have some corruption in a toast value.

> When I select a certain table, I get the following error:
> ERROR: invalid page header in block 984 of relation "pg_toast_17341_index"

If it's only that index that has gotten damaged, REINDEX will fix it, eg
       reindex index pg_toast.pg_toast_17341_index;


I tried this, and might have exposed another issue.
Now I am getting:
ERROR: could not access status of transaction 3839923882
DETAIL: could not open file "pg_clog/0E4E": No such file or directory.

I checked the pg_clog folder, and there is only a 0000 file.

Any ideas on this one?


The bigger concern is whether there is other damage.  Have you had any
system crashes, indications of flaky hardware, etc on that machine?

This happened on a customer machine which I have not had a chance to investigate.  I have copied the PGDATA folder to one of our systems to diagnose.

Thanks again,
Michael.


Re: Fix corrupt pg_toast table?

От
Greg Stark
Дата:
On Fri, Jul 31, 2009 at 8:01 PM, Michael Clark<codingninja@gmail.com> wrote:
> I tried this, and might have exposed another issue.
> Now I am getting:
> ERROR: could not access status of transaction 3839923882
> DETAIL: could not open file "pg_clog/0E4E": No such file or directory.
> I checked the pg_clog folder, and there is only a 0000 file.

How long has this database been in use? That's a very high transaction
number that would only be reached on a heavily used database after a
reasonably long period of use. If that's not likely than a likely
possibility is that your table has been overwritten with garbage.
Possibly by a filesystem bug or hardware failure.

You can work around this problem for that row by creating a file named
0E4E in the clog directory. I think you can fill it with zeros but if
you search the mailing list you'll find instructions for doing this
that are might have better suggestions.

You'll still be best off recovering rows one by one. If you have
trouble using the index you might find it more reliable (but more
tedious) to do it using:

select * from tab where ctid = '(0,0)'
select * from tab where ctid = '(0,1)'
select * from tab where ctid = '(0,2)'
...

The first digit is the page number and the second is the index on the
page which can go as high as about 250.

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: Fix corrupt pg_toast table?

От
Michael Clark
Дата:
Hello,

Thanks for the reply.

On Fri, Jul 31, 2009 at 5:24 PM, Greg Stark <gsstark@mit.edu> wrote:
> I tried this, and might have exposed another issue.
> Now I am getting:
> ERROR: could not access status of transaction 3839923882
> DETAIL: could not open file "pg_clog/0E4E": No such file or directory.
> I checked the pg_clog folder, and there is only a 0000 file.

How long has this database been in use? That's a very high transaction
number that would only be reached on a heavily used database after a
reasonably long period of use. If that's not likely than a likely
possibility is that your table has been overwritten with garbage.
Possibly by a filesystem bug or hardware failure.

Not really that long.  The database was created in early June, and if that is supposed to be a transaction count (3839923882), then yeah that seems way out there.
 

You can work around this problem for that row by creating a file named
0E4E in the clog directory. I think you can fill it with zeros but if
you search the mailing list you'll find instructions for doing this
that are might have better suggestions.

When you say fill it with zeros, is there a certain file size I need to create?
(I will do some more googling to see if I can find more examples of people recovering from this problem)
 

You'll still be best off recovering rows one by one. If you have
trouble using the index you might find it more reliable (but more
tedious) to do it using:

select * from tab where ctid = '(0,0)'
select * from tab where ctid = '(0,1)'
select * from tab where ctid = '(0,2)'
...

The first digit is the page number and the second is the index on the
page which can go as high as about 250.

I am sorry, that is a little over my head with my knowledge gained with PG thus far.
Is there some background information I could read that would explain this to me?  

I tried that select a couple of times, and eventually got the "invalid page header in the block 984 of relation "pg_toast_17431_index" error message again.   ctid = '(0,6)' gave it to me. 
(Although, I don't know what I am doing here! :)


Thanks again for the reply, your help is appreciated!
Michael.

Re: Fix corrupt pg_toast table?

От
Tom Lane
Дата:
Michael Clark <codingninja@gmail.com> writes:
> On Fri, Jul 31, 2009 at 5:24 PM, Greg Stark <gsstark@mit.edu> wrote:
>>> Now I am getting:
>>> ERROR: could not access status of transaction 3839923882

>> How long has this database been in use? That's a very high transaction
>> number that would only be reached on a heavily used database after a
>> reasonably long period of use. If that's not likely than a likely
>> possibility is that your table has been overwritten with garbage.
>> Possibly by a filesystem bug or hardware failure.

> Not really that long.  The database was created in early June, and if that
> is supposed to be a transaction count (3839923882), then yeah that seems way
> out there.

It seems certain that you're looking at corrupt data.  Trashed data
frequently manifests this way, because the transaction ID is the first
field of tuple headers that the database can cross-check with any
amount of rigor.

At this point we know that at least two unrelated disk blocks have been
clobbered by something (first that index page, and now this).  There's
no very good reason to think there are only two :-(.  I'd bet at least
a cheese sandwich on hardware problems.  Test and fix/replace your
hardware, then go back to your last backup (I hope you've got one).

            regards, tom lane

Re: Fix corrupt pg_toast table?

От
Michael Clark
Дата:
Hello Tom and others who replied.

I appreciate the help tracking this down and eventually determining there is no point tracking it down further.  
We will do our best recovering what is accessible from the corrupt DB and an older backup.

Thanks,
Michael.


On Fri, Jul 31, 2009 at 7:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Michael Clark <codingninja@gmail.com> writes:
> On Fri, Jul 31, 2009 at 5:24 PM, Greg Stark <gsstark@mit.edu> wrote:
>>> Now I am getting:
>>> ERROR: could not access status of transaction 3839923882

>> How long has this database been in use? That's a very high transaction
>> number that would only be reached on a heavily used database after a
>> reasonably long period of use. If that's not likely than a likely
>> possibility is that your table has been overwritten with garbage.
>> Possibly by a filesystem bug or hardware failure.

> Not really that long.  The database was created in early June, and if that
> is supposed to be a transaction count (3839923882), then yeah that seems way
> out there.

It seems certain that you're looking at corrupt data.  Trashed data
frequently manifests this way, because the transaction ID is the first
field of tuple headers that the database can cross-check with any
amount of rigor.

At this point we know that at least two unrelated disk blocks have been
clobbered by something (first that index page, and now this).  There's
no very good reason to think there are only two :-(.  I'd bet at least
a cheese sandwich on hardware problems.  Test and fix/replace your
hardware, then go back to your last backup (I hope you've got one).

                       regards, tom lane