Обсуждение: ERROR: XX001 (Critical and Urgent)

Поиск
Список
Период
Сортировка

ERROR: XX001 (Critical and Urgent)

От
Siddharth Shah
Дата:
Hello All,
   
    Getting ERROR:  XX001: could not read block 17 of relation base/16386/2619: read only 0 of 8192 bytes, While vacuuming database
    Manual vacuuming and Auto vacuuming process constantly taking high CPU, not able to skip corrupted table for vacuuming and dump this message at regular interval.  fsync is off , From strace, found that semop call was in infinite loop.
     
    I have tried with making fsync on, Now manual vacuum process is taking high CPU, Strace unable to show any results (may be dead lock situation)
    and not any error / warning from postgres daemon

    Postgres Version : 8.4.3 (Migrated data from 8.4.1)

What can be issue ? Is it issue coming after database  table corruption, Can fsync on can prevent such (corruption) scenarios ?

Thanks,
Siddharth

Re: ERROR: XX001 (Critical and Urgent)

От
"Kevin Grittner"
Дата:
Siddharth Shah <siddharth.shah@elitecore.com> wrote:

> * fsync is off*

If you are running the database with fsync off and there is any sort
of unusual termination, your database will probably be corrupted.  I
recommend restoring from your last good backup.  If you don't have
one, recovery is going to be painful; I recommend contracting with
one of the many companies which off PostgreSQL support.  (I'm not
affiliated with any of them.)

> I have tried with making fsync on

That may help prevent further corruption, but will do nothing to
help recover from the damage already done.

> Postgres Version : 8.4.3 (Migrated data from 8.4.1)

What do you mean by that?  You installed 8.4.3 and reindexed hash
indexes?

> What can be issue ? Is it issue coming after database table
> corruption

Yes.

> Can fsync on can prevent such (corruption) scenarios ?

Yes.

-Kevin

Re: ERROR: XX001 (Critical and Urgent)

От
Siddharth Shah
Дата:
Thanks Kevin.

Yes, I installed 8.4.3 then I have found that DDL and DML statements were getting failed to execute in some distributions
So that's why taken call for reindexing

What can be the method to verify that it's a database corruption ?  

xdb=# \dt;
ERROR:  index "pg_class_relname_nsp_index" contains unexpected zero page at block 33
HINT:  Please REINDEX it.
xdb=# analyse verbose pg_class_relname_nsp_index;
ANALYZE
xdb=# \dt;   
ERROR:  index "pg_class_relname_nsp_index" contains unexpected zero page at block 33
HINT:  Please REINDEX it.
xdb=# reindex index pg_class_relname_nsp_index;

Now INDEXing taking High CPU and postgres baffled.


I don't have any backup available, Is there any way to fix this ?


Kevin Grittner wrote:
Siddharth Shah <siddharth.shah@elitecore.com> wrote: 
* fsync is off*   
 
If you are running the database with fsync off and there is any sort
of unusual termination, your database will probably be corrupted.  I
recommend restoring from your last good backup.  If you don't have
one, recovery is going to be painful; I recommend contracting with
one of the many companies which off PostgreSQL support.  (I'm not
affiliated with any of them.) 
I have tried with making fsync on   
 
That may help prevent further corruption, but will do nothing to
help recover from the damage already done. 
Postgres Version : 8.4.3 (Migrated data from 8.4.1)   
 
What do you mean by that?  You installed 8.4.3 and reindexed hash
indexes? 
What can be issue ? Is it issue coming after database table
corruption   
 
Yes. 
Can fsync on can prevent such (corruption) scenarios ?   
 
Yes.
-Kevin 

Re: ERROR: XX001 (Critical and Urgent)

От
"Kevin Grittner"
Дата:
[rearranged to put the most critical point first]

Siddharth Shah <siddharth.shah@elitecore.com> wrote:

> I don't have any backup available, Is there any way to fix this ?

I *strongly* recommend that you shut down the database and take a
file copy of the whole data tree (everything under what -D points to
on the server startup) which you should keep until long after you
think everything is working OK again.  Before you do anything else.
You are at risk of losing everything in the database, and one
misstep could put you over the edge.  If this is a production
database, tell the users that it is down until further notice.

> What can be the method to verify that it's a database corruption ?

> ERROR:  index "pg_class_relname_nsp_index" contains unexpected
> zero page at block 33

Getting an error like that indicates database corruption.

> HINT:  Please REINDEX it.
> xdb=# reindex index pg_class_relname_nsp_index;
>
> Now INDEXing taking High CPU and postgres baffled.

That is an index on the table which describes all your tables and
indexes.  It normally doesn't take a long time to reindex.  You
should consider doing your recovery in single-user mode (*AFTER* you
make that copy):

http://www.postgresql.org/docs/8.4/interactive/app-postgres.html

After trying reindex in that context, please post again.

-Kevin

Re: ERROR: XX001 (Critical and Urgent)

От
Siddharth Shah
Дата:
Kevin Grittner wrote:
[rearranged to put the most critical point first]
Siddharth Shah <siddharth.shah@elitecore.com> wrote: 
I don't have any backup available, Is there any way to fix this ?   
 
I *strongly* recommend that you shut down the database and take a
file copy of the whole data tree (everything under what -D points to
on the server startup) which you should keep until long after you
think everything is working OK again.  Before you do anything else. 
You are at risk of losing everything in the database, and one
misstep could put you over the edge.  If this is a production
database, tell the users that it is down until further notice. 
    Yes Kevin, I have taken backup of DATADIR.
  
What can be the method to verify that it's a database corruption ?   
  
ERROR:  index "pg_class_relname_nsp_index" contains unexpected
zero page at block 33   
 
Getting an error like that indicates database corruption. 
HINT:  Please REINDEX it.
xdb=# reindex index pg_class_relname_nsp_index;

Now INDEXing taking High CPU and postgres baffled.   
 
That is an index on the table which describes all your tables and
indexes.  It normally doesn't take a long time to reindex.  You
should consider doing your recovery in single-user mode (*AFTER* you
make that copy):
http://www.postgresql.org/docs/8.4/interactive/app-postgres.html
After trying reindex in that context, please post again. 
    postgres --single -P -D $DATADIR -p 5433 xdb
    Same behavior in single mode.
 
-Kevin 


Re: ERROR: XX001 (Critical and Urgent)

От
Siddharth Shah
Дата:

One more point, This is observed two times while product firmware updates which updates Postgres 8.4.3 from 8.4.1
Abruptly shutdown never leads to this type of corruption while fsync is always off

Thanks,
Siddharth


Siddharth Shah wrote:
Kevin Grittner wrote:
[rearranged to put the most critical point first]
Siddharth Shah <siddharth.shah@elitecore.com> wrote: 
I don't have any backup available, Is there any way to fix this ?   
 
I *strongly* recommend that you shut down the database and take a
file copy of the whole data tree (everything under what -D points to
on the server startup) which you should keep until long after you
think everything is working OK again.  Before you do anything else. 
You are at risk of losing everything in the database, and one
misstep could put you over the edge.  If this is a production
database, tell the users that it is down until further notice. 
    Yes Kevin, I have taken backup of DATADIR.
  
What can be the method to verify that it's a database corruption ?   
  
ERROR:  index "pg_class_relname_nsp_index" contains unexpected
zero page at block 33   
 
Getting an error like that indicates database corruption. 
HINT:  Please REINDEX it.
xdb=# reindex index pg_class_relname_nsp_index;

Now INDEXing taking High CPU and postgres baffled.   
 
That is an index on the table which describes all your tables and
indexes.  It normally doesn't take a long time to reindex.  You
should consider doing your recovery in single-user mode (*AFTER* you
make that copy):
http://www.postgresql.org/docs/8.4/interactive/app-postgres.html
After trying reindex in that context, please post again. 
    postgres --single -P -D $DATADIR -p 5433 xdb
    Same behavior in single mode.
 
-Kevin 



Re: ERROR: XX001 (Critical and Urgent)

От
"Kevin Grittner"
Дата:
Siddharth Shah <siddharth.shah@elitecore.com> wrote:

>>> xdb=# reindex index pg_class_relname_nsp_index;
>>>
>>> Now INDEXing taking High CPU and postgres baffled.

>> consider doing your recovery in single-user mode

>     postgres --single -P -D $DATADIR -p 5433 xdb
>     Same behavior in single mode.

How long did you leave it running?  Did you get any messages?  Is
there anything in the log?  What do CPU usage and disk usage look
like during the attempt?

-Kevin

Re: ERROR: XX001 (Critical and Urgent)

От
Siddharth Shah
Дата:
Kevin Grittner wrote:
Siddharth Shah <siddharth.shah@elitecore.com> wrote: 
xdb=# reindex index pg_class_relname_nsp_index;

Now INDEXing taking High CPU and postgres baffled.       
  
consider doing your recovery in single-user mode     
  
    postgres --single -P -D $DATADIR -p 5433 xdb   Same behavior in single mode.   
 
How long did you leave it running?  Did you get any messages?  Is
there anything in the log?  What do CPU usage and disk usage look
like during the attempt?
-Kevin 

Kevin, It start normally , I have successfully retrieved data from few tables
But I am not able to do dS / dT or dt. As you said this is index file for Postgres tables and indexes
Now when I taken call for reindex   pg_class_relname_nsp_index it takes 99% CPU

  PID  PPID USER     STAT   VSZ %MEM %CPU COMMAND
13419 13418 nobody   R    39172   8%  99% postgres --single -P -D /var/db -p 5433 xdb
It's been running from 10 minutes still there is no output or logs.


			
		

Re: ERROR: XX001 (Critical and Urgent)

От
Tom Lane
Дата:
Siddharth Shah <siddharth.shah@elitecore.com> writes:
>   PID  PPID USER     STAT   VSZ %MEM %CPU COMMAND
> 13419 13418 nobody   R    39172   8%  99% postgres --single -P -D
> /var/db -p 5433 xdb
> It's been running from 10 minutes still there is no output or logs.

What does "strace" show that process is doing?

            regards, tom lane