Обсуждение: Corrupted database's files (linux RAID5 + PostgreSQL 8.3.0)

Поиск
Список
Период
Сортировка

Corrupted database's files (linux RAID5 + PostgreSQL 8.3.0)

От
Peter Petrov
Дата:
Hi,

Today one of the disk was marked as as failed .... and now some files
are corrupted.
I've decided to copy the pgsqldata directory and try to fix PG_VERSION
(see below for information - what PostgreSQL don't like) files ... and
see if the database will come up.
During copying files and etc. I'll be open for any other idea how to
deal with the problem ;)

PostgreSQL's log offer me to run initdb (HINT message from LOG file) -
what will happen if then I try to copy the rest ot the structure into
the newly created database cluster ?

linux (Slackware 12.0.0), software RAID5 (partition based) + PostgreSQL
8.3.0:

Here's what happen (from dmesg):

---------------------------------------
# uname -a
Linux xeonito 2.6.21.5 #3 SMP Tue Oct 2 16:20:48 EEST 2007 i686 Intel(R)
Xeon(R) CPU           E5335  @ 2.00GHz GenuineIntel GNU/Linux

---------------------------------------
# dmesg
sd 0:0:3:0: SCSI error: return code = 0x08000002
sdd: Current: sense key=0x4
    ASC=0x44 ASCQ=0x0
Info fld=0x0
end_request: I/O error, dev sdd, sector 159620863
sd 0:0:3:0: SCSI error: return code = 0x08000002
sdd: Current: sense key=0x4
    ASC=0x44 ASCQ=0x0
Info fld=0x0
end_request: I/O error, dev sdd, sector 159617119
raid5: Disk failure on sdd1, disabling device. Operation continuing on 4
devices
......

RAID5 conf printout:
 --- rd:5 wd:4
 disk 0, o:1, dev:sdb1
 disk 1, o:1, dev:sdc1
 disk 2, o:0, dev:sdd1
 disk 3, o:1, dev:sde1
 disk 4, o:1, dev:sdf1
RAID5 conf printout:
 --- rd:5 wd:4
 disk 0, o:1, dev:sdb1
 disk 1, o:1, dev:sdc1
 disk 3, o:1, dev:sde1
 disk 4, o:1, dev:sdf1

---------------------------------------

# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5]
[raid4] [multipath] [faulty]
md1 : active raid5 sdb1[0] sdf1[4] sde1[3] sdd1[5](F) sdc1[1]
      585924608 blocks level 5, 8192k chunk, algorithm 2 [5/4] [UU_UU]

md0 : active raid5 sdb2[0] sdf2[4] sde2[3] sdd2[5](F) sdc2[1]
      390053888 blocks level 5, 1024k chunk, algorithm 2 [5/4] [UU_UU]

unused devices: <none>

---------------------------------------

And here's what the partitions look like:

# fdisk  -l /dev/sdb

Disk /dev/sdb: 249.8 GB, 249865175040 bytes
255 heads, 63 sectors/track, 30377 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1       18237   146488671   83  Linux
/dev/sdb2           18238       30377    97514550   83  Linux

---------------------------------------
Kernel parameters:

echo 4200000000 > /proc/sys/kernel/shmmax
echo 4200000000 > /proc/sys/kernel/shmall
sysctl -w vm.overcommit_memory=2

echo 8192 >  /sys/block/md0/md/stripe_cache_size
echo 8192 >  /sys/block/md1/md/stripe_cache_size

---------------------------------------


Both md0 and md1 are used from PostgreSQL - initially it was not design
to use the whole disk sdb-sdf, but due to size requirement I join also
the other unused space to be used by PostgreSQL.


And here's the Postgre's log (FATAL message is coming when I try to
connect to the database, of course this is the case for the most
interesting database ... some other small databases are working fine):

LOG:  received smart shutdown request
LOG:  autovacuum launcher shutting down
LOG:  shutting down
LOG:  database system is shut down
LOG:  could not create IPv6 socket: Address family not supported by protocol
LOG:  database system was shut down at 2008-05-20 17:54:17 EEST
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections
FATAL:  "base/16399" is not a valid data directory
DETAIL:  File "base/16399/PG_VERSION" does not contain valid data.
HINT:  You might need to initdb.

Of course base/16399/PG_VERSION contains something strange not the
version information:

# cat base/16399/PG_VERSION
X


---------------------------------------




Re: Corrupted database's files (linux RAID5 + PostgreSQL 8.3.0)

От
Sim Zacks
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

If you have a backup, the easiest way would be to restore it. There is
also a way to run the database logfile into the database from a point in
time (ie. from the time f last backup) so that you can get your data.
I've never actually seen it work though.



Peter Petrov wrote:
> Hi,
>
> Today one of the disk was marked as as failed .... and now some files
> are corrupted.
> I've decided to copy the pgsqldata directory and try to fix PG_VERSION
> (see below for information - what PostgreSQL don't like) files ... and
> see if the database will come up.
> During copying files and etc. I'll be open for any other idea how to
> deal with the problem ;)
>
> PostgreSQL's log offer me to run initdb (HINT message from LOG file) -
> what will happen if then I try to copy the rest ot the structure into
> the newly created database cluster ?
>
> linux (Slackware 12.0.0), software RAID5 (partition based) + PostgreSQL
> 8.3.0:
>
> Here's what happen (from dmesg):
>
> ---------------------------------------
> # uname -a
> Linux xeonito 2.6.21.5 #3 SMP Tue Oct 2 16:20:48 EEST 2007 i686 Intel(R)
> Xeon(R) CPU           E5335  @ 2.00GHz GenuineIntel GNU/Linux
>
> ---------------------------------------
> # dmesg
> sd 0:0:3:0: SCSI error: return code = 0x08000002
> sdd: Current: sense key=0x4
>    ASC=0x44 ASCQ=0x0
> Info fld=0x0
> end_request: I/O error, dev sdd, sector 159620863
> sd 0:0:3:0: SCSI error: return code = 0x08000002
> sdd: Current: sense key=0x4
>    ASC=0x44 ASCQ=0x0
> Info fld=0x0
> end_request: I/O error, dev sdd, sector 159617119
> raid5: Disk failure on sdd1, disabling device. Operation continuing on 4
> devices
> ......
>
> RAID5 conf printout:
> --- rd:5 wd:4
> disk 0, o:1, dev:sdb1
> disk 1, o:1, dev:sdc1
> disk 2, o:0, dev:sdd1
> disk 3, o:1, dev:sde1
> disk 4, o:1, dev:sdf1
> RAID5 conf printout:
> --- rd:5 wd:4
> disk 0, o:1, dev:sdb1
> disk 1, o:1, dev:sdc1
> disk 3, o:1, dev:sde1
> disk 4, o:1, dev:sdf1
>
> ---------------------------------------
>
> # cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5]
> [raid4] [multipath] [faulty]
> md1 : active raid5 sdb1[0] sdf1[4] sde1[3] sdd1[5](F) sdc1[1]
>      585924608 blocks level 5, 8192k chunk, algorithm 2 [5/4] [UU_UU]
>
> md0 : active raid5 sdb2[0] sdf2[4] sde2[3] sdd2[5](F) sdc2[1]
>      390053888 blocks level 5, 1024k chunk, algorithm 2 [5/4] [UU_UU]
>
> unused devices: <none>
>
> ---------------------------------------
>
> And here's what the partitions look like:
>
> # fdisk  -l /dev/sdb
>
> Disk /dev/sdb: 249.8 GB, 249865175040 bytes
> 255 heads, 63 sectors/track, 30377 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
>   Device Boot      Start         End      Blocks   Id  System
> /dev/sdb1               1       18237   146488671   83  Linux
> /dev/sdb2           18238       30377    97514550   83  Linux
>
> ---------------------------------------
> Kernel parameters:
>
> echo 4200000000 > /proc/sys/kernel/shmmax
> echo 4200000000 > /proc/sys/kernel/shmall
> sysctl -w vm.overcommit_memory=2
>
> echo 8192 >  /sys/block/md0/md/stripe_cache_size
> echo 8192 >  /sys/block/md1/md/stripe_cache_size
>
> ---------------------------------------
>
>
> Both md0 and md1 are used from PostgreSQL - initially it was not design
> to use the whole disk sdb-sdf, but due to size requirement I join also
> the other unused space to be used by PostgreSQL.
>
>
> And here's the Postgre's log (FATAL message is coming when I try to
> connect to the database, of course this is the case for the most
> interesting database ... some other small databases are working fine):
>
> LOG:  received smart shutdown request
> LOG:  autovacuum launcher shutting down
> LOG:  shutting down
> LOG:  database system is shut down
> LOG:  could not create IPv6 socket: Address family not supported by
> protocol
> LOG:  database system was shut down at 2008-05-20 17:54:17 EEST
> LOG:  autovacuum launcher started
> LOG:  database system is ready to accept connections
> FATAL:  "base/16399" is not a valid data directory
> DETAIL:  File "base/16399/PG_VERSION" does not contain valid data.
> HINT:  You might need to initdb.
>
> Of course base/16399/PG_VERSION contains something strange not the
> version information:
>
> # cat base/16399/PG_VERSION
> X
>
>
> ---------------------------------------
>
>
>
>

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkg0F0YACgkQjDX6szCBa+r5wwCg5Dzms7G3ipmVaoBbCZd+jPp8
TmIAnRrehvG1m+wvERsZ8J8Xw8v9scO5
=5AgU
-----END PGP SIGNATURE-----