Обсуждение: (Again) Datacorruption using 7.4.2 on XFS/raid1
Hi We have again experienced data-corruption using 7.4.2 on an XFS Filesystem on top of a software-raid (md) raid-1. After a server crash last night (It was a rather strange crash - The machine was still pingable, but no login was possible, and postgres and apache didn't respond to requests any more) we hard-reset the machine. It came up again nicely, but a few hours later the following errors occured when trying to access certain tabled. (Those tables are updated heavily - each day about 2 million tuples are inserted, and the old versions of those tuples deleted). ERROR: could not access status of transaction 34048 DETAIL: could not open file "/var/lib/postgres/data/pg_clog/0000": No such file or directory While reading linux-kernel today, I stumbled upon a description of a rather strange XFS behaviour. It seems to zero a block if the block was updated, and the corresponding metadata-update was flushed to disk, but not the data itself. It does not happen if the file is fsynced() after the update - but I was wondering what would happen if the machine crashed between the write() and the fsync(). The lkml thread about this can be found here: http://www.ussg.iu.edu/hypermail/linux/kernel/0407.1/0359.html Could this XFS behaviour cause the postgres problems we are seeing? greetings, Florian Pflug
FYI, I have seen the SW linux raid not detect failed drives and cause filesystem corruption on many occasions. I would reccomend staying away from it. Maybe what you describe is a problem with PG but, i doubt it. On Jul 12, 2004, at 12:31 PM, Florian G. Pflug wrote: > Hi > > We have again experienced data-corruption using 7.4.2 on an XFS > Filesystem > on top of a software-raid (md) raid-1. > > After a server crash last night (It was a rather strange crash - The > machine > was still pingable, but no login was possible, and postgres and apache > didn't respond to requests any more) we hard-reset the machine. It > came up > again nicely, but a few hours later the following errors occured when > trying > to access certain tabled. (Those tables are updated heavily - each day > about > 2 million tuples are inserted, and the old versions of those tuples > deleted). > > ERROR: could not access status of transaction 34048 > DETAIL: could not open file "/var/lib/postgres/data/pg_clog/0000": No > such > file or directory > > While reading linux-kernel today, I stumbled upon a description of a > rather > strange XFS behaviour. It seems to zero a block if the block was > updated, > and the corresponding metadata-update was flushed to disk, but not the > data > itself. > It does not happen if the file is fsynced() after the update - but I > was > wondering what would happen if the machine crashed between the write() > and > the fsync(). > > The lkml thread about this can be found here: > http://www.ussg.iu.edu/hypermail/linux/kernel/0407.1/0359.html > > Could this XFS behaviour cause the postgres problems we are seeing? > > greetings, Florian Pflug > > ---------------------------(end of > broadcast)--------------------------- > TIP 8: explain analyze is your friend
On Mon, 12 Jul 2004 20:31:15 +0200, Florian G. Pflug <fgp@phlo.org> wrote: > Hi > > We have again experienced data-corruption using 7.4.2 on an XFS Filesystem > on top of a software-raid (md) raid-1. > > After a server crash last night (It was a rather strange crash - The machine > was still pingable, but no login was possible, and postgres and apache > didn't respond to requests any more) we hard-reset the machine. It came up > again nicely, but a few hours later the following errors occured when trying > to access certain tabled. (Those tables are updated heavily - each day about > 2 million tuples are inserted, and the old versions of those tuples > deleted). > > ERROR: could not access status of transaction 34048 > DETAIL: could not open file "/var/lib/postgres/data/pg_clog/0000": No such > file or directory You don't say what kind of disks you are using. Sounds very much like hardware problems though. I had a PostgreSQL installation on a pair of IDE disks with software RAID1 / Ext3 die very nastily with similar error messages. Turned out that one of the disks was very defective and the RAID wasn't handling it. On the other hand - after copying the files from the good disk, PostgreSQL started with barely a complaint and I couldn't detect any corruption. Ian Barwick
On Mon, Jul 12, 2004 at 01:22:02PM -0600, Brian Hirt wrote: > FYI, I have seen the SW linux raid not detect failed drives and cause > filesystem corruption on many occasions. I would reccomend staying > away from it. Maybe what you describe is a problem with PG but, i > doubt it. Hi I was under the impression that this only applies to the ataraid-drivers (Those drivers for promise and hpt raid-controllers that don't really provide hardware raid, but do provide a BIOS that is capable of booting from raid1 and raid0 arrays) - well, I guess I'll have to figure out some way to test the software raid. greetings, Florian Pflug