On Mon, 2007-12-31 at 18:35 -0500, Tom Lane wrote:
> "Mason Hale" <masonhale@gmail.com> writes:
> >> This could be the kernel's fault, but I'm wondering whether the
> >> RAID controller is going south.
>
> > To clarify a bit further -- on the production server, the data is written to
> > a 10-disk RAID 1+0, but the pg_xlog directory is symlinked to a separate,
> > dedicated SATA II disk.
>
> > There is a similar setup on the standby server, except that in addition to
> > the RAID for the data, and a separate SATA II disk for the pg_xlog, there is
> > another disk (also SATA II) dedicated for the archive of wal files copied
> > over from the production server.
>
> Oh. Maybe it's one of those disks' fault then. Although WAL corruption
> would not lead to corruption of the primary DB as long as there were no
> crash/replay events. Maybe there is more than one issue here, or maybe
> it's the kernel's fault after all.
The standby replays from the archive drive, whereas the primary does
crash recovery from the pg_xlog. We know that the primary is corrupted
in some way, and so is the standby, plus we know the standby corruption
occurred after it was copied to the archive and restored/recovered. So
we must have problems on at least two drives.
If we have had at least one recent primary server database crash
recovery then we might explain all the corruptions by a common issue
related to the SATA II drives. That might be the device driver but maybe
other things as well.
--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com