Обсуждение: Recover from corrupted database due to failing disk
Dear all, some days ago I was tasked to recover a production database from a failing single-disk (!) system. I initially planned to restore from backups but, due to the bad disk, backups (done via pg_dumpall) were failing and nobody cared to notice (!!). Bottom line, the system was failing and no backup were in place (!!!). I perfectly understand this is (very) bad management, but I am now tasked to somehow recover from this situation. This is a very old installation. Some details: - o.s. CentOS 5.6 x86_64 - postgresql version postgresql-8.1.22-1.el5_5.1 By using ddrescue, I successfully migrated the entire old failing disk to a healty one and solve the first problem (failing disk). However, backup continue to fail with "invalid page header in block" message. Morever, I am very near the xid wraparound limit and, as vacuum fails due to the invalid blocks, I expect a database shutdown (triggered by the 1M transaction protection) within some days. From my understanding, both problem *should* be solved enabling "zero_damaged_pages" and executing a "vacuumdb -a". Is this expectation correct? Will a "reindexdb -a" necessary? This is my current action plan: - stop postgresql; - make a backup copy the entire /var/lib/pgsql directory - set zero_damaged_pages=on and increase maintenance_work_mem (to let vacuum operate faster); - start postgresql; - execute a "vacuumdb -a" - if necessary, execute a "reindexdb -a" It sounds reasonable? Do you have any suggestions/warnings? Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8
On 11/2/16 2:02 PM, Gionatan Danti wrote: > However, backup continue to fail with "invalid page header in block" > message. Morever, I am very near the xid wraparound limit and, as vacuum > fails due to the invalid blocks, I expect a database shutdown (triggered > by the 1M transaction protection) within some days. That means at least some of the Postgres files have been damaged (possibly due to the failing disk). Postgres will complain when it sees internal data structures that don't make sense, but it has no way to know if any of the user data has been screwed up. > From my understanding, both problem *should* be solved enabling > "zero_damaged_pages" and executing a "vacuumdb -a". Is this expectation > correct? Will a "reindexdb -a" necessary? I wouldn't trust the existing cluster that far. Since it sounds like you have no better options, you could use zero_damaged_pages to allow a pg_dumpall to complete, but you're going to end up with missing data. So what I'd suggest would be: stop Postgres make a copy of the cluster start with zero_damaged_pages pg_dumpall stop and remove the cluster (make sure you've got that backup) create a new cluster and load the dump -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On 11/2/16 6:21 PM, Jim Nasby wrote: > I wouldn't trust the existing cluster that far. Since it sounds like you > have no better options, you could use zero_damaged_pages to allow a > pg_dumpall to complete, but you're going to end up with missing data. So > what I'd suggest would be: > > stop Postgres > make a copy of the cluster > start with zero_damaged_pages > pg_dumpall > stop and remove the cluster (make sure you've got that backup) > create a new cluster and load the dump Oh, and while you're at it, upgrade to a version that's supported. 8.1 has been out of support for 5+ years. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
Il 03-11-2016 00:21 Jim Nasby ha scritto: > On 11/2/16 2:02 PM, Gionatan Danti wrote: > > That means at least some of the Postgres files have been damaged > (possibly due to the failing disk). Postgres will complain when it > sees internal data structures that don't make sense, but it has no way > to know if any of the user data has been screwed up. I understand that (unfortunately) user data *will* be corrupted/lost. However, having no backup, I think the customer *must* accept that... > > I wouldn't trust the existing cluster that far. Since it sounds like > you have no better options, you could use zero_damaged_pages to allow > a pg_dumpall to complete, but you're going to end up with missing > data. So what I'd suggest would be: > > stop Postgres > make a copy of the cluster > start with zero_damaged_pages > pg_dumpall > stop and remove the cluster (make sure you've got that backup) > create a new cluster and load the dump The whole dump/restore approach surely is the most sensible one. However, I am concerned that if the dump have some undetected problems leading to a failed restore, I had to recover from the raw files (which I would like to avoid). Moreover, the expected remaining lifetime of such a database is 2/3 months only, as a new production system should be installed shortly. This is why I would prefer to use vacuum/reindex and avoid a full dump/restore. Thank you very much Jim. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8
On 11/02/2016 11:18 PM, Gionatan Danti wrote: > Il 03-11-2016 00:21 Jim Nasby ha scritto: >> On 11/2/16 2:02 PM, Gionatan Danti wrote: >> >> That means at least some of the Postgres files have been damaged >> (possibly due to the failing disk). Postgres will complain when it >> sees internal data structures that don't make sense, but it has no way >> to know if any of the user data has been screwed up. > > I understand that (unfortunately) user data *will* be corrupted/lost. > However, having no backup, I think the customer *must* accept that... > >> >> I wouldn't trust the existing cluster that far. Since it sounds like >> you have no better options, you could use zero_damaged_pages to allow >> a pg_dumpall to complete, but you're going to end up with missing >> data. So what I'd suggest would be: >> >> stop Postgres >> make a copy of the cluster >> start with zero_damaged_pages >> pg_dumpall >> stop and remove the cluster (make sure you've got that backup) >> create a new cluster and load the dump > > The whole dump/restore approach surely is the most sensible one. > However, I am concerned that if the dump have some undetected problems > leading to a failed restore, I had to recover from the raw files (which > I would like to avoid). Moreover, the expected remaining lifetime of > such a database is 2/3 months only, as a new production system should be > installed shortly. This is why I would prefer to use vacuum/reindex and > avoid a full dump/restore. The above does not make sense. You are having to recover because there was no backup and now you want to go forward without doing a backup? > > Thank you very much Jim. > -- Adrian Klaver adrian.klaver@aklaver.com
On 03/11/2016 14:20, Adrian Klaver wrote: > > The above does not make sense. You are having to recover because there > was no backup and now you want to go forward without doing a backup? > Hi Adrian, no, I don't want go forward without backups ;) Actually, the *first* thing I did after the vacuum completed was a full cluster backup (via pg_dumpall), and I scheduled nightly backups as well. Problem is this customer does not have another server were backups can be restored and the entire production database migrated. In short, the two possibilities I have are: 1) execute the vacuum (done), schedule regular dumps (done) and, if something goes wrong, recover from backups; 2) execute the vacuum (done), do a manual backup (done), reinit (remove/recreate) the entire cluster (not done) and restore from backups (not done). I strongly prefer to execute n.2 on another machine, so that production is not impacted while the recovered backup can be througly tested. If/when the backups are validated, I want to migrate all clients to the new server (with RAID1 in place), and dismiss the old one. Unfortuntaly I am working with incredible constrains from customer side; even buying two SAS disks seems a problem. Moreover, as an external consultant, I have basically no decision/buying power :| What I can do (and I did) is to raise a very big red flag and let others decide what to do. The good thing is that zero_damaged_pages and vacuum did their works, as now the database can be dumped and vacuumed with no (apparent) problems. Thanks. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.danti@assyoma.it - info@assyoma.it GPG public key ID: FF5F32A8
On 11/04/2016 03:20 AM, Gionatan Danti wrote: > > > On 03/11/2016 14:20, Adrian Klaver wrote: >> >> The above does not make sense. You are having to recover because there >> was no backup and now you want to go forward without doing a backup? >> > > Hi Adrian, no, I don't want go forward without backups ;) > Actually, the *first* thing I did after the vacuum completed was a full > cluster backup (via pg_dumpall), and I scheduled nightly backups as well. > > Problem is this customer does not have another server were backups can > be restored and the entire production database migrated. In short, the > two possibilities I have are: > > 1) execute the vacuum (done), schedule regular dumps (done) and, if > something goes wrong, recover from backups; > > 2) execute the vacuum (done), do a manual backup (done), reinit > (remove/recreate) the entire cluster (not done) and restore from backups > (not done). > > I strongly prefer to execute n.2 on another machine, so that production > is not impacted while the recovered backup can be througly tested. > If/when the backups are validated, I want to migrate all clients to the > new server (with RAID1 in place), and dismiss the old one. > > Unfortuntaly I am working with incredible constrains from customer side; > even buying two SAS disks seems a problem. Moreover, as an external > consultant, I have basically no decision/buying power :| > What I can do (and I did) is to raise a very big red flag and let others > decide what to do. Ouch, understood. Good luck! > > The good thing is that zero_damaged_pages and vacuum did their works, as > now the database can be dumped and vacuumed with no (apparent) problems. > > Thanks. > -- Adrian Klaver adrian.klaver@aklaver.com
On 4 November 2016 at 11:20, Gionatan Danti <g.danti@assyoma.it> wrote: > Unfortuntaly I am working with incredible constrains from customer side; > even buying two SAS disks seems a problem. Moreover, as an external > consultant, I have basically no decision/buying power :| > What I can do (and I did) is to raise a very big red flag and let others > decide what to do. It seems to me that your customer doesn't realise how expensive it would be if their server would be unavailable for any length of time or if they would actually lose the data it contains. That, or the data of your customer isn't so valuable that it's worth your time. We've been fighting a somewhat similar fight internally here, where management wasn't prepared to spend € 30,000 once on a server plus software licenses, while they pay that to one of our new managers monthly. -- If you can't see the forest for the trees, Cut the trees and you'll see there is no forest.