Обсуждение: [GENERAL] pg_dump 8.3.3 ERROR: invalid page header in block 2264419 of relation"pg_largeobject"

Поиск
Список
Период
Сортировка
We have not noted any issues, but when I ran a pg_dump on an 8.3.3
database, it failed after an hour or so with the error:

ERROR:  invalid page header in block 2264419 of relation "pg_largeobject"
pg_dump: The command was: FETCH 1000 IN bloboid

As we seem to have some data corruption issue, the question is how can I
either fix this, or have pg_dump ignore it and continue doing the best
dump it can?  That is, I'd like to create a new clean database that has
whatever data I can recover.

Because the large objects are mostly for storing uploaded files (that
have been encrypted, so the DB contents will likely be meaningless), if
we are missing any, it's not too bad, well, no less bad than whatever we
have now.

Thanks,
David

The OS it is running on shows:

cat /proc/version
Linux version 2.6.18-92.1.10.el5.xs5.0.0.39xen (root@pondo-2) (gcc
version 4.1.2 20071124 (Red Hat 4.1.2-42)) #1 SMP Thu Aug 7 14:58:14 EDT
2008

uname -a
Linux example.com 2.6.18-92.1.10.el5.xs5.0.0.39xen #1 SMP Thu Aug 7
14:58:14 EDT 2008 i686 i686 i386 GNU/Linux



David Wall <d.wall@computer.org> writes:
> We have not noted any issues, but when I ran a pg_dump on an 8.3.3
> database, it failed after an hour or so with the error:

8.3.3?

> cat /proc/version
> Linux version 2.6.18-92.1.10.el5.xs5.0.0.39xen (root@pondo-2) (gcc
> version 4.1.2 20071124 (Red Hat 4.1.2-42)) #1 SMP Thu Aug 7 14:58:14 EDT
> 2008

Egad.  I take it this server has been entirely unmaintained for ~ 8 years.

> ERROR:  invalid page header in block 2264419 of relation "pg_largeobject"
> pg_dump: The command was: FETCH 1000 IN bloboid

> As we seem to have some data corruption issue, the question is how can I
> either fix this, or have pg_dump ignore it and continue doing the best
> dump it can?  That is, I'd like to create a new clean database that has
> whatever data I can recover.

Setting zero_damaged_pages would be a brute-force answer, but bear in mind
that that's irreversible and it's hard to predict how much you'll lose.
If possible, I'd take a physical backup (e.g. with tar) of the entire
$PGDATA directory, preferably with the server stopped, before you do
that.  Then you'll at least know you can get back to where you are.

I hope this is being done as part of migration to more up-to-date
software.

            regards, tom lane


On 5/24/17 4:18 PM, Tom Lane wrote:
> David Wall <d.wall@computer.org> writes:
>> We have not noted any issues, but when I ran a pg_dump on an 8.3.3
>> database, it failed after an hour or so with the error:
> 8.3.3?
Yes, it's old.

>> cat /proc/version
>> Linux version 2.6.18-92.1.10.el5.xs5.0.0.39xen (root@pondo-2) (gcc
>> version 4.1.2 20071124 (Red Hat 4.1.2-42)) #1 SMP Thu Aug 7 14:58:14 EDT
>> 2008
> Egad.  I take it this server has been entirely unmaintained for ~ 8 years.
Indeed!  We are just the software vendor, hence our surprise too. They
didn't even know their backups were failing due to this error.


>> ERROR:  invalid page header in block 2264419 of relation "pg_largeobject"
>> pg_dump: The command was: FETCH 1000 IN bloboid
>> As we seem to have some data corruption issue, the question is how can I
>> either fix this, or have pg_dump ignore it and continue doing the best
>> dump it can?  That is, I'd like to create a new clean database that has
>> whatever data I can recover.
> Setting zero_damaged_pages would be a brute-force answer, but bear in mind
> that that's irreversible and it's hard to predict how much you'll lose.
> If possible, I'd take a physical backup (e.g. with tar) of the entire
> $PGDATA directory, preferably with the server stopped, before you do
> that.  Then you'll at least know you can get back to where you are.
>
> I hope this is being done as part of migration to more up-to-date
> software.
Yes, this was discovered as part of a migration to all new servers. We
have just put into place PG 9.3.4 and was looking to load it when their
last pg_dump was restored and got an error because it's an incomplete
dump.  We then when to run our own pg_dump when we found that it crashed
with the invalid page header.

Good idea on the physical backup first, while the system is stopped.

They do have a slave DB running via WAL shipping.  Would that likely
help us in any way?  Because the DBs are big (they have two at 191GB and
127GB), it takes a fair bit of time to do backups, transfers and
restores.  I'm trying to find options as it likely means downtime for
them that they are not expecting (yet).

Thanks for your help!


On Wed, May 24, 2017 at 07:18:14PM -0400, Tom Lane wrote:

> If possible, I'd take a physical backup (e.g. with tar) of the entire $PGDATA directory,

Make sure the backup goes directly to a different physical
volume in case the underlying hardware is bad.

Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


On Wed, May 24, 2017 at 04:45:51PM -0700, David Wall wrote:

> They do have a slave DB running via WAL shipping.  Would that likely help us
> in any way?

If you can find out which blobs are afflicted you may be able
to extract those from the slave and re-insert them into the
new DB.

> Because the DBs are big (they have two at 191GB and 127GB), it
> takes a fair bit of time to do backups, transfers and restores.  I'm trying
> to find options as it likely means downtime for them that they are not
> expecting (yet).

Oh, they likely have downtime already, because what you first
reported smells of bad hardware ?

Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


David Wall <d.wall@computer.org> writes:
> They do have a slave DB running via WAL shipping.  Would that likely
> help us in any way?

Have you tried taking a backup from the slave?  It's possible that
the corruption exists only on the master.

            regards, tom lane


On 5/25/17 6:30 AM, Tom Lane wrote:
> David Wall <d.wall@computer.org> writes:
>> They do have a slave DB running via WAL shipping.  Would that likely
>> help us in any way?
> Have you tried taking a backup from the slave?  It's possible that
> the corruption exists only on the master.
We will give this a try once the customer let's us try.  It appears
we'll need to allot considerable downtime to try our options.

Because the warm standby was resynced in October (it was down due to the
OS going into a read-only filesystem for an untold long time that we
only noted when the primary disk was going full with WALs), we believe
we may have 'tar' copied the corrupted data too.  But we will first stop
the web apps, then 'tar' backup the database, then stop recovery on the
warm standby and ensure our table counts appear to match production (in
case it has any issues of its own), and see if the warm DB is any
better.  If so, we'll restore from there.  If not, we'll try the
zero-out bad blocks and see what happens.

Fortunately, this only appears in one of their two DBs.  Once we can
successfully dump the DBs, we will migrate to the new hardware and OS
and upgraded PG.

Thanks to you and Karsten Hilbert for your help.