Обсуждение: BUG #5929: ERROR: found toasted toast chunk for toast value 260340218 in pg_toast_260339342

Поиск
Список
Период
Сортировка

BUG #5929: ERROR: found toasted toast chunk for toast value 260340218 in pg_toast_260339342

От
"Tambet Matiisen"
Дата:
The following bug has been logged online:

Bug reference:      5929
Logged by:          Tambet Matiisen
Email address:      tambet.matiisen@gmail.com
PostgreSQL version: 8.4.7
Operating system:   PostgreSQL 8.4.7 on i486-pc-linux-gnu, compiled by GCC
gcc-4.4.real (Debian 4.4.5-10) 4.4.5, 32-bit
Description:        ERROR: found toasted toast chunk for toast value
260340218 in pg_toast_260339342
Details:

For a few days I've been getting this error from my nightly backup script:

Warning: pg_dump: SQL command failed pg_dump: Error message from server:
ERROR: found toasted toast chunk for toast value 260340218 in
pg_toast_260339342 pg_dump: The command was: COPY public.yhistud_urlcache
(id, url, params, sess_id, content) TO stdout; pg_dumpall: pg_dump failed on
database "yhistud", exiting
Warning: Failed to dump pgsql cluster

Yesterday I upgraded the server from 8.4.5 to 8.4.7, hoping that this error
will go away, but no success.

I've been getting occasional errors from backup script for several months,
but this time the error is recurring. I have upgraded Linux kernel to
2.6.32, hoping that maybe the problem is in software RAID driver, but no
changes, occasionally I still get errors. I still have to do memory test on
the server, but I doubt faulty memories are the problem, because otherwise
the server behaves well.

Re: BUG #5929: ERROR: found toasted toast chunk for toast value 260340218 in pg_toast_260339342

От
"Kevin Grittner"
Дата:
"Tambet Matiisen" <tambet.matiisen@gmail.com> wrote:

> For a few days I've been getting this error from my nightly backup
> script:
>
> Warning: pg_dump: SQL command failed pg_dump: Error message from
> server: ERROR: found toasted toast chunk for toast value 260340218
> in pg_toast_260339342 pg_dump: The command was: COPY
> public.yhistud_urlcache (id, url, params, sess_id, content) TO
> stdout; pg_dumpall: pg_dump failed on database "yhistud", exiting
> Warning: Failed to dump pgsql cluster

So you don't have a current backup, and your database is corrupted.

(1)  If you still have a backup from before you started getting
backup failures, keep it safe until everything has settled down and
is running well for several months.

(2)  Stop PostgreSQL and do a full copy of the data directory and
everything under it to a backup medium or another machine.  Keep
this copy safe for months, too.

> Yesterday I upgraded the server from 8.4.5 to 8.4.7, hoping that
> this error will go away, but no success.

Newer versions with more bug fixes may be less likely to contain
bugs which could cause corruption, but an upgrade like that is
unlikely to "heal" data which is already corrupted.

> I've been getting occasional errors from backup script for several
> months,

Do you know what those were?

> I have upgraded Linux kernel to 2.6.32, hoping that maybe the
> problem is in software RAID driver, but no changes, occasionally I
> still get errors.

Occasionally get what errors?

> I still have to do memory test on the server, but I doubt faulty
> memories are the problem, because otherwise the server behaves
> well.

So, no problems other than months of errors on backups?  Never any
OS lockups, power losses, or other abrupt terminations of
operations?

Also, do you now or have you ever run the database with fsync = off
or full_page_writes = off?

It is very important to figure out how your data got corrupted;
otherwise you can't really trust this machine..

-Kevin

Re: BUG #5929: ERROR: found toasted toast chunk for toast value 260340218 in pg_toast_260339342

От
Tambet Matiisen
Дата:
Hi Kevin!

Thanks for your reply. You make me feel that this is more serious than I
thought.

This is development server, that is also used as pre-live server.
Pre-live database is restored from live database dump every night. So
far the errors have been in pre-live database, that's why I didn't worry
too much - it is anyway overwritten every night from backup. Usually the
next day error was gone. I mostly blamed badly timed backup and restore
scripts, although this shouldn't result in errors.

The errors started from 07.09.2010, when I was still running PostgreSQL
8.1. Few examples:

07.09.2010:
Warning: pg_dump: ERROR: could not open relation with OID 339815468
pg_dump: SQL command to dump the contents of table "kannete_read"
failed: PQendcopy() failed. pg_dump: Error message from server: ERROR:
could not open relation with OID 339815468 pg_dump: The command was:
COPY public.kannete_read (yhistu_id, kande_rea_id, kande_id, konto_nr,
alamkonto_nr, deebetsumma, kreeditsumma, deebetsaldo, kreeditsaldo,
alamkonto_deebetsaldo, alamkonto_kreeditsaldo, looja, loomise_aeg,
muutja, muutmise_aeg, kuupaev, kande_nr, kinnitatud, deebetprotsent,
kreeditprotsent) TO stdout; pg_dumpall: pg_dump failed on database
"korteriy_histu", exiting

19.09.2010:
Warning: pg_dump: ERROR: unexpected chunk number 926884437 (expected
514) for toast value 1736426835 pg_dump: SQL command to dump the
contents of table "failid" failed: PQendcopy() failed. pg_dump: Error
message from server: ERROR: unexpected chunk number 926884437 (expected
514) for toast value 1736426835 pg_dump: The command was: COPY
public.failid (faili_id, yhistu_id, perioodi_id, arve_id, dokumendi_id,
tyyp, sisu, laius, korgus, pikkus, faili_nimi, sisu_tyyp, looja,
loomise_aeg, muutja, muutmise_aeg) TO stdout; pg_dumpall: pg_dump failed
on database "arvetest", exiting

24.09.2010:
Warning: pg_dump: socket not open pg_dump: SQL command to dump the
contents of table "failid" failed: PQendcopy() failed. pg_dump: Error
message from server: socket not open pg_dump: The command was: COPY
public.failid (faili_id, yhistu_id, perioodi_id, arve_id, dokumendi_id,
tyyp, sisu, laius, korgus, pikkus, faili_nimi, sisu_tyyp, looja,
loomise_aeg, muutja, muutmise_aeg) TO stdout; pg_dumpall: pg_dump failed
on database "arvetest", exiting

9.11.2010:
Warning: pg_dump: Dumping the contents of table "maaramised" failed:
PQgetCopyData() failed. pg_dump: Error message from server: server
closed the connection unexpectedly This probably means the server
terminated abnormally before or while processing the request. pg_dump:
The command was: COPY public.maaramised (maaramise_id, kululiigi_id,
perioodi_id, yhistu_id, korteri_id, kogus, yhik, hind, summa, looja,
loomise_aeg, muutja, muutmise_aeg) TO stdout; pg_dumpall: pg_dump failed
on database "arvetest", exiting

More recently after I upgraded to 8.4, 11.02.2010:
Warning: pg_dump: SQL command failed pg_dump: Error message from server:
ERROR: compressed data is corrupt pg_dump: The command was: COPY
public.failid (faili_id, yhistu_id, perioodi_id, arve_id, dokumendi_id,
tyyp, sisu, laius, korgus, pikkus, faili_nimi, sisu_tyyp, looja,
loomise_aeg, muutja, muutmise_aeg) TO stdout; pg_dumpall: pg_dump failed
on database "korteriy_histu", exiting

The current error has occurred 3 days in a row - 13-15.03.2011:
Warning: pg_dump: SQL command failed pg_dump: Error message from server:
ERROR: found toasted toast chunk for toast value 260340218 in
pg_toast_260339342 pg_dump: The command was: COPY
public.yhistud_urlcache (id, url, params, sess_id, content) TO stdout;
pg_dumpall: pg_dump failed on database "yhistud", exiting

This time the error is not in pre-live database and therefore it doesn't
go away.

I have not noticed any unusual errors in other services. The server is
also running Subversion, Trac, Apache, Samba, MySQL, Oracle, Tomcat and
so on. PostgreSQL, Subversion, Trac and Apache+PHP are used actively
every day.

Both fsync and full_page_writes are on. OK, I don't have UPS for this
machine, but power has been stable. Current uptime is 32 days, which I
bet is from the last kernel update. I run Debian testing on that machine.

Currently I blame either faulty memory or faulty software RAID driver. I
can easily eliminate the memory cause by running memtest86 for few
hours. But how do I eliminate the software RAID driver? PostgreSQL has
always been solid for me, so I suspect it least, but you never know...

Now, off to buy UPS...

   Tambet

On 15.03.2011 19:47, Kevin Grittner wrote:
> "Tambet Matiisen"<tambet.matiisen@gmail.com>  wrote:
>
>> For a few days I've been getting this error from my nightly backup
>> script:
>>
>> Warning: pg_dump: SQL command failed pg_dump: Error message from
>> server: ERROR: found toasted toast chunk for toast value 260340218
>> in pg_toast_260339342 pg_dump: The command was: COPY
>> public.yhistud_urlcache (id, url, params, sess_id, content) TO
>> stdout; pg_dumpall: pg_dump failed on database "yhistud", exiting
>> Warning: Failed to dump pgsql cluster
>
> So you don't have a current backup, and your database is corrupted.
>
> (1)  If you still have a backup from before you started getting
> backup failures, keep it safe until everything has settled down and
> is running well for several months.
>
> (2)  Stop PostgreSQL and do a full copy of the data directory and
> everything under it to a backup medium or another machine.  Keep
> this copy safe for months, too.
>
>> Yesterday I upgraded the server from 8.4.5 to 8.4.7, hoping that
>> this error will go away, but no success.
>
> Newer versions with more bug fixes may be less likely to contain
> bugs which could cause corruption, but an upgrade like that is
> unlikely to "heal" data which is already corrupted.
>
>> I've been getting occasional errors from backup script for several
>> months,
>
> Do you know what those were?
>
>> I have upgraded Linux kernel to 2.6.32, hoping that maybe the
>> problem is in software RAID driver, but no changes, occasionally I
>> still get errors.
>
> Occasionally get what errors?
>
>> I still have to do memory test on the server, but I doubt faulty
>> memories are the problem, because otherwise the server behaves
>> well.
>
> So, no problems other than months of errors on backups?  Never any
> OS lockups, power losses, or other abrupt terminations of
> operations?
>
> Also, do you now or have you ever run the database with fsync = off
> or full_page_writes = off?
>
> It is very important to figure out how your data got corrupted;
> otherwise you can't really trust this machine..
>
> -Kevin

Re: BUG #5929: ERROR: found toasted toast chunk for toast value 260340218 in pg_toast_260339342

От
"Kevin Grittner"
Дата:
Tambet Matiisen <tambet.matiisen@gmail.com> wrote:

> Pre-live database is restored from live database dump every night.

How is that done?  A single pg_dump of the entire live database
restored using psql?  Are both database servers at the same
PostgreSQL version?

> So far the errors have been in pre-live database,

You're running pg_dump against a database you just restored from a
pg_dump image?

> Usually the next day error was gone. I mostly blamed badly timed
> backup and restore scripts, although this shouldn't result in
> errors.

No it shouldn't -- if you're following any of the documented backup
and restore techniques.  I have a suspicion that you're just doing a
file copy without stopping the live database or properly following
the documented PITR backup and recovery techniques.

> The errors started from 07.09.2010, when I was still running
> PostgreSQL 8.1. Few examples:
>
> 07.09.2010:
> Warning: pg_dump: ERROR: could not open relation with OID
> 339815468

> [additional errors which could be caused by copying a database
> while running without proper PITR techniques]

> The current error has occurred 3 days in a row - 13-15.03.2011:
> Warning: pg_dump: SQL command failed pg_dump: Error message from
> server:
> ERROR: found toasted toast chunk for toast value 260340218 in
> pg_toast_260339342

> This time the error is not in pre-live database and therefore it
> doesn't go away.

If I understand you, this sounds like corruption in the live
database; nothing on the pre-live database is part of causing this
problem.

> The server is also running [...] Samba [...]

I hope you're not trusting Samba too far.  For a while we were using
it in backups across our WAN, and it mangled at least one file
almost every day.  We had to take to running md5sum against both
ends for each file to ensure we didn't get garbage (until we
converted everything to use TCP communications, which have never
mangled anything for us).

> Both fsync and full_page_writes are on.

Good.  Without those an OS or hardware crash can corrupt your
database.

> OK, I don't have UPS for this machine, but power has been stable.
> Current uptime is 32 days, which I bet is from the last kernel
> update.

OK.  A power outage wouldn't be too likely to matter if you have
fsync and full_page_writes on.

> Currently I blame either faulty memory or faulty software RAID
> driver.  I can easily eliminate the memory cause by running
> memtest86 for few hours

Is this ECC memory?  If not, even a good test doesn't prove that a
RAM problem didn't cause the corruption.

> Now, off to buy UPS...

Not a bad idea, but it doesn't sound like lack of that is likely to
have caused the corruption in your live database, based on the
settings you mentioned.  (Assuming those settings are in use on the
live server.)

-Kevin

Re: BUG #5929: ERROR: found toasted toast chunk for toast value 260340218 in pg_toast_260339342

От
"Kevin Grittner"
Дата:
Tambet Matiisen <tambet.matiisen@gmail.com> wrote:
> On 16.03.2011 17:09, Kevin Grittner wrote:
>> Tambet Matiisen<tambet.matiisen@gmail.com>  wrote:
>>
>>> Pre-live database is restored from live database dump every
>>> night.
>>
>> How is that done?  A single pg_dump of the entire live database
>> restored using psql?  Are both database servers at the same
>> PostgreSQL version?
>
> Yes, I use pg_dump on live server and the result is
> rdiff-backupped into development server. Whole SQL dump is 12G
> without compression and the rdiff delta is about 10-20MB every
> day. Then I drop pre-live database on development server and
> recreate it using createdb and psql.

createdb, not initdb?  I suggest you backup and delete everything in
the data directory, and start with initdb, and see whether the
problem still exists.  If it goes away, the problem was in your
shared system tables.  If it persists, the problem is in your backup
files, and I would try a delete and a fresh copy.  If *that* fixes
it you know the problem was with rdiff-backup.  (Of course, keeping
copies of things before the delete might provide useful forensic
information.)

> For a while development server was running 8.4 and live server
> 8.1. Now both are 8.4, but this shouldn't matter, as I do backup
> and restore via SQL.

I hope you were using the 8.4 version of pg_dump when you were in
the dual-version situation.  Using the earlier version of pg_dump is
not guaranteed to provide a backup which can be cleanly installed on
a later version.  That could *possibly* be related to current
problems.

>> You're running pg_dump against a database you just restored from
>> a pg_dump image?
>
> Hmm, yeah. This sounds rather dumb, but haven't got to that yet.

Well, it might not be as dumb as you think, if it uncovered a
problem with your dump/restore process from live to pre-live.

> Development server contains some additional databases as well,
> that do not exist on live server.

So are you really using pg_dumpall or pg_dump?

>>> Both fsync and full_page_writes are on.
>>
>> Good.  Without those an OS or hardware crash can corrupt your
>> database.
>
> Actually they are commented out, but I suppose this means "on".

Yeah, they default to the safe setting.

> It's not ECC memory.

Well, then there has been proven to be a non-negligible possibility
of occasional random bit-flips.  Seriously, next time you upgrade,
make sure any database server has ECC RAM.

> It is possible, that restore of pre-live database using psql lasts
> so long, that backup of the same database using pg_dump is already
> kicking in.

Hmmm...  You might want to do enough logging of the processes to be
able to confirm or eliminate that possibility.  Dumping an
incompletely-restored database might generate some odd errors.

-Kevin

Re: BUG #5929: ERROR: found toasted toast chunk for toast value 260340218 in pg_toast_260339342

От
Tambet Matiisen
Дата:
On 16.03.2011 17:09, Kevin Grittner wrote:
> Tambet Matiisen<tambet.matiisen@gmail.com>  wrote:
>
>> Pre-live database is restored from live database dump every night.
>
> How is that done?  A single pg_dump of the entire live database
> restored using psql?  Are both database servers at the same
> PostgreSQL version?

Yes, I use pg_dump on live server and the result is rdiff-backupped into
development server. Whole SQL dump is 12G without compression and the
rdiff delta is about 10-20MB every day. Then I drop pre-live database on
development server and recreate it using createdb and psql.

For a while development server was running 8.4 and live server 8.1. Now
both are 8.4, but this shouldn't matter, as I do backup and restore via SQL.

>
>> So far the errors have been in pre-live database,
>
> You're running pg_dump against a database you just restored from a
> pg_dump image?

Hmm, yeah. This sounds rather dumb, but haven't got to that yet.
Development server contains some additional databases as well, that do
not exist on live server.

>
>> Usually the next day error was gone. I mostly blamed badly timed
>> backup and restore scripts, although this shouldn't result in
>> errors.
>
> No it shouldn't -- if you're following any of the documented backup
> and restore techniques.  I have a suspicion that you're just doing a
> file copy without stopping the live database or properly following
> the documented PITR backup and recovery techniques.

No, I don't do any advanced backup tricks. Just plain pg_dump and psql.

>
> This time the error is not in pre-live database and therefore it
>> doesn't go away.
>
> If I understand you, this sounds like corruption in the live
> database; nothing on the pre-live database is part of causing this
> problem.

This would be the case when I do filesystem level copy, but I do not.

>
>> The server is also running [...] Samba [...]
>
> I hope you're not trusting Samba too far.  For a while we were using
> it in backups across our WAN, and it mangled at least one file
> almost every day.  We had to take to running md5sum against both
> ends for each file to ensure we didn't get garbage (until we
> converted everything to use TCP communications, which have never
> mangled anything for us).

As I said, I'm using rdiff-backup to transfer pure SQL files.

>
>> Both fsync and full_page_writes are on.
>
> Good.  Without those an OS or hardware crash can corrupt your
> database.

Actually they are commented out, but I suppose this means "on".

>
>> OK, I don't have UPS for this machine, but power has been stable.
>> Current uptime is 32 days, which I bet is from the last kernel
>> update.
>
> OK.  A power outage wouldn't be too likely to matter if you have
> fsync and full_page_writes on.

That's a relief :).

>
>> Currently I blame either faulty memory or faulty software RAID
>> driver.  I can easily eliminate the memory cause by running
>> memtest86 for few hours
>
> Is this ECC memory?  If not, even a good test doesn't prove that a
> RAM problem didn't cause the corruption.

It's not ECC memory.

>
>> Now, off to buy UPS...
>
> Not a bad idea, but it doesn't sound like lack of that is likely to
> have caused the corruption in your live database, based on the
> settings you mentioned.  (Assuming those settings are in use on the
> live server.)

Checked live server, it has also fsync=on and full_page_writes=on. But
it shouldn't matter, because backup of live server doesn't give any errors.

It is possible, that restore of pre-live database using psql lasts so
long, that backup of the same database using pg_dump is already kicking
in. But again, this shouldn't matter and it doesn't explain why the last
error is in another database, that hasn't changed for months.

Now I have to find time to run memtest.

   Tambet

Re: BUG #5929: ERROR: found toasted toast chunk for toast value 260340218 in pg_toast_260339342

От
Tambet Matiisen
Дата:
On 16.03.2011 22:29, Kevin Grittner wrote:
> Tambet Matiisen<tambet.matiisen@gmail.com>  wrote:
>> Yes, I use pg_dump on live server and the result is
>> rdiff-backupped into development server. Whole SQL dump is 12G
>> without compression and the rdiff delta is about 10-20MB every
>> day. Then I drop pre-live database on development server and
>> recreate it using createdb and psql.
>
> createdb, not initdb?  I suggest you backup and delete everything in
> the data directory, and start with initdb, and see whether the
> problem still exists.  If it goes away, the problem was in your
> shared system tables.  If it persists, the problem is in your backup
> files, and I would try a delete and a fresh copy.  If *that* fixes
> it you know the problem was with rdiff-backup.  (Of course, keeping
> copies of things before the delete might provide useful forensic
> information.)

Yes, I use createdb to recreate just one database. I doubt backup files
could cause such an error, they are plain SQL files.

Today I got another error, so it seems to get worse:

Warning: pg_dump: WARNING: could not write block 188224 of base/2802415579/2802416218 DETAIL: Multiple failures ---
writeerror might be permanent. pg_dump: SQL command failed pg_dump: Error message from server: ERROR: xlog flush
request200EB/9E4CD48 is not satisfied --- flushed only to CC/3F22EFB4 LINE 1: ...LECT tableoid, oid, nspname, (SELECT
rolnameFROM pg_catalog... ^ CONTEXT: writing block 188224 of relation base/2802415579/2802416218 pg_dump: The command
was:SELECT tableoid, oid, nspname, (SELECT rolname FROM pg_catalog.pg_roles WHERE oid = nspowner) AS rolname, nspacl
FROMpg_namespace pg_dumpall: pg_dump failed on database "hekotekerp", exiting 
Warning: Failed to dump pgsql cluster


Strange that I have no problems actually using that database.

>
>> For a while development server was running 8.4 and live server
>> 8.1. Now both are 8.4, but this shouldn't matter, as I do backup
>> and restore via SQL.
>
> I hope you were using the 8.4 version of pg_dump when you were in
> the dual-version situation.  Using the earlier version of pg_dump is
> not guaranteed to provide a backup which can be cleanly installed on
> a later version.  That could *possibly* be related to current
> problems.

I used 8.1 version of pg_dump previously, but had no problems with it.
Currently both are 8.4, so this is not a problem.

>> Development server contains some additional databases as well,
>> that do not exist on live server.
>
> So are you really using pg_dumpall or pg_dump?

I'm using pg_dump on live server and pg_dumpall on development server.

>
>> It's not ECC memory.
>
> Well, then there has been proven to be a non-negligible possibility
> of occasional random bit-flips.  Seriously, next time you upgrade,
> make sure any database server has ECC RAM.

Thanks for a tip, will do that.

>
>> It is possible, that restore of pre-live database using psql lasts
>> so long, that backup of the same database using pg_dump is already
>> kicking in.
>
> Hmmm...  You might want to do enough logging of the processes to be
> able to confirm or eliminate that possibility.  Dumping an
> incompletely-restored database might generate some odd errors.
>

Thanks Kevin for suggestions, investigating further...

   Tambet