Обсуждение: missing chunk number 0 for toast value

Поиск

Список

Период

Сортировка

missing chunk number 0 for toast value

От

"BJ Taylor"

Дата:

24 сентября 2008 г., 19:03:28

Our database seems to have been corrupted. It is a heavily used database, and went several months without any type of vacuuming. When we finally realized that it wasn't being vacuumed, we started the process, but the process never successfully completed, and our database has never been the same since.

The exact error that we receive now is as follows:

postgres@server:~> pg_dumpall -p 5433 > dbmail_dumpall_23092008.sql
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR: missing chunk number 0 for toast value 554339
pg_dump: The command was: COPY public.dbmail_messageblks (messageblk_idnr, physmessage_id, blocksize, is_header, messageblk) TO stdout;
pg_dumpall: pg_dump failed on database "dbmail", exiting

We have tried using the -d option of the pg_dumpall so we could get a full dump of the database, and just start over from that, but that fails as well. We have also tried reindexing the table, but although the reindex didn't fail, it didn't solve our problem either. Our next option is to do a full vacuum, but we are reluctant to take our mail server down for that long, especially when we do not know for sure that it will either succeed or fix our problem. I have searched the forums, but was unable to find a solution that we have not already tried. The solutions didn't appear to help others who had this problem either.

Any suggestions?

Thanks,
BJ

Re: missing chunk number 0 for toast value

От

"Scott Whitney"

Дата:

24 сентября 2008 г., 19:26:51

I ran into this issue awhile ago. Here's my long internal tech note to my
dev guys on what I did. A bit modified for more genericism:

I'm in the process of migrating our internal db server, and I decided to use
the helpdesk as
my test database. It backed up fine last night. Something went horribly
wrong today, since pg_dump tells me:

> pg_dump: ERROR:  missing chunk number 0 for toast value 110439697
> pg_dump: SQL command to dump the contents of table "attachments" failed:
> PQendcopy() failed.
> pg_dump: Error message from server: ERROR:  missing chunk number 0 for
toast
> value 110439697
> pg_dump: The command was: COPY public.attachments (id, transactionid,
> parent, messageid, subject, filename, contenttype, contentencoding,
content,
> headers, creator, created) TO stdout;


I reindexed attachments. I reindexed the db. I retoasted the index. Or
reindexed the toast. Or toasted the index. Or something:

rt3=# select reltoastrelid::regclass from pg_class where relname =
'attachments';
       reltoastrelid
---------------------------
 pg_toast.pg_toast_8507627
(1 row)

rt3=# reindex table pg_toast.pg_toast_8507627;
REINDEX
rt3=# \q
-bash-2.05b$ pg_dump rt3 > /tmp/rt3

pg_dump: ERROR:  missing chunk number 0 for toast value 110439697
pg_dump: SQL command to dump the contents of table "attachments" failed:
PQendcopy() failed.
pg_dump: Error message from server: ERROR:  missing chunk number 0 for toast
value 110439697
pg_dump: The command was: COPY public.attachments (id, transactionid,
parent, messageid, subject, filename, contenttype, contentencoding, content,
headers, creator, created) TO stdout;


That didn't work. So...I figured I could find out what the bad rec was.
Doing this:

Select * from attachments limit 5000 offset 0
Select * from attachments limit 5000 offset 5000
Select * from attachments limit 5000 offset 10000
Select * from attachments limit 5000 offset 15000

quickly showed me that record 16179 was causing this issue.


So, in order to resolve, this I did this:

pg_dump -s rt3 > /tmp/rt3schema

followed by:

for each in `psql rt3 -c "\d" | awk {'print $3'} | grep -vw attachments`;do
pg_dump rt3 -t $each >> /tmp/rt3data; done

(This second one removes only the table named attachments)

Then I used pg.py to do this:

import pg
olddb=pg.connect('rt3','myolddbserver')
new=pg.connect('rt3','localhost')
first=olddb.query("""select * from attachments limit 16169""").getresult()
last=olddb.query("""select * from attachments limit 100000 offset 16170""")
for eachRec in first + last:
    new.query("""insert into attachments values %r""" % (eachRec,))


*** I did have to deal with some quotification issues, but you get the
point.


-----Original Message-----
From: pgsql-admin-owner@postgresql.org
[mailto:pgsql-admin-owner@postgresql.org] On Behalf Of BJ Taylor
Sent: Sep 24, 2008 1:32 PM
To: pgsql-admin@postgresql.org
Subject: [ADMIN] missing chunk number 0 for toast value

Our database seems to have been corrupted.  It is a heavily used database,
and went several months without any type of vacuuming.  When we finally
realized that it wasn't being vacuumed, we started the process, but the
process never successfully completed, and our database has never been the
same since.

The exact error that we receive now is as follows:


postgres@server:~> pg_dumpall -p 5433 > dbmail_dumpall_23092008.sql
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR: missing chunk number 0 for toast
value 554339
pg_dump: The command was: COPY public.dbmail_messageblks (messageblk_idnr,
physmessage_id, blocksize, is_header, messageblk) TO stdout;
pg_dumpall: pg_dump failed on database "dbmail", exiting


We have tried using the -d option of the pg_dumpall so we could get a full
dump of the database, and just start over from that, but that fails as well.
We have also tried reindexing the table, but although the reindex didn't
fail, it didn't solve our problem either.  Our next option is to do a full
vacuum, but we are reluctant to take our mail server down for that long,
especially when we do not know for sure that it will either succeed or fix
our problem.  I have searched the forums, but was unable to find a solution
that we have not already tried.  The solutions didn't appear to help others
who had this problem either.

Any suggestions?

Thanks,
BJ

Re: missing chunk number 0 for toast value

От

Tom Lane

Дата:

24 сентября 2008 г., 20:26:45

"BJ Taylor" <btaylor@propertysolutions.com> writes:
> Our database seems to have been corrupted.  It is a heavily used database,
> and went several months without any type of vacuuming.  When we finally
> realized that it wasn't being vacuumed, we started the process, but the
> process never successfully completed, and our database has never been the
> same since.

What PG version is this?  Exactly what do you mean by "never
successfully completed"?

            regards, tom lane

Re: missing chunk number 0 for toast value

От

"BJ Taylor"

Дата:

25 сентября 2008 г., 12:37:13

We are using version 8.3.1. And to be precise, when I started the vacuum (analyze), I started it as a cron job to run daily around midnight. The next day I came in and checked on it and it was still running. Not thinking that it would take more than a full 24 hours to run, I let it be, and the next day I came in and the server started acting weird. I believe the vacuum process continued to run, and a second vacuum process was started. The server became unstable, and refused incoming connections. At which point, I killed all vacuum processes, and restarted postgresql. I believe it was somewhere during this process that the database became corrupted. I am not certain what happens when two vacuum processes run at the same time. That may have been the problem, or it may not have. Or it may have been that I killed the vacuum process in the middle of what it was doing. One way or another, the problem that we have now, is that we are unable to get a dump of the database for backups, and the database seems less stable than it was previously (dropping connections, and refusing connections seemingly at random).

BJ

On Wed, Sep 24, 2008 at 2:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"BJ Taylor" <btaylor@propertysolutions.com> writes:
> Our database seems to have been corrupted. It is a heavily used database,
> and went several months without any type of vacuuming. When we finally
> realized that it wasn't being vacuumed, we started the process, but the
> process never successfully completed, and our database has never been the
> same since.

What PG version is this? Exactly what do you mean by "never
successfully completed"?

regards, tom lane

Re: missing chunk number 0 for toast value

От

Tom Lane

Дата:

25 сентября 2008 г., 13:24:21

"BJ Taylor" <btaylor@propertysolutions.com> writes:
> We are using version 8.3.1.  And to be precise, when I started the vacuum
> (analyze), I started it as a cron job to run daily around midnight.  The
> next day I came in and checked on it and it was still running.  Not thinking
> that it would take more than a full 24 hours to run, I let it be, and the
> next day I came in and the server started acting weird.  I believe the
> vacuum process continued to run, and a second vacuum process was started.
> The server became unstable, and refused incoming connections.

Unstable how?  What error did you get on the refused connections?  What
was showing up in the postmaster log?

> At which
> point, I killed all vacuum processes, and restarted postgresql.

How did you do that killing exactly?

> I believe
> it was somewhere during this process that the database became corrupted.  I
> am not certain what happens when two vacuum processes run at the same time.

Nothing of interest, it's done all the time.

> That may have been the problem, or it may not have.  Or it may have been
> that I killed the vacuum process in the middle of what it was doing.  One
> way or another, the problem that we have now, is that we are unable to get a
> dump of the database for backups, and the database seems less stable than it
> was previously (dropping connections, and refusing connections seemingly at
> random).

Again, what errors are you getting exactly, and what shows up in the
postmaster log?

            regards, tom lane

Re: missing chunk number 0 for toast value

От

"BJ Taylor"

Дата:

25 сентября 2008 г., 16:15:10

Hey Tom,

Here are some recent logs from our system. Unfortunately, I didn't think to grab the logs at the time I killed those processes, and now they are gone. I found those processes by using ps, and then I killed them with a simple kill processid. Here are samples of our current log files:

FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
LOG: autovacuum launcher started
LOG: database system is ready to accept connections
PANIC: right sibling's left-link doesn't match: block 175337 links to 243096 instead of expected 29675 in index "dbmail_headervalue_3"
STATEMENT: INSERT INTO dbmail_headervalue (headername_id, physmessage_id, headervalue) VALUES (4,12335778,'from [76.13.13.25] by n6.bullet.mail.ac4.yahoo.com with NNFMP; 25 Sep 2008 04:01:36 -0000')
LOG: server process (PID 13888) was terminated by signal 6: Aborted
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
FATAL: the database system is in recovery mode
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted; last known up at 2008-09-25 09:12:41 MDT
LOG: database system was not properly shut down; automatic recovery in progress
FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode

...

FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
LOG: redo starts at 3A/2D0DEA78
LOG: record with zero length at 3A/2D1B8D68
LOG: redo done at 3A/2D1B8D3C
LOG: last completed transaction was at log time 2008-09-25 09:12:45.204162-06
FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode

...

FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
LOG: redo starts at 3A/2D1B8DA8
LOG: unexpected pageaddr 3A/2520A000 in log file 58, segment 45, offset 2138112
LOG: redo done at 3A/2D208660
LOG: last completed transaction was at log time 2008-09-25 09:12:47.971207-06
FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode

...

LOG: unexpected EOF on client connection
LOG: unexpected EOF on client connection
ERROR: missing chunk number 0 for toast value 554365
STATEMENT: SELECT messageblk, is_header FROM dbmail_messageblks WHERE physmessage_id = 12111760 ORDER BY messageblk_idnr
LOG: unexpected EOF on client connection
LOG: unexpected EOF on client connection

To be honest, I don't know if all of these logs are relevant or not. I half way suspect that nagios causes the "unexpected EOF on client connection" notices, but I can't be certain.

You also asked how it is being unstable. It drops connections seemingly at random. The error received when a connection is dropped is the following:

WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.

Please let me know if there are any other questions I can answer for you.

Thanks,
BJ

On Thu, Sep 25, 2008 at 7:24 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"BJ Taylor" <btaylor@propertysolutions.com> writes:
> We are using version 8.3.1. And to be precise, when I started the vacuum
> (analyze), I started it as a cron job to run daily around midnight. The
> next day I came in and checked on it and it was still running. Not thinking
> that it would take more than a full 24 hours to run, I let it be, and the
> next day I came in and the server started acting weird. I believe the
> vacuum process continued to run, and a second vacuum process was started.
> The server became unstable, and refused incoming connections.

Unstable how? What error did you get on the refused connections? What
was showing up in the postmaster log?

> At which
> point, I killed all vacuum processes, and restarted postgresql.

How did you do that killing exactly?

> I believe
> it was somewhere during this process that the database became corrupted. I
> am not certain what happens when two vacuum processes run at the same time.

Nothing of interest, it's done all the time.

> That may have been the problem, or it may not have. Or it may have been
> that I killed the vacuum process in the middle of what it was doing. One
> way or another, the problem that we have now, is that we are unable to get a
> dump of the database for backups, and the database seems less stable than it
> was previously (dropping connections, and refusing connections seemingly at
> random).

Again, what errors are you getting exactly, and what shows up in the
postmaster log?

regards, tom lane

--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: missing chunk number 0 for toast value

От

Tom Lane

Дата:

25 сентября 2008 г., 17:30:39

"BJ Taylor" <btaylor@propertysolutions.com> writes:
> Here are some recent logs from our system.  Unfortunately, I didn't think to
> grab the logs at the time I killed those processes, and now they are gone.
> I found those processes by using ps, and then I killed them with a simple
> kill *processid*.  Here are samples of our current log files:

Huh.  That would have generated SIGTERM, which should at least mostly be
safe.  We did recently fix a couple of problems with cleanup after
SIGTERM but they were both pretty low-probability risks.  Anyway,
you've definitely got issues now:

> PANIC:  right sibling's left-link doesn't match: block 175337 links to
> 243096 instead of expected 29675 in index "dbmail_headervalue_3"

Try reindexing that index, and any other ones that you see complaints
about.

> You also asked how it is being unstable.  It drops connections seemingly at
> random.  The error received when a connection is dropped is the following:

> WARNING:  terminating connection because of crash of another server process

Well, that's just the fallout from a PANIC in some other session; if you
weren't watching the session that actually crashed, you need to look in
the postmaster log to see what went wrong.

            regards, tom lane

Re: missing chunk number 0 for toast value

От

"Scott Marlowe"

Дата:

25 сентября 2008 г., 19:04:46

On Thu, Sep 25, 2008 at 10:09 AM, BJ Taylor
<btaylor@propertysolutions.com> wrote:
> PANIC:  right sibling's left-link doesn't match: block 175337 links to
> 243096 instead of expected 29675 in index "dbmail_headervalue_3"
> STATEMENT:  INSERT INTO dbmail_headervalue (headername_id, physmessage_id,
> headervalue) VALUES (4,12335778,'from [76.13.13.25] by
> n6.bullet.mail.ac4.yahoo.com with NNFMP; 25 Sep 2008 04:01:36 -0000')
> LOG:  server process (PID 13888) was terminated by signal 6: Aborted
> LOG:  terminating any other active server processes
> WARNING:  terminating connection because of crash of another server process

Tom, does postgres generate abort signal?  Or would this be an external signal?

Re: missing chunk number 0 for toast value

От

Tom Lane

Дата:

25 сентября 2008 г., 19:34:46

"Scott Marlowe" <scott.marlowe@gmail.com> writes:
> On Thu, Sep 25, 2008 at 10:09 AM, BJ Taylor
> <btaylor@propertysolutions.com> wrote:
>> PANIC:  right sibling's left-link doesn't match: block 175337 links to
>> 243096 instead of expected 29675 in index "dbmail_headervalue_3"
>> STATEMENT:  INSERT INTO dbmail_headervalue (headername_id, physmessage_id,
>> headervalue) VALUES (4,12335778,'from [76.13.13.25] by
>> n6.bullet.mail.ac4.yahoo.com with NNFMP; 25 Sep 2008 04:01:36 -0000')
>> LOG:  server process (PID 13888) was terminated by signal 6: Aborted
>> LOG:  terminating any other active server processes
>> WARNING:  terminating connection because of crash of another server process

> Tom, does postgres generate abort signal?  Or would this be an external signal?

Yeah, we call abort() after reporting a PANIC error, so as to get a core
dump.

            regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: missing chunk number 0 for toast value