Обсуждение: Missing Toast Chunk

Поиск
Список
Период
Сортировка

Missing Toast Chunk

От
Sam Nelson
Дата:
Good morning, list.

We've got a bit of a problem on a customer's production box.  We got a "missing chunk number 0 for toast value N" (N being a number) this week on their production box.  We verified that it was only a problem with one row, tried to fix it with updates, and ended up deleting the row.

To check for similar issues in other tables, we set up a script to run at midnight and do a pg_dump on each individual table in the database where the original error happened, sending stderr to a log file.  Since the original problem was discovered while running pg_dump, we figured this would show us any tables that have similar issues.

We found the same problem in a couple of other tables, but the big problem is that the same table that we just fixed had that error again, in a different row this time.

Some information on the customer's box:  It's an Amazon EC2 box running debian (I believe debian 5, but I'm not sure).  They are using postgres 8.3.11, installed from apt.  They are mainly using ruby on rails for their application(s).

Here's the full error from the log file, minus (mildly) sensitive info:

--| Table schema.table dump start: Wed Aug 18 04:54:34 UTC 2010 |--
pg_dump: SQL command failed
pg_dump: Error message from server: ERROR:  missing chunk number 0 for toast value N in pg_toast_M
pg_dump: The command was: COPY schema.table (id, foreign_key, some_text_stuff, timestamp1, timestamp2) TO stdout;
--| Table schema.table dump end: Wed Aug 18 04:54:44 UTC 2010 |--

So the question is, what could be causing this?  It's not so terrible a deal that we found that error in their database once, but this happened again right after we fixed it.  Could it be ruby?  The customer's application(s)?  Some weirdness with Amazon EC2 and/or debian?  A bug in postgres, itself?  Any ideas?

-Sam

Re: Missing Toast Chunk

От
Tom Lane
Дата:
Sam Nelson <samn@consistentstate.com> writes:
> We've got a bit of a problem on a customer's production box.  We got a
> "missing chunk number 0 for toast value N" (N being a number) this week on
> their production box.  We verified that it was only a problem with one row,
> tried to fix it with updates, and ended up deleting the row.
> ...
> We found the same problem in a couple of other tables, but the big problem
> is that the same table that we just fixed had that error again, in a
> different row this time.

Did you try reindexing that table's toast table?  If the problem is a
corrupt index rather than an actually missing toast row, then this would
fix it.  Index corruption could also explain multiple such failures in the
same table.

If it becomes clear that there are multiple/continuing occurrences of
corruption, then you've most likely got an unreliable storage system.
It would be a wise idea to go looking for other indicators of
inconsistency and lost rows (dangling foreign-key references for instance).

> Some information on the customer's box:  It's an Amazon EC2 box running
> debian (I believe debian 5, but I'm not sure).  They are using postgres
> 8.3.11, installed from apt.  They are mainly using ruby on rails for their
> application(s).

A number of people around here view EC2 with deep suspicion.  It's great
if you only need semi-reliable computing cycles ...

            regards, tom lane

Re: Missing Toast Chunk

От
Scott Marlowe
Дата:
On Thu, Aug 19, 2010 at 11:26 AM, Sam Nelson <samn@consistentstate.com> wrote:
> Good morning, list.
> We've got a bit of a problem on a customer's production box.  We got a
> "missing chunk number 0 for toast value N" (N being a number) this week on

SNIP

> So the question is, what could be causing this?  It's not so terrible a deal
> that we found that error in their database once, but this happened again
> right after we fixed it.  Could it be ruby?  The customer's application(s)?
>  Some weirdness with Amazon EC2 and/or debian?  A bug in postgres, itself?

It's almost certainly not ruby's fault.  Have they done anything
strange like kill the instance and restart it without letting the db
shut down?  I'd tend to suspect Amazon's fsyncing is amiss and they
did something that triggered it.

--
To understand recursion, one must first understand recursion.

Re: Missing Toast Chunk

От
Sam Nelson
Дата:
Sorry, I forgot to mention that we also tried reindexing the toast table.

On Thu, Aug 19, 2010 at 1:20 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
SNIP

It's almost certainly not ruby's fault.  Have they done anything
strange like kill the instance and restart it without letting the db
shut down?  I'd tend to suspect Amazon's fsyncing is amiss and they
did something that triggered it.

--
To understand recursion, one must first understand recursion.

They haven't done anything like that, that we know of.  However, they do have a process that kills off all waiting (and only waiting) postgres processes if there are more than 1000 locks.  Could that be an issue?

If Amazon's fsyncing is the problem and they're doing something to trigger it, how would we go about debugging that?

Re: Missing Toast Chunk

От
Tom Lane
Дата:
Sam Nelson <samn@consistentstate.com> writes:
>> It's almost certainly not ruby's fault.  Have they done anything
>> strange like kill the instance and restart it without letting the db
>> shut down?  I'd tend to suspect Amazon's fsyncing is amiss and they
>> did something that triggered it.

> They haven't done anything like that, that we know of.  However, they do
> have a process that kills off all waiting (and only waiting) postgres
> processes if there are more than 1000 locks.  Could that be an issue?

What do you mean by "kills off"?  If they randomly kill -9 backend
processes, I would describe that as taking pot-shots at one's own toes
(and hoping that you can't aim well enough to hit them).  In theory
Postgres should survive that without data corruption but it's surely
playing with fire.  And it's most definitely not solving whatever their
real problem is.

Anyway, the known issues in this area have to do with the filesystem not
honoring writes in the correct order.  I'd agree with Scott's suspicion
that a dirty instance shutdown, rather than a dirty database shutdown,
is the more likely cause.

            regards, tom lane

Re: Missing Toast Chunk

От
Sam Nelson
Дата:
I'm honestly not sure what I mean by that -- we didn't write that process.  I don't even know the intent of that process; they first told us that it exists yesterday.  I'll ask my boss to ask them for more details about it in the morning.

We'll talk with 'em and see if they've done anything weird with the database in the last while.

Thanks for your help.

-Sam

On Thu, Aug 19, 2010 at 5:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Sam Nelson <samn@consistentstate.com> writes:
>> It's almost certainly not ruby's fault.  Have they done anything
>> strange like kill the instance and restart it without letting the db
>> shut down?  I'd tend to suspect Amazon's fsyncing is amiss and they
>> did something that triggered it.

> They haven't done anything like that, that we know of.  However, they do
> have a process that kills off all waiting (and only waiting) postgres
> processes if there are more than 1000 locks.  Could that be an issue?

What do you mean by "kills off"?  If they randomly kill -9 backend
processes, I would describe that as taking pot-shots at one's own toes
(and hoping that you can't aim well enough to hit them).  In theory
Postgres should survive that without data corruption but it's surely
playing with fire.  And it's most definitely not solving whatever their
real problem is.

Anyway, the known issues in this area have to do with the filesystem not
honoring writes in the correct order.  I'd agree with Scott's suspicion
that a dirty instance shutdown, rather than a dirty database shutdown,
is the more likely cause.

                       regards, tom lane