Обсуждение: Memory Errors

Поиск
Список
Период
Сортировка

Memory Errors

От
Sam Nelson
Дата:
Hey, a client of ours has been having some data corruption in their database.  We got the data corruption fixed and we believe we've discovered the cause (they had a script killing any waiting queries if the locks on their database hit 1000), but they're still getting errors from one table:

pg_dump: SQL command failed
pg_dump: Error message from server: ERROR:  invalid memory alloc request size 18446744073709551613
pg_dump: The command was: COPY public.foo (<columns>) TO stdout;

That seems like an incredibly large memory allocation request - it shouldn't be possible for the table to really be that large, should it?  Any idea what may be wrong if it's actually trying to allocate that much memory for a copy command?

Re: Memory Errors

От
Scott Marlowe
Дата:
On Wed, Sep 8, 2010 at 12:56 PM, Sam Nelson <samn@consistentstate.com> wrote:
> Hey, a client of ours has been having some data corruption in their
> database.  We got the data corruption fixed and we believe we've discovered
> the cause (they had a script killing any waiting queries if the locks on
> their database hit 1000), but they're still getting errors from one table:

Not sure that's really the underlying problem. Depending on how they
killed the processes there's a slight chance for corruption, but more
likely they've got bad hardware.  Can they take their machine down for
testing?  memtest86+ is a good tool to get an idea if you've got a
good cpu mobo ram combo or not.

The last bit you included definitely looks like something's corrupted
in the database.

Re: Memory Errors

От
Tom Lane
Дата:
Sam Nelson <samn@consistentstate.com> writes:
> pg_dump: Error message from server: ERROR:  invalid memory alloc request
> size 18446744073709551613
> pg_dump: The command was: COPY public.foo (<columns>) TO stdout;

> That seems like an incredibly large memory allocation request - it shouldn't
> be possible for the table to really be that large, should it?  Any idea what
> may be wrong if it's actually trying to allocate that much memory for a copy
> command?

What that looks like is data corruption; specifically, a bogus length
word for a variable-length field.

            regards, tom lane

Re: Memory Errors

От
Sam Nelson
Дата:
It figures I'd have an idea right after posting to the mailing list.

Yeah, running COPY foo TO stdout; gets me a list of data before erroring out, so I did a copy (select * from foo order by id asc) to stdout; to see if I could make some kind of guess as to whether this was related to a single row or something else.

I got the id of the last row the copy to command was able to grab normally and tried to figure out the next id.  The following started to make me think along the lines of some kinda bad corruption (even before getting responses that agreed with that):

Assuming that the last id copied was 1500:

1) select * from foo where id = (select min(id) from foo where id > 1500);
Results in 0 rows

2) select min(id) from foo where id > 1500;
Results in, for example, 200000

3) select max(id) from foo where id > 1500;
Results in, for example, 90000 (a much lower number than returned by min)

4) select id from foo where id > 1500 order by id asc limit 10;
Results in (for example):

200000
202000
210273
220980
15005
15102
15104
15110
15111
15113

So ... yes, it seems that those four id's are somehow part of the problem.

They're on amazon EC2 boxes (yeah, we're not too fond of the EC2 boxes either), so memtest isn't available, but no new corruption has cropped up since they stopped killing the waiting queries (I just double checked - they were getting corrupted rows constantly, and we haven't gotten one since that script stopped killing queries).

We're going to have them attempt to delete the rows with those id's (even though the rows don't exist) and if that fails, we're going to copy (select * from foo where id not in (<list>)) to file;, drop table foo;, create table foo;, and copy foo from file.  I'll try to remember to write back with whether or not any of those things worked.

On Wed, Sep 8, 2010 at 1:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Sam Nelson <samn@consistentstate.com> writes:
> pg_dump: Error message from server: ERROR:  invalid memory alloc request
> size 18446744073709551613
> pg_dump: The command was: COPY public.foo (<columns>) TO stdout;

> That seems like an incredibly large memory allocation request - it shouldn't
> be possible for the table to really be that large, should it?  Any idea what
> may be wrong if it's actually trying to allocate that much memory for a copy
> command?

What that looks like is data corruption; specifically, a bogus length
word for a variable-length field.

                       regards, tom lane

Re: Memory Errors

От
Merlin Moncure
Дата:
On Wed, Sep 8, 2010 at 4:03 PM, Sam Nelson <samn@consistentstate.com> wrote:
> It figures I'd have an idea right after posting to the mailing list.
> Yeah, running COPY foo TO stdout; gets me a list of data before erroring
> out, so I did a copy (select * from foo order by id asc) to stdout; to see
> if I could make some kind of guess as to whether this was related to a
> single row or something else.
> I got the id of the last row the copy to command was able to grab normally
> and tried to figure out the next id.  The following started to make me think
> along the lines of some kinda bad corruption (even before getting responses
> that agreed with that):
> Assuming that the last id copied was 1500:
> 1) select * from foo where id = (select min(id) from foo where id > 1500);
> Results in 0 rows
> 2) select min(id) from foo where id > 1500;
> Results in, for example, 200000
> 3) select max(id) from foo where id > 1500;
> Results in, for example, 90000 (a much lower number than returned by min)
> 4) select id from foo where id > 1500 order by id asc limit 10;
> Results in (for example):
> 200000
> 202000
> 210273
> 220980
> 15005
> 15102
> 15104
> 15110
> 15111
> 15113
> So ... yes, it seems that those four id's are somehow part of the problem.
> They're on amazon EC2 boxes (yeah, we're not too fond of the EC2 boxes
> either), so memtest isn't available, but no new corruption has cropped up
> since they stopped killing the waiting queries (I just double checked - they
> were getting corrupted rows constantly, and we haven't gotten one since that
> script stopped killing queries).

That's actually a startling indictment of ec2 -- how were you killing
your queries exactly?  You say this is repeatable?  What's your
setting of full_page_writes?

one way to identify and potentially nuke bad records of this kind is
to do something like:

select max(length(field1)) from foo order by 1 desc limit 5;

where field1 is the first varlen field (text, bytea, etc) from left to
right order.  look for bogously high values and move on to the next
field if you don't find any.  once you hit a bad value, try deleting
the record by it's key.

once you've found/deleted them all,  immediately pull off a dump, then
rebuild the table.  as always, take a filesystem dump before doing
this type of surgery...

merlin
merlin

Re: Memory Errors

От
Tom Lane
Дата:
Merlin Moncure <mmoncure@gmail.com> writes:
> On Wed, Sep 8, 2010 at 4:03 PM, Sam Nelson <samn@consistentstate.com> wrote:
>> So ... yes, it seems that those four id's are somehow part of the problem.
>> They're on amazon EC2 boxes (yeah, we're not too fond of the EC2 boxes
>> either), so memtest isn't available, but no new corruption has cropped up
>> since they stopped killing the waiting queries (I just double checked - they
>> were getting corrupted rows constantly, and we haven't gotten one since that
>> script stopped killing queries).

> That's actually a startling indictment of ec2 -- how were you killing
> your queries exactly?  You say this is repeatable?  What's your
> setting of full_page_writes?

I think we'd established that they were doing kill -9 on backend
processes :-(.  However, PG has a lot of track record that says that
backend crashes don't result in corrupt data.  What seems more likely
to me is that the corruption is the result of some shortcut taken while
shutting down or migrating the ec2 instance, so that some writes that
Postgres thought got to disk didn't really.

            regards, tom lane

Re: Memory Errors

От
Sam Nelson
Дата:
My (our) complaints about EC2 aren't particularly extensive, but last time I posted to the mailing list saying they were using EC2, the first reply was someone saying that the corruption was the fault of EC2.

Not that we don't have complaints at all (there are some aspects that are very frustrating), but I was just trying to stave off anyone who was going to reply saying "Tell them to stop using EC2".

 -- More detail about the script that kills queries:

Honestly, we (or, at least, I) haven't discovered which type of kill they were doing, but it does seem to be the culprit in some way.  I don't talk to the customers (that's my boss's job), so I didn't get to ask specifics.  If my boss did ask specifics, he didn't tell me.

The previous issue involved toast corruption showing up very regularly (e.g. once a day, in some cases), the end result being that we had to delete the corrupted rows.  Coming back the next day to see the same corruption on different rows was not very encouraging.

We found out after that that they had a script running as a daemon that would, every ten minutes (I believe), check the number of locks on the table and kill all waiting queries if there were >= 1000 locks.

Even if the corruption wasn't a result of that, we weren't too excited about the process being there to begin with.  We thought there had to be a better solution than just killing the processes.  So we had a discussion about the intent of that script and my boss dealt with something that solved the same problem without killing queries, then had them stop that daemon and we have been working with that database to make sure it doesn't go screwy again.  No new corruption has shown up since stopping that daemon.

That memory allocation issue looked drastically different from the toast value errors, though, so it seemed like a separate problem.  But now it's looking like more corruption.

---

We're requesting that they do a few things (this is their production database, so we usually don't alter any data unless they ask us to), including deleting those rows.  My memory is insufficient, so there's a good chance that I'll forget to post back to the mailing list with the results, but I'll try to remember to do so.

Thank you for the help - I'm sure I'll be back soon with many more questions.

-Sam

On Wed, Sep 8, 2010 at 2:58 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Merlin Moncure <mmoncure@gmail.com> writes:
> On Wed, Sep 8, 2010 at 4:03 PM, Sam Nelson <samn@consistentstate.com> wrote:
>> So ... yes, it seems that those four id's are somehow part of the problem.
>> They're on amazon EC2 boxes (yeah, we're not too fond of the EC2 boxes
>> either), so memtest isn't available, but no new corruption has cropped up
>> since they stopped killing the waiting queries (I just double checked - they
>> were getting corrupted rows constantly, and we haven't gotten one since that
>> script stopped killing queries).

> That's actually a startling indictment of ec2 -- how were you killing
> your queries exactly?  You say this is repeatable?  What's your
> setting of full_page_writes?

I think we'd established that they were doing kill -9 on backend
processes :-(.  However, PG has a lot of track record that says that
backend crashes don't result in corrupt data.  What seems more likely
to me is that the corruption is the result of some shortcut taken while
shutting down or migrating the ec2 instance, so that some writes that
Postgres thought got to disk didn't really.

                       regards, tom lane

Re: Memory Errors

От
Merlin Moncure
Дата:
On Wed, Sep 8, 2010 at 6:55 PM, Sam Nelson <samn@consistentstate.com> wrote:
> Even if the corruption wasn't a result of that, we weren't too excited about
> the process being there to begin with.  We thought there had to be a better
> solution than just killing the processes.  So we had a discussion about the
> intent of that script and my boss dealt with something that solved the same
> problem without killing queries, then had them stop that daemon and we have
> been working with that database to make sure it doesn't go screwy again.  No
> new corruption has shown up since stopping that daemon.
> That memory allocation issue looked drastically different from the toast
> value errors, though, so it seemed like a separate problem.  But now it's
> looking like more corruption.
> ---
> We're requesting that they do a few things (this is their production
> database, so we usually don't alter any data unless they ask us to),
> including deleting those rows.  My memory is insufficient, so there's a good
> chance that I'll forget to post back to the mailing list with the results,
> but I'll try to remember to do so.
> Thank you for the help - I'm sure I'll be back soon with many more
> questions.

Any information on repeatable data corruption, whether it is ec2
improperly flushing data on instance resets, postgres misbehaving
under atypical conditions, or bad interactions between ec2 and
postgres is highly valuable.  The only cases of 'understandable' data
corruption are hardware failures, sync issues (either fsync off, or
fsync not honored by hardware), torn pages on non journaling file
systems, etc.

Naturally people are going to be skeptical of ec2 since you are so
abstracted from the hardware.  Maybe all your problems stem from a
single explainable incident -- but we definitely want to get to the
bottom of this...please keep us updated!

merlin

Re: Memory Errors

От
Sam Nelson
Дата:
Okay, we're finally getting the last bits of corruption fixed, and I finally remembered to ask my boss about the kill script.

The only details I have are these:

1) The script does nothing if there are fewer than 1000 locks on tables in the database

2) If there are 1000 or more locks, it will grab the processes in pg_stat_activity that are in a waiting state

3) for each of the previous processes, it will do a system kill $pid call

The kill is not pg_terminate_backend or pg_cancel_backend, and it's also not a kill -9.  Just a normal kill.

As far as the postgres and EC2 instances go, we're not really sure if anyone shut down, created, or migrated them in a weird way, but Kevin (my boss) said that it wouldn't surprise him.

All I can say is that where we were getting 1 new row of corruption every day when the kill script was running, we haven't gotten any new corruption since we stopped it.

As far as the table with memory errors goes, we had asked them to rebuild the table, and they came back saying that they no longer need that table.  So they're just going to drop it.

We'll try to keep digging, but I'm not sure we'll get much more info than that.  We're quite busy and my ability to remember things is ... questionable.

-Sam

On Thu, Sep 9, 2010 at 8:14 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Wed, Sep 8, 2010 at 6:55 PM, Sam Nelson <samn@consistentstate.com> wrote:
> Even if the corruption wasn't a result of that, we weren't too excited about
> the process being there to begin with.  We thought there had to be a better
> solution than just killing the processes.  So we had a discussion about the
> intent of that script and my boss dealt with something that solved the same
> problem without killing queries, then had them stop that daemon and we have
> been working with that database to make sure it doesn't go screwy again.  No
> new corruption has shown up since stopping that daemon.
> That memory allocation issue looked drastically different from the toast
> value errors, though, so it seemed like a separate problem.  But now it's
> looking like more corruption.
> ---
> We're requesting that they do a few things (this is their production
> database, so we usually don't alter any data unless they ask us to),
> including deleting those rows.  My memory is insufficient, so there's a good
> chance that I'll forget to post back to the mailing list with the results,
> but I'll try to remember to do so.
> Thank you for the help - I'm sure I'll be back soon with many more
> questions.

Any information on repeatable data corruption, whether it is ec2
improperly flushing data on instance resets, postgres misbehaving
under atypical conditions, or bad interactions between ec2 and
postgres is highly valuable.  The only cases of 'understandable' data
corruption are hardware failures, sync issues (either fsync off, or
fsync not honored by hardware), torn pages on non journaling file
systems, etc.

Naturally people are going to be skeptical of ec2 since you are so
abstracted from the hardware.  Maybe all your problems stem from a
single explainable incident -- but we definitely want to get to the
bottom of this...please keep us updated!

merlin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Memory Errors

От
Merlin Moncure
Дата:
On Tue, Sep 21, 2010 at 12:57 PM, Sam Nelson <samn@consistentstate.com> wrote:
>> On Thu, Sep 9, 2010 at 8:14 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> Naturally people are going to be skeptical of ec2 since you are so
>> abstracted from the hardware.  Maybe all your problems stem from a
>> single explainable incident -- but we definitely want to get to the
>> bottom of this...please keep us updated!
>
> As far as the postgres and EC2 instances go, we're not really sure if anyone
> shut down, created, or migrated them in a weird way, but Kevin (my boss)
> said that it wouldn't surprise him.

<please try to avoid top-posting -- it destroys the context of the conversation>

The shutdown/migration point is key, along with fsync settings and a
description of whatever durability guarantees ec2 gives on the storage
you are using.  It's the difference between this being a non-event and
something much more interesting.  The correct way btw to kill backends
is with pg_ctl, but what you did is not related to data corruption.

merlin

Re: Memory Errors

От
Tom Lane
Дата:
Sam Nelson <samn@consistentstate.com> writes:
> Okay, we're finally getting the last bits of corruption fixed, and I finally
> remembered to ask my boss about the kill script.

> The only details I have are these:

> 1) The script does nothing if there are fewer than 1000 locks on tables in
> the database

> 2) If there are 1000 or more locks, it will grab the processes in
> pg_stat_activity that are in a waiting state

> 3) for each of the previous processes, it will do a system kill $pid call

> The kill is not pg_terminate_backend or pg_cancel_backend, and it's also not
> a kill -9.  Just a normal kill.

SIGTERM then.  Since (according to the other thread) this was 8.3.11,
that should in theory be safe; but it's not something I'd consider
tremendously well tested before 8.4.x.

I'd still lean to the theory of data lost during an EC2 instance
shutdown.

            regards, tom lane