Обсуждение: invalid memory alloc request size

Поиск
Список
Период
Сортировка

invalid memory alloc request size

От
Janning Vygen
Дата:
Hi,

my cron job which is dumping the databse fails this night. I got:

pg_dump: ERROR:  invalid memory alloc request size 18446744073709551614
pg_dump: SQL command to dump the contents of table "spieletipps" failed:
PQendcopy() failed.
pg_dump: Error message from server: ERROR:  invalid memory alloc request size
18446744073709551614
pg_dump: The command was: COPY public.spieletipps (tr_kurzname, mg_name,
sp_id, stip_heimtore, stip_gasttore) TO stdout;

I am running
  postgresql-server-8.0.3-1.2
  on SuSE Linux 9.3 (x86-64)

I had this a few days before and decided to use a recent backup. It works fine
for two days only. Maybe my harddisk is broken? Maybe 64-bit is broken? i
have no clue and no idea what do to. i ve searched the archives and found a
recent thread on HACKERS but sorry guys: i dont know how to produce a
backtrace.

select count(*) from spieletipps;
  count
----------
 11612957
(1 Zeile)

works fine. When i do something like this:

$ select * from spieletipps where sp_id > 10000000;

Server beendete die Verbindung unerwartet
        Das heißt wahrscheinlich, daß der Server abnormal beendete
        bevor oder während die Anweisung bearbeitet wurde.
Die Verbindung zum Server wurde verloren.  Versuche Reset: Fehlgeschlagen.

(it means: server closed the connection unexpectedly. ... Try to reset
connection failed.)

Please help me!

kind regards,
janning

Re: invalid memory alloc request size

От
Tom Lane
Дата:
Janning Vygen <vygen@gmx.de> writes:
> pg_dump: ERROR:  invalid memory alloc request size 18446744073709551614
> pg_dump: SQL command to dump the contents of table "spieletipps" failed:
> PQendcopy() failed.

This looks more like a corrupt-data problem than anything else.  Have
you tried the usual memory and disk testing programs?

> recent thread on HACKERS but sorry guys: i dont know how to produce a
> backtrace.

Time to learn ;-)

    gdb /path/to/postgres_executable /path/to/core_file
    gdb> bt
    gdb> q

The core file will be somewhere under $PGDATA, named either "core" or
"core.nnnnn" depending on your kernel settings.  If you don't see one
then it's probable that the postmaster was started under "ulimit -c 0".
Put "ulimit -c unlimited" in your postgres startup script, restart,
trigger the crash again.

It's also a good idea to look in the postmaster log to see if any
unusual messages appeared before the crash.

            regards, tom lane

Re: invalid memory alloc request size

От
Janning Vygen
Дата:
Am Montag, 23. Januar 2006 17:05 schrieb Tom Lane:
> Janning Vygen <vygen@gmx.de> writes:
> > pg_dump: ERROR:  invalid memory alloc request size 18446744073709551614
> > pg_dump: SQL command to dump the contents of table "spieletipps" failed:
> > PQendcopy() failed.
>
> This looks more like a corrupt-data problem than anything else.  Have
> you tried the usual memory and disk testing programs?

no, i didn't. What are the usual memory and disk testing programs? ( a few
weeks ago i wanted to start a troubleshooting guide for guys like me, but i
didn't start yet.... this needs to be documented.). I am not a system
administrator and a hard disk is a black box to me.

By the way: the database is still running and serving requests.

> > recent thread on HACKERS but sorry guys: i dont know how to produce a
> > backtrace.
>
> Time to learn ;-)
>
>     gdb /path/to/postgres_executable /path/to/core_file
>     gdb> bt
>     gdb> q

I shouldn't call gdb while my database is up and running, don't i?

I tried to find and delete the corrupted row (as you mentioned in
http://archives.postgresql.org/pgsql-admin/2006-01/msg00117.php)

I found it:

$ select sp_id from spieletipps limit 1 offset 387583;
Server beendete die Verbindung unerwartet
        Das heißt wahrscheinlich, daß der Server abnormal beendete
        bevor oder während die Anweisung bearbeitet wurde.
Die Verbindung zum Server wurde verloren.  Versuche Reset: Fehlgeschlagen.
!> \q

and i can get the ctid:

$ select ctid from spieletipps limit 1 offset 387583;
   ctid
-----------
 (3397,49)
(1 Zeile)


but when i want to delete it:
$ delete from spieletipps where ctid = '(3397,49)';
Server beendete die Verbindung unerwartet
        Das heißt wahrscheinlich, daß der Server abnormal beendete
        bevor oder während die Anweisung bearbeitet wurde.
Die Verbindung zum Server wurde verloren.  Versuche Reset: Fehlgeschlagen.

How can i get rid of it? (I don't have oids in the table, i created them
without oids)

> > The core file will be somewhere under $PGDATA, named either "core" or
> "core.nnnnn" depending on your kernel settings.  If you don't see one
> then it's probable that the postmaster was started under "ulimit -c 0".
> Put "ulimit -c unlimited" in your postgres startup script, restart,
> trigger the crash again.
>
> It's also a good idea to look in the postmaster log to see if any
> unusual messages appeared before the crash.

this is form the postmaster log:

LOG:  server process (PID 14756) was terminated by signal 11
LOG:  terminating any other active server processes
LOG:  all server processes terminated; reinitializing
FATAL:  the database system is starting up
LOG:  database system was interrupted at 2006-01-23 09:46:03 CET
LOG:  checkpoint record is at 1/D890C0E0
LOG:  redo record is at 1/D88F93E8; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 485068; next OID: 16882321
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  redo starts at 1/D88F93E8
LOG:  record with zero length at 1/D8953988
LOG:  redo done at 1/D8953920
LOG:  database system is ready
LOG:  server process (PID 15198) was terminated by signal 11
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat
your command.
FATAL:  the database system is in recovery mode
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted at 2006-01-23 09:46:15 CET
LOG:  checkpoint record is at 1/D8953988
LOG:  redo record is at 1/D8953988; undo record is at 0/0; shutdown TRUE
LOG:  next transaction ID: 485130; next OID: 16882321
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  redo starts at 1/D89539D0
LOG:  record with zero length at 1/D8966BF8
LOG:  redo done at 1/D8966BC8
LOG:  database system is ready
LOG:  server process (PID 15400) was terminated by signal 11
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat
your command.
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted at 2006-01-23 09:46:24 CET
LOG:  checkpoint record is at 1/D8966BF8
LOG:  redo record is at 1/D8966BF8; undo record is at 0/0; shutdown TRUE
LOG:  next transaction ID: 485183; next OID: 16882321
LOG:  database system was not properly shut down; automatic recovery in
progress
FATAL:  the database system is starting up
LOG:  redo starts at 1/D8966C40
LOG:  record with zero length at 1/D8991CC8
LOG:  redo done at 1/D8991C98
LOG:  database system is ready

any further help is very appreciated,

kind regards
janning


Re: invalid memory alloc request size

От
Tom Lane
Дата:
Janning Vygen <vygen@gmx.de> writes:
> I shouldn't call gdb while my database is up and running, don't i?

Sure you can.  Especially against a core dump --- that mode doesn't have
anything to do with the running processes.

> $ delete from spieletipps where ctid = '(3397,49)';
> Server beendete die Verbindung unerwartet

Hmm ... as far as I can think at the moment, this suggests a problem
with a toasted field; DELETE wouldn't need to look at the contents of
a target row except if it has to find and delete subsidiary toast rows.
But looking at the gdb backtrace would help to confirm or deny that.

Another thing that would be useful at this point is to get a dump of the
page containing the corrupted tuple, which we now know is block 3397 of
that table.  See pg_filedump from
http://sources.redhat.com/rhdb/utilities.html
Something like "pg_filedump -i -f -R 3397 $PGDATA/base/XXXX/YYYY", where
XXXX is the database OID and YYYY is the table's relfilenode.

            regards, tom lane

Re: invalid memory alloc request size

От
Tom Lane
Дата:
Janning Vygen <vygen@gmx.de> writes:
> Ok, i got the reffilnode from pg_class and compiled pg_filedump. result of
> ./pg_filedump -i -f -R 3397 /home/postgres8/data/base/12934120/12934361 >
> filedump.txt is attached

OK, what's the schema of this table exactly?  It looks like there are
a couple of text or varchar columns to start, but I'm not sure about the
last three columns.

> but i guess its item 49 which makes trouble
>   1258: 01000000 616c7465 68656964 65000000  ....alteheide...

> But it doesn't look very diffrent to item 48:
>   12a0: 0d000000 616c7465 68656964 65000000  ....alteheide...

If these are both supposed to be strings 'alteheide', then the problem
is the bogus length word on the first one: instead of starting with
01000000 it should start with 0d000000, like the second one does.

It's conceivable that this stems from a software problem, but I'm
wondering about hardware problems causing dropped bits, myself.

Another point is that AFAICS this tuple could not pose a problem for
DELETE all by itself, because it doesn't have any toasted fields.
Perhaps there is more corruption elsewhere.  Could you get a stack
trace from the crashed DELETE, rather than a crashed SELECT?

            regards, tom lane

Re: invalid memory alloc request size

От
Janning Vygen
Дата:
Am Montag, 23. Januar 2006 20:30 schrieb Tom Lane:
> Janning Vygen <vygen@gmx.de> writes:
> > Ok, i got the reffilnode from pg_class and compiled pg_filedump. result
> > of ./pg_filedump -i -f -R 3397
> > /home/postgres8/data/base/12934120/12934361 > filedump.txt is attached
>
> OK, what's the schema of this table exactly?  It looks like there are
> a couple of text or varchar columns to start, but I'm not sure about the
> last three columns.

kicktipp.de=> \d spieletipps
     Tabelle »public.spieletipps«
    Spalte     |   Typ    | Attribute
---------------+----------+-----------
 tr_kurzname   | text     | not null
 mg_name       | text     | not null
 sp_id         | integer  | not null
 stip_heimtore | smallint | not null
 stip_gasttore | smallint | not null
Indexe:
    »pk_spieletipps« PRIMARY KEY, btree (tr_kurzname, mg_name, sp_id)
    »ix_stip_fk_spiele« btree (tr_kurzname, sp_id) CLUSTER
Fremdschlüssel-Constraints:
    »fk_mitglieder« FOREIGN KEY (tr_kurzname, mg_name) REFERENCES
mitglieder(tr_kurzname, mg_name) ON UPDATE CASCADE ON DELETE CASCADE
DEFERRABLE INITIALLY DEFERRED
    »fk_tippspieltage2spiele« FOREIGN KEY (tr_kurzname, sp_id) REFERENCES
tippspieltage2spiele(tr_kurzname, sp_id) ON UPDATE CASCADE ON DELETE CASCADE
DEFERRABLE INITIALLY DEFERRED
Regeln:
    cache_stip_delete AS
    ON DELETE TO spieletipps DO  UPDATE tsptcache SET tc_cache = -2
   FROM tippspieltage2spiele tspt2sp, spiele sp
  WHERE tsptcache.tr_kurzname = old.tr_kurzname AND tspt2sp.tr_kurzname =
old.tr_kurzname AND tspt2sp.sp_id = old.sp_id AND tspt2sp.sp_id = sp.sp_id
AND sp.sp_abpfiff = true AND tsptcache.tspt_sort >= tspt2sp.tspt_sort AND
sign((old.stip_heimtore - old.stip_gasttore)::double precision) =
sign((sp.sp_heimtore - sp.sp_gasttore)::double precision) AND
tsptcache.tc_cache <> -2
    cache_stip_insert AS
    ON INSERT TO spieletipps DO  UPDATE tsptcache SET tc_cache = -2
   FROM tippspieltage2spiele tspt2sp, spiele sp
  WHERE tsptcache.tr_kurzname = new.tr_kurzname AND tspt2sp.tr_kurzname =
new.tr_kurzname AND tspt2sp.sp_id = new.sp_id AND tspt2sp.sp_id = sp.sp_id
AND sp.sp_abpfiff = true AND tsptcache.tspt_sort >= tspt2sp.tspt_sort AND
sign((new.stip_heimtore - new.stip_gasttore)::double precision) =
sign((sp.sp_heimtore - sp.sp_gasttore)::double precision) AND
tsptcache.tc_cache <> -2
    cache_stip_update AS
    ON UPDATE TO spieletipps DO  UPDATE tsptcache SET tc_cache = -2
   FROM tippspieltage2spiele tspt2sp, spiele sp
  WHERE tsptcache.tr_kurzname = new.tr_kurzname AND tspt2sp.tr_kurzname =
new.tr_kurzname AND tspt2sp.sp_id = new.sp_id AND tspt2sp.sp_id = sp.sp_id
AND sp.sp_abpfiff = true AND tsptcache.tspt_sort >= tspt2sp.tspt_sort AND
(sign((new.stip_heimtore - new.stip_gasttore)::double precision) =
sign((sp.sp_heimtore - sp.sp_gasttore)::double precision) OR
sign((old.stip_heimtore - old.stip_gasttore)::double precision) =
sign((sp.sp_heimtore - sp.sp_gasttore)::double precision)) AND
tsptcache.tc_cache <> -2

> > but i guess its item 49 which makes trouble
> >   1258: 01000000 616c7465 68656964 65000000  ....alteheide...
> >
> > But it doesn't look very diffrent to item 48:
> >   12a0: 0d000000 616c7465 68656964 65000000  ....alteheide...
>
> If these are both supposed to be strings 'alteheide', then the problem
> is the bogus length word on the first one: instead of starting with
> 01000000 it should start with 0d000000, like the second one does.

yes, they should both be "alteheide". Is it possible to open the file and just
fix the bit?

> It's conceivable that this stems from a software problem, but I'm
> wondering about hardware problems causing dropped bits, myself.

I have no clue, why it happens. But i changed my schema a few month ago to use
a materialized view (You see all the rules in this schema above). i need some
complicated ranking algorithm to calculate the materialzed view. everything
is implemented inside postgresql with rules and functions (pgperl and
plpgsql). One more aspect are temp tables to me. I use lots of them for a
specific tasks (reusing the calculating algorithm mentioned above for a
different data view). With lots of temp tables i got problems with pg_type
where some old temp values reside and i got to delete some of them manually a
few times per month. After all my "feeling" is that i encouter problems like
this one too often to believe in hardware problems. But this time it seems to
be a new one and i have no clue if hardware or software related. At this time
i just want to fix it. But if you want to take a close look at it, i will
send you all you need.

> Another point is that AFAICS this tuple could not pose a problem for
> DELETE all by itself, because it doesn't have any toasted fields.
> Perhaps there is more corruption elsewhere.  Could you get a stack
> trace from the crashed DELETE, rather than a crashed SELECT?

Maybe the rule is a problem?

here you are. I did:

select ctid from spieletipps limit 1 offset 387439;
   ctid
-----------
 (3397,49)
(1 Zeile)

kicktipp.de=> delete from spieletipps where ctid = '(3397,49)';
Server beendete die Verbindung unerwartet
        Das heißt wahrscheinlich, daß der Server abnormal beendete
        bevor oder während die Anweisung bearbeitet wurde.
Die Verbindung zum Server wurde verloren.  Versuche Reset: Fehlgeschlagen.
!> \q



gdb output:

Loaded symbols for /usr/lib64/libkrb5support.so.0
#0  0x00000000004373f0 in nocachegetattr ()
(gdb) bt
#0  0x00000000004373f0 in nocachegetattr ()
#1  0x00000000004d614d in ExecInitExprInitPlan ()
#2  0x00000000004d3c1d in ExecProject ()
#3  0x00000000004d7c38 in ExecScan ()
#4  0x00000000004d33ad in ExecProcNode ()
#5  0x00000000004dfde1 in ExecNestLoop ()
#6  0x00000000004d337d in ExecProcNode ()
#7  0x00000000004dfde1 in ExecNestLoop ()
#8  0x00000000004d337d in ExecProcNode ()
#9  0x00000000004dfde1 in ExecNestLoop ()
#10 0x00000000004d337d in ExecProcNode ()
#11 0x00000000004d1e9c in ExecutorRun ()
#12 0x0000000000549b32 in CreateQueryDesc ()
#13 0x000000000054a10a in PortalRun ()
#14 0x0000000000546382 in pg_parse_query ()
#15 0x0000000000547eba in PostgresMain ()
#16 0x000000000051ef44 in ClosePostmasterPorts ()
#17 0x000000000051fce1 in PostmasterMain ()
#18 0x00000000004ef5c3 in main ()
(gdb) q


kind regards,
janning

Re: invalid memory alloc request size

От
Tom Lane
Дата:
Janning Vygen <vygen@gmx.de> writes:
>> OK, what's the schema of this table exactly?

> ...
> Regeln:
>     cache_stip_delete AS
>     ON DELETE TO spieletipps DO  UPDATE tsptcache SET tc_cache = -2
>    FROM tippspieltage2spiele tspt2sp, spiele sp
>   WHERE tsptcache.tr_kurzname = old.tr_kurzname AND tspt2sp.tr_kurzname =
> old.tr_kurzname AND tspt2sp.sp_id = old.sp_id AND tspt2sp.sp_id = sp.sp_id
> AND sp.sp_abpfiff = true AND tsptcache.tspt_sort >= tspt2sp.tspt_sort AND
> sign((old.stip_heimtore - old.stip_gasttore)::double precision) =
> sign((sp.sp_heimtore - sp.sp_gasttore)::double precision) AND
> tsptcache.tc_cache <> -2

Oh, I should have thought of that: the bare DELETE operation doesn't
care what's in the tuple, but this ON DELETE rule sure does.  That's
why the delete crashes, it's trying to extract the field contents so
it can execute the rule.

> yes, they should both be "alteheide". Is it possible to open the file and just
> fix the bit?

Yeah, if you have a suitable hex editor.  You'll probably need to shut
down the postmaster first, as it may have a cached copy of the page.

> I have no clue, why it happens. But i changed my schema a few month
> ago to use a materialized view (You see all the rules in this schema
> above). i need some complicated ranking algorithm to calculate the
> materialzed view. everything is implemented inside postgresql with
> rules and functions (pgperl and plpgsql). One more aspect are temp
> tables to me. I use lots of them for a specific tasks (reusing the
> calculating algorithm mentioned above for a different data view). With
> lots of temp tables i got problems with pg_type where some old temp
> values reside and i got to delete some of them manually a few times
> per month.

Hmm ... the one part of that that jumps out at me is plperl.  We already
know that plperl can screw up the locale settings; I wonder whether
there are other bugs.  Anyway, if you are using plperl I *strongly*
recommend updating to the latest PG release ASAP (8.0.6 in your case).
If you cannot, at least make sure the postmaster is launched with the
same LC_XXX settings in its environment as are embedded in the database.

            regards, tom lane

Re: invalid memory alloc request size

От
Janning Vygen
Дата:
TOM! Ich will ein Kind von Dir!!
(it means 'something like': thank you so much. you just saved my life!)

Am Montag, 23. Januar 2006 21:16 schrieb Tom Lane:
> Janning Vygen <vygen@gmx.de> writes:
> >> OK, what's the schema of this table exactly?
> >
> > ...
> > Regeln:
> >     cache_stip_delete AS
> >     ON DELETE TO spieletipps DO  UPDATE tsptcache SET tc_cache = -2
>>      [...]
>
> Oh, I should have thought of that: the bare DELETE operation doesn't
> care what's in the tuple, but this ON DELETE rule sure does.  That's
> why the delete crashes, it's trying to extract the field contents so
> it can execute the rule.

I dropped the rule and deleted the row successfully with the ctid. Thanks a
lot for the great support! This problem will be my first article in my
PostgreSQL Troubleshooting Guide for Dummies. "We" really need it for guys
like me.

> > yes, they should both be "alteheide". Is it possible to open the file and
> > just fix the bit?
>
> Yeah, if you have a suitable hex editor.  You'll probably need to shut
> down the postmaster first, as it may have a cached copy of the page.

i decided not to poke to postgres internal file storage.

> > I have no clue, why it happens. But i changed my schema a few month
> > ago to use a materialized view (You see all the rules in this schema
> > above). i need some complicated ranking algorithm to calculate the
> > materialzed view. everything is implemented inside postgresql with
> > rules and functions (pgperl and plpgsql). One more aspect are temp
> > tables to me. I use lots of them for a specific tasks (reusing the
> > calculating algorithm mentioned above for a different data view). With
> > lots of temp tables i got problems with pg_type where some old temp
> > values reside and i got to delete some of them manually a few times
> > per month.
>
> Hmm ... the one part of that that jumps out at me is plperl.  We already
> know that plperl can screw up the locale settings; I wonder whether
> there are other bugs.  Anyway, if you are using plperl I *strongly*
> recommend updating to the latest PG release ASAP (8.0.6 in your case).

ok, shouldn't i upgrade to 8.1 instead of 8.0.6 if i can?

> If you cannot, at least make sure the postmaster is launched with the
> same LC_XXX settings in its environment as are embedded in the database.

i will look at it!

kind regards
janning


Re: invalid memory alloc request size

От
Tom Lane
Дата:
Janning Vygen <vygen@gmx.de> writes:
>> Hmm ... the one part of that that jumps out at me is plperl.  We already
>> know that plperl can screw up the locale settings; I wonder whether
>> there are other bugs.  Anyway, if you are using plperl I *strongly*
>> recommend updating to the latest PG release ASAP (8.0.6 in your case).

> ok, shouldn't i upgrade to 8.1 instead of 8.0.6 if i can?

Up to you --- you have more risk of compatibility issues if you do that,
whereas within-branch updates are supposed to be painless.  Depends
whether you have the time right now to deal with testing your applications
against 8.1.

            regards, tom lane

Re: invalid memory alloc request size

От
Janning Vygen
Дата:
Am Montag, 23. Januar 2006 21:57 schrieb Tom Lane:
> Janning Vygen <vygen@gmx.de> writes:
> > ok, shouldn't i upgrade to 8.1 instead of 8.0.6 if i can?
>
> Up to you --- you have more risk of compatibility issues if you do that,
> whereas within-branch updates are supposed to be painless.  Depends
> whether you have the time right now to deal with testing your applications
> against 8.1.

ok, i will think about it.

one more question: You mentioned standard disk and memory checks. Can you
point to some link where i can find more about it or which software do you
mean? I guess i have to start looking at it.

kind regards,
janning


hardware checks (was Re: invalid memory alloc request size)

От
Tom Lane
Дата:
Janning Vygen <vygen@gmx.de> writes:
> one more question: You mentioned standard disk and memory checks. Can you
> point to some link where i can find more about it or which software do you
> mean? I guess i have to start looking at it.

The stuff I've heard recommended is memtest86 for memory checks and
badblocks for disk checks.  But perhaps someone on the list has better
ideas.

            regards, tom lane

Re: hardware checks (was Re: invalid memory alloc request

От
Terry Fielder
Дата:
I second Tom:

badblocks and memtest86 are what I use and works great on all kinds of
hardware.  You don't even need a specific OS for memtest86 because you
can make a bootable floppy and test any old piece of hardware it recognizes.

Terry


--
Terry Fielder
terry@greatgulfhomes.com
Associate Director Software Development and Deployment
Great Gulf Homes / Ashton Woods Homes
Fax: (416) 441-9085

Re: hardware checks (was Re: invalid memory alloc request size)

От
Greg Stark
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Janning Vygen <vygen@gmx.de> writes:
> > one more question: You mentioned standard disk and memory checks. Can you
> > point to some link where i can find more about it or which software do you
> > mean? I guess i have to start looking at it.
>
> The stuff I've heard recommended is memtest86 for memory checks and
> badblocks for disk checks.  But perhaps someone on the list has better
> ideas.

I second memtest86, though even the author says memory errors can be tricksy
things. Sometimes a large compile finds memory errors that even memtest86
doesn't find (the symptom is gcc crashing).

However I fear using badblocks alone is pretty useless these days. Modern IDE
drives detect bad blocks and remap them to other locations. If you just use
badblocks you'll see mysterious errors that disappear or might not see any
errors at all. You need to use tools like smartctl to query the drive's SMART
firmware about errors. It's not easy to interpret but if you watch the numbers
for a while you can tell if a drive is going bad and continually remapping bad
blocks. badblocks is useful still as a way of ensuring that every block is
read and written to, but then you have to look at the SMART data to see what
happened.

--
greg

Re: hardware checks (was Re: invalid memory alloc request

От
Bruce Momjian
Дата:
Greg Stark wrote:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>
> > Janning Vygen <vygen@gmx.de> writes:
> > > one more question: You mentioned standard disk and memory checks. Can you
> > > point to some link where i can find more about it or which software do you
> > > mean? I guess i have to start looking at it.
> >
> > The stuff I've heard recommended is memtest86 for memory checks and
> > badblocks for disk checks.  But perhaps someone on the list has better
> > ideas.
>
> I second memtest86, though even the author says memory errors can be tricksy
> things. Sometimes a large compile finds memory errors that even memtest86
> doesn't find (the symptom is gcc crashing).
>
> However I fear using badblocks alone is pretty useless these days. Modern IDE
> drives detect bad blocks and remap them to other locations. If you just use
> badblocks you'll see mysterious errors that disappear or might not see any
> errors at all. You need to use tools like smartctl to query the drive's SMART
> firmware about errors. It's not easy to interpret but if you watch the numbers
> for a while you can tell if a drive is going bad and continually remapping bad
> blocks. badblocks is useful still as a way of ensuring that every block is
> read and written to, but then you have to look at the SMART data to see what
> happened.

It is my experience the SCSI drive controllers will beep if they have a
bad block that can't be read cleanly.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073