Обсуждение: Bad records in table
Friends, I have a table with bad data in it. Whenever the backend hits the tupples with bad data I get: Jul 2 07:44:18 db postgres[2652]: [22] DEBUG: server process (pid 2869) was terminated by signal 11 Jul 2 07:44:18 db postgres[2652]: [23] DEBUG: terminating any other active server processes Jul 2 07:44:18 db postgres[2652]: [24] DEBUG: all server processes terminated; reinitializing shared memory and semaphores Jul 2 07:44:18 db postgres[2871]: [25] FATAL 1: The database system is starting up Jul 2 07:44:18 db postgres[2870]: [25] DEBUG: database system was interrupted at 2002-07-02 07:42:30 PDT Jul 2 07:44:18 db postgres[2870]: [26] DEBUG: checkpoint record is at 1/94E54768 Jul 2 07:44:18 db postgres[2870]: [27] DEBUG: redo record is at 1/94E54768; undo record is at 0/0; shutdown TRUE Jul 2 07:44:18 db postgres[2870]: [28] DEBUG: next transaction id: 55508992; next oid: 13249044 Jul 2 07:44:18 db postgres[2870]: [29] DEBUG: database system was not properly shut down; automatic recovery in progress Jul 2 07:44:18 db postgres[2870]: [30] DEBUG: redo starts at 1/94E547A8 Jul 2 07:44:18 db postgres[2870]: [31] DEBUG: ReadRecord: record with zero length at 1/94E547CC Jul 2 07:44:18 db postgres[2870]: [32] DEBUG: redo done at 1/94E547A8 Jul 2 07:44:20 db postgres[2870]: [33] DEBUG: database system is ready I can identify the bad records with a simple perl script that reads all the records sequentially and lets me know which ones crash the back end. Now that I know which ones they are, how can I get rid of them? Once I get this all fixed, what kinds of tests should I do on the machine to see if it's got some kind of hardware problem? It's RedHat Linux 2.4.9-31smp. Thanks for your help! Peter Darley
"Peter Darley" <pdarley@kinesis-cem.com> writes: > I can identify the bad records with a simple perl script that reads all the > records sequentially and lets me know which ones crash the back end. Now > that I know which ones they are, how can I get rid of them? DELETE? That might not work very well if the corruption is in toasted fields, but you haven't given any details that would let us know (not even such basic info as your PG version). But before doing that I'd suggest eyeballing the bad data using a tool such as pg_filedump (see http://sources.redhat.com/rhdb/tools.html). Look at the "ctid" column of the bad tuples to see what part of the file you need to dump (ctid is block number and item number). The pattern of the corruption might give some clue what happened. regards, tom lane
On Tue, Jul 02, 2002 at 08:15:01AM -0700, Peter Darley wrote: > I can identify the bad records with a simple perl script that reads all the > records sequentially and lets me know which ones crash the back end. Now > that I know which ones they are, how can I get rid of them? If you know which tuple it is that kills it, you can delete it. That should be sufficient. If you know the ctid or oid, they're good keys for this sort of thing. > Once I get this all fixed, what kinds of tests should I do on the machine > to see if it's got some kind of hardware problem? It's RedHat Linux > 2.4.9-31smp. I strongly recommend memtest86[1]. It runs from a floppy and is very good at picking up memory errors. After that check out your hard disk with whatever program your HDD manufacturer provides. But memory is the most likely. Just in case, open the case and check for dust buildup and loose cables. [1] http://www.memtest86.com/ HTH, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't.
Tom, Sorry about that. It's 7.2.1 on RedHat 7.2. I wasn't able to delete; whenever the database touched those records (even to do a delete) the backend would crash. It turns out that my problems were much deeper than I thought. There was a bad dimm in the machine wreaking havoc. Thanks for the pointer to pg_filedump. Thanks, Peter Darley -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Tom Lane Sent: Tuesday, July 02, 2002 12:00 PM To: Peter Darley Cc: Pgsql-General Subject: Re: [GENERAL] Bad records in table "Peter Darley" <pdarley@kinesis-cem.com> writes: > I can identify the bad records with a simple perl script that reads all the > records sequentially and lets me know which ones crash the back end. Now > that I know which ones they are, how can I get rid of them? DELETE? That might not work very well if the corruption is in toasted fields, but you haven't given any details that would let us know (not even such basic info as your PG version). But before doing that I'd suggest eyeballing the bad data using a tool such as pg_filedump (see http://sources.redhat.com/rhdb/tools.html). Look at the "ctid" column of the bad tuples to see what part of the file you need to dump (ctid is block number and item number). The pattern of the corruption might give some clue what happened. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly
Martijn, What I didn't think to mention in my original post is that I was unable to delete the rows because the backend would crash. It turned that the table was way beyond help (at least by someone at my skill level), because there was some bad memory in the machine. Thanks for the pointer to memtest86. It did the trick. Thanks, Peter Darley -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Martijn van Oosterhout Sent: Tuesday, July 02, 2002 5:58 PM To: Peter Darley Cc: Pgsql-General Subject: Re: [GENERAL] Bad records in table On Tue, Jul 02, 2002 at 08:15:01AM -0700, Peter Darley wrote: > I can identify the bad records with a simple perl script that reads all the > records sequentially and lets me know which ones crash the back end. Now > that I know which ones they are, how can I get rid of them? If you know which tuple it is that kills it, you can delete it. That should be sufficient. If you know the ctid or oid, they're good keys for this sort of thing. > Once I get this all fixed, what kinds of tests should I do on the machine > to see if it's got some kind of hardware problem? It's RedHat Linux > 2.4.9-31smp. I strongly recommend memtest86[1]. It runs from a floppy and is very good at picking up memory errors. After that check out your hard disk with whatever program your HDD manufacturer provides. But memory is the most likely. Just in case, open the case and check for dust buildup and loose cables. [1] http://www.memtest86.com/ HTH, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > There are 10 kinds of people in the world, those that can do binary > arithmetic and those that can't. ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)