Обсуждение: Database Recovery Procedures
Looks like for the first time in 6 years, I'm experienced some database table corruption. This was due to the space filling up on a server (you don't want to know how that happened). I have 3 tables corrupt and the others are fine (which I dumped to be safe). I have a backup which I could use but then I realized that maybe there might be some "surgery" I could perform to get the table "repaired". Note that the normal recovery that the database does on its own did not work in this case. I looked through the documentation (Admin 7.3.2) and I thought there was a disaster recovery section but there is only "recovery" discussed as part of backup/restore. If this information is out there somewhere else if someone could provide a link that would be a great help as well. My question/statement is that I think this is something that is important to have. At least in regards to different strategies one could try to surgically recover data BEFORE use the broad sword method of going to a backup. One of the successful "sell" points I use to my clients is how resilient Linux/Unix filesystems are. As well as Pg on Linux. In the case here, though I don't have FS corruption so I'd like to know what should and could I do in this case. Suggestions? Oh and here is the output of a "select *" on one of the corrupt tables... (saved as draft email here on 9/12/03) ..Ok, I was going to paste that in the email but now the database isn't coming up at all. Here is the start up message ~~~ DEBUG: FindExec: found "/usr/local/pgsql/bin/postgres" using argv[0] DEBUG: invoking IpcMemoryCreate(size=1466368) DEBUG: FindExec: found "/usr/local/pgsql/bin/postmaster" using argv[0] LOG: database system shutdown was interrupted at 2003-09-16 15:11:36 EDT LOG: checkpoint record is at 5/2D497FC0 LOG: redo record is at 5/2D497FC0; undo record is at 0/0; shutdown TRUE LOG: next transaction id: 5287090; next oid: 26471 LOG: database system was not properly shut down; automatic recovery in progress LOG: ReadRecord: unexpected pageaddr 5/27498000 in log file 5, segment 45, offset 4816896 LOG: redo is not required PANIC: XLogWrite: write request 5/2D498000 is past end of log 5/2D498000 DEBUG: reaping dead processes LOG: startup process (pid 17031) was terminated by signal 6 LOG: aborting startup due to startup process failure DEBUG: proc_exit(1) DEBUG: shmem_exit(1) DEBUG: exit(1) ~~~ Thanks in advance to all -- Keith C. Perry Director of Networks & Applications VCSN, Inc. http://vcsn.com ____________________________________ This email account is being host by: VCSN, Inc : http://vcsn.com
Network Administrator <netadmin@vcsn.com> writes: > PANIC: XLogWrite: write request 5/2D498000 is past end of log 5/2D498000 I'll bet you are running 7.3.3. You need to update to 7.3.4 to escape this startup bug. regards, tom lane
Ahhh, we it just so happens that I have 7.3.4 waiting to be built. I was trying to wait up 7.4 was released but other than this startup bug does 7.3.4 do a better job of recovery? *pause* Ok, wait- at this point I have to go to 7.3.4 because I would not be able to go to 7.4 'cause I need the dump/restore process that we've been talking about first... so, I'll do that and when/if I have a problem, I'll repost :) Quoting Tom Lane <tgl@sss.pgh.pa.us>: > Network Administrator <netadmin@vcsn.com> writes: > > PANIC: XLogWrite: write request 5/2D498000 is past end of log 5/2D498000 > > I'll bet you are running 7.3.3. You need to update to 7.3.4 to escape > this startup bug. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings > -- Keith C. Perry Director of Networks & Applications VCSN, Inc. http://vcsn.com ____________________________________ This email account is being host by: VCSN, Inc : http://vcsn.com
Ok, upgrade done and the system is up but when try hit database through the app (browser front end just does selects) on the debug screen (level 3) I get this... DEBUG: child process (pid 21248) was terminated by signal 6 LOG: server process (pid 21248) was terminated by signal 6 LOG: terminating any other active server processes LOG: all server processes terminated; reinitializing shared memory and semaphores DEBUG: shmem_exit(0) DEBUG: invoking IpcMemoryCreate(size=1466368) LOG: database system was interrupted at 2003-09-17 10:52:16 EDT LOG: checkpoint record is at 5/2D498110 LOG: redo record is at 5/2D498110; undo record is at 0/0; shutdown TRUE LOG: next transaction id: 5287090; next oid: 26471 LOG: database system was not properly shut down; automatic recovery in progress LOG: ReadRecord: record with zero length at 5/2D498150 LOG: redo is not required LOG: database system is ready DEBUG: proc_exit(0) DEBUG: shmem_exit(0) DEBUG: exit(0) ..if I using the psql client, and issue a "select * from <corrupt table name> limit 5" if get this... PANIC: read of clog file 5, offset 16384 failed: Success server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !# ..I've never seen the prompt got to "!#" So back to my original question. What are the recover procedures (if any) that should be tried before I grab my PGDATA path from tape? Quoting Tom Lane <tgl@sss.pgh.pa.us>: > Network Administrator <netadmin@vcsn.com> writes: > > PANIC: XLogWrite: write request 5/2D498000 is past end of log 5/2D498000 > > I'll bet you are running 7.3.3. You need to update to 7.3.4 to escape > this startup bug. > > regards, tom lane > -- Keith C. Perry Director of Networks & Applications VCSN, Inc. http://vcsn.com ____________________________________ This email account is being host by: VCSN, Inc : http://vcsn.com
Network Administrator <netadmin@vcsn.com> writes: > ..if I using the psql client, and issue a "select * from <corrupt table name> > limit 5" if get this... > PANIC: read of clog file 5, offset 16384 failed: Success Hm, not good :-(. What files actually exist in $PGDATA/pg_clog/ (names and sizes)? > So back to my original question. What are the recover procedures (if > any) that should be tried before I grab my PGDATA path from tape? You may be up against having to do that, but some investigation first seems called for. regards, tom lane
Quoting Tom Lane <tgl@sss.pgh.pa.us>: > Network Administrator <netadmin@vcsn.com> writes: > > ..if I using the psql client, and issue a "select * from <corrupt table > name> > > limit 5" if get this... > > > PANIC: read of clog file 5, offset 16384 failed: Success > > Hm, not good :-(. What files actually exist in $PGDATA/pg_clog/ (names > and sizes)? Here you go... -rw------- 1 postgres users 262144 Jul 20 15:53 0000 -rw------- 1 postgres users 262144 Jul 31 12:57 0001 -rw------- 1 postgres users 262144 Aug 12 17:32 0002 -rw------- 1 postgres users 262144 Aug 26 00:15 0003 -rw------- 1 postgres users 262144 Sep 9 23:44 0004 -rw------- 1 postgres users 16384 Sep 10 21:21 0005 > > So back to my original question. What are the recover procedures (if > > any) that should be tried before I grab my PGDATA path from tape? > > You may be up against having to do that, but some investigation first > seems called for. Yep, its ready to go. When and if nothing surgical can be done. > regards, tom lane > -- Keith C. Perry Director of Networks & Applications VCSN, Inc. http://vcsn.com ____________________________________ This email account is being host by: VCSN, Inc : http://vcsn.com
Network Administrator <netadmin@vcsn.com> writes: >>> PANIC: read of clog file 5, offset 16384 failed: Success >> Hm, not good :-(. What files actually exist in $PGDATA/pg_clog/ (names >> and sizes)? > -rw------- 1 postgres users 262144 Jul 20 15:53 0000 > -rw------- 1 postgres users 262144 Jul 31 12:57 0001 > -rw------- 1 postgres users 262144 Aug 12 17:32 0002 > -rw------- 1 postgres users 262144 Aug 26 00:15 0003 > -rw------- 1 postgres users 262144 Sep 9 23:44 0004 > -rw------- 1 postgres users 16384 Sep 10 21:21 0005 Okay, it's trying to read off the end of the clog, no doubt looking for a transaction number just slightly larger than what's known to clog. This probably indicates more serious problems (because WAL replay really should have prevented such an inconsistency), but you can get past the immediate panic relatively easily: just append an 8k page of zeroes to clog. Assuming your system has /dev/zero, something like this should do it: dd bs=8k count=1 < /dev/zero >> $PGDATA/pg_clog/0005 (do this with care of course, and you should probably shut down the postmaster first). You might possibly have to add more than one page, if you then get similar PANICs with larger offsets, but try one page for starters. If this does suppress the failure messages, you are still not really out of the woods; you should do what you can to check for data consistency. A paranoid person would probably take a complete pg_dump and try to diff it against the last known good dump. At the very least, I'd treat the table involved in the problem with great suspicion. regards, tom lane
Quoting Tom Lane <tgl@sss.pgh.pa.us>: > Network Administrator <netadmin@vcsn.com> writes: > >>> PANIC: read of clog file 5, offset 16384 failed: Success > > >> Hm, not good :-(. What files actually exist in $PGDATA/pg_clog/ (names > >> and sizes)? > > > -rw------- 1 postgres users 262144 Jul 20 15:53 0000 > > -rw------- 1 postgres users 262144 Jul 31 12:57 0001 > > -rw------- 1 postgres users 262144 Aug 12 17:32 0002 > > -rw------- 1 postgres users 262144 Aug 26 00:15 0003 > > -rw------- 1 postgres users 262144 Sep 9 23:44 0004 > > -rw------- 1 postgres users 16384 Sep 10 21:21 0005 > > Okay, it's trying to read off the end of the clog, no doubt looking for > a transaction number just slightly larger than what's known to clog. > This probably indicates more serious problems (because WAL replay really > should have prevented such an inconsistency), but you can get past the > immediate panic relatively easily: just append an 8k page of zeroes to > clog. Assuming your system has /dev/zero, something like this should > do it: > > dd bs=8k count=1 < /dev/zero >> $PGDATA/pg_clog/0005 > > (do this with care of course, and you should probably shut down the > postmaster first). You might possibly have to add more than one page, > if you then get similar PANICs with larger offsets, but try one page > for starters. Well whatdoyaknow! That did it- EVERYTHING is there! I only needed the one page. I'm going to have to read up on pg_clog (WAL) so that I understand what it does a little better. It makes total sense too because obvious if there is no more space to write too anything on disk get "frozen" where as application is just going to keep moving along. At least for a time. > If this does suppress the failure messages, you are still not really out > of the woods; you should do what you can to check for data consistency. > A paranoid person would probably take a complete pg_dump and try to diff > it against the last known good dump. At the very least, I'd treat the > table involved in the problem with great suspicion. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 7: don't forget to increase your free space map settings Well, I fortunately was on the side of the road and not in the woods in this case. I'm assuming the next thing to do is run a vacuum analyse and they a dump all. I'll see how it things perform over the next 48 hours or so. Now that we've done that, how should should this surgery be documented? I would think a "tip" like this should be somewhere in the docs (not just the archive). I'd be more than will to write up this case but I'm trying to establish a long goal here- if I may be so bold as to suggest one. Thanks again. -- Keith C. Perry Director of Networks & Applications VCSN, Inc. http://vcsn.com ____________________________________ This email account is being host by: VCSN, Inc : http://vcsn.com