Обсуждение: Hosed PostGreSQL Installation
As a result of some disk errors on another drive, an admin in our group brought down the server hosting our pgsql databases with a kill -KILL after having gone to runlevel 1 and finding the postmaster process still running. No surprise, our installation was hosed in the process. After talking on #postgresql with klamath for about an hour or so to work through the issue (many thanks!), it was suggested that I send the info to this list. Currently, PostGreSQL will no longer start, and gives this error. bash-2.05$ /usr/bin/pg_ctl -D $PGDATA -p /usr/bin/postmaster start postmaster successfully started bash-2.05$ DEBUG: database system shutdown was interrupted at 2002-09-19 22:59:54 EDT DEBUG: open(logfile 0 seg 0) failed: No such file or directory DEBUG: Invalid primary checkPoint record DEBUG: open(logfile 0 seg 0) failed: No such file or directory DEBUG: Invalid secondary checkPoint record FATAL 2: Unable to locate a valid CheckPoint record /usr/bin/postmaster: Startup proc 11735 exited with status 512 - abort Our setup is vanilla Red Hat 7.2, having pretty much all of the postgresql-*-7.1.3-2 packages installed. Klamath asked if I had disabled fsync in postgresql.conf, and the only non-default (read: non-commented) setting in the file is: `tcpip_socket = true` Klamath suggested that I run pg_controldata: bash-2.05$ ./pg_controldata pg_control version number: 71 Catalog version number: 200101061 Database state: SHUTDOWNING pg_control last modified: Thu Sep 19 22:59:54 2002 Current log file id: 0 Next log file segment: 1 Latest checkpoint location: 0/1739A0 Prior checkpoint location: 0/1718F0 Latest checkpoint's REDO location: 0/1739A0 Latest checkpoint's UNDO location: 0/0 Latest checkpoint's StartUpID: 21 Latest checkpoint's NextXID: 615 Latest checkpoint's NextOID: 18720 Time of latest checkpoint: Thu Sep 19 22:49:42 2002 Database block size: 8192 Blocks per segment of large relation: 131072 LC_COLLATE: en_US LC_CTYPE: en_US If I look into the pg_xlog directory, I see this: sh-2.05$ cd pg_xlog/ bash-2.05$ ls -l total 32808 -rw------- 1 postgres postgres 16777216 Sep 20 23:13 0000000000000002 -rw------- 1 postgres postgres 16777216 Sep 19 22:09 000000020000007E There is one caveat. The installation resides on a partition of its own: /dev/hda3 17259308 6531140 9851424 40% /var/lib/pgsql/data fdisk did not report errors for this partition at boot time after the forced shutdown, however. This installation serves a university research project, and although most of the code / schemas are in development (and should be in cvs by rights), I can't confirm that all projects have indeed done that. So any advice, ideas or suggestions on how the data and / or schemas can be recovered would be greatly appreciated. Many thanks! -- pete P.S.: I've been using pgsql for about four years now, and it played a big role during my grad work. In fact, the availability of pgsql was one of the reasons why I was able to complete and graduate. Many thanks for such a great database! -- Pete St. Onge Research Associate, Computational Biologist, UNIX Admin Banting and Best Institute of Medical Research Program in Bioinformatics and Proteomics University of Toronto http://www.utoronto.ca/emililab/ pete@seul.org
"Pete St. Onge" <pete@seul.org> writes: > As a result of some disk errors on another drive, an admin in our group > brought down the server hosting our pgsql databases with a kill -KILL > after having gone to runlevel 1 and finding the postmaster process still > running. No surprise, our installation was hosed in the process. That should not have been a catastrophic mistake in any version >= 7.1. I suspect you had disk problems or other problems. > Klamath suggested that I run pg_controldata: > ... > Latest checkpoint's StartUpID: 21 > Latest checkpoint's NextXID: 615 > Latest checkpoint's NextOID: 18720 These numbers are suspiciously small for an installation that's been in production awhile. I suspect you have not told us the whole story; in particular I suspect you already tried "pg_resetxlog -f", which was probably not a good idea. > If I look into the pg_xlog directory, I see this: > -rw------- 1 postgres postgres 16777216 Sep 20 23:13 0000000000000002 > -rw------- 1 postgres postgres 16777216 Sep 19 22:09 000000020000007E Yeah, your xlog positions should be a great deal higher than they are, if segment 2/7E was previously in use. It is likely that you can recover (with some uncertainty about integrity of recent transactions) if you proceed as follows: 1. Get contrib/pg_resetxlog/pg_resetxlog.c from the 7.2.2 release (you can't use 7.1's pg_resetxlog because it doesn't offer the switches you'll need). Compile it *against your 7.1 headers*. It should compile except you'll have to remove this change: *************** *** 853,858 **** --- 394,403 ---- page->xlp_magic = XLOG_PAGE_MAGIC; page->xlp_info = 0; page->xlp_sui = ControlFile.checkPointCopy.ThisStartUpID; + page->xlp_pageaddr.xlogid = + ControlFile.checkPointCopy.redo.xlogid; + page->xlp_pageaddr.xrecoff = + ControlFile.checkPointCopy.redo.xrecoff - SizeOfXLogPHD; record = (XLogRecord *) ((char *) page + SizeOfXLogPHD); record->xl_prev.xlogid = 0; record->xl_prev.xrecoff = 0; Test it using its -n switch to make sure it reports sane values. 2. Run the hacked-up pg_resetxlog like this: pg_resetxlog -l 2 127 -x 1000000000 $PGDATA (the -l position is next beyond what we see in pg_xlog, the 1-billion XID is just a guess at something past where you were. Actually, can you give us the size of pg_log, ie, $PGDATA/global/1269? That would allow computing a correct next-XID to use. Figure 4 XIDs per byte, thus if pg_log is 1 million bytes you need -x at least 4 million.) 3. The postmaster should start now. 4. *Immediately* attempt to do a pg_dumpall. Do not pass GO, do not collect $200, do not let in any interactive clients until you've done it. (I'd suggest tweaking pg_hba.conf to disable all logins but your own.) 5. If pg_dumpall succeeds and produces sane-looking output, then you've survived. initdb, reload the dump file, re-open for business, go have a beer. (Recommended: install 7.2.2 and reload into that, not 7.1.*.) You will probably still need to check for partially-applied recent transactions, but for the most part you should be OK. 6. If pg_dumpall fails then let us know what the symptoms are, and we'll see if we can figure out a workaround for whatever the corruption is. regards, tom lane
Just following up on Tom Lane's email - A couple of things that I hadn't mentioned: After bringing up the machine, the first thing I did before mucking about with PostGreSQL was to tarball $PGDATA so that I'd have a second chance if I messed up. I then ran pg_resetlog -f the first time, as Tom surmised, with the unwanted results. That done, I sent out the email, and followed Tom's instructions (yay backups!) and did it properly. On Sat, Sep 21, 2002 at 11:13:44AM -0400, Tom Lane wrote: > "Pete St. Onge" <pete@seul.org> writes: > > That should not have been a catastrophic mistake in any version >= 7.1. > I suspect you had disk problems or other problems. We did, but these were on a different disk according to the logs, AFAIK. > These numbers are suspiciously small for an installation that's been > in production awhile. I suspect you have not told us the whole story; > in particular I suspect you already tried "pg_resetxlog -f", which was > probably not a good idea. *raises hand* Yep. Here's the contents of the pg_xlog directory. PGSQL has only been used here for approximately 4 months of fairly light use, so perhaps the numbers aren't as strange as they could be (this is from the backup). -rw------- 1 postgres postgres 16777216 Sep 19 22:09 000000020000007E > Yeah, your xlog positions should be a great deal higher than they are, > if segment 2/7E was previously in use. > > It is likely that you can recover (with some uncertainty about integrity > of recent transactions) if you proceed as follows: > > 1. Get contrib/pg_resetxlog/pg_resetxlog.c from the 7.2.2 release ... [Chomp] The compile worked without a hitch after doing ./configure in the top-level directory. I just downloaded the src for both trees, made the changes manually, copied the file into the 7.1.3 tree and compiled it there. > 2. Run the hacked-up pg_resetxlog like this: > > pg_resetxlog -l 2 127 -x 1000000000 $PGDATA > > (the -l position is next beyond what we see in pg_xlog, the 1-billion > XID is just a guess at something past where you were. Actually, can > you give us the size of pg_log, ie, $PGDATA/global/1269? That would > allow computing a correct next-XID to use. Figure 4 XIDs per byte, > thus if pg_log is 1 million bytes you need -x at least 4 million.) -rw------- 1 postgres postgres 11870208 Sep 19 17:00 1269 This gives a min WAL starting location of 47480832. I used 47500000. > 3. The postmaster should start now. I had to use pg_resetxlog's force option, but yeah, it worked like you said it would. > 4. *Immediately* attempt to do a pg_dumpall. Do not pass GO, do not > collect $200, do not let in any interactive clients until you've done > it. (I'd suggest tweaking pg_hba.conf to disable all logins but your > own.) I did not pass go, I did not collect $200. I *did* do a pg_dumpall right there and then, and was able to dump everything I needed. One of the projects uses large objects - image files and html files (don't ask, I've already tried to dissuade the Powers-That-Be) - and these didn't come out. However, since this stuff is entered via script, the project leader was fine with re-running the scripts tomorrow. > 5. If pg_dumpall succeeds and produces sane-looking output, then you've > survived. initdb, reload the dump file, re-open for business, go have > a beer. (Recommended: install 7.2.2 and reload into that, not 7.1.*.) > You will probably still need to check for partially-applied recent > transactions, but for the most part you should be OK. rpm -Uvh'ed the 7.2.2 RPMs, initdb'd and reloaded data into thenew installation. Pretty painless. I've just sent out an email to folks here to let them know the situation, and we should know in the next day or so what is up. > 6. If pg_dumpall fails then let us know what the symptoms are, and we'll > see if we can figure out a workaround for whatever the corruption is. I've kept the tarball with the corrupted data.I'll hold onto it for a bit, in case, but will likely expunge it in the next week or so. If this can have a use for the project (whatever it may be), let me know and I can burn it to DVD. Of course, without your help, Tom, there would be a lot of Very Unhappy People here, me only being one of them. Many thanks for your help and advice! Cheers, Pete -- Pete St. Onge Research Associate, Computational Biologist, UNIX Admin Banting and Best Institute of Medical Research Program in Bioinformatics and Proteomics University of Toronto http://www.utoronto.ca/emililab/