Обсуждение: Hosed PostGreSQL Installation

Поиск
Список
Период
Сортировка

Hosed PostGreSQL Installation

От
"Pete St. Onge"
Дата:
As a result of some disk errors on another drive, an admin in our group
brought down the server hosting our pgsql databases with a kill -KILL
after having gone to runlevel 1 and finding the postmaster process still
running. No surprise, our installation was hosed in the process. 

After talking on #postgresql with klamath for about an hour or so to
work through the issue (many thanks!), it was suggested that I send
the info to this list.

Currently, PostGreSQL will no longer start, and gives this error.

bash-2.05$ /usr/bin/pg_ctl  -D $PGDATA -p /usr/bin/postmaster start
postmaster successfully started
bash-2.05$ DEBUG:  database system shutdown was interrupted at
2002-09-19 22:59:54 EDT
DEBUG:  open(logfile 0 seg 0) failed: No such file or directory
DEBUG:  Invalid primary checkPoint record
DEBUG:  open(logfile 0 seg 0) failed: No such file or directory
DEBUG:  Invalid secondary checkPoint record
FATAL 2:  Unable to locate a valid CheckPoint record
/usr/bin/postmaster: Startup proc 11735 exited with status 512 - abort


Our setup is vanilla Red Hat 7.2, having pretty much all of the
postgresql-*-7.1.3-2 packages installed. Klamath asked if I had disabled
fsync in postgresql.conf, and the only non-default (read: non-commented)
setting in the file is: `tcpip_socket = true`


Klamath suggested that I run pg_controldata:

bash-2.05$ ./pg_controldata 
pg_control version number:            71
Catalog version number:               200101061
Database state:                       SHUTDOWNING
pg_control last modified:             Thu Sep 19 22:59:54 2002
Current log file id:                  0
Next log file segment:                1
Latest checkpoint location:           0/1739A0
Prior checkpoint location:            0/1718F0
Latest checkpoint's REDO location:    0/1739A0
Latest checkpoint's UNDO location:    0/0
Latest checkpoint's StartUpID:        21
Latest checkpoint's NextXID:          615
Latest checkpoint's NextOID:          18720
Time of latest checkpoint:            Thu Sep 19 22:49:42 2002
Database block size:                  8192
Blocks per segment of large relation: 131072
LC_COLLATE:                           en_US
LC_CTYPE:                             en_US


If I look into the pg_xlog directory, I see this:

sh-2.05$ cd pg_xlog/
bash-2.05$ ls -l
total 32808
-rw-------    1 postgres postgres 16777216 Sep 20 23:13 0000000000000002
-rw-------    1 postgres postgres 16777216 Sep 19 22:09 000000020000007E


There is one caveat. The installation resides on a partition of its own:
/dev/hda3        17259308   6531140   9851424  40% /var/lib/pgsql/data

fdisk did not report errors for this partition at boot time after the
forced shutdown, however.

This installation serves a university research project, and although
most of the code / schemas are in development (and should be in cvs by
rights), I can't confirm that all projects have indeed done that. So any
advice, ideas or suggestions on how the data and / or schemas can be
recovered would be greatly appreciated.

Many thanks!

-- pete

P.S.: I've been using pgsql for about four years now, and it played a
big role during my grad work. In fact, the availability of pgsql was one
of the reasons why I was able to complete and graduate. Many thanks for
such a great database!


-- 
Pete St. Onge
Research Associate, Computational Biologist, UNIX Admin
Banting and Best Institute of Medical Research
Program in Bioinformatics and Proteomics
University of Toronto
http://www.utoronto.ca/emililab/       pete@seul.org


Re: Hosed PostGreSQL Installation

От
Tom Lane
Дата:
"Pete St. Onge" <pete@seul.org> writes:
> As a result of some disk errors on another drive, an admin in our group
> brought down the server hosting our pgsql databases with a kill -KILL
> after having gone to runlevel 1 and finding the postmaster process still
> running. No surprise, our installation was hosed in the process. 

That should not have been a catastrophic mistake in any version >= 7.1.
I suspect you had disk problems or other problems.

> Klamath suggested that I run pg_controldata:

> ...
> Latest checkpoint's StartUpID:        21
> Latest checkpoint's NextXID:          615
> Latest checkpoint's NextOID:          18720

These numbers are suspiciously small for an installation that's been
in production awhile.  I suspect you have not told us the whole story;
in particular I suspect you already tried "pg_resetxlog -f", which was
probably not a good idea.

> If I look into the pg_xlog directory, I see this:

> -rw-------    1 postgres postgres 16777216 Sep 20 23:13 0000000000000002
> -rw-------    1 postgres postgres 16777216 Sep 19 22:09 000000020000007E

Yeah, your xlog positions should be a great deal higher than they are,
if segment 2/7E was previously in use.

It is likely that you can recover (with some uncertainty about integrity
of recent transactions) if you proceed as follows:

1. Get contrib/pg_resetxlog/pg_resetxlog.c from the 7.2.2 release (you
can't use 7.1's pg_resetxlog because it doesn't offer the switches
you'll need).  Compile it *against your 7.1 headers*.  It should compile
except you'll have to remove this change:

***************
*** 853,858 ****
--- 394,403 ----     page->xlp_magic = XLOG_PAGE_MAGIC;     page->xlp_info = 0;     page->xlp_sui =
ControlFile.checkPointCopy.ThisStartUpID;
+     page->xlp_pageaddr.xlogid =
+         ControlFile.checkPointCopy.redo.xlogid;
+     page->xlp_pageaddr.xrecoff =
+         ControlFile.checkPointCopy.redo.xrecoff - SizeOfXLogPHD;     record = (XLogRecord *) ((char *) page +
SizeOfXLogPHD);    record->xl_prev.xlogid = 0;     record->xl_prev.xrecoff = 0;
 

Test it using its -n switch to make sure it reports sane values.

2. Run the hacked-up pg_resetxlog like this:
pg_resetxlog -l 2 127 -x 1000000000 $PGDATA

(the -l position is next beyond what we see in pg_xlog, the 1-billion
XID is just a guess at something past where you were.  Actually, can
you give us the size of pg_log, ie, $PGDATA/global/1269?  That would
allow computing a correct next-XID to use.  Figure 4 XIDs per byte,
thus if pg_log is 1 million bytes you need -x at least 4 million.)

3. The postmaster should start now.

4. *Immediately* attempt to do a pg_dumpall.  Do not pass GO, do not
collect $200, do not let in any interactive clients until you've done
it. (I'd suggest tweaking pg_hba.conf to disable all logins but your
own.)

5. If pg_dumpall succeeds and produces sane-looking output, then you've
survived.  initdb, reload the dump file, re-open for business, go have
a beer.  (Recommended: install 7.2.2 and reload into that, not 7.1.*.)
You will probably still need to check for partially-applied recent
transactions, but for the most part you should be OK.

6. If pg_dumpall fails then let us know what the symptoms are, and we'll
see if we can figure out a workaround for whatever the corruption is.
        regards, tom lane


Re: Hosed PostGreSQL Installation

От
"Pete St. Onge"
Дата:
Just following up on Tom Lane's email - 

A couple of things that I hadn't mentioned: After bringing up the
machine, the first thing I did before mucking about with PostGreSQL was
to tarball $PGDATA so that I'd have a second chance if I messed up. I
then ran pg_resetlog -f the first time, as Tom surmised, with the
unwanted results. 

That done, I sent out the email, and followed Tom's instructions (yay
backups!) and did it properly.

On Sat, Sep 21, 2002 at 11:13:44AM -0400, Tom Lane wrote:
> "Pete St. Onge" <pete@seul.org> writes:
> 
> That should not have been a catastrophic mistake in any version >= 7.1.
> I suspect you had disk problems or other problems.    We did, but these were on a different disk according to the
logs,
AFAIK. 

> These numbers are suspiciously small for an installation that's been
> in production awhile.  I suspect you have not told us the whole story;
> in particular I suspect you already tried "pg_resetxlog -f", which was
> probably not a good idea.    *raises hand* Yep.

Here's the contents of the pg_xlog directory. PGSQL has only been used
here for approximately 4 months of fairly light use, so perhaps the
numbers aren't as strange as they could be (this is from the backup).

-rw-------    1 postgres postgres 16777216 Sep 19 22:09 000000020000007E


> Yeah, your xlog positions should be a great deal higher than they are,
> if segment 2/7E was previously in use.
> 
> It is likely that you can recover (with some uncertainty about integrity
> of recent transactions) if you proceed as follows:
> 
> 1. Get contrib/pg_resetxlog/pg_resetxlog.c from the 7.2.2 release ...
[Chomp]

The compile worked without a hitch after doing ./configure in the
top-level directory. I just downloaded the src for both trees, made the
changes manually, copied the file into the 7.1.3 tree and compiled it
there. 

> 2. Run the hacked-up pg_resetxlog like this:
> 
>     pg_resetxlog -l 2 127 -x 1000000000 $PGDATA
> 
> (the -l position is next beyond what we see in pg_xlog, the 1-billion
> XID is just a guess at something past where you were.  Actually, can
> you give us the size of pg_log, ie, $PGDATA/global/1269?  That would
> allow computing a correct next-XID to use.  Figure 4 XIDs per byte,
> thus if pg_log is 1 million bytes you need -x at least 4 million.)
    -rw-------    1 postgres postgres 11870208 Sep 19 17:00 1269
    This gives a min WAL starting location of 47480832. I used
47500000.


> 3. The postmaster should start now.    I had to use pg_resetxlog's force option, but yeah, it worked like
you said it would.

> 4. *Immediately* attempt to do a pg_dumpall.  Do not pass GO, do not
> collect $200, do not let in any interactive clients until you've done
> it. (I'd suggest tweaking pg_hba.conf to disable all logins but your
> own.)    I did not pass go, I did not collect $200. I *did* do a pg_dumpall
right there and then, and was able to dump everything I needed. One
of the projects uses large objects - image files and html files (don't
ask, I've already tried to dissuade the Powers-That-Be) - and these
didn't come out. However, since this stuff is entered via script, the
project leader was fine with re-running the scripts tomorrow.


> 5. If pg_dumpall succeeds and produces sane-looking output, then you've
> survived.  initdb, reload the dump file, re-open for business, go have
> a beer.  (Recommended: install 7.2.2 and reload into that, not 7.1.*.)
> You will probably still need to check for partially-applied recent
> transactions, but for the most part you should be OK.    rpm -Uvh'ed the 7.2.2 RPMs, initdb'd and reloaded data into
thenew
 
installation. Pretty painless. I've just sent out an email to folks here
to let them know the situation, and we should know in the next day or so
what is up.


> 6. If pg_dumpall fails then let us know what the symptoms are, and we'll
> see if we can figure out a workaround for whatever the corruption is.    I've kept the tarball with the corrupted
data.I'll hold onto it
 
for a bit, in case, but will likely expunge it in the next week or so.
If this can have a use for the project (whatever it may be), let me know
and I can burn it to DVD.
    Of course, without your help, Tom, there would be a lot of Very
Unhappy People here, me only being one of them. Many thanks for your
help and advice!
   Cheers,
   Pete 


-- 
Pete St. Onge
Research Associate, Computational Biologist, UNIX Admin
Banting and Best Institute of Medical Research
Program in Bioinformatics and Proteomics
University of Toronto
http://www.utoronto.ca/emililab/