Обсуждение: How to recover when can't start database

Поиск
Список
Период
Сортировка

How to recover when can't start database

От
"L.Boldareva"
Дата:
Hi!
(Hope this is the right place to post)

I crashed the postmaster and cannot start it anymore, with the error

 LOG:  database system was interrupted while in recovery at 2005-04-01 11:04:33 CEST
HINT:  This probably means that some data is corrupted and you will have to use the last backup for recovery.
LOG:  checkpoint record is at 5/6F00C540
LOG:  redo record is at 5/6F000ABC; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 599824; next OID: 147679259
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  redo starts at 5/6F000ABC
PANIC:  btree_split_redo: lost left sibling
LOG:  startup process (PID 5603) was terminated by signal 6
LOG:  aborting startup due to startup process failure

Is there a way to recover from that?

I don't have a fresh backup, but loosing some couple of days won't be a
problem.

I use PG 8.0 on a linux box, with standard postgresq.conf (except some
increased memory settings).

thanks,
L.B.

Re: How to recover when can't start database

От
Tom Lane
Дата:
"L.Boldareva" <pg@pierro.dds.nl> writes:
> LOG:  database system was not properly shut down; automatic recovery in
> progress
> LOG:  redo starts at 5/6F000ABC
> PANIC:  btree_split_redo: lost left sibling
> LOG:  startup process (PID 5603) was terminated by signal 6

Hmm.  AFAICS that could only happen if a page split record is pointing
at an "original" page that's not there anymore; that is, the page is
past what the kernel says is the end of the file.  Exactly how did you
get into this state ... was there a system-level crash involved?  ISTM
this cannot happen unless the filesystem has dropped data.

You could probably get it to start by changing the "false" to "true"
in this call of XLogReadBuffer

    /* Left (original) sibling */
    buffer = XLogReadBuffer(false, reln, leftsib);
    if (!BufferIsValid(buffer))
        elog(PANIC, "btree_split_%s: lost left sibling", op);

in src/backend/access/nbtree/nbtxlog.c (it's line 261 in CVS tip,
possibly a little different in 8.0).  Let us know if that helps.

I'd be a bit suspicious of the contents of the index, if not the
whole database, so an immediate dump,reinitdb,reload might be your
most prudent course of action after you get it to start.

Plan B would be to wipe out the WAL log with pg_resetxlog.  This will
allow you to start but the odds of having corrupt data afterwards would
be about 100% ... you *must* dump and reload if you go that way.

            regards, tom lane

Re: How to recover when can't start database

От
Scott Marlowe
Дата:
On Fri, 2005-04-01 at 04:02, L.Boldareva wrote:
> Hi!
> (Hope this is the right place to post)
>
> I crashed the postmaster and cannot start it anymore, with the error

In addition to what everyone's posted already, I would suggest you spend
some time figuring out what got you here in the first place.  Bad
hardware, misconfigured postgresql.conf (specifically fsync=false) or
other possibilities.

Under normal circumstances, a crashing postmaster should not cause these
problems.

Re: How to recover when can't start database

От
"L.Boldareva"
Дата:
I have kind of fixed the problem (hopefully)

I turned out I used plan B, dump/reload will be my next step then.
What is meant by the "corrupt data", is this about the data or the things
around it, like indexes, system tables?

Here is how I got it crashed:
I compiled a c-procedure and copied the .so file to its place exactly at a
time when (quite unfortunately) another query was running, that used that
library.

I usually do pg_ctl reload right after that, and it seems to be enough,
but not this time.

The c-function contained code that would send a query to populate a table
(most likely one of the 2 bad ones), but I am not sure this matters since
there were 2 tables out of order, and
only one at a time is touched by my script.

That's it. I just checked, if that matters, that fsync = true on the
config file.


On Fri, 1 Apr 2005, Tom Lane wrote:

> Hmm.  AFAICS that could only happen if a page split record is pointing
> at an "original" page that's not there anymore; that is, the page is
> past what the kernel says is the end of the file.

Something like that was mentioned  in the WARNING message when I tried to
drop the table, but I got  that warning  only once, further actions just
raised the error about relid.

Thank you for your help,
LB

> You could probably get it to start by changing the "false" to "true"
> in this call of XLogReadBuffer
>
>     /* Left (original) sibling */
>     buffer = XLogReadBuffer(false, reln, leftsib);
>     if (!BufferIsValid(buffer))
>         elog(PANIC, "btree_split_%s: lost left sibling", op);
>
> in src/backend/access/nbtree/nbtxlog.c (it's line 261 in CVS tip,
> possibly a little different in 8.0).  Let us know if that helps.
>
> I'd be a bit suspicious of the contents of the index, if not the
> whole database, so an immediate dump,reinitdb,reload might be your
> most prudent course of action after you get it to start.
>

> Plan B would be to wipe out the WAL log with pg_resetxlog.  This will
> allow you to start but the odds of having corrupt data afterwards would
> be about 100% ... you *must* dump and reload if you go that way.
>
>             regards, tom lane
>

Re: How to recover when can't start database

От
Tom Lane
Дата:
"L.Boldareva" <pg@pierro.dds.nl> writes:
> Here is how I got it crashed:
> I compiled a c-procedure and copied the .so file to its place exactly at a
> time when (quite unfortunately) another query was running, that used that
> library.

Hmm, at worst that should only crash one backend, not result in bad data
getting into the WAL log.  If there was no filesystem crash then I have
to suppose that the page number entered in the WAL page split record was
wrong ... but how could that happen if the same page number had just
been used successfully to read the page?

            regards, tom lane

Re: How to recover when can't start database

От
"L.Boldareva"
Дата:
On Fri, 1 Apr 2005, Tom Lane wrote:

> "L.Boldareva" <pg@pierro.dds.nl> writes:
> > Here is how I got it crashed:
> > I compiled a c-procedure and copied the .so file to its place exactly at a
> > time when (quite unfortunately) another query was running, that used that
> > library.
>

> Hmm, at worst that should only crash one backend, not result in bad data
> getting into the WAL log.  If there was no filesystem crash then I have
> to suppose that the page number entered in the WAL page split record was
> wrong ... but how could that happen if the same page number had just
> been used successfully to read the page?
>
>             regards, tom lane

I have no idea
Tell me where to look, I'll have a look in the tar. I'd
rather not reproduce the thing, but if someone is interested, I can try
that, too (later).

regards,
LB

>

Re: How to recover when can't start database

От
"Colin E. Freas"
Дата:
Sorry if this post winds up as a duplicate on the list...

---

In an effort to disable all of the foreign key restraints on a set of
tables in my installation of PostgreSQL 8.0.1, I was fiddling with the
pg_class table and managed to elicit behavior similar to what L.
Boldareva ran into.

(This is Red Hat AS 4; Opteron w/4GB...)

pg_ctl can start and stop my postmaster, but I can't use psql to access
a database.  When I try, I get this on the command line:
---
psql: server closed the connection unexpectedly
      This probably means the server terminated abnormally
      before or while processing the request.
---
...and this in the log:
---
2005-06-07 16:44:37 EDTLOG:  connection received: host=[local] port=
2005-06-07 16:44:37 EDTLOG:  connection authorized: user=postgres
database=caarray
2005-06-07 16:44:43 EDTLOG:  server process (PID 14278) was terminated
by signal 11
2005-06-07 16:44:43 EDTLOG:  terminating any other active server processes
2005-06-07 16:44:43 EDTLOG:  all server processes terminated; reinitializing
2005-06-07 16:44:43 EDTLOG:  database system was interrupted at
2005-06-07 16:34:46 EDT
2005-06-07 16:44:43 EDTLOG:  checkpoint record is at 0/81F18770
2005-06-07 16:44:43 EDTLOG:  redo record is at 0/81F18770; undo record
is at 0/0; shutdown TRUE
2005-06-07 16:44:43 EDTLOG:  next transaction ID: 6196113; next OID: 6226772
2005-06-07 16:44:43 EDTLOG:  database system was not properly shut down;
automatic recovery in progress
2005-06-07 16:44:43 EDTLOG:  record with zero length at 0/81F187AC
2005-06-07 16:44:43 EDTLOG:  redo is not required
2005-06-07 16:44:43 EDTLOG:  database system is ready
---
I think I know exactly what caused this.  Postmaster seemed to crash
after I ran these two updates on pg_class:
---
2005-06-07 16:02:38 EDTLOG:  statement: update pg_class set
reltriggers=foo.c from (select relname,count(tgrelid) as c from
pg_class,pg_trigger where pg_class.oid=tgrelid and relnamespace=2200
group by relname) foo;
2005-06-07 16:03:21 EDTLOG:  statement: update pg_class set
reltriggers=foo.c from (select relname,count(tgrelid) as c from
pg_class,pg_trigger where pg_class.oid=tgrelid and relnamespace=2200
group by relname) foo where pg_class.relnamespace=2200;
2005-06-07 16:03:22 EDTLOG:  connection received: host=[local] port=
2005-06-07 16:03:22 EDTLOG:  connection authorized: user=postgres
database=caarray
2005-06-07 16:03:22 EDTLOG:  server process (PID 12845) was terminated
by signal 11
2005-06-07 16:03:22 EDTLOG:  terminating any other active server processes
---

If I can't ever get this database back up and running, it's no big
deal...  this isn't something in production, and I wouldn't do this on a
production machine.  However, it would save me enough time to get this
working that I'm trying this list.  The only possible saving grace I
seem to have is that immediately before I ran the updates, I made a copy
of the pg_class table.  So, if the only problem is the pg_class table,
then I'm thinking I could just copy the valid data back in, and be more
careful next time.  But, right now, I can't even get the server to talk
to me.

Also, if there's an easier way to disable and re-enable the foreign key
constraints, I'd certainly appreciate the tip as well.

Any help appreciated.

Colin Freas






Re: How to recover when can't start database

От
Tom Lane
Дата:
"Colin E. Freas" <cef6@georgetown.edu> writes:
> 2005-06-07 16:02:38 EDTLOG:  statement: update pg_class set
> reltriggers=foo.c from (select relname,count(tgrelid) as c from
> pg_class,pg_trigger where pg_class.oid=tgrelid and relnamespace=2200
> group by relname) foo;

I'm afraid that database is toast :-(.  You managed to overwrite
reltriggers in every row of pg_class, including all the system
catalogs, including some that absolutely positively cannot have
triggers ... like pg_trigger for instance.

-> try to open pg_class
   ... hmm, it says it has triggers
   -> try to open pg_trigger to read triggers
      ... hmm, it says it has triggers
      -> try to open pg_trigger to read triggers
         ... hmm, it says it has triggers
         -> try to open pg_trigger to read triggers

... recurse until out of stack space.

            regards, tom lane

Re: How to recover when can't start database

От
Alvaro Herrera
Дата:
On Fri, Jun 10, 2005 at 12:34:15PM -0400, Tom Lane wrote:
> "Colin E. Freas" <cef6@georgetown.edu> writes:
> > 2005-06-07 16:02:38 EDTLOG:  statement: update pg_class set
> > reltriggers=foo.c from (select relname,count(tgrelid) as c from
> > pg_class,pg_trigger where pg_class.oid=tgrelid and relnamespace=2200
> > group by relname) foo;
>
> I'm afraid that database is toast :-(.  You managed to overwrite
> reltriggers in every row of pg_class, including all the system
> catalogs, including some that absolutely positively cannot have
> triggers ... like pg_trigger for instance.
>
> -> try to open pg_class
>    ... hmm, it says it has triggers
>    -> try to open pg_trigger to read triggers
>       ... hmm, it says it has triggers
>       -> try to open pg_trigger to read triggers
>          ... hmm, it says it has triggers
>          -> try to open pg_trigger to read triggers
>
> ... recurse until out of stack space.

IIRC Joe Conway had this problem some time ago, and he managed to get
out of it.  I don't recall details, maybe searching the archives ...

--
Alvaro Herrera (<alvherre[a]surnet.cl>)
"El hombre nunca sabe de lo que es capaz hasta que lo intenta" (C. Dickens)