Обсуждение: db recovery (FATAL 2)

Поиск
Список
Период
Сортировка

db recovery (FATAL 2)

От
Bojan Belovic
Дата:
My database apparently crashed - don't know how or why. It happend in the
middle of the night so I wasn't around to troubleshoot it at the time. It
looks like it died during the scheduled vacuum.

Here's the log that gets generated when I attempt to bring it back up:

postmaster successfully started
DEBUG:  database system shutdown was interrupted at 2002-05-07 09:35:35 EDT
DEBUG:  CheckPoint record at (10, 1531023244)
DEBUG:  Redo record at (10, 1531023244); Undo record at (10, 1531022908);
Shutdown FALSE
DEBUG:  NextTransactionId: 29939385; NextOid: 9729307
DEBUG:  database system was not properly shut down; automatic recovery in
progress...
DEBUG:  redo starts at (10, 1531023308)
DEBUG:  ReadRecord: record with zero len at (10, 1575756128)
DEBUG:  redo done at (10, 1575756064)
FATAL 2:  write(logfile 10 seg 93 off 15474688) failed: Success
/usr/bin/postmaster: Startup proc 1339 exited with status 512 - abort

Any suggestions? What are my options, other than doing a complete restore of
the DB from a dump (which is not really an option as the backup is not as
recent as it should be).

Thanks!

Bojan


Re: db recovery (FATAL 2)

От
Tom Lane
Дата:
Bojan Belovic <bbelovic@usa.net> writes:
> FATAL 2:  write(logfile 10 seg 93 off 15474688) failed: Success

This would be what PG version?

Broad-jumping to conclusions, I'm going to guess (a) 7.1.0-7.1.2
and (b) you are out of disk space for the WAL logs.

If so, you'll need to free up 16MB or so to restart the postmaster,
and you'd be well advised to update to 7.1.3 before trying another
VACUUM.

            regards, tom lane

data loss due to improper handling of postmaster ....

От
"Rajesh Kumar Mallah."
Дата:
Hi folks,

this is a long email.
I too  experienced a data loss of 11 hrs recently.
i have the most recent postgresql 7.2.1 on RedHat 6.2

but my case was bit different and i  feel my wrong handling
of situation was also responsible for it.

I would be grateful if someone could tell me what should have
been done *instead* to prevent the data loss.

as far as i remember the following is the post mortem :

the load average of my database server had reached 5.15 and my website had
become slugglish   so i decided to stop the postmaster and start again,
(i dont know it it was a right thing but  was inituitive to me)

so i did

# su - postgres
# pg_ctl stop  <-- did not work out
it said postmaster could not be stopped.

# pg_ctl stop -m immediate
it said postmaster is stopped ,
but it was wrong ps auxwww  still showed some processes running.

# pg_clt  -l /var/log/pgsql start
said started successfully (but in reality not )

at this point postmaster is neither dead nor running essentially my live
website was down, so under pressure i decided to reboot the system
and told my ISP to do so.

but even the reboot was not smooth , the unix admin of my isp says
some process does not let the system reboot (and it was postmaster).
so he has to put the machine in power cycle and the machine fscked
in startup.

as a result i too got similar messages as  Bojan has given below .
and my website was not connecting to the database.
it used to say "database in recovery mode.... "

then i did "pg_ctl stop" then start but nothing worked out.

since it was my production database i had to restore the database
in minimum time  so i used my old backup that was 11 hrs old and
hence a major data loss.

I  strongly beleive Postgresql is the best open source database
around and is *safe* unless fiddled in a wrong manner.

But  there are problems in using it.

due to The  current Lack  of inbuilt failover and replication solutions in
postgresql  people like me would tend to become desperate because
one cannot keep webserver down for long as a result we take wrong steps.

For mere mortals like me there should be set of guidelines for safe
handling of the server. (DOS' and DON'TS type) to  prevent
DATA LOSS.


Also i would like suggestions on how to live with postgresql
with its current limitations of replication ( or failover solutions) and
without data loss.

what i currently do is backup my database with pg_dump but there are
problems with it.

Because of large size of my database pg_dump takes
20-30 mins and the server load increases this means
i cannot do it quite frequently  on my production server.
so in worst case i still loose of duration ranging from 1-24 hrs
depending on frequency of pg_dump.
And for many of us even 1Hour  of data is *quite* a loss for us.

I would also want  comments on usability of USOGRES  / RSERV
replication systems with postgres 7.2.1

hoping to get some tips from the intellectuals out here

regds
mallah.


On Tuesday 07 May 2002 07:52 pm, Bojan Belovic wrote:
> My database apparently crashed - don't know how or why. It happend in the
> middle of the night so I wasn't around to troubleshoot it at the time. It
> looks like it died during the scheduled vacuum.
>
> Here's the log that gets generated when I attempt to bring it back up:
>
> postmaster successfully started
> DEBUG:  database system shutdown was interrupted at 2002-05-07 09:35:35 EDT
> DEBUG:  CheckPoint record at (10, 1531023244)
> DEBUG:  Redo record at (10, 1531023244); Undo record at (10, 1531022908);
> Shutdown FALSE
> DEBUG:  NextTransactionId: 29939385; NextOid: 9729307
> DEBUG:  database system was not properly shut down; automatic recovery in
> progress...
> DEBUG:  redo starts at (10, 1531023308)
> DEBUG:  ReadRecord: record with zero len at (10, 1575756128)
> DEBUG:  redo done at (10, 1575756064)
> FATAL 2:  write(logfile 10 seg 93 off 15474688) failed: Success
> /usr/bin/postmaster: Startup proc 1339 exited with status 512 - abort
>
> Any suggestions? What are my options, other than doing a complete restore
> of the DB from a dump (which is not really an option as the backup is not
> as recent as it should be).
>
> Thanks!
>
> Bojan

Re: db recovery (FATAL 2)

От
Bojan Belovic
Дата:
Addition to my previous question...

If postgres has trouble recovering the database from the log, is it possible
to skip the recovery step and loose some recent data changes? If the loss is
relatively small, I think it would be acceptable (say we loose all the changes
in the last few hours). Not sure if the vacuum makes this a bigger problem?

Thanks!

--------

My database apparently crashed - don't know how or why. It happend in the
middle of the night so I wasn't around to troubleshoot it at the time. It
looks like it died during the scheduled vacuum.

Here's the log that gets generated when I attempt to bring it back up:

postmaster successfully started
DEBUG: database system shutdown was interrupted at 2002-05-07 09:35:35 EDT
DEBUG: CheckPoint record at (10, 1531023244)
DEBUG: Redo record at (10, 1531023244); Undo record at (10, 1531022908);
Shutdown FALSE
DEBUG: NextTransactionId: 29939385; NextOid: 9729307
DEBUG: database system was not properly shut down; automatic recovery in
progress...
DEBUG: redo starts at (10, 1531023308)
DEBUG: ReadRecord: record with zero len at (10, 1575756128)
DEBUG: redo done at (10, 1575756064)
FATAL 2: write(logfile 10 seg 93 off 15474688) failed: Success
/usr/bin/postmaster: Startup proc 1339 exited with status 512 - abort

Any suggestions? What are my options, other than doing a complete restore of
the DB from a dump (which is not really an option as the backup is not as
recent as it should be).

Thanks!

Bojan


My database apparently crashed - don't know how or why. It happend in the
middle of the night so I wasn't around to troubleshoot it at the time. It
looks like it died during the scheduled vacuum.

Here's the log that gets generated when I attempt to bring it back up:

postmaster successfully started
DEBUG:  database system shutdown was interrupted at 2002-05-07 09:35:35 EDT
DEBUG:  CheckPoint record at (10, 1531023244)
DEBUG:  Redo record at (10, 1531023244); Undo record at (10, 1531022908);
Shutdown FALSE
DEBUG:  NextTransactionId: 29939385; NextOid: 9729307
DEBUG:  database system was not properly shut down; automatic recovery in
progress...
DEBUG:  redo starts at (10, 1531023308)
DEBUG:  ReadRecord: record with zero len at (10, 1575756128)
DEBUG:  redo done at (10, 1575756064)
FATAL 2:  write(logfile 10 seg 93 off 15474688) failed: Success
/usr/bin/postmaster: Startup proc 1339 exited with status 512 - abort

Any suggestions? What are my options, other than doing a complete restore of
the DB from a dump (which is not really an option as the backup is not as
recent as it should be).

Thanks!

Bojan


Re: db recovery (FATAL 2)

От
Bojan Belovic
Дата:
You are correct, it's 7.1.2 . However, the problem is not with disk space
(there's several gigs available), but there could be a problem with a bad
sector on one of the log files. If this is the case, and the log file is
corrupted, is there any way of recovering, even with a certain data loss?

Thanks!


Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Bojan Belovic <bbelovic@usa.net> writes:
> > FATAL 2:  write(logfile 10 seg 93 off 15474688) failed: Success
>
> This would be what PG version?
>
> Broad-jumping to conclusions, I'm going to guess (a) 7.1.0-7.1.2
> and (b) you are out of disk space for the WAL logs.
>
> If so, you'll need to free up 16MB or so to restart the postmaster,
> and you'd be well advised to update to 7.1.3 before trying another
> VACUUM.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster


Re: db recovery (FATAL 2)

От
Tom Lane
Дата:
Bojan Belovic <bbelovic@usa.net> writes:
> You are correct, it's 7.1.2 . However, the problem is not with disk space
> (there's several gigs available), but there could be a problem with a bad
> sector on one of the log files. If this is the case, and the log file is
> corrupted, is there any way of recovering, even with a certain data loss?

Hm.  It's complaining about a write, not a read, so there is no lost
data (yet), even if your theory is correct.  You might first try copying
the entire $PGDATA/pg_xlog directory somewhere else.

If nothing else avails, see contrib/pg_resetxlog.  But that should be
your last resort not first.

            regards, tom lane

Re: db recovery (FATAL 2)

От
Bojan Belovic
Дата:
It turns out it was a bad sector, and once the data was retreived and the
drive replaced, postgres was able to go through the startup process
sucessfully. Given that it did not report any errors on recovery, I suppose
there was no data loss, but even if there was some damage to the log file, it
should be minor - the crash happened at 6am, almost no activity, so I'm not
going to worry about that at all at this point.
Anyway, thank you very much for your help.
One quick question - you mentioned I should upgrade to 7.1.3 beofre I run
vacuum again. What are the known problems that "ask" for this?

Thanks again!


Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Bojan Belovic <bbelovic@usa.net> writes:
> > You are correct, it's 7.1.2 . However, the problem is not with disk space
> > (there's several gigs available), but there could be a problem with a bad
> > sector on one of the log files. If this is the case, and the log file is
> > corrupted, is there any way of recovering, even with a certain data loss?
>
> Hm.  It's complaining about a write, not a read, so there is no lost
> data (yet), even if your theory is correct.  You might first try copying
> the entire $PGDATA/pg_xlog directory somewhere else.
>
> If nothing else avails, see contrib/pg_resetxlog.  But that should be
> your last resort not first.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/users-lounge/docs/faq.html


Re: db recovery (FATAL 2)

От
Tom Lane
Дата:
Bojan Belovic <bbelovic@usa.net> writes:
> One quick question - you mentioned I should upgrade to 7.1.3 beofre I run
> vacuum again. What are the known problems that "ask" for this?

WAL growth.  My original theory was that you'd run out of disk space
because of a VACUUM trying to do a huge amount of work.  In 7.1.2
the WAL can grow arbitrarily large during a long transaction...

There are some other not-unimportant bug fixes in 7.1.3 too, but that's
the one I was thinking of.

            regards, tom lane

Re: db recovery (FATAL 2)

От
Bojan Belovic
Дата:
Just to make sure - given that there is plenty of space available (database is
slightly larger than 1GB and there is almost 10GB free), there should be no
problem with vacuum?
Or should I upgrade regardless? (I generally like to keep stable system
stable, unless I know there is a specific reason why I should change things in
the environment)

Thanks a lot,

Bojan


Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Bojan Belovic <bbelovic@usa.net> writes:
> > One quick question - you mentioned I should upgrade to 7.1.3 beofre I run
> > vacuum again. What are the known problems that "ask" for this?
>
> WAL growth.  My original theory was that you'd run out of disk space
> because of a VACUUM trying to do a huge amount of work.  In 7.1.2
> the WAL can grow arbitrarily large during a long transaction...
>
> There are some other not-unimportant bug fixes in 7.1.3 too, but that's
> the one I was thinking of.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly


Missing pg_clog file

От
Brian McCane
Дата:
My database ran out of disk space the other day (I usually monitor better
but my wife is having chemo and that is more important).  Anyway, I
shtudown the server using the 'pg_ctl stop' command, moved about 6GB of
indexes to another drive and fixed all the links.  After I restarted the
server everything looked fine until I got:

FATAL 2:  open of /usr/local/pgsql/data/pg_clog/0000 failed: No such file
or directory

This happens about every 30-40 minutes, then the server comes up and
behaves okay for a while until *POOF*.

I tried to dump my databases to one of my alternate servers, but it fails
because I have duplicate records in miscellaneous tables in a primary key.
I tried to get smart and changed the primary key to include the oid,
figuring that would make it unique, and it would also be easier to delete
one of the conflicting records.  When I started the dump, I got the same
error.  After many hours running on a table with about 200M records:

SELECT oid,cnt FROM (SELECT oid,count(oid) AS cnt FROM foo GROUP BY oid)
as bar WHERE cnt > 1 ;

I discovered that I have a bunch (20,000+) of records that have duplicated
oid numbers.  Is this because of the disk running out of space or is it
some deeper more evil problem.  Also, how the %!$#! do I fix it without
losing the associated data?  Most of the records are identical (maybe
all), and when I delete any 1, all of them disappear.  I guess I could
select distinct into a temporary table, delete from current, and then
insert from the temporary table, but this is gonna take a long time.

- brian



Wm. Brian McCane                    | Life is full of doors that won't open
Search http://recall.maxbaud.net/   | when you knock, equally spaced amid those
Usenet http://freenews.maxbaud.net/ | that open when you don't want them to.
Auction http://www.sellit-here.com/ | - Roger Zelazny "Blood of Amber"