Обсуждение: Disk full and WALs

Поиск
Список
Период
Сортировка

Disk full and WALs

От
John Krasnay
Дата:
Hi folks,

I recently ran into an issue with PostgreSQL 8.3 on a disk that became
full. We freed up some space and restarted PostgreSQL, but startup
failed with the following error:

2010-08-01 08:21:19 EDT FATAL:  invalid data in file
"00000001000002BD00000072.00000020.backup"

The indicated file has zero bytes.

We decided to do a point-in-time recovery, but that failed too, since
the archived WAL file 00000001000002BD00000072 was zero-length. Looking
at the logs, the archive command for this file failed at about 6:29am,
but the server continued on until later in the evening when we noticed
there was a disk space problem.

Now our problem is that we appear to have lost a whole day's worth of
data, since we can't do a PITR past the failed archive log.

The documentation says that if the archive command fails, the server
retries until it's successful, but that appears not to have happened. It
looks like the zero-length file that PostgreSQL complained about,
00000001000002BD00000072.00000020.backup, might be related to this.

Does anyone have any idea how we might recover from this? Could this be
a bug in how PostgreSQL deals with archive logging?

Thanks.

jk

Re: Disk full and WALs

От
Tom Lane
Дата:
John Krasnay <john@krasnay.ca> writes:
> We decided to do a point-in-time recovery, but that failed too, since
> the archived WAL file 00000001000002BD00000072 was zero-length. Looking
> at the logs, the archive command for this file failed at about 6:29am,
> but the server continued on until later in the evening when we noticed
> there was a disk space problem.

> Now our problem is that we appear to have lost a whole day's worth of
> data, since we can't do a PITR past the failed archive log.

> The documentation says that if the archive command fails, the server
> retries until it's successful, but that appears not to have happened.

The archiver will retry, *if the archive command returns non-zero exit
status*.  It sounds to me like you're using an archive command script
that dutifully logs a failure but is careless about returning the proper
exit status.

> Does anyone have any idea how we might recover from this?

I'm afraid you're probably screwed as far as replaying any data beyond
the lost WAL segment goes.  Even if you forced the system to try to
replay it, you'd have corrupted database state because of the omission
of the changes that were in the lost segment.  If you still have the
original $PGDATA tree (ie you didn't blow it away while trying the PITR
idea) then you might be able to get a closer approximation to current
time by doing resetxlog and starting up --- though the consistency of
the DB would still be questionable, so a dump and reload would be
advisable.

            regards, tom lane

Re: Disk full and WALs

От
John Krasnay
Дата:
On 10-08-01 03:03 PM, Tom Lane wrote:
> The archiver will retry, *if the archive command returns non-zero exit
> status*.  It sounds to me like you're using an archive command script
> that dutifully logs a failure but is careless about returning the proper
> exit status.

That was my first thought, too, but the PostgreSQL log says this...

2010-07-31 06:29:11 EDT LOG:  archive command failed with exit code 1

...so it definitely knew about it. It was also suspicious that
00000001000002BD00000072.00000020.backup hung around in the pg_xlog
directory; if the server thought the archive command was successful it
would presumably have cleaned it up.

> I'm afraid you're probably screwed as far as replaying any data beyond
> the lost WAL segment goes.  Even if you forced the system to try to
> replay it, you'd have corrupted database state because of the omission
> of the changes that were in the lost segment.  If you still have the
> original $PGDATA tree (ie you didn't blow it away while trying the PITR
> idea) then you might be able to get a closer approximation to current
> time by doing resetxlog and starting up --- though the consistency of
> the DB would still be questionable, so a dump and reload would be
> advisable.
>
>             regards, tom lane

Luckily, we were able to rebuild our data from out-of-band data, but
it's good to know about resetxlog.

Thanks for your help.

jk