Обсуждение: Attempt to re-archive existing WAL logs after restoring from backup

Поиск
Список
Период
Сортировка

Attempt to re-archive existing WAL logs after restoring from backup

От
Jon Colverson
Дата:
Hello.

I've been testing my DB backup and restore procedure and I've run into
something I can't figure out. After recovering from a PITR backup, when
I do another pg_start_backup PostgreSQL attempts to re-archive the last
log again, which fails because it already exists in the archive. I'm
running the Ubuntu 7.04 version of PostgreSQL 8.2.4. Here's an example
log interspersed with what I was doing in the gaps (sorry for posting
the whole thing; I'm new to PostgreSQL so I'm not sure what's
significant and what isn't):

--------
2007-05-29 02:18:33 UTC LOG:  archived transaction log file
"000000010000000000000012"
2007-05-29 02:18:34 UTC LOG:  incomplete startup packet
2007-05-29 02:18:34 UTC LOG:  received fast shutdown request
2007-05-29 02:18:34 UTC LOG:  shutting down
2007-05-29 02:18:34 UTC LOG:  database system is shut down

Deleted data directory and restored from base backup. Restarted DB:

2007-05-29 02:20:33 UTC LOG:  could not load root certificate file
"root.crt": no SSL error reported
2007-05-29 02:20:33 UTC DETAIL:  Will not verify client certificates.
2007-05-29 02:20:33 UTC LOG:  database system was interrupted at
2007-05-29 02:16:06 UTC
2007-05-29 02:20:33 UTC LOG:  starting archive recovery
2007-05-29 02:20:33 UTC LOG:  restore_command =
"/usr/local/sbin/xlogrestore %f %p"
2007-05-29 02:20:33 UTC LOG:  incomplete startup packet
2007-05-29 02:20:34 UTC LOG:  restored log file
"000000010000000000000011.000118A0.backup" from archive
2007-05-29 02:20:34 UTC FATAL:  the database system is starting up
2007-05-29 02:20:34 UTC FATAL:  the database system is starting up
2007-05-29 02:20:35 UTC LOG:  restored log file
"000000010000000000000011" from archive
2007-05-29 02:20:35 UTC LOG:  checkpoint record is at 0/110118A0
2007-05-29 02:20:35 UTC LOG:  redo record is at 0/110118A0; undo record
is at 0/0; shutdown FALSE
2007-05-29 02:20:35 UTC LOG:  next transaction ID: 0/975; next OID: 24576
2007-05-29 02:20:35 UTC LOG:  next MultiXactId: 1; next MultiXactOffset: 0
2007-05-29 02:20:35 UTC LOG:  automatic recovery in progress
2007-05-29 02:20:35 UTC LOG:  redo starts at 0/110118E8
2007-05-29 02:20:35 UTC FATAL:  the database system is starting up
2007-05-29 02:20:35 UTC LOG:  restored log file
"000000010000000000000012" from archive
2007-05-29 02:20:36 UTC FATAL:  the database system is starting up
2007-05-29 02:20:36 UTC LOG:  could not open file
"pg_xlog/000000010000000000000013" (log file 0, segment 19): No such
file or directory
2007-05-29 02:20:36 UTC LOG:  redo done at 0/120000C8
2007-05-29 02:20:36 UTC FATAL:  the database system is starting up
2007-05-29 02:20:37 UTC FATAL:  the database system is starting up
2007-05-29 02:20:37 UTC FATAL:  the database system is starting up
2007-05-29 02:20:38 UTC FATAL:  the database system is starting up
2007-05-29 02:20:38 UTC FATAL:  the database system is starting up
2007-05-29 02:20:39 UTC FATAL:  the database system is starting up
2007-05-29 02:20:39 UTC FATAL:  the database system is starting up
2007-05-29 02:20:39 UTC LOG:  incomplete startup packet
2007-05-29 02:20:39 UTC LOG:  restored log file
"000000010000000000000012" from archive
2007-05-29 02:20:39 UTC LOG:  archive recovery complete
2007-05-29 02:20:40 UTC LOG:  database system is ready

SELECT pg_switch_xlog(); (It doesn't make any difference if I do this
command or not)

2007-05-29 02:22:04 UTC LOG:  archived transaction log file
"000000010000000000000013"

SELECT pg_start_backup('testbackup');


gzip: stdout: Broken pipe
2007-05-29 02:22:46 UTC LOG:  archive command
"/usr/local/sbin/xlogarchive pg_xlog/000000010000000000000012
000000010000000000000012" failed: return code 256

gzip: stdout: Broken pipe
2007-05-29 02:22:48 UTC LOG:  archive command
"/usr/local/sbin/xlogarchive pg_xlog/000000010000000000000012
000000010000000000000012" failed: return code 256

gzip: stdout: Broken pipe
2007-05-29 02:22:50 UTC LOG:  archive command
"/usr/local/sbin/xlogarchive pg_xlog/000000010000000000000012
000000010000000000000012" failed: return code 256
2007-05-29 02:22:50 UTC WARNING:  transaction log file
"000000010000000000000012" could not be archived: too many failures
--------

Thanks for any advice.

--
Jon

Re: Attempt to re-archive existing WAL logs after restoring from backup

От
Jon Colverson
Дата:
Jon Colverson wrote:
> I've been testing my DB backup and restore procedure and I've run into
> something I can't figure out. After recovering from a PITR backup, when
> I do another pg_start_backup PostgreSQL attempts to re-archive the last
> log again, which fails because it already exists in the archive.

Hello again. I've been investigating this some more and I've got some
more details to add:

In xlog.c, exitArchiveRecovery() moves the last restored log file into
XLOGDIR.

Later (when shutting down the cluster, for instance),
XLogArchiveCheckDone() is called on the log file. This function looks
for a ".done" or ".ready" file and finds neither, so it calls
XLogArchiveNotify() to create the ".ready" file. This causes the
archiver to come along later and attempt to archive the log and this
fails because the log already exists in the archive (that's where it
came from).

So, I think the problem is that exitArchiveRecovery() should be marking
the restored log file as ".done" (or perhaps not bothering to restore
the log file at all?).

This looks like a bug to me. Any thoughts?

--
Jon

Re: Attempt to re-archive existing WAL logs after restoring from backup

От
Tom Lane
Дата:
Jon Colverson <pgsql@vcxz.co.uk> writes:
> So, I think the problem is that exitArchiveRecovery() should be marking
> the restored log file as ".done" (or perhaps not bothering to restore
> the log file at all?).

> This looks like a bug to me. Any thoughts?

I don't think that will fix it.  The problem is that after recovery
we try to use the rest of the last WAL segment for fresh WAL data, and
therefore need to modify the segment, and therefore need to (re) archive
it.  We can't fix that by doing a forced switch to a new segment, because
that still requires writing a switch record at the right place in the
last segment.  There's really no way to make this work without
abandoning the principle that the archive process should never overwrite
files.  (Which is certainly not a logically *necessary* thing, but it
seems like a mighty good idea from a safety perspective.)

I believe that this situation doesn't come up in ordinary scenarios
because a WAL segment doesn't get archived until it's really filled.
I take it you are using a backup procedure that undertakes to force the
last partial segment into the archive after a postmaster shutdown.  I'd
recommend not doing it that way --- instead, if you want a deterministic
time that the backup is good up to, don't stop the postmaster at all,
just apply pg_switch_xlog() and let the regular archiving process
archive the thereby-completed segment file.

            regards, tom lane

Re: Attempt to re-archive existing WAL logs after restoring from backup

От
Jon Colverson
Дата:
Thanks for your reply.

Tom Lane wrote:
> Jon Colverson <pgsql@vcxz.co.uk> writes:
>> So, I think the problem is that exitArchiveRecovery() should be marking
>> the restored log file as ".done" (or perhaps not bothering to restore
>> the log file at all?).
>
>> This looks like a bug to me. Any thoughts?
>
> I don't think that will fix it.  The problem is that after recovery
> we try to use the rest of the last WAL segment for fresh WAL data, and
> therefore need to modify the segment, and therefore need to (re) archive
> it.  We can't fix that by doing a forced switch to a new segment, because
> that still requires writing a switch record at the right place in the
> last segment.  There's really no way to make this work without
> abandoning the principle that the archive process should never overwrite
> files.  (Which is certainly not a logically *necessary* thing, but it
> seems like a mighty good idea from a safety perspective.)

I'm not sure this is quite right. Any log in the archive is, by
definition, a complete log that will have the switch record already
(because it will only have been marked for archiving after being
completed). Hence I don't really understand why the last one is being
copied back to pg_xlog after being used in the recovery. The long
comment in exitArchiveRecovery() explains that this is to replace any
existing version of the file in pg_xlog, but I think it would be just as
valid to simply remove any existing version since it is definitely out
of date (iff restoredFromArchive).

> I believe that this situation doesn't come up in ordinary scenarios
> because a WAL segment doesn't get archived until it's really filled.
> I take it you are using a backup procedure that undertakes to force the
> last partial segment into the archive after a postmaster shutdown.  I'd

No; nothing is touching the archive except for the archive_command being
run by PostgreSQL.

--
Jon

Re: Attempt to re-archive existing WAL logs afterrestoring from backup

От
"Simon Riggs"
Дата:
On Thu, 2007-05-31 at 16:09 +0100, Jon Colverson wrote:
> >
> >> This looks like a bug to me. Any thoughts?
> >

> No; nothing is touching the archive except for the archive_command being
> run by PostgreSQL.

Jon,

You're still troubled by what you've seen?

Can you start from the beginning again and describe exactly what set of
circumstances you're in. I want to remove any bug or doubt that there is
one lurking.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: Attempt to re-archive existing WAL logs afterrestoringfrom backup

От
"Simon Riggs"
Дата:
On Fri, 2007-06-01 at 12:02 +0100, Simon Riggs wrote:
> On Thu, 2007-05-31 at 16:09 +0100, Jon Colverson wrote:
> > >
> > >> This looks like a bug to me. Any thoughts?
> > >
>
> > No; nothing is touching the archive except for the archive_command being
> > run by PostgreSQL.
>
> Jon,
>
> You're still troubled by what you've seen?
>
> Can you start from the beginning again and describe exactly what set of
> circumstances you're in. I want to remove any bug or doubt that there is
> one lurking.

I've spent some time re-examining the code and I can't see a route for
the discussed problem to occur. All the main code paths do not exhibit
this problem in 8.2, so I'm not sure where to go from here.

If you can help further, please give me a shout.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: Attempt to re-archive existing WAL logs afterrestoringfrom backup

От
Jon Colverson
Дата:
Simon Riggs wrote:
> I've spent some time re-examining the code and I can't see a route for
> the discussed problem to occur. All the main code paths do not exhibit
> this problem in 8.2, so I'm not sure where to go from here.
>
> If you can help further, please give me a shout.

Thanks a lot for looking into this.

I've attached a self-contained minimal shell script which sets up a
cluster and demonstrates the problem. I hope that will provide an
unambiguous description. I'm running it on Ubuntu 7.04 (i386) and the
script includes all the commands to build PostgreSQL from scratch, but
obviously you can snip that stuff out and change the paths to suit your
environment.

The script ends with the server log showing the archiver attempting to
re-archive the last log that was restored in the recovery procedure.

Thanks again.

--
Jon
#!/bin/sh

set -e

apt-get -y install curl
cd /tmp
curl -O ftp://ftp5.us.postgresql.org/pub/PostgreSQL//source/v8.2.4/postgresql-8.2.4.tar.gz
cat <<EOF | sha512sum -c

571fc42abd7b06e0fcc33a326796bb28bdcb7e3260d3e4b56e40f0e22517784460829900e2941e978ca253cac7c8e00397c342ba6397a0657b4b68157aa425e9
*postgresql-8.2.4.tar.gz
EOF
adduser --system --home /var/lib/pgdata --no-create-home --gecos "PostgreSQL administrator,,," --group --disabled-login
postgres
apt-get -y install libc6-dev gcc make libreadline5-dev zlib1g-dev libssl-dev
tar -zxof postgresql-8.2.4.tar.gz
cd postgresql-8.2.4
./configure
make
make install
# apt-get -y autoremove --purge libc6-dev gcc make libreadline5-dev zlib1g-dev libssl-dev
cd /

#----------------------
# End of installation.
#----------------------

PGSQL_BIN=/usr/local/pgsql/bin

DB=/tmp/duplicatexlogdemo
XLOG_ARCHIVE=/tmp/xlogarchive
PGDATA_BACKUP=/tmp/pgdata_backup.tar.gz
MAKE_BACKUP=/tmp/makepgdatabackup
RESTORE_BACKUP=/tmp/restorepgdatabackup
SERVER_LOG=/tmp/serverlog

mkdir -p $DB
chown postgres:postgres $DB
sudo -u postgres $PGSQL_BIN/initdb -D $DB

mkdir -p $XLOG_ARCHIVE
chown postgres:postgres $XLOG_ARCHIVE

cat <<EOF >$MAKE_BACKUP
#!/bin/sh
sudo -u postgres $PGSQL_BIN/psql -c "SELECT pg_start_backup('testbackup')" >/dev/null
tar -C $DB --exclude=pg_xlog -c . | gzip -c >$PGDATA_BACKUP
sudo -u postgres $PGSQL_BIN/psql -Atc "SELECT pg_xlogfile_name(pg_stop_backup())"
EOF
chmod 755 $MAKE_BACKUP

cat <<EOF >$RESTORE_BACKUP
#!/bin/sh
mkdir -p $DB/pg_xlog/archive_status
echo "restore_command = 'cp $XLOG_ARCHIVE/%f %p'" >$DB/recovery.conf
chown -R postgres:postgres $DB
chmod -R g=,o= $DB
tar -C $DB -zxf $PGDATA_BACKUP
EOF
chmod 755 $RESTORE_BACKUP

echo "archive_command = 'test ! -f $XLOG_ARCHIVE/%f && cp %p $XLOG_ARCHIVE/%f'" >>$DB/postgresql.conf

sudo -u postgres $PGSQL_BIN/pg_ctl start -w -D $DB -l $SERVER_LOG
sudo -u postgres $PGSQL_BIN/psql -c "CREATE DATABASE blah1"
$MAKE_BACKUP
# Create some logs to recover from:
sudo -u postgres $PGSQL_BIN/psql -c "CREATE DATABASE blah2; SELECT pg_switch_xlog()"
sudo -u postgres $PGSQL_BIN/psql -c "CREATE DATABASE blah3; SELECT pg_switch_xlog()"
sudo -u postgres $PGSQL_BIN/psql -c "CREATE DATABASE blah4; SELECT pg_switch_xlog()"
sudo -u postgres $PGSQL_BIN/psql -c "CREATE DATABASE blah5; SELECT pg_switch_xlog()"
sudo -u postgres $PGSQL_BIN/psql -c "CREATE DATABASE blah6; SELECT pg_switch_xlog()"
sudo -u postgres $PGSQL_BIN/pg_ctl stop -D $DB
cd /
rm -Rf $DB
$RESTORE_BACKUP
sudo -u postgres $PGSQL_BIN/pg_ctl start -w -D $DB -l $SERVER_LOG
$MAKE_BACKUP
tail $SERVER_LOG

Re: Attempt to re-archive existing WAL logsafterrestoringfrom backup

От
"Simon Riggs"
Дата:
On Sat, 2007-06-02 at 01:07 +0100, Jon Colverson wrote:
> Simon Riggs wrote:
> > I've spent some time re-examining the code and I can't see a route for
> > the discussed problem to occur. All the main code paths do not exhibit
> > this problem in 8.2, so I'm not sure where to go from here.
> >
> > If you can help further, please give me a shout.
>
> Thanks a lot for looking into this.
>
> I've attached a self-contained minimal shell script which sets up a
> cluster and demonstrates the problem.

OK, thanks. I'll run with it from here.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: Attempt to re-archive existing WAL logsafterrestoringfrom backup

От
"Simon Riggs"
Дата:
On Sat, 2007-06-02 at 01:07 +0100, Jon Colverson wrote:
> Simon Riggs wrote:
> > I've spent some time re-examining the code and I can't see a route for
> > the discussed problem to occur. All the main code paths do not exhibit
> > this problem in 8.2, so I'm not sure where to go from here.
> >
> > If you can help further, please give me a shout.
>
> Thanks a lot for looking into this.
>
> I've attached a self-contained minimal shell script which sets up a
> cluster and demonstrates the problem. I hope that will provide an
> unambiguous description. I'm running it on Ubuntu 7.04 (i386) and the
> script includes all the commands to build PostgreSQL from scratch, but
> obviously you can snip that stuff out and change the paths to suit your
> environment.
>
> The script ends with the server log showing the archiver attempting to
> re-archive the last log that was restored in the recovery procedure.

Jon,

Your script is correct and there is a problem, as you demonstrate. Thank
you for bringing this to the list.

You have followed the manual in full. In Step 6 of the recovery
procedure outlined here
http://developer.postgresql.org/pgdocs/postgres/continuous-archiving.html#BACKUP-PITR-RECOVERY
we say that if you have partially completed WAL logs these should be
copied back into pg_xlog. This is correct and if we do this we avoid the
error you have highlighted. In the case of a true disaster recovery, as
you have attempted to test, no files are available and so an attempt is
made to archive the last restored file. That attempt fails iff you have
done as the manual suggests and used an archive_command that prevents
duplicate file insertions.

I'll repost to pgsql-hackers to discuss solutions. Fixes are simple, but
require some discussion.

Initially I thought you'd fallen foul of another problem which is known
to exist, which is caused by immediately shutting down a server after
having taken a hot backup. This isn't a problem in real usage though is
a problem in a scripted test. I've already proposed solutions there and
will pick up that thread again also.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com