Re: Online Backup and WAL archives

Поиск
Список
Период
Сортировка
От Morus Walter
Тема Re: Online Backup and WAL archives
Дата
Msg-id 16896.48660.415157.247818@tanto-xipolis.de
обсуждение исходный текст
Ответ на Re: Online Backup and WAL archives  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-admin
Tom Lane writes:

Thanks for your answer Tom, and also thanks to the others, that answered.

> > The documentation says
> > ' To make use of this backup, you will need to keep around all the
> > WAL segment files generated at or after the starting time of the backup. '
> > Now I'm wondering how much of these WAL segment files do I really need
> > in order to recover the databases to a consistent state.
>
> If you are satisfied with recovering to the state shortly after you
> completed the backup, then it would be sufficient to have a set of WAL
> files spanning the time period in which the backup is done.

Hmm. But that set is probably empty if there are few transaction during
the backup.
So I guess either I have to save non archived WALs as well (which seems
to be problematic since they might change during saving) or wait until
there is at least one WAL archived after the end of the backup.
Is it save to backup non archived WALs? Probably it is, since the postmaster
has to deal with incomplete writes to WALs for crash recovery anyway.

> I'm dubious
> that this is necessarily an improvement over a pg_dump backup, though.
>
It's better in the case I don't loose the WAL archive at least.
Of course it's best not to need the backup at all, and second best to be
able to restore to a state as close to the failure as possible.
OTOH one of the worst cases I can think of, is to think to have a daily
backup and find that it's unusable at the moment you need it, e.g. because
you lost an WAL that's essential for recovery.

> > I expect the online backup to faster on recovery than an SQL dump, since
> > the latter would imply recreation of indexes during recovery.
>
> Is that assumption founded on any hard evidence?
>
It wasn't. But I did some tests and found it to be true.

Backup using pg_dumpall:
------------------------
time pg_dumpall -U postgres | gzip > pg.dump.gz
37.818u 7.697s 9:44.62 7.7%     0+0k 0+0io 2108pf+0w
266.722u 15.111s 9:44.61 48.2%  0+0k 0+0io 2198pf+0w
ll pg.dump.gz
-rw-rw-r--    1 morus    tanto    308194343 Feb  2 09:08 pg.dump.gz

Recovery:
---------
- create directories (data, tablespaces)
- initdb
- copy config/access config from backup
- start postmaster
- recover using
  time gunzip -c pg.dump.gz | psql -U postgres template1
  -> 1:08:17
- analyze all tables
  time vacuumdb -U postgres -a -v -z
  -> 0:06:13

So over all it's 1 1/4 to 1 1/2 hours to recover (once the machine is
up again and you got the dump from tape).

BTW: shouldn't pg_dumpall add an apropriate encoding command to avoid
encoding trouble during restore? Just to make sure people won't shoot
themselfs into the foot by having some encoding command in ~/.psqlrc.

Online Backup:
--------------
All data directories are copied by a 'tar -czf'
After pg_stop_backup() is issued all archived WAL files are saved to
another tar.gz file. The test was done without any transactions going on.

real    9m11.644s
user    7m28.789s
sys     0m33.730s

Backup-Size:
-rw-r--r--    1 root     root     782785152 Feb  2 11:18 backup.20050202_110935.tar.gz
-rw-r--r--    1 root     root          302 Feb  2 11:18 backup.wal.20050202_110935.tar.gz

(the second file contains the archives WALs, so it's just the .backup file)

Recovery:
---------
There are no archived WALs after the 000000010000000100000039.000F51E8.backup
file.
I kept the non archived WALs (they are linked from data/pg_xlog)

- stop postmaster
- remove all files/dirs in data, tablespaces
- unpack backup (tar xvzf ...)
  took 0:05:23
- prepare recovery.conf (note: I erroneously created i in pg_xlog, see below)
- start the server
- since there was nothing to redo, server came up immediatly:
  from the log:
2005-02-02 11:35:25 CET % LOG:  database system was interrupted at 2005-02-02 11:09:35 CET
2005-02-02 11:35:25 CET % LOG:  checkpoint record is at 1/390F51E8
2005-02-02 11:35:25 CET % LOG:  redo record is at 1/390F51E8; undo record is at 0/0; shutdown FALSE
2005-02-02 11:35:25 CET % LOG:  next transaction ID: 3679; next OID: 27566926
2005-02-02 11:35:25 CET % LOG:  database system was not properly shut down; automatic recovery in progress
2005-02-02 11:35:25 CET % LOG:  redo starts at 1/390F5224
2005-02-02 11:35:25 CET % LOG:  record with zero length at 1/390F5260
2005-02-02 11:35:25 CET % LOG:  redo done at 1/390F5224
2005-02-02 11:35:25 CET % LOG:  database system is ready

So the recovery time is ~ 10 minutes (ok, theoretically I would have to
repeat that test because of the  erroneous recovery.conf; but I don't think
it would make a difference).

I did a second test doing some data manipulation during the backup.
I followed the recovery steps above, with the addition of upacking the
archived WALs, which takes just a few seconds.
Again I kept the pg_xlog. After moved the recovery.conf to the data directory
the server recovered in ~ 15 seconds.
2005-02-02 12:20:12 CET % LOG:  database system was interrupted at 2005-02-02 11:50:34 CET
2005-02-02 12:20:12 CET % LOG:  starting archive recovery
2005-02-02 12:20:12 CET % LOG:  restore_command = "cp /d1/postgresql8.0/archive/%f %p"
cp: Aufruf von stat für »/d1/postgresql8.0/archive/00000001.history« nicht möglich: Datei oder Verzeichnis nicht
gefunden
2005-02-02 12:20:12 CET % LOG:  restored log file "000000010000000100000039" from archive
2005-02-02 12:20:12 CET % LOG:  checkpoint record is at 1/390F52D8
2005-02-02 12:20:12 CET % LOG:  redo record is at 1/390F52D8; undo record is at 0/0; shutdown FALSE
2005-02-02 12:20:12 CET % LOG:  next transaction ID: 3687; next OID: 27562212
2005-02-02 12:20:12 CET % LOG:  automatic recovery in progress
2005-02-02 12:20:14 CET % LOG:  redo starts at 1/390F5314
2005-02-02 12:20:15 CET % LOG:  restored log file "00000001000000010000003A" from archive
2005-02-02 12:20:16 CET % LOG:  restored log file "00000001000000010000003B" from archive
2005-02-02 12:20:17 CET % LOG:  restored log file "00000001000000010000003C" from archive
2005-02-02 12:20:18 CET % LOG:  restored log file "00000001000000010000003D" from archive
2005-02-02 12:20:21 CET % LOG:  restored log file "00000001000000010000003E" from archive
2005-02-02 12:20:22 CET % LOG:  restored log file "00000001000000010000003F" from archive
cp: Aufruf von stat für »/d1/postgresql8.0/archive/000000010000000100000040« nicht möglich: Datei oder Verzeichnis
nichtgefunden 
2005-02-02 12:20:23 CET % LOG:  record with zero length at 1/407793A4
2005-02-02 12:20:23 CET % LOG:  redo done at 1/40779368
cp: Aufruf von stat für »/d1/postgresql8.0/archive/000000010000000100000040« nicht möglich: Datei oder Verzeichnis
nichtgefunden 
2005-02-02 12:20:23 CET % LOG:  archive recovery complete
2005-02-02 12:20:27 CET % LOG:  database system is ready

(the german error message of cp means `cannot stat file' / `file or directory
not found')

So recovery time still is ~ 10 minutes. This might change if there are much
more WALs to redo but this should only happen if one recovers to a state
significant time after the backup.

The server is a 2 CPU (Xeon 1.8 GHz with Hyperthreading) machine running
linux (RedHat 8.0, Kernel 2.4.18smp) with 1 GB RAM.
The disk is a two disk raid 1.
All data including the dumps were on this disk. Everything was done locally.
postgresql version is 8.0.0.

Of course the test leaves out the time it takes to load the dump from a
tape or any other external storage. Since the online backup has more than
2.5 times the size of the dump this will presumably be slower for the
online backup.
Recovering from a dump has the advantage that one gets fresh tables and
indices (so you save the next vacuuming and reindexing) but I doubt that's
worth an aditional hour offline.

Morus

В списке pgsql-admin по дате отправления:

Предыдущее
От: "Vishal Kashyap @ [SaiHertz]"
Дата:
Сообщение: PostgreSQL Multiple Vulnerabilities
Следующее
От: Pallav Kalva
Дата:
Сообщение: Re: Online Backup and WAL archives