[RFC] Incremental backup v3: incremental PoC

Поиск
Список
Период
Сортировка
Hi Hackers,

following the advices gathered on the list I've prepared a third partial
patch on the way of implementing incremental pg_basebackup as described
here https://wiki.postgresql.org/wiki/Incremental_backup


== Changes

Compared to the previous version I've made the following changes:

* The backup_profile is not optional anymore. Generating it is cheap
enough not to bother the user with such a choice.

* I've isolated the code which detects the maxLSN of a segment in a
separate getMaxLSN function. At the moment it works scanning the whole
file, but I'm looking to replace it in the next versions.

* I've made possible to request an incremental backup passing a "-I
<LSN>" option to pg_basebackup. It is probably too "raw" to remain as
is, but it's is useful at this stage to test the code.

* I've modified the backup label to report the fact that the backup was
taken with the incremental option. The result will be something like:

START WAL LOCATION: 0/52000028 (file 000000010000000000000052)
CHECKPOINT LOCATION: 0/52000060
INCREMENTAL FROM LOCATION: 0/51000028
BACKUP METHOD: streamed
BACKUP FROM: master
START TIME: 2014-10-14 16:05:04 CEST
LABEL: pg_basebackup base backup


== Testing it

At this stage you can make an incremental file-level backup using this
procedure:

pg_basebackup -v -F p -D /tmp/x -x
LSN=$(awk '/^START WAL/{print $4}' /tmp/x/backup_profile)
pg_basebackup -v -F p -D /tmp/y -I $LSN -x

the result will be an incremental backup in /tmp/y based on the full
backup on /tmp/x.

You can "reintegrate" the incremental backup in the /tmp/z directory
with the following little python script, calling it as

./recover.py /tmp/x /tmp/y /tmp/z

----
#!/usr/bin/env python
# recover.py

import os
import shutil
import sys

if len(sys.argv) != 4:
    print >> sys.stderr, "usage: %s base incremental destination"
    sys.exit(1)

base=sys.argv[1]
incr=sys.argv[2]
dest=sys.argv[3]

if os.path.exists(dest):
    print >> sys.stderr, "error: destination must not exist (%s)" % dest
    sys.exit(1)

profile=open(os.path.join(incr, 'backup_profile'), 'r')

for line in profile:
    if line.strip() == 'FILE LIST':
        break

shutil.copytree(incr, dest)
for line in profile:
    tblspc, lsn, sent, date, size, path = line.strip().split('\t')
    if sent == 't' or lsn=='\\N':
        continue
    base_file = os.path.join(base, path)
    dest_file = os.path.join(dest, path)
    shutil.copy2(base_file, dest_file)
----

It has obviously to be replaced by a full-fledged user tool, but it is
enough to test the concept.

== What next

I would to replace the getMaxLSN function with a more-or-less persistent
structure which contains the maxLSN for each data segment.

To make it work I would hook into the ForwardFsyncRequest() function in
src/backend/postmaster/checkpointer.c and update an in memory hash every
time a block is going to be fsynced. The structure could be persisted on
disk at some time (probably on checkpoint).

I think a good key for the hash would be a BufferTag with blocknum
"rounded" to the start of the segment.

I'm here asking for comments and advices on how to implement it in an
acceptable way.

== Disclaimer

The code here is an intermediate step, it does not contain any
documentation beside the code comments and will be subject to deep and
radical changes. However I believe it can be a base to allow PostgreSQL
to have its file-based incremental backup, and a block-based incremental
backup after it.

Regards,
Marco

--
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciarini@2ndQuadrant.it | www.2ndQuadrant.it

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Lucas Lersch
Дата:
Сообщение: Re: Buffer Requests Trace
Следующее
От: Stephen Frost
Дата:
Сообщение: Re: Buffer Requests Trace