Re: block-level incremental backup

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: block-level incremental backup
Дата
Msg-id CA+Tgmoa1SeWgzE1G9nRHuh7=H=ryGS3=bhGximhdOwPaF7AWwg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: block-level incremental backup  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On Tue, Apr 9, 2019 at 12:35 PM Andres Freund <andres@anarazel.de> wrote:
> Hm. But that means that files that are shipped nearly in their entirety,
> need to be fully rewritten. Wonder if it's better to ship them as files
> with holes, and have the metadata in a separate file. That'd then allow
> to just fill in the holes with data from the older version.  I'd assume
> that there's a lot of workloads where some significantly sized relations
> will get updated in nearly their entirety between backups.

I don't want to rely on holes at the FS level.  I don't want to have
to worry about what Windows does and what every Linux filesystem does
and what NetBSD and FreeBSD and Dragonfly BSD and MacOS do.  And I
don't want to have to write documentation for the fine manual
explaining to people that they need to use a hole-preserving tool when
they copy an incremental backup around.  And I don't want to have to
listen to complaints from $USER that their backup tool, $THING, is not
hole-aware.  Just - no.

But what we could do is have some threshold (as git does), beyond
which you just send the whole file.  For example if >90% of the blocks
have changed, or >80% or whatever, then you just send everything.
That way, if you have a database where you have lots and lots of 1GB
segments with low churn (so that you can't just use full backups) and
lots and lots of 1GB segments with high churn (to create the problem
you're describing) you'll still be OK.

> > 3. There should be a new tool that knows how to merge a full backup
> > with any number of incremental backups and produce a complete data
> > directory with no remaining partial files.
>
> Could just be part of server startup?

Yes, but I think that sucks.  You might not want to start the server
but rather just create a new synthetic backup.  And realistically,
it's hard to imagine the server doing anything but synthesizing the
backup first and then proceeding as normal.  In theory there's no
reason why it couldn't be smart enough to construct the files it needs
"on demand" in the background, but that sounds really hard and I don't
think there's enough value to justify that level of effort.  YMMV, of
course.

> > - I imagine that the server would offer this functionality through a
> > new replication command or a syntax extension to an existing command,
> > so it could also be used by tools other than pg_basebackup if they
> > wished.
>
> Would this logic somehow be usable from tools that don't want to copy
> the data directory via pg_basebackup (e.g. for parallelism, to directly
> send to some backup service / SAN / whatnot)?

Well, I'm imagining it as a piece of server-side functionality that
can figure out what has changed using one of several possible methods,
and then send that stuff to you.  So I think if you don't have a
server connection you are out of luck.  If you have a server
connection but just want to be told what has changed rather than
actually being given that data, that might be something that could be
worked into the design.  I'm not sure whether that's a real need,
though, or just extra work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Gary M
Дата:
Сообщение: Re: block-level incremental backup
Следующее
От: Robert Haas
Дата:
Сообщение: Re: block-level incremental backup