Re: block-level incremental backup

Поиск
Список
Период
Сортировка
От Stephen Frost
Тема Re: block-level incremental backup
Дата
Msg-id 20190420203232.GB6197@tamriel.snowman.net
обсуждение исходный текст
Ответ на Re: block-level incremental backup  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: block-level incremental backup  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Sat, Apr 20, 2019 at 12:19 AM Stephen Frost <sfrost@snowman.net> wrote:
> > * Robert Haas (robertmhaas@gmail.com) wrote:
> > > What I'm NOT willing to
> > > do is build a whole bunch of infrastructure that will help pgbackrest
> > > do amazing things but will not provide a simple and convenient way of
> > > taking incremental backups using only core tools.  I do care about
> > > having something that's good for pgbackrest and other out-of-core
> > > tools.  I just care about it MUCH LESS than I care about making
> > > PostgreSQL core awesome.
> >
> > Then I misunderstood your original proposal where you talked about
> > providing something that the various external tools could use.  If you'd
> > like to *just* provide a mechanism for pg_basebackup to be able to do a
> > trivial incremental backup, great, but it's not going to be useful or
> > used by the external tools, just like the existing base backup protocol
> > isn't used by the external tools because it can't be used in a parallel
> > fashion.
>
> Well, what I meant - and perhaps I wasn't clear enough about this - is
> that it could be used by an external solution for *managing* backups,
> not so much an external engine for *taking* backups.  But actually, I
> really don't see any reason why the latter wouldn't also be possible.
> It was already suggested upthread by Anastasia that there should be a
> way to ask the server to give only the identity of the modified blocks
> without the contents of those blocks; if we provide that, then a tool
> can get those and do whatever it likes with them, including fetching
> them in parallel by some other means.  Another obvious extension would
> be to add a command that says 'give me this file' or 'give me this
> file but only this list of blocks' which would give clients lots of
> options: they could provide their own lists of blocks to fetch
> computed by whatever internal magic they have, or they could request
> the server's modified-block map information first and then schedule
> fetching those blocks in parallel using this new command.  So it seems
> like with some pretty straightforward extensions this can be made
> usable by and valuable to people wanting to build external backup
> engines, too.  I do not necessarily feel obliged to implement every
> feature that might help with that kind of thing just because I've
> expressed an interest in this general area, but I might do some of
> them, and maybe people like you or Anastasia who want to make these
> facilities available to external tools can help with some of the work,
> too.

Yes, if we spend a bit of time thinking about how this could be
implemented in a way that could be used by multiple connections
concurrently then we could provide something that both pg_basebackup and
the external tools could use.  Getting a list first and then supporting
a 'give me this file' API, or 'give me these blocks from this file'
would be very similar to what many of the external tools today.  I agree
that I don't think it'd be hard to do.  I'm suggesting that we do that
instead of, at a protocol level, something similar to what was done with
pg_basebackup which prevents that.

I don't really agree that implementing "give me a list of files" and
"give me this file" is really somehow an 'extension' to the tar-based
approach that pg_basebackup uses today, it's really a rather different
thing, and I mention that as a parallel (hah!) to what we're discussing
here regarding the incremental backup approach.

Having been around for a while working on backup-related things, if I
was to implement the protocol for pg_basebackup today, I'd definitely
implement "give me a list" and "give me this file" rather than the
tar-based approach, because I've learned that people want to be
able to do parallel backups and that's a decent way to do that.  I
wouldn't set out and implement something new that's there's just no hope
of making parallel.  Maybe the first write of pg_basebackup would still
be simple and serial since it's certainly more work to make a frontend
tool like that work in parallel, but at least the protocol would be
ready to support a parallel option being added alter without being
rewritten.

And that's really what I was trying to get at here- if we've got the
choice now to decide what this is going to look like from a protocol
level, it'd be great if we could make it able to support being used in a
parallel fashion, even if pg_basebackup is still single-threaded.

> That being said, as long as there is significant demand for
> value-added backup features over and above what is in core, there are
> probably going to be non-core backup tools that do things their own
> way instead of just leaning on whatever the server provides natively.
> In a certain sense that's regrettable, because it means that somebody
> - or perhaps multiple somebodys - goes to the trouble of doing
> something outside core and then somebody else puts something in core
> that obsoletes it and therein lies duplication of effort.  On the
> other hand, it also allows people to innovate way faster than can be
> done in core, it allows competition among different possible designs,
> and it's just kinda the way we roll around here.  I can't get very
> worked up about it.

Yes, that's largely the tact we've taken with it- build something
outside of core, where we can move a lot faster with the implementation
and innovate quickly, until we get to a stable system that's as portable
and in a compatible language to what's in core today.  I don't have any
problem with new things going into core, in fact, I'm all for it, but if
someone asks me "I'd like to do this thing in core and I'd like it to be
useful for external tools" then I'll do my best to share my experiences
with what's been done in core vs. what's been done in this space outside
of core and what some lessons learned from that have been and ways that
we could at least try to make it so that external tools will be able to
use whatever is implemented in core.

> One thing I'm definitely not going to do here is abandon my goal of
> producing a *simple* incremental backup solution that can be deployed
> *easily* by users. I understand from your remarks that such a solution
> will not suit everybody.  However, unlike you, I do not believe that
> pg_basebackup was a failure.  I certainly agree that it has some
> limitations that mean that it is hard to use in large deployments, but
> it's also *extremely* convenient for people with a fairly small
> database when they just need a quick and easy backup.  Adding some
> more features to it - such as incremental backup - will make it useful
> to more people in more cases.  There will doubtless still be people
> who need more, and that's OK: those people can use a third-party tool.
> I will not get anywhere trying to solve every problem at once.

I don't get this at all.  What I've really been focused on has been the
protocol-level questions of what this is going to look like, because
that's what I see the external tools potentially using.  pg_basebackup
itself could remain single-threaded and could provide exactly the same
interface, no matter if the protocol is "give me all the blocks across
the entire cluster as a single compressed stream" or the protocol is
"give me a list of files that changed" and "give me a list of these
blocks in this file" or even "give me all the blocks that changed in
this file".

I also don't think pg_basebackup is a failure, and I didn't mean to
imply that, and I'm sorry for some of the hyperbole which lead to that
impression coming across.  pg_basebackup is great, for what it is, and I
regularly recommend it in certain use-cases as being a simple tool that
does one thing and does it pretty well, for smaller clusters.  The
protocol it uses is unfortunately only useful in a single-threaded
manner though and it'd be great if we could avoid implementing similar
things in the protocol in the future.

Thanks,

Stephen

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: finding changed blocks using WAL scanning
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: finding changed blocks using WAL scanning