Обсуждение: where should I stick that backup?

Поиск

Список

Период

Сортировка

where should I stick that backup?

От

Robert Haas

Дата:

03 апреля 2020 г., 14:19:21

There are a couple of things that pg_basebackup can't do that might be
an issue for some users. One of them is that you might want to do
something like encrypt your backup. Another is that you might want to
store someplace other than in the filesystem, like maybe S3. We could
certainly teach pg_basebackup how to do specifically those things, and
maybe that is worthwhile. However, I wonder if it would be useful to
provide a more general capability, either instead of doing those more
specific things or in addition to doing those more specific things.

What I'm thinking about is: suppose we add an option to pg_basebackup
with a name like --pipe-output. This would be mutually exclusive with
-D, but would work at least with -Ft and maybe also with -Fp. The
argument to --pipe-output would be a shell command to be executed once
per output file. Any instance of %f in the shell command would be
replaced with the name of the file that would have been written (and
%% would turn into a single %). The shell command itself would be
executed via system(). So if you want to compress, but using some
other compression program instead of gzip, you could do something
like:

pg_basebackup -Ft --pipe-output 'bzip > %f.bz2'

And if you want to encrypt, you could do something like:

pg_basebackup -Ft --pipe-output 'gpg -e -o %f.gpg'

And if you want to ship it off to be stored in a concrete bunker deep
underground, you can just do something like:

pg_basebackup -Ft --pipe-output 'send-to-underground-storage.sh
backup-2020-04-03 %f'

You still have to write send-to-underground-storage.sh, of course, and
that may involve some work, and maybe also some expensive
construction. But what you don't have to do is first copy the entire
backup to your local filesystem and then as a second step figure out
how to put it through whatever post-processing it needs. Instead, you
can simply take your backup and stick it anywhere you like.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: where should I stick that backup?

От

Noah Misch

Дата:

05 апреля 2020 г., 06:53:28

On Fri, Apr 03, 2020 at 10:19:21AM -0400, Robert Haas wrote:
> What I'm thinking about is: suppose we add an option to pg_basebackup
> with a name like --pipe-output. This would be mutually exclusive with
> -D, but would work at least with -Ft and maybe also with -Fp. The
> argument to --pipe-output would be a shell command to be executed once
> per output file. Any instance of %f in the shell command would be
> replaced with the name of the file that would have been written (and
> %% would turn into a single %). The shell command itself would be
> executed via system(). So if you want to compress, but using some
> other compression program instead of gzip, you could do something
> like:
> 
> pg_basebackup -Ft --pipe-output 'bzip > %f.bz2'

Seems good to me.  I agree -Fp is a "maybe" since the overhead will be high
for small files.

Re: where should I stick that backup?

От

Stephen Frost

Дата:

06 апреля 2020 г., 14:45:12

Greetings,

* Noah Misch (noah@leadboat.com) wrote:
> On Fri, Apr 03, 2020 at 10:19:21AM -0400, Robert Haas wrote:
> > What I'm thinking about is: suppose we add an option to pg_basebackup
> > with a name like --pipe-output. This would be mutually exclusive with
> > -D, but would work at least with -Ft and maybe also with -Fp. The
> > argument to --pipe-output would be a shell command to be executed once
> > per output file. Any instance of %f in the shell command would be
> > replaced with the name of the file that would have been written (and
> > %% would turn into a single %). The shell command itself would be
> > executed via system(). So if you want to compress, but using some
> > other compression program instead of gzip, you could do something
> > like:
> >
> > pg_basebackup -Ft --pipe-output 'bzip > %f.bz2'
>
> Seems good to me.  I agree -Fp is a "maybe" since the overhead will be high
> for small files.

For my 2c, at least, introducing more shell commands into critical parts
of the system is absolutely the wrong direction to go in.
archive_command continues to be a mess that we refuse to clean up or
even properly document and the project would be much better off by
trying to eliminate it rather than add in new ways for users to end up
with bad or invalid backups.

Further, having a generic shell script approach like this would result
in things like "well, we don't need to actually add support for X, Y or
Z, because we have this wonderful generic shell script thing and you can
write your own, and therefore we won't accept patches which do add those
capabilities because then we'd have to actually maintain that support."

In short, -1 from me.

Thanks,

Stephen

Вложения

signature.asc

Re: where should I stick that backup?

От

Robert Haas

Дата:

06 апреля 2020 г., 15:13:24

On Mon, Apr 6, 2020 at 10:45 AM Stephen Frost <sfrost@snowman.net> wrote:
> For my 2c, at least, introducing more shell commands into critical parts
> of the system is absolutely the wrong direction to go in.
> archive_command continues to be a mess that we refuse to clean up or
> even properly document and the project would be much better off by
> trying to eliminate it rather than add in new ways for users to end up
> with bad or invalid backups.
>
> Further, having a generic shell script approach like this would result
> in things like "well, we don't need to actually add support for X, Y or
> Z, because we have this wonderful generic shell script thing and you can
> write your own, and therefore we won't accept patches which do add those
> capabilities because then we'd have to actually maintain that support."
>
> In short, -1 from me.

I'm not sure that there's any point in responding to this because I
believe that the wording of this email suggests that you've made up
your mind that it's bad and that position is not subject to change no
matter what anyone else may say. However, I'm going to try on reply
anyway, on the theory that (1) I might be wrong and (2) even if I'm
right, it might influence the opinions of others who have not spoken
yet, and whose opinions may be less settled.

First of all, while I agree that archive_command has some problems, I
don't think that means that every case where we use a shell command
for anything is a hopeless mess. The only problem I really see in this
case is that if you route to a local file via an intermediate program
you wouldn't get an fsync() any more. But we could probably figure out
some clever things to work around that problem, if that's the issue.
If there's some other problem, what is it?

Second, PostgreSQL is not realistically going to link pg_basebackup
against every compression, encryption, and remote storage library out
there. One, yeah, we don't want to maintain that. Two, we don't want
PostgreSQL to have build-time dependencies on a dozen or more
libraries that people might want to use for stuff like this. We might
well want to incorporate support for a few of the more popular things
in this area, but people will always want support for newer things
than what existing server releases feature, and for more of them.

Third, I am getting pretty tired of being told every time I try to do
something that is related in any way to backup that it's wrong. If
your experience with pgbackrest motivated you to propose ways of
improving backup and restore functionality in the community, that
would be great. But in my experience so far, it seems to mostly
involve making a lot of negative comments that make it hard to get
anything done. I would appreciate it if you would adopt a more
constructive tone.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: where should I stick that backup?

От

Magnus Hagander

Дата:

06 апреля 2020 г., 17:32:45

On Mon, Apr 6, 2020 at 4:45 PM Stephen Frost <sfrost@snowman.net> wrote:
>
> Greetings,
>
> * Noah Misch (noah@leadboat.com) wrote:
> > On Fri, Apr 03, 2020 at 10:19:21AM -0400, Robert Haas wrote:
> > > What I'm thinking about is: suppose we add an option to pg_basebackup
> > > with a name like --pipe-output. This would be mutually exclusive with
> > > -D, but would work at least with -Ft and maybe also with -Fp. The
> > > argument to --pipe-output would be a shell command to be executed once
> > > per output file. Any instance of %f in the shell command would be
> > > replaced with the name of the file that would have been written (and
> > > %% would turn into a single %). The shell command itself would be
> > > executed via system(). So if you want to compress, but using some
> > > other compression program instead of gzip, you could do something
> > > like:
> > >
> > > pg_basebackup -Ft --pipe-output 'bzip > %f.bz2'
> >
> > Seems good to me.  I agree -Fp is a "maybe" since the overhead will be high
> > for small files.
>
> For my 2c, at least, introducing more shell commands into critical parts
> of the system is absolutely the wrong direction to go in.
> archive_command continues to be a mess that we refuse to clean up or
> even properly document and the project would be much better off by
> trying to eliminate it rather than add in new ways for users to end up
> with bad or invalid backups.

I think the bigger problem with archive_command more comes from how
it's defined to work tbh. Which leaves a lot of things open.

This sounds to me like a much narrower use-case, which makes it a lot
more OK. But I agree we have to be careful not to get back into that
whole mess. One thing would be to clearly document such things *from
the beginning*, and not try to retrofit it years later like we ended
up doing with archive_command.

And as Robert mentions downthread, the fsync() issue is definitely a
real one, but if that is documented clearly ahead of time, that's a
reasonable level foot-gun I'd say.

> Further, having a generic shell script approach like this would result
> in things like "well, we don't need to actually add support for X, Y or
> Z, because we have this wonderful generic shell script thing and you can
> write your own, and therefore we won't accept patches which do add those
> capabilities because then we'd have to actually maintain that support."

In principle, I agree with "shellscripts suck".

Now, if we were just talking about compression, it would actually be
interesting to implement some sort of "postgres compression API" if
you will, that is implemented by a shared library. This library could
then be used from pg_basebackup or from anything else that needs
compression. And anybody who wants could then do a "<compression X>
for PostgreSQL" module, removing the need for us to carry such code
upstream.

There's been discussions of that for the backend before IIRC, but I
don't recall the conclusions. And in particular, I don't recall if it
included the idea of being able to use it in situations like this as
well, and with *run-time loading*.

And that said, then we'd limit ourselves to compression. We'd still
need a way to deal with encryption...

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/

Re: where should I stick that backup?

От

Stephen Frost

Дата:

06 апреля 2020 г., 18:23:07

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Apr 6, 2020 at 10:45 AM Stephen Frost <sfrost@snowman.net> wrote:
> > For my 2c, at least, introducing more shell commands into critical parts
> > of the system is absolutely the wrong direction to go in.
> > archive_command continues to be a mess that we refuse to clean up or
> > even properly document and the project would be much better off by
> > trying to eliminate it rather than add in new ways for users to end up
> > with bad or invalid backups.
> >
> > Further, having a generic shell script approach like this would result
> > in things like "well, we don't need to actually add support for X, Y or
> > Z, because we have this wonderful generic shell script thing and you can
> > write your own, and therefore we won't accept patches which do add those
> > capabilities because then we'd have to actually maintain that support."
> >
> > In short, -1 from me.
>
> I'm not sure that there's any point in responding to this because I
> believe that the wording of this email suggests that you've made up
> your mind that it's bad and that position is not subject to change no
> matter what anyone else may say. However, I'm going to try on reply
> anyway, on the theory that (1) I might be wrong and (2) even if I'm
> right, it might influence the opinions of others who have not spoken
> yet, and whose opinions may be less settled.

Chances certainly aren't good that you'll convince me that putting more
absolutely crticial-to-get-perfect shell scripts into the backup path is
a good idea.

> First of all, while I agree that archive_command has some problems, I
> don't think that means that every case where we use a shell command
> for anything is a hopeless mess. The only problem I really see in this
> case is that if you route to a local file via an intermediate program
> you wouldn't get an fsync() any more. But we could probably figure out
> some clever things to work around that problem, if that's the issue.
> If there's some other problem, what is it?

We certainly haven't solved the issues with archive_command (at least,
not in core), so this "well, maybe we could fix all the issues" claim
really doesn't hold any water.  Having commands like this ends up just
punting on the whole problem and saying "here user, you deal with it."
*Maybe* if we *also* wrote dedicated tools to be used with these
commands (as has been proposed multiple times with archive_command, but
hasn't actually happened, at least, not in core), we could build
something where this would work reasonably well and it'd be alright, but
that wasn't what seemed to be suggested here, and if we're going to
write all that code anyway, it doesn't really seem like a shell
interface is a best one to go with.

There's also been something of an expectation that if we're going to
provide an interface then we should have an example of something that
uses it- but when it comes to something like archive_command, the
example we came up with was terrible and yet it's still in our
documentation and is commonly used, much to the disservice of our users.
Sure, we can point to our users and say "well, that's now how you should
actually use that feature, you should do all this other stuff in that
command" and punt on this and push it back on our users and tell them
that they're using the interface we provide wrong but the only folks who
can possibly actually like that answer is ourselves- our users aren't
happy with it because they're left with a broken backup that they can't
restore from when they needed to.

That your initial email had more-or-less the exact same kind of
"example" certainly doesn't inspire confidence that this would end up
being used sensibly by our users.

Yes, fsync() is part of the issue but it's not the only one- retry
logic, and making sure the results are correct, is pretty darn important
too, especially with things like s3 (even dedicated tools have issues in
this area- I just saw a report about wal-g failing to archive a WAL file
properly because there was an error which resulted in a 0-byte WAL file
being stored; wal-g did properly retry, but then it saw the file was
there and figured "all is well" and returned success even though the
file was 0-byte in s3).  I don't doubt that David could point out a few
other issues- he routinely does whenever I chat with him about various
ideas I've got.

So, instead of talking about 'bzip2 > %f.bz2', and then writing into our
documentation that that's how this feature can be used, what about
proposing something that would actually work reliably with this
interface?  Which properly fsync's everything, has good retry logic for
when failures happen, is able to actually detect when a failure
happened, how to restore from a backup taken this way, and it'd probably
be good to show how pg_verifybackup could be used to make sure the
backup is actually correct and valid too.

> Second, PostgreSQL is not realistically going to link pg_basebackup
> against every compression, encryption, and remote storage library out
> there. One, yeah, we don't want to maintain that. Two, we don't want
> PostgreSQL to have build-time dependencies on a dozen or more
> libraries that people might want to use for stuff like this. We might
> well want to incorporate support for a few of the more popular things
> in this area, but people will always want support for newer things
> than what existing server releases feature, and for more of them.

We don't need to link to 'every compression, encryption and remote
storage library out there'.  In some cases, yes, it makes sense to use
an existing library (OpenSSL, zlib, lz4), but in many other cases it
makes more sense to build support directly into the system (s3, gcs,
probably others) because a good library doesn't exist.  It'd also be
good to build a nicely extensible system which people can add to, to
support other storage or compression options but I don't think that's
reasonable to do with a shell-script based interface- maybe with
shared libraries, as Magnus suggests elsewhere, maybe, but even there I
have some doubts.

> Third, I am getting pretty tired of being told every time I try to do
> something that is related in any way to backup that it's wrong. If
> your experience with pgbackrest motivated you to propose ways of
> improving backup and restore functionality in the community, that
> would be great. But in my experience so far, it seems to mostly
> involve making a lot of negative comments that make it hard to get
> anything done. I would appreciate it if you would adopt a more
> constructive tone.

pgbackrest is how we're working to improve backup and restore
functionality in the community, and we've come a long way and gone
through a great deal of fire getting there.  I appreciate that it's not
in core and I'd love to discuss how we can change that, but it's
absolutely a part of the PG community and ecosystem- with changes being
made in core routinely which improve the in-core tools as well as
pgbackrest by the authors contributing back.

As far as my tone, I'm afraid that's simply coming from having dealt
with and discussed many of these, well, shortcuts, to trying to improve
backup and recovery.  Did David and I discuss using s3cmd?  Of course.
Did we research various s3 libraries?  http libraries?  SSL libraries?
compression libraries?  Absolutely, which is why we ended up using
OpenSSL (PG links to it already, so if you're happy enough with PG's SSL
then you'll probably accept pgbackrest using the same one- and yes,
we've talked about supporting others as PG is moving in that direction
too), and zlib (same reasons), we've now added lz4 (after researching it
and deciding it was pretty reasonable to include), but when it came to
dealing with s3, we wrote our own HTTP and s3 code- none of the existing
libraries were a great answer and trying to make it work with s3cmd was,
well, about like saying that you should just use CSV files and forget
about this whole database thing.  We're very likely to write our own
code for gcs too, but we already have the HTTP code, which means it's
not actually all that heavy of a lift to do.

I'm not against trying to improve the situation in core, and I've even
talked about and tried to give feedback about what would make the most
sense for that to look like, but I feel like every time I do that
there's a bunch of push-back that I want it to look like pgbackrest or
that I'm being negative about things that don't look like pgbackrest.
Guess what?  Yes, I do think it should look like pgbackrest, but that's
not because I have some not invented here syndrome issue, it's because
we've been through this and have learned a great deal and have taken
what we've learned and worked to build the best tool we can, much the
way the PG community works to make the best database we can.

Yes, we were able to argue and make it clear that a manifest really did
make sense and even that it should be in json format, and then argue
that checking WAL was a pretty important part of verifying any backup,
but each and every one of these ends up being a long and drawn out
argument and it's draining.  The thing is, this stuff isn't new to us.

Thanks,

Stephen

Вложения

signature.asc

Re: where should I stick that backup?

От

Stephen Frost

Дата:

06 апреля 2020 г., 18:31:50

Greetings,

* Magnus Hagander (magnus@hagander.net) wrote:
> On Mon, Apr 6, 2020 at 4:45 PM Stephen Frost <sfrost@snowman.net> wrote:
> > * Noah Misch (noah@leadboat.com) wrote:
> > > On Fri, Apr 03, 2020 at 10:19:21AM -0400, Robert Haas wrote:
> > > > What I'm thinking about is: suppose we add an option to pg_basebackup
> > > > with a name like --pipe-output. This would be mutually exclusive with
> > > > -D, but would work at least with -Ft and maybe also with -Fp. The
> > > > argument to --pipe-output would be a shell command to be executed once
> > > > per output file. Any instance of %f in the shell command would be
> > > > replaced with the name of the file that would have been written (and
> > > > %% would turn into a single %). The shell command itself would be
> > > > executed via system(). So if you want to compress, but using some
> > > > other compression program instead of gzip, you could do something
> > > > like:
> > > >
> > > > pg_basebackup -Ft --pipe-output 'bzip > %f.bz2'
> > >
> > > Seems good to me.  I agree -Fp is a "maybe" since the overhead will be high
> > > for small files.
> >
> > For my 2c, at least, introducing more shell commands into critical parts
> > of the system is absolutely the wrong direction to go in.
> > archive_command continues to be a mess that we refuse to clean up or
> > even properly document and the project would be much better off by
> > trying to eliminate it rather than add in new ways for users to end up
> > with bad or invalid backups.
>
> I think the bigger problem with archive_command more comes from how
> it's defined to work tbh. Which leaves a lot of things open.
>
> This sounds to me like a much narrower use-case, which makes it a lot
> more OK. But I agree we have to be careful not to get back into that
> whole mess. One thing would be to clearly document such things *from
> the beginning*, and not try to retrofit it years later like we ended
> up doing with archive_command.

This sounds like a much broader use-case to me, not a narrower one.  I
agree that we don't want to try and retrofit things years later.

> And as Robert mentions downthread, the fsync() issue is definitely a
> real one, but if that is documented clearly ahead of time, that's a
> reasonable level foot-gun I'd say.

Documented how..?

> > Further, having a generic shell script approach like this would result
> > in things like "well, we don't need to actually add support for X, Y or
> > Z, because we have this wonderful generic shell script thing and you can
> > write your own, and therefore we won't accept patches which do add those
> > capabilities because then we'd have to actually maintain that support."
>
> In principle, I agree with "shellscripts suck".
>
> Now, if we were just talking about compression, it would actually be
> interesting to implement some sort of "postgres compression API" if
> you will, that is implemented by a shared library. This library could
> then be used from pg_basebackup or from anything else that needs
> compression. And anybody who wants could then do a "<compression X>
> for PostgreSQL" module, removing the need for us to carry such code
> upstream.

Getting a bit off-track here, but I actually think we should absolutely
figure out a way to support custom compression options in PG.  I had
been thinking of something along the lines of per-datatype actually,
where each data type could define it's own compression method, since we
know that different data has different characteristics and therefore
might benefit from different ways of compressing it.  Though it's also
true that generically there are tradeoffs between cpu time, memory size,
resulting size on disk, etc, and having ways to pick between those could
also be interesting.

> There's been discussions of that for the backend before IIRC, but I
> don't recall the conclusions. And in particular, I don't recall if it
> included the idea of being able to use it in situations like this as
> well, and with *run-time loading*.

Run-time loading brings in the fun that maybe we aren't able to load the
library when we need to too, and what then? :)

> And that said, then we'd limit ourselves to compression. We'd still
> need a way to deal with encryption...

And shipping stuff off to some remote server too, at least if we are
going to tell users that they can use this approach to send their
backups to s3...  (and that reminds me- there's other things to think
about there too, like maybe you don't want to ship off 0-byte files to
s3, or maybe you don't want to ship tiny files, because there's costs
associated with these things...).

Thanks,

Stephen

Вложения

signature.asc

Re: where should I stick that backup?

От

Robert Haas

Дата:

06 апреля 2020 г., 21:09:17

On Mon, Apr 6, 2020 at 2:23 PM Stephen Frost <sfrost@snowman.net> wrote:
> So, instead of talking about 'bzip2 > %f.bz2', and then writing into our
> documentation that that's how this feature can be used, what about
> proposing something that would actually work reliably with this
> interface?  Which properly fsync's everything, has good retry logic for
> when failures happen, is able to actually detect when a failure
> happened, how to restore from a backup taken this way, and it'd probably
> be good to show how pg_verifybackup could be used to make sure the
> backup is actually correct and valid too.

I don't really understand the problem here.  Suppose I do:

mkdir ~/my-brand-new-empty-directory
cd ~/my-brand-new-empty-directory
pg_basebackup -Ft --pipe-output 'bzip2 > %f.bz2'
initdb -S --dont-expect-that-this-is-a-data-directory . # because
right now it would complain about pg_wal and pg_tblspc being missing

I think if all that works, my backup should be good and durably on
disk. If it's not, then either pg_basebackup or bzip2 or initdb didn't
report errors that they should have reported. If you're worried about
that, say because you suspect those programs are buggy or because you
think the kernel may not be reporting errors properly, you can use tar
-jxvf + pg_validatebackup to check.

What *exactly* do you think can go wrong here?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: where should I stick that backup?

От

Robert Haas

Дата:

06 апреля 2020 г., 21:27:24

On Mon, Apr 6, 2020 at 1:32 PM Magnus Hagander <magnus@hagander.net> wrote:
> Now, if we were just talking about compression, it would actually be
> interesting to implement some sort of "postgres compression API" if
> you will, that is implemented by a shared library. This library could
> then be used from pg_basebackup or from anything else that needs
> compression. And anybody who wants could then do a "<compression X>
> for PostgreSQL" module, removing the need for us to carry such code
> upstream.

I think it could be more general than a compression library. It could
be a store-my-stuff-and-give-it-back-to-me library, which might do
compression or encryption or cloud storage or any combination of the
three, and probably other stuff too. Imagine that you first call an
init function with a namespace that is basically a string provided by
the user. Then you open a file either for read or for write (but not
both). Then you read or write a series of chunks (depending on the
file mode). Then you close the file. Then you can do the same with
more files. Finally at the end you close the namespace. You don't
really need to care where or how the functions you are calling store
the data. You just need them to return proper error indicators if by
chance they fail.

As compared with my previous proposal, this would work much better for
pg_basebackup -Fp, because you wouldn't launch a new bzip2 process for
every file. You'd just bzopen(), which is presumably quite lightweight
by comparison. The reasons I didn't propose it are:

1. Running bzip2 on every file in a plain-format backup seems a lot
sillier than running it on every tar file in a tar-format backup.
2. I'm not confident that the command specified here actually needs to
be anything very complicated (unlike archive_command).
3. The barrier to entry for a loadable module is a lot higher than for
a shell command.
4. I think that all of our existing infrastructure for loadable
modules is backend-only.

Now all of these are up for discussion. I am sure we can make the
loadable module stuff work in frontend code; it would just take some
work. A C interface for extensibility is very significantly harder to
use than a shell interface, but it's still way better than no
interface. The idea that this shell command can be something simple is
my current belief, but it may turn out to be wrong. And I'm sure
somebody can propose a good reason to do something with every file in
a plain-format backup rather than using tar format.

All that being said, I still find it hard to believe that we will want
to add dependencies for libraries that we'd need to do encryption or
S3 cloud storage to PostgreSQL itself. So if we go with this more
integrated approach we should consider the possibility that, when the
dust settles, PostgreSQL will only have pg_basebackup
--output-plugin=lz4 and Aurora will also have pg_basebackup
--output-plugin=s3. From my point of view, that would be less than
ideal.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: where should I stick that backup?

От

Stephen Frost

Дата:

08 апреля 2020 г., 17:05:28

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Apr 6, 2020 at 2:23 PM Stephen Frost <sfrost@snowman.net> wrote:
> > So, instead of talking about 'bzip2 > %f.bz2', and then writing into our
> > documentation that that's how this feature can be used, what about
> > proposing something that would actually work reliably with this
> > interface?  Which properly fsync's everything, has good retry logic for
> > when failures happen, is able to actually detect when a failure
> > happened, how to restore from a backup taken this way, and it'd probably
> > be good to show how pg_verifybackup could be used to make sure the
> > backup is actually correct and valid too.
>
> I don't really understand the problem here.  Suppose I do:
>
> mkdir ~/my-brand-new-empty-directory
> cd ~/my-brand-new-empty-directory
> pg_basebackup -Ft --pipe-output 'bzip2 > %f.bz2'
> initdb -S --dont-expect-that-this-is-a-data-directory . # because
> right now it would complain about pg_wal and pg_tblspc being missing
>
> I think if all that works, my backup should be good and durably on
> disk. If it's not, then either pg_basebackup or bzip2 or initdb didn't
> report errors that they should have reported. If you're worried about
> that, say because you suspect those programs are buggy or because you
> think the kernel may not be reporting errors properly, you can use tar
> -jxvf + pg_validatebackup to check.
>
> What *exactly* do you think can go wrong here?

What if %f.bz2 already exists?  How about if %f has a space in it?  What
about if I'd like to verify that the backup looks reasonably valid
without having to find space to store it entirely decompressed?

Also, this argument feels disingenuous to me.  That isn't the only thing
you're promoting this feature be used for, as you say below.  If the
only thing this feature is *actually* intended for is to add bzip2
support, then we should just add bzip2 support directly and call it a
day, but what you're really talking about here is a generic interface
that you'll want to push users to for things like "how do I back up to
s3" or "how do I back up to GCS" and so we should be thinking about
those cases and not just a relatively simple use case.

This is the same kind of slippery slope that our archive command is
built on- sure, if everything "works" then it's "fine", even with our
documented example, but we know that not everything works in the real
world, and just throwing an 'initdb -S' in there isn't a general
solution because users want to do things like send WAL to s3 or GCS or
such.

I don't think there's any doubt that there'll be no shortage of shell
scripts and various other things that'll be used with this that, yes,
will be provided by our users and therefore we can blame them for doing
it wrong, but then they'll complain on our lists and we'll spend time
educating them as to how to write proper software to be used, or
pointing them to a solution that someone writes specifically for this.
I don't view that as, ultimately, a good solution.

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Apr 6, 2020 at 1:32 PM Magnus Hagander <magnus@hagander.net> wrote:
> > Now, if we were just talking about compression, it would actually be
> > interesting to implement some sort of "postgres compression API" if
> > you will, that is implemented by a shared library. This library could
> > then be used from pg_basebackup or from anything else that needs
> > compression. And anybody who wants could then do a "<compression X>
> > for PostgreSQL" module, removing the need for us to carry such code
> > upstream.
>
> I think it could be more general than a compression library. It could
> be a store-my-stuff-and-give-it-back-to-me library, which might do
> compression or encryption or cloud storage or any combination of the
> three, and probably other stuff too. Imagine that you first call an
> init function with a namespace that is basically a string provided by
> the user. Then you open a file either for read or for write (but not
> both). Then you read or write a series of chunks (depending on the
> file mode). Then you close the file. Then you can do the same with
> more files. Finally at the end you close the namespace. You don't
> really need to care where or how the functions you are calling store
> the data. You just need them to return proper error indicators if by
> chance they fail.

Yes, having a storage layer makes a lot of sense here, with features
that are understood by the core system and which each driver
understands, and then having a filter system which is also pluggable and
can support things like compression and hashing for this would also be
great.

I can point you to examples of all of the above, already implemented, in
C, all OSS.  Sure seems like a pretty good and reasonable approach to
take when other folks are doing it.

> As compared with my previous proposal, this would work much better for
> pg_basebackup -Fp, because you wouldn't launch a new bzip2 process for
> every file. You'd just bzopen(), which is presumably quite lightweight
> by comparison. The reasons I didn't propose it are:
>
> 1. Running bzip2 on every file in a plain-format backup seems a lot
> sillier than running it on every tar file in a tar-format backup.

This is circular, isn't it?  It's silly because you're launching new
bzip2 processes for every file, but if you were using bzopen() then you
wouldn't have that issue and therefore compressing every file in a
plain-format backup would be entirely reasonable.

> 2. I'm not confident that the command specified here actually needs to
> be anything very complicated (unlike archive_command).

This.. just doesn't make sense to me.  The above talks about pushing
things to cloud storage and such, which is definitely much more
complicated than what had really been contemplated when archive_command
was introduced.

> 3. The barrier to entry for a loadable module is a lot higher than for
> a shell command.

Sure.

> 4. I think that all of our existing infrastructure for loadable
> modules is backend-only.

That certainly doesn't seem a terrible hurdle, but I'm not convinced
that we'd actually need or want this to be done through loadable
modules- I'd argue that we should, instead, be thinking about building a
system where we could accept patches that add in new drivers and new
filters to core, where they're reviewed and well written.

> Now all of these are up for discussion. I am sure we can make the
> loadable module stuff work in frontend code; it would just take some
> work. A C interface for extensibility is very significantly harder to
> use than a shell interface, but it's still way better than no
> interface. The idea that this shell command can be something simple is
> my current belief, but it may turn out to be wrong. And I'm sure
> somebody can propose a good reason to do something with every file in
> a plain-format backup rather than using tar format.

I've already tried to point out that the shell command you're talking
about isn't going to be able to just be a simple command if the idea is
that it'd be used to send things to s3 or gcs or anything like that.
*Maybe* it could be simple if the only thing it's used for is a simple
compression filter (though we'd have to deal with the whole fsync thing,
as discussed), but it seems very likely that everyone would be a lot
happier if we just built in support for bzip2, lz4, gzip, whatever, and
that certainly doesn't strike me as a large ask in terms of code
complexity or level of effort.

> All that being said, I still find it hard to believe that we will want
> to add dependencies for libraries that we'd need to do encryption or
> S3 cloud storage to PostgreSQL itself. So if we go with this more
> integrated approach we should consider the possibility that, when the
> dust settles, PostgreSQL will only have pg_basebackup
> --output-plugin=lz4 and Aurora will also have pg_basebackup
> --output-plugin=s3. From my point of view, that would be less than
> ideal.

We already have libraries for encryption and we do not have to add
libraries for s3 storage to support it as an option, as I mentioned
up-thread.  I don't find the argument that someone else might extend
pg_basebackup (or whatever) to add on new features to be one that
concerns me terribly much, provided we give people the opportunity to
add those same features into core if they're willing to put in the
effort to make it happen.  I'm quite concerned that using this generic
"you can just write a shell script to do it" approach will be used, over
and over again, as an argument or at least a deterrent to having
something proper in core and will ultimately result in us not having any
good solution in core for the very common use cases that our users have
today.

That certainly seems like what's happened with archive_command.

Thanks,

Stephen

Вложения

signature.asc

Re: where should I stick that backup?

От

Robert Haas

Дата:

08 апреля 2020 г., 17:33:32

On Wed, Apr 8, 2020 at 1:05 PM Stephen Frost <sfrost@snowman.net> wrote:
> What if %f.bz2 already exists?

That cannot occur in the scenario I described.

> How about if %f has a space in it?

For a tar-format backup I don't think that can happen, because the
file names will be base.tar and ${tablespace_oid}.tar. For a plain
format backup it's a potential issue.

> What
> about if I'd like to verify that the backup looks reasonably valid
> without having to find space to store it entirely decompressed?

Then we need to make pg_validatebackup better.

> Also, this argument feels disingenuous to me.
> [ lots more stuff ]

This all just sounds like fearmongering to me. "archive_command
doesn't work very well, so maybe your thing won't either." Maybe it
won't, but the fact that archive_command doesn't isn't a reason.

> Yes, having a storage layer makes a lot of sense here, with features
> that are understood by the core system and which each driver
> understands, and then having a filter system which is also pluggable and
> can support things like compression and hashing for this would also be
> great.

It's good to know that you prefer a C interface to one based on shell
scripting. I hope that we will also get some other opinions on that
question, as my own feelings are somewhat divided (but with some bias
toward trying to making the shell scripting thing work, because I
believe it gives a lot more practical flexibility).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: where should I stick that backup?

От

Stephen Frost

Дата:

08 апреля 2020 г., 18:06:12

Greeitngs,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, Apr 8, 2020 at 1:05 PM Stephen Frost <sfrost@snowman.net> wrote:
> > What if %f.bz2 already exists?
>
> That cannot occur in the scenario I described.

Of course it can.

> > How about if %f has a space in it?
>
> For a tar-format backup I don't think that can happen, because the
> file names will be base.tar and ${tablespace_oid}.tar. For a plain
> format backup it's a potential issue.

I agree that it might not be an issue for tar-format.

> > What
> > about if I'd like to verify that the backup looks reasonably valid
> > without having to find space to store it entirely decompressed?
>
> Then we need to make pg_validatebackup better.

Sure- but shouldn't the design be contemplating how these various tools
will work together?

> > Also, this argument feels disingenuous to me.
> > [ lots more stuff ]
>
> This all just sounds like fearmongering to me. "archive_command
> doesn't work very well, so maybe your thing won't either." Maybe it
> won't, but the fact that archive_command doesn't isn't a reason.

I was trying to explain that we have literally gone down exactly this
path before and it's not been a good result, hence we should be really
careful before going down it again.  I don't consider that to be
fearmongering, nor that we should be dismissing that concern out of
hand.

> > Yes, having a storage layer makes a lot of sense here, with features
> > that are understood by the core system and which each driver
> > understands, and then having a filter system which is also pluggable and
> > can support things like compression and hashing for this would also be
> > great.
>
> It's good to know that you prefer a C interface to one based on shell
> scripting. I hope that we will also get some other opinions on that
> question, as my own feelings are somewhat divided (but with some bias
> toward trying to making the shell scripting thing work, because I
> believe it gives a lot more practical flexibility).

Yes, I do prefer a C interface.  One might even say "been there, done
that."  Hopefully sharing such experience is still useful to do on these
lists.

Thanks,

Stephen

Вложения

signature.asc

Re: where should I stick that backup?

От

Robert Haas

Дата:

08 апреля 2020 г., 19:38:35

On Wed, Apr 8, 2020 at 2:06 PM Stephen Frost <sfrost@snowman.net> wrote:
> * Robert Haas (robertmhaas@gmail.com) wrote:
> > On Wed, Apr 8, 2020 at 1:05 PM Stephen Frost <sfrost@snowman.net> wrote:
> > > What if %f.bz2 already exists?
> >
> > That cannot occur in the scenario I described.
>
> Of course it can.

Not really. The steps I described involved creating a new directory.
Yeah, in theory, somebody could inject a file into that directory
after we created it and before bzip writes any files into it, but
pg_basebackup already has the exact same race condition.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: where should I stick that backup?

От

Stephen Frost

Дата:

08 апреля 2020 г., 19:43:15

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, Apr 8, 2020 at 2:06 PM Stephen Frost <sfrost@snowman.net> wrote:
> > * Robert Haas (robertmhaas@gmail.com) wrote:
> > > On Wed, Apr 8, 2020 at 1:05 PM Stephen Frost <sfrost@snowman.net> wrote:
> > > > What if %f.bz2 already exists?
> > >
> > > That cannot occur in the scenario I described.
> >
> > Of course it can.
>
> Not really. The steps I described involved creating a new directory.
> Yeah, in theory, somebody could inject a file into that directory
> after we created it and before bzip writes any files into it, but
> pg_basebackup already has the exact same race condition.

With pg_basebackup, at least we could reasonably fix that race
condition.

Thanks,

Stephen

Вложения

signature.asc

Re: where should I stick that backup?

От

Bruce Momjian

Дата:

09 апреля 2020 г., 19:57:00

On Mon, Apr  6, 2020 at 07:32:45PM +0200, Magnus Hagander wrote:
> On Mon, Apr 6, 2020 at 4:45 PM Stephen Frost <sfrost@snowman.net> wrote:
> > For my 2c, at least, introducing more shell commands into critical parts
> > of the system is absolutely the wrong direction to go in.
> > archive_command continues to be a mess that we refuse to clean up or
> > even properly document and the project would be much better off by
> > trying to eliminate it rather than add in new ways for users to end up
> > with bad or invalid backups.
> 
> I think the bigger problem with archive_command more comes from how
> it's defined to work tbh. Which leaves a lot of things open.
> 
> This sounds to me like a much narrower use-case, which makes it a lot
> more OK. But I agree we have to be careful not to get back into that
> whole mess. One thing would be to clearly document such things *from
> the beginning*, and not try to retrofit it years later like we ended
> up doing with archive_command.
> 
> And as Robert mentions downthread, the fsync() issue is definitely a
> real one, but if that is documented clearly ahead of time, that's a
> reasonable level foot-gun I'd say.

I think we need to step back and look at the larger issue.  The real
argument goes back to the Unix command-line API vs the VMS/Windows API. 
The former has discrete parts that can be stitched together, while the
VMS/Windows API presents a more duplicative but more holistic API for
every piece.  We have discussed using shell commands for
archive_command, and even more recently, for the server pass phrase.  

To get more specific, I think we have to understand how the
_requirements_ of the job match the shell script API, with stdin,
stdout, stderr, return code, and command-line arguments.  Looking at
archive_command, the command-line arguments allow specification of file
names, but quoting can be complex.  The error return code and stderr
output seem to work fine.  There is no clean API for fsync and testing
if the file exists, so that all that has to be hand done in one
command-line.  This is why many users use pre-written archive_command
shell scripts.

This brings up a few questions:

*  Should we have split apart archive_command into file-exists, copy,
fsync-file?  Should we add that now?

*  How well does this backup requirement match with the shell command
API?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: where should I stick that backup?

От

Stephen Frost

Дата:

09 апреля 2020 г., 20:15:07

Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> I think we need to step back and look at the larger issue.  The real
> argument goes back to the Unix command-line API vs the VMS/Windows API.
> The former has discrete parts that can be stitched together, while the
> VMS/Windows API presents a more duplicative but more holistic API for
> every piece.  We have discussed using shell commands for
> archive_command, and even more recently, for the server pass phrase.

When it comes to something like the server pass phrase, it seems much
more reasonable to consider using a shell script (though still perhaps
not ideal) because it's not involved directly in ensuring that the data
is reliably stored and it's pretty clear that if it doesn't work the
worst thing that happens is that the database doesn't start up, but it
won't corrupt any data or destroy it or do other bad things.

> To get more specific, I think we have to understand how the
> _requirements_ of the job match the shell script API, with stdin,
> stdout, stderr, return code, and command-line arguments.  Looking at
> archive_command, the command-line arguments allow specification of file
> names, but quoting can be complex.  The error return code and stderr
> output seem to work fine.  There is no clean API for fsync and testing
> if the file exists, so that all that has to be hand done in one
> command-line.  This is why many users use pre-written archive_command
> shell scripts.

We aren't considering all of the use-cases really though, in specific,
things like pushing to s3 or gcs require, at least, good retry logic,
and that's without starting to think about things like high-rate systems
(spawning lots of new processes isn't free, particularly if they're
written in shell script but any interpreted language is expensive) and
wanting to parallelize.

> This brings up a few questions:
>
> *  Should we have split apart archive_command into file-exists, copy,
> fsync-file?  Should we add that now?

No..  The right approach to improving on archive command is to add a way
for an extension to take over that job, maybe with a complete background
worker of its own, or perhaps a shared library that can be loaded by the
archiver process, at least if we're talking about how to allow people to
extend it.

Potentially a better answer is to just build this stuff into PG- things
like "archive WAL to s3/GCS with these credentials" are what an awful
lot of users want.  There's then some who want "archive first to this
other server, and then archive to s3/GCS", or more complex options.

I'll also point out that there's not one "s3"..  there's quite a few
alternatives, including some which are open source, which talk the s3
protocol (sadly, they don't all do it perfectly, which is why we are
talking about building a GCS-specific driver for gcs rather than using
their s3 gateway, but still, s3 isn't just 'one thing').

> *  How well does this backup requirement match with the shell command
> API?

For my part, it's not just a question of an API, but it's a question of
who is going to implement a good and reliable solution- PG developers,
or some admin who is just trying to get PG up and running in their
environment..?  One aspect of that is being knowledgable about where all
the land mines are- like the whole fsync thing.  Sure, if you're a PG
developer or you've been around long enough, you're going to realize
that 'cp' isn't going to fsync() the file and therefore it's a pretty
high risk choice for archive_command, and you'll understand just how
important WAL is, but there's certainly an awful lot of folks out there
who don't realize that or at least don't think about it when they're
standing up a new system and instead they just are following our docs
with the expectation that those docs are providing good advice.

Thanks,

Stephen

Вложения

signature.asc

Re: where should I stick that backup?

От

Bruce Momjian

Дата:

09 апреля 2020 г., 22:44:48

On Thu, Apr  9, 2020 at 04:15:07PM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Bruce Momjian (bruce@momjian.us) wrote:
> > I think we need to step back and look at the larger issue.  The real
> > argument goes back to the Unix command-line API vs the VMS/Windows API. 
> > The former has discrete parts that can be stitched together, while the
> > VMS/Windows API presents a more duplicative but more holistic API for
> > every piece.  We have discussed using shell commands for
> > archive_command, and even more recently, for the server pass phrase.  
> 
> When it comes to something like the server pass phrase, it seems much
> more reasonable to consider using a shell script (though still perhaps
> not ideal) because it's not involved directly in ensuring that the data
> is reliably stored and it's pretty clear that if it doesn't work the
> worst thing that happens is that the database doesn't start up, but it
> won't corrupt any data or destroy it or do other bad things.

Well, the pass phrase relates to security, so it is important too.  I
don't think the _importance_ of the action is the most determining
issue.  Rather, I think it is how well the action fits the shell script
API.

> > To get more specific, I think we have to understand how the
> > _requirements_ of the job match the shell script API, with stdin,
> > stdout, stderr, return code, and command-line arguments.  Looking at
> > archive_command, the command-line arguments allow specification of file
> > names, but quoting can be complex.  The error return code and stderr
> > output seem to work fine.  There is no clean API for fsync and testing
> > if the file exists, so that all that has to be hand done in one
> > command-line.  This is why many users use pre-written archive_command
> > shell scripts.
> 
> We aren't considering all of the use-cases really though, in specific,
> things like pushing to s3 or gcs require, at least, good retry logic,
> and that's without starting to think about things like high-rate systems
> (spawning lots of new processes isn't free, particularly if they're
> written in shell script but any interpreted language is expensive) and
> wanting to parallelize.

Good point, but if there are multiple APIs, it makes shell script
flexibility even more useful.

> > This brings up a few questions:
> > 
> > *  Should we have split apart archive_command into file-exists, copy,
> > fsync-file?  Should we add that now?
> 
> No..  The right approach to improving on archive command is to add a way
> for an extension to take over that job, maybe with a complete background
> worker of its own, or perhaps a shared library that can be loaded by the
> archiver process, at least if we're talking about how to allow people to
> extend it.

That seems quite vague, which is the issue we had years ago when
considering doing archive_command as a link to a C library.

> Potentially a better answer is to just build this stuff into PG- things
> like "archive WAL to s3/GCS with these credentials" are what an awful
> lot of users want.  There's then some who want "archive first to this
> other server, and then archive to s3/GCS", or more complex options.

Yes, we certainly know how to do a file system copy, but what about
copying files to other things like S3?  I don't know how we would do
that and allow users to change things like file paths or URLs.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: where should I stick that backup?

От

Robert Haas

Дата:

10 апреля 2020 г., 13:49:07

On Thu, Apr 9, 2020 at 6:44 PM Bruce Momjian <bruce@momjian.us> wrote:
> Good point, but if there are multiple APIs, it makes shell script
> flexibility even more useful.

This is really the key point for me. There are so many existing tools
that store a file someplace that we really can't ever hope to support
them all in core, or even to have well-written extensions that support
them all available on PGXN or wherever. We need to integrate with the
tools that other people have created, not try to reinvent them all in
PostgreSQL.

Now what I understand Stephen to be saying is that a lot of those
tools actually suck, and I think that's a completely valid point. But
I also think that it's unwise to decide that such problems are our
problems rather than problems with those tools. That's a hole with no
bottom.

One thing I do think would be realistic would be to invent a set of
tools that are perform certain local filesystem operations in a
"hardened" way. Maybe a single tool with subcommands and options. So
you could say, e.g. 'pgfile cp SOURCE TARGET' and it would create a
temporary file in the target directory, write the contents of the
source into that file, fsync the file, rename it into place, and do
more fsyncs to make sure it's all durable in case of a crash. You
could have a variant of this that instead of using the temporary file
and rename in place approach, does the thing where you open the target
file with O_CREAT|O_EXCL, writes the bytes, and then closes and fsyncs
it. And you could have other things too, like 'pgfile mkdir DIR' to
create a directory and fsync it for durability. A toolset like this
would probably help people write better archive commands - it would
certainly been an improvement over what we have now, anyway, and it
could also be used with the feature that I proposed upthread.

For example, if you're concerned that bzip might overwrite an existing
file and that it might not fsync, then instead of saying:

pg_basebackup -Ft --pipe-output 'bzip > %f.bz2'

You could instead write:

pg_basebackup -Ft --pipe-output 'bzip | pgfile create-exclusive - %f.bz2'

or whatever we pick for actual syntax. And that provides a kind of
hardening that can be used with any other command line tool that can
be used as a filter.

If you want to compress with bzip, encrypt, and then copy the file to
a remote system, you could do:

pg_basebackup -Ft --pipe-output 'bzip | gpg -e | ssh someuser@somehost
pgfile create-exclusive - /backups/tuesday/%f.bz2'

It is of course not impossible to teach pg_basebackup to do all of
that stuff internally, but I have a really difficult time imagining us
ever getting it done. There are just too many possibilities, and new
ones arise all the time.

A 'pgfile' utility wouldn't help at all for people who are storing to
S3 or whatever. They could use 'aws s3' as a target for --pipe-output,
but if it turns out that said tool is insufficiently robust in terms
of overwriting files or doing fsyncs or whatever, then they might have
problems. Now, Stephen or anyone else could choose to provide
alternative tools with more robust behavior, and that would be great.
But even if he didn't, people could take their chances with what's
already out there. To me, that's a good thing. Yeah, maybe they'll do
dumb things that don't work, but realistically, they can do dumb stuff
without the proposed option too.

> Yes, we certainly know how to do a file system copy, but what about
> copying files to other things like S3?  I don't know how we would do
> that and allow users to change things like file paths or URLs.

Right. I think it's key that we provide people with tools that are
highly flexible and, ideally, also highly composable.

(Incidentally, pg_basebackup already has an option to output the
entire backup as a tarfile on standard output, and a user can already
pipe that into any tool they like. However, it doesn't work with
tablespaces. So you could think of this proposal as extending the
existing functionality to cover that case.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: where should I stick that backup?

От

Stephen Frost

Дата:

10 апреля 2020 г., 13:51:16

Greetings,

* Bruce Momjian (bruce@momjian.us) wrote:
> On Thu, Apr  9, 2020 at 04:15:07PM -0400, Stephen Frost wrote:
> > * Bruce Momjian (bruce@momjian.us) wrote:
> > > I think we need to step back and look at the larger issue.  The real
> > > argument goes back to the Unix command-line API vs the VMS/Windows API.
> > > The former has discrete parts that can be stitched together, while the
> > > VMS/Windows API presents a more duplicative but more holistic API for
> > > every piece.  We have discussed using shell commands for
> > > archive_command, and even more recently, for the server pass phrase.
> >
> > When it comes to something like the server pass phrase, it seems much
> > more reasonable to consider using a shell script (though still perhaps
> > not ideal) because it's not involved directly in ensuring that the data
> > is reliably stored and it's pretty clear that if it doesn't work the
> > worst thing that happens is that the database doesn't start up, but it
> > won't corrupt any data or destroy it or do other bad things.
>
> Well, the pass phrase relates to security, so it is important too.  I
> don't think the _importance_ of the action is the most determining
> issue.  Rather, I think it is how well the action fits the shell script
> API.

There isn't a single 'shell script API' though, and it's possible to
craft a 'shell script API' to fit nearly any use-case, but that doesn't
make it a good solution.  The amount we depend on the external code for
the correct operation of the system is relevant, and important to
consider.

> > > To get more specific, I think we have to understand how the
> > > _requirements_ of the job match the shell script API, with stdin,
> > > stdout, stderr, return code, and command-line arguments.  Looking at
> > > archive_command, the command-line arguments allow specification of file
> > > names, but quoting can be complex.  The error return code and stderr
> > > output seem to work fine.  There is no clean API for fsync and testing
> > > if the file exists, so that all that has to be hand done in one
> > > command-line.  This is why many users use pre-written archive_command
> > > shell scripts.
> >
> > We aren't considering all of the use-cases really though, in specific,
> > things like pushing to s3 or gcs require, at least, good retry logic,
> > and that's without starting to think about things like high-rate systems
> > (spawning lots of new processes isn't free, particularly if they're
> > written in shell script but any interpreted language is expensive) and
> > wanting to parallelize.
>
> Good point, but if there are multiple APIs, it makes shell script
> flexibility even more useful.

This doesn't seem to answer the concerns that I brought up.

Trying to understand it did make me think of another relevant question
that was brought up in this discussion- can we really expect users to
actually implement a C library for this, if we provided a way for them
to?  For that, I'd point to FDWs, where we certainly don't have any
shortage of external, written in C, solutions.  Another would be logical
decoding.

> > > This brings up a few questions:
> > >
> > > *  Should we have split apart archive_command into file-exists, copy,
> > > fsync-file?  Should we add that now?
> >
> > No..  The right approach to improving on archive command is to add a way
> > for an extension to take over that job, maybe with a complete background
> > worker of its own, or perhaps a shared library that can be loaded by the
> > archiver process, at least if we're talking about how to allow people to
> > extend it.
>
> That seems quite vague, which is the issue we had years ago when
> considering doing archive_command as a link to a C library.

That prior discussion isn't really relevant though, as it was before we
had extensions, and before we had background workers that can run as part
of an extension.

> > Potentially a better answer is to just build this stuff into PG- things
> > like "archive WAL to s3/GCS with these credentials" are what an awful
> > lot of users want.  There's then some who want "archive first to this
> > other server, and then archive to s3/GCS", or more complex options.
>
> Yes, we certainly know how to do a file system copy, but what about
> copying files to other things like S3?  I don't know how we would do
> that and allow users to change things like file paths or URLs.

There's a few different ways we could go about this.  The simple answer
would be to use GUCs, which would simplify things like dealing with the
restore side too.  Another option would be to have a concept of
'repository' objects in the system, not unlike tablespaces, but they'd
have more options.  To deal with that during recovery though, we'd need
a way to get the relevant information from the catalogs (maybe we write
the catalog out to a flat file on update, not unlike what we used to do
with pg_shadow), perhaps even in a format that users could modify if
they needed to.  The nice thing about having actual objects in the
system is that it'd be a bit cleaner to be able to define multiple ones
and then have SQL-level functions/commands that work with them.

A good deal of this does involve the question about how to deal with
recovery though, since you might want to, or need to, use different
options when it comes to recovery.  Back to the use-case that I was
mentioning, you could certainly want something like "try to get the WAL
from the local archive, and if that doesn't work, try to get it from the
s3 repo".  What that implies then is that you'd really like a way to
configure multiple repos, which is where we start to see the fragility
of our GUC system.  Pushing that out to something external doesn't
strike me as the right answer though, but rather, we should think about
how to resolve these issues with the GUC system, or come up with
something better.  This isn't the only area where the GUC system isn't
really helping us- synchronous standby names is getting to be a pretty
complicated GUC, for example.

Of course, we could start out with just supporting a single repo with
just a few new GUCs to configure it, that wouldn't be hard and there's
good examples out there about what's needed to configure an s3 repo.

Thanks,

Stephen

Вложения

signature.asc

Re: where should I stick that backup?

От

Stephen Frost

Дата:

10 апреля 2020 г., 14:54:10

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Thu, Apr 9, 2020 at 6:44 PM Bruce Momjian <bruce@momjian.us> wrote:
> > Good point, but if there are multiple APIs, it makes shell script
> > flexibility even more useful.
>
> This is really the key point for me. There are so many existing tools
> that store a file someplace that we really can't ever hope to support
> them all in core, or even to have well-written extensions that support
> them all available on PGXN or wherever. We need to integrate with the
> tools that other people have created, not try to reinvent them all in
> PostgreSQL.

So, this goes to what I was just mentioning to Bruce independently- you
could have made the same argument about FDWs, but it just doesn't
actually hold any water.  Sure, some of the FDWs aren't great, but
there's certainly no shortage of them, and the ones that are
particularly important (like postgres_fdw) are well written and in core.

> Now what I understand Stephen to be saying is that a lot of those
> tools actually suck, and I think that's a completely valid point. But
> I also think that it's unwise to decide that such problems are our
> problems rather than problems with those tools. That's a hole with no
> bottom.

I don't really think 'bzip2' sucks as a tool, or that bash does.  They
weren't designed or intended to meet the expectations that we have for
data durability though, which is why relying on them for exactly that
ends up being a bad recipe.

> One thing I do think would be realistic would be to invent a set of
> tools that are perform certain local filesystem operations in a
> "hardened" way. Maybe a single tool with subcommands and options. So
> you could say, e.g. 'pgfile cp SOURCE TARGET' and it would create a
> temporary file in the target directory, write the contents of the
> source into that file, fsync the file, rename it into place, and do
> more fsyncs to make sure it's all durable in case of a crash. You
> could have a variant of this that instead of using the temporary file
> and rename in place approach, does the thing where you open the target
> file with O_CREAT|O_EXCL, writes the bytes, and then closes and fsyncs
> it. And you could have other things too, like 'pgfile mkdir DIR' to
> create a directory and fsync it for durability. A toolset like this
> would probably help people write better archive commands - it would
> certainly been an improvement over what we have now, anyway, and it
> could also be used with the feature that I proposed upthread.

This argument leads in a direction to justify anything as being sensible
to implement using shell scripts.  If we're open to writing the shell
level tools that would be needed, we could reimplement all of our
indexes that way, or FDWs, or TDE, or just about anything else.

What we would end up with though is that we'd have more complications
changing those interfaces because people will be using those tools, and
maybe those tools don't get updated at the same time as PG does, and
maybe there's critical changes that need to be made in back branches and
we can't really do that with these interfaces.

> It is of course not impossible to teach pg_basebackup to do all of
> that stuff internally, but I have a really difficult time imagining us
> ever getting it done. There are just too many possibilities, and new
> ones arise all the time.

I agree that it's certainly a fair bit of work, but it can be
accomplished incrementally and, with a good design, allow for adding in
new options in the future with relative ease.  Now is the time to
discuss what that design looks like, think about how we can implement it
in a way that all of the tools we have are able to work together, and
have them all support and be tested together with these different
options.

The concerns about there being too many possibilities and new ones
coming up all the time could be applied equally to FDWs, but rather than
ending up with a dearth of options and external solutions there, what
we've actually seen is an explosion of options and externally written
libraries for a large variety of options.

> A 'pgfile' utility wouldn't help at all for people who are storing to
> S3 or whatever. They could use 'aws s3' as a target for --pipe-output,
> but if it turns out that said tool is insufficiently robust in terms
> of overwriting files or doing fsyncs or whatever, then they might have
> problems. Now, Stephen or anyone else could choose to provide
> alternative tools with more robust behavior, and that would be great.
> But even if he didn't, people could take their chances with what's
> already out there. To me, that's a good thing. Yeah, maybe they'll do
> dumb things that don't work, but realistically, they can do dumb stuff
> without the proposed option too.

How does this solution give them a good way to do the right thing
though?  In a way that will work with large databases and complex
requirements?  The answer seems to be "well, everyone will have to write
their own tool to do that" and that basically means that, at best, we're
only providing half of a solution and expecting all of our users to
provide the other half, and to always do it correctly and in a well
written way.  Acknowledging that most users aren't going to actually do
that and instead they'll implement half measures that aren't reliable
shouldn't be seen as an endorsement of this approach.

Thanks,

Stephen

Вложения

signature.asc

Re: where should I stick that backup?

От

Bruce Momjian

Дата:

10 апреля 2020 г., 15:48:39

On Fri, Apr 10, 2020 at 10:54:10AM -0400, Stephen Frost wrote:
> Greetings,
> 
> * Robert Haas (robertmhaas@gmail.com) wrote:
> > On Thu, Apr 9, 2020 at 6:44 PM Bruce Momjian <bruce@momjian.us> wrote:
> > > Good point, but if there are multiple APIs, it makes shell script
> > > flexibility even more useful.
> > 
> > This is really the key point for me. There are so many existing tools
> > that store a file someplace that we really can't ever hope to support
> > them all in core, or even to have well-written extensions that support
> > them all available on PGXN or wherever. We need to integrate with the
> > tools that other people have created, not try to reinvent them all in
> > PostgreSQL.
> 
> So, this goes to what I was just mentioning to Bruce independently- you
> could have made the same argument about FDWs, but it just doesn't
> actually hold any water.  Sure, some of the FDWs aren't great, but
> there's certainly no shortage of them, and the ones that are
> particularly important (like postgres_fdw) are well written and in core.

No, no one made that argument.  It isn't clear how a shell script API
would map to relational database queries.  The point is how well the
APIs match, and then if they are close, does it give us the flexibility
we need.  You can't just look at flexibility without an API match.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: where should I stick that backup?

От

Robert Haas

Дата:

10 апреля 2020 г., 16:20:01

On Fri, Apr 10, 2020 at 10:54 AM Stephen Frost <sfrost@snowman.net> wrote:
> So, this goes to what I was just mentioning to Bruce independently- you
> could have made the same argument about FDWs, but it just doesn't
> actually hold any water.  Sure, some of the FDWs aren't great, but
> there's certainly no shortage of them, and the ones that are
> particularly important (like postgres_fdw) are well written and in core.

That's a fairly different use case. In the case of the FDW interface:

- The number of interface method calls is very high, at least one per
tuple and a bunch of extra ones for each query.
- There is a significant amount of complex state that needs to be
maintained across API calls.
- The return values are often tuples, which are themselves an
in-memory data structure.

But here:

- We're only talking about writing a handful of tar files, and that's
in the context of a full-database backup, which is a much
heavier-weight operation than a query.
- There is not really any state that needs to be maintained across calls.
- The expected result is that a file gets written someplace, which is
not an in-memory data structure but something that gets written to a
place outside of PostgreSQL.

> The concerns about there being too many possibilities and new ones
> coming up all the time could be applied equally to FDWs, but rather than
> ending up with a dearth of options and external solutions there, what
> we've actually seen is an explosion of options and externally written
> libraries for a large variety of options.

Sure, but a lot of those FDWs are relatively low-quality, and it's
often hard to find one that does what you want. And even if you do,
you don't really know how good it is. Unfortunately, in that case
there's no real alternative, because implementing something based on
shell commands couldn't ever have reasonable performance or a halfway
decent feature set. That's not the case here.

> How does this solution give them a good way to do the right thing
> though?  In a way that will work with large databases and complex
> requirements?  The answer seems to be "well, everyone will have to write
> their own tool to do that" and that basically means that, at best, we're
> only providing half of a solution and expecting all of our users to
> provide the other half, and to always do it correctly and in a well
> written way.  Acknowledging that most users aren't going to actually do
> that and instead they'll implement half measures that aren't reliable
> shouldn't be seen as an endorsement of this approach.

I don't acknowledge that. I think it's possible to use tools like the
proposed option in a perfectly reliable way, and I've already given a
bunch of examples of how it could be done. Writing a file is not such
a complex operation that every bit of code that writes one reliably
has to be written by someone associated with the PostgreSQL project. I
strongly suspect that people who use a cloud provider's tools to
upload their backup files will be quite happy with the results, and if
they aren't, I hope they will blame the cloud provider's tool for
eating the data rather than this option for making it easy to give the
data to the thing that ate it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: where should I stick that backup?

От

Andres Freund

Дата:

10 апреля 2020 г., 19:38:04

Hi,

On 2020-04-10 12:20:01 -0400, Robert Haas wrote:
> - We're only talking about writing a handful of tar files, and that's
> in the context of a full-database backup, which is a much
> heavier-weight operation than a query.
> - There is not really any state that needs to be maintained across calls.
> - The expected result is that a file gets written someplace, which is
> not an in-memory data structure but something that gets written to a
> place outside of PostgreSQL.

Wouldn't there be state like a S3/ssh/https/... connection? And perhaps
a 'backup_id' in the backup metadata DB that'd one would want to update
at the end?

Greetings,

Andres Freund

Re: where should I stick that backup?

От

Jose Luis Tallon

Дата:

11 апреля 2020 г., 16:24:13

On 10/4/20 15:49, Robert Haas wrote:
> On Thu, Apr 9, 2020 at 6:44 PM Bruce Momjian <bruce@momjian.us> wrote:
>> Good point, but if there are multiple APIs, it makes shell script
>> flexibility even more useful.
> [snip]
>
> One thing I do think would be realistic would be to invent a set of
> tools that are perform certain local filesystem operations in a
> "hardened" way.
+10
>   Maybe a single tool with subcommands and options. So
> you could say, e.g. 'pgfile cp SOURCE TARGET' and it would create a
> temporary file in the target directory, write the contents of the
> source into that file, fsync the file, rename it into place, and do
> more fsyncs to make sure it's all durable in case of a crash. You
> could have a variant of this that instead of using the temporary file
> and rename in place approach, does the thing where you open the target
> file with O_CREAT|O_EXCL, writes the bytes, and then closes and fsyncs
> it.
Behaviour might be decided in the same way as the default for 
'wal_sync_method' gets chosen, as the most appropriate for a particular 
system.
> And you could have other things too, like 'pgfile mkdir DIR' to
> create a directory and fsync it for durability. A toolset like this
> would probably help people write better archive commands

Definitely, "mkdir" and "create-exclusive" (along with cp) would be a 
great addition and simplify the kind of tasks properly (i.e. with 
risking data loss every time)
> [excerpted]
>
> pg_basebackup -Ft --pipe-output 'bzip | pgfile create-exclusive - %f.bz2'
>
> [....]
>
> pg_basebackup -Ft --pipe-output 'bzip | gpg -e | ssh someuser@somehost
> pgfile create-exclusive - /backups/tuesday/%f.bz2'
Yep. Would also fit the case for non-synchronous NFS mounts for backups...
> It is of course not impossible to teach pg_basebackup to do all of
> that stuff internally, but I have a really difficult time imagining us
> ever getting it done. There are just too many possibilities, and new
> ones arise all the time.

Indeed. The beauty of Unix-like OSs is precisely this.

> A 'pgfile' utility wouldn't help at all for people who are storing to
> S3 or whatever. They could use 'aws s3' as a target for --pipe-output,
> [snip]
> (Incidentally, pg_basebackup already has an option to output the
> entire backup as a tarfile on standard output, and a user can already
> pipe that into any tool they like. However, it doesn't work with
> tablespaces. So you could think of this proposal as extending the
> existing functionality to cover that case.)

Been there already :S  Having pg_basebackup output multiple tarballs 
(one per tablespace), ideally separated via something so that splitting 
can be trivially done on the receiving end.

...but that's probably matter for another thread.


Thanks,

     / J.L.

Re: where should I stick that backup?

От

Robert Haas

Дата:

11 апреля 2020 г., 20:22:09

On Fri, Apr 10, 2020 at 3:38 PM Andres Freund <andres@anarazel.de> wrote:
> Wouldn't there be state like a S3/ssh/https/... connection? And perhaps
> a 'backup_id' in the backup metadata DB that'd one would want to update
> at the end?

Good question. I don't know that there would be but, uh, maybe? It's
not obvious to me why all of that would need to be done using the same
connection, but if it is, the idea I proposed isn't going to work very
nicely.

More generally, can you think of any ideas for how to structure an API
here that are easier to use than "write some C code"? Or do you think
we should tell people to write some C code if they want to
compress/encrypt/relocate their backup in some non-standard way?

For the record, I'm not against eventually having more than one way to
do this, maybe a shell-script interface for simpler things and some
kind of API for more complex needs (e.g. NetBackup integration,
perhaps). And I did wonder if there was some other way we could do
this. For instance, we could add an option --tar-everything that
sticks all the things that would have been returned by the backup
inside another level of tar file and sends the result to stdout. Then
you can pipe it into a single command that gets invoked only once for
all the data, rather than once per tablespace. That might be better,
but I'm not sure it's better. It's better if you want to do
complicated things that involve steps that happen before and after and
persistent connections and so on, but it seems worse for simple things
like piping through a non-default compressor.

Larry Wall somewhat famously commented that a good programming
language should (and I paraphrase) make simple things simple and
complex things possible. My hesitation in going straight to a C API is
that it does not make simple things simple; and I'd like to be really
sure that there is no way of achieving that valuable goal before we
give up on it. However, there is no doubt that a C API is potentially
more powerful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: where should I stick that backup?

От

Jose Luis Tallon

Дата:

12 апреля 2020 г., 00:38:23

On 10/4/20 21:38, Andres Freund wrote:
> Hi,
>
> On 2020-04-10 12:20:01 -0400, Robert Haas wrote:
>> - We're only talking about writing a handful of tar files, and that's
>> in the context of a full-database backup, which is a much
>> heavier-weight operation than a query.
>> - There is not really any state that needs to be maintained across calls.
>> - The expected result is that a file gets written someplace, which is
>> not an in-memory data structure but something that gets written to a
>> place outside of PostgreSQL.
> Wouldn't there be state like a S3/ssh/https/... connection?
...to try and save opening a new connection in the context of a 
(potentially) multi-TB backup? :S
> And perhaps
> a 'backup_id' in the backup metadata DB that'd one would want to update
> at the end?

This is, indeed, material for external tools. Each cater for a 
particular set of end-user requirements.

We got many examples already, with most even co-authored by this list's 
regulars... and IMHO none is suitable for ALL use-cases.

BUT I agree that providing better tools with Postgres itself, ready to 
use --- that is, uncomment the default "archive_command" and get going 
for a very basic starting point --- is a huge advancement in the right 
direction. More importantly (IMO): including the call to "pgfile" or 
equivalent quite clearly signals any inadvertent user that there is more 
to safely archiving WAL segments than just doing "cp -a" blindly and 
hoping that the tool magically does all required steps [needed to ensure 
data safety in this case, rather than the usual behaviour]. It's 
probably more effective than just ammending the existing comments to 
point users to a (new?) section within the documentation.

This comment is from experience: I've lost count of how many times I 
have had to "fix" the default command for WAL archiving --- precisely 
because it had been blindly copied from the default without further 
thinking of the implications should there happen any 
(deviation-from-expected-behaviour) during WAL archiving .... only to be 
noticed at (attempted) recovery time :\

HTH.

Thanks,

     J.L.

Re: where should I stick that backup?

От

Magnus Hagander

Дата:

12 апреля 2020 г., 14:09:13

On Sat, Apr 11, 2020 at 10:22 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Apr 10, 2020 at 3:38 PM Andres Freund <andres@anarazel.de> wrote:
> Wouldn't there be state like a S3/ssh/https/... connection? And perhaps
> a 'backup_id' in the backup metadata DB that'd one would want to update
> at the end?

Good question. I don't know that there would be but, uh, maybe? It's
not obvious to me why all of that would need to be done using the same
connection, but if it is, the idea I proposed isn't going to work very
nicely.

There are certainly cases for it. It might not be they have to be the same connection, but still be the same session, meaning before the first time you perform some step of authentication, get a token, and then use that for all the files. You'd need somewhere to maintain that state, even if it doesn't happen to be a socket. But there are definitely plenty of cases where keeping an open socket can be a huge performance gain -- especially when it comes to not re-negotiating encryption etc.

More generally, can you think of any ideas for how to structure an API
here that are easier to use than "write some C code"? Or do you think
we should tell people to write some C code if they want to
compress/encrypt/relocate their backup in some non-standard way?

For compression and encryption, it could perhaps be as simple as "the command has to be pipe on both input and output" and basically send the response back to pg_basebackup.

But that won't help if the target is to relocate things...

For the record, I'm not against eventually having more than one way to
do this, maybe a shell-script interface for simpler things and some
kind of API for more complex needs (e.g. NetBackup integration,
perhaps). And I did wonder if there was some other way we could do
this. For instance, we could add an option --tar-everything that
sticks all the things that would have been returned by the backup
inside another level of tar file and sends the result to stdout. Then
you can pipe it into a single command that gets invoked only once for
all the data, rather than once per tablespace. That might be better,
but I'm not sure it's better. It's better if you want to do
complicated things that involve steps that happen before and after and
persistent connections and so on, but it seems worse for simple things
like piping through a non-default compressor.

That is one way to go for it -- and in a case like that, I'd suggest the shellscript interface would be an implementation of the other API. A number of times through the years I've bounced ideas around for what to do with archive_command with different people (never quite to the level of "it's time to write a patch"), and it's mostly come down to some sort of shlib api where in turn we'd ship a backwards compatible implementation that would behave like archive_command. I'd envision something similar here.

Larry Wall somewhat famously commented that a good programming
language should (and I paraphrase) make simple things simple and
complex things possible. My hesitation in going straight to a C API is
that it does not make simple things simple; and I'd like to be really
sure that there is no way of achieving that valuable goal before we
give up on it. However, there is no doubt that a C API is potentially
more powerful.

Is there another language that it would make sense to support in the form of "native plugins". Assume we had some generic way to say let people write such plugins in python (we can then bikeshed about which language we should use). That would give them a much higher level language, while also making it possible for a "better" API.

Note that I'm not suggesting supporting a python script running as a regular script -- that could easily be done by anybody making a shellscript implementation. It would be an actual API where the postgres tool would instantiate the python interpreter in-process and create an object there. This would allow things like keeping state across calls, and would also give access to the extensive library availability of the language (e.g. you could directly import an S3 compatible library to upload files etc).

Doing that for just pg_basebackup would probably be overkill, but it might be a generic choice that could extend to other things as well.

Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Re: where should I stick that backup?

От

Robert Haas

Дата:

12 апреля 2020 г., 15:04:46

On Sun, Apr 12, 2020 at 10:09 AM Magnus Hagander <magnus@hagander.net> wrote:
> There are certainly cases for it. It might not be they have to be the same connection, but still be the same session,
meaningbefore the first time you perform some step of authentication, get a token, and then use that for all the files.
You'dneed somewhere to maintain that state, even if it doesn't happen to be a socket. But there are definitely plenty
ofcases where keeping an open socket can be a huge performance gain -- especially when it comes to not re-negotiating
encryptionetc.

Hmm, OK.

> For compression and encryption, it could perhaps be as simple as "the command has to be pipe on both input and
output"and basically send the response back to pg_basebackup.
>
> But that won't help if the target is to relocate things...

Right. And, also, it forces things to be sequential in a way I'm not
too happy about. Like, if we have some kind of parallel backup, which
I hope we will, then you can imagine (among other possibilities)
getting files for each tablespace concurrently, and piping them
through the output command concurrently. But if we emit the result in
a tarfile, then it has to be sequential; there's just no other choice.
I think we should try to come up with something that can work in a
multi-threaded environment.

> That is one way to go for it -- and in a case like that, I'd suggest the shellscript interface would be an
implementationof the other API. A number of times through the years I've bounced ideas around for what to do with
archive_commandwith different people (never quite to the level of "it's time to write a patch"), and it's mostly come
downto some sort of shlib api where in turn we'd ship a backwards compatible implementation that would behave like
archive_command.I'd envision something similar here.

I agree. Let's imagine that there are a conceptually unlimited number
of "targets" and "filters". Targets and filters accept data via the
same API, but a target is expected to dispose of the data, whereas a
filter is expected to pass it, via that same API, to a subsequent
filter or target. So filters could include things like "gzip", "lz4",
and "encrypt-with-rot13", whereas targets would include things like
"file" (the thing we have today - write my data into some local
files!), "shell" (which writes my data to a shell command, as
originally proposed), and maybe eventually things like "netbackup" and
"s3". Ideally this will all eventually be via a loadable module
interface so that third-party filters and targets can be fully
supported, but perhaps we could consider that an optional feature for
v1. Note that there is quite a bit of work to do here just to
reorganize the code.

I would expect that we would want to provide a flexible way for a
target or filter to be passed options from the pg_basebackup command
line. So one might for example write this:

pg_basebackup --filter='lz4 -9' --filter='encrypt-with-rot13
rotations=2' --target='shell ssh rhaas@depository pgfile
create-exclusive - %f.lz4'

The idea is that the first word of the filter or target identifies
which one should be used, and the rest is just options text in
whatever form the provider cares to accept them; but with some
%<character> substitutions allowed, for things like the file name.
(The aforementioned escaping problems for things like filenames with
spaces in them still need to be sorted out, but this is just a sketch,
so while I think it's quite solvable, I am going to refrain from
proposing a precise solution here.)

As to the underlying C API behind this, I propose approximately the
following set of methods:

1. Begin a session. Returns a pointer to a session handle. Gets the
options provided on the command line. In the case of a filter, also
gets a pointer to the session handle for the next filter, or for the
target (which means we set up the final target first, and then stack
the filters on top of it).

2. Begin a file. Gets a session handle and a file name. Returns a
pointer to a file handle.

3. Write data to a file. Gets a file handle, a byte count, and some bytes.

4. End a file. Gets a file handle.

5. End a session. Gets a session handle.

If we get parallelism at some point, then there could be multiple
files in progress at the same time. Maybe some targets, or even
filters, won't be able to handle that, so we could have a flag
someplace indicating that a particular target or filter isn't
parallelism-capable. As an example, writing output to a bunch of files
in a directory is fine to do in parallel, but if you want the entire
backup in one giant tar file, you need each file sequentially.

> Is there another language that it would make sense to support in the form of "native plugins". Assume we had some
genericway to say let people write such plugins in python (we can then bikeshed about which language we should use).
Thatwould give them a much higher level language, while also making it possible for a "better" API.

The idea of using LUA has been floated before, and I imagine that an
interface like the above could also be made to have language bindings
for the scripting language of your choice - e.g. Python. However, I
think we should start by trying to square away the C interface and
then anybody who feels motivated can try to put language bindings on
top of it. I tend to feel that's a bit of a fringe feature myself,
since realistically shell commands are about as much as (and
occasionally more than) typical users can manage. However, it would
not surprise me very much if there are power users out there for whom
C is too much but Python or LUA or something is just right, and if
somebody builds something nifty that caters to that audience, I think
that's great.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: where should I stick that backup?

От

Andres Freund

Дата:

12 апреля 2020 г., 19:17:02

Hi,

On 2020-04-11 16:22:09 -0400, Robert Haas wrote:
> On Fri, Apr 10, 2020 at 3:38 PM Andres Freund <andres@anarazel.de> wrote:
> > Wouldn't there be state like a S3/ssh/https/... connection? And perhaps
> > a 'backup_id' in the backup metadata DB that'd one would want to update
> > at the end?
> 
> Good question. I don't know that there would be but, uh, maybe? It's
> not obvious to me why all of that would need to be done using the same
> connection, but if it is, the idea I proposed isn't going to work very
> nicely.

Well, it depends on what you want to support. If you're only interested
in supporting tarball mode ([1]), *maybe* you can get away without
longer lived sessions (but I'm doubtful). But if you're interested in
also supporting archiving plain files, then the cost of establishing
sessions, and the latency penalty of having to wait for command
completion would imo be prohibitive.  A lot of solutions for storing
backups can achieve pretty decent throughput, but have very significant
latency. That's of course in addition to network latency itself.

[1] I don't think we should restrict it that way. Would make it much
  more complicated to support incremental backup, pg_rewind,
  deduplication, etc.

> More generally, can you think of any ideas for how to structure an API
> here that are easier to use than "write some C code"? Or do you think
> we should tell people to write some C code if they want to
> compress/encrypt/relocate their backup in some non-standard way?

> For the record, I'm not against eventually having more than one way to
> do this, maybe a shell-script interface for simpler things and some
> kind of API for more complex needs (e.g. NetBackup integration,
> perhaps). And I did wonder if there was some other way we could do
> this.

I'm doubtful that an API based on string replacement is the way to
go. It's hard for me to see how that's not either going to substantially
restrict the way the "tasks" are done, or yield a very complicated
interface.

I wonder whether the best approach here could be that pg_basebackup (and
perhaps other tools) opens pipes to/from a subcommand and over the pipe
it communicates with the subtask using a textual ([2]) description of
tasks. Like:

backup mode=files base_directory=/path/to/data/directory
backup_file name=base/14037/16396.14 size=1073741824
backup_file name=pg_wal/XXXX size=16777216
or
backup mode=tar
base_directory /path/to/data/
backup_tar name=dir.tar size=983498875687487

The obvious problem with that proposal is that we don't want to
unnecessarily store the incoming data on the system pg_basebackup is
running on, just for the subcommand to get access to them. More on that
in a second.

A huge advantage of a scheme like this would be that it wouldn't have to
be specific to pg_basebackup. It could just as well work directly on the
server, avoiding an unnecesary loop through the network. Which
e.g. could integrate with filesystem snapshots etc.  Without needing to
build the 'archive target' once with server libraries, and once with
client libraries.

One reason I think something like this could be advantageous over a C
API is that it's quite feasible to implement it from a number of
different language, including shell if really desired, without needing
to provide a C API via a FFI.

It'd also make it quite natural to split out compression from
pg_basebackup's main process, which IME currently makes it not really
feasible to use pg_basebackup's compression.

There's various ways we could address the issue for how the subcommand
can access the file data. The most flexible probably would be to rely on
exchanging file descriptors between basebackup and the subprocess (these
days all supported platforms have that, I think).  Alternatively we
could invoke the subcommand before really starting the backup, and ask
how many files it'd like to receive in parallel, and restart the
subcommand with that number of file descriptors open.

If we relied on FDs, here's an example for how a trace between
pg_basebackup (BB) a backup target command (TC) could look like:

BB: required_capabilities fd_send files
BB: provided_capabilities fd_send file_size files tar
TC: required_capabilities fd_send files file_size
BB: backup mode=files base_directory=/path/to/data/directory
BB: backup_file method=fd name=base/14037/16396.1 size=1073741824
BB: backup_file method=fd name=base/14037/16396.2 size=1073741824
BB: backup_file method=fd name=base/14037/16396.3 size=1073741824
TC: fd name=base/14037/16396.1 (contains TC fd 4)
TC: fd name=base/14037/16396.2 (contains TC fd 5)
BB: backup_file method=fd name=base/14037/16396.4 size=1073741824
TC: fd name=base/14037/16396.3 (contains TC fd 6)
BB: backup_file method=fd name=base/14037/16396.5 size=1073741824
TC: fd name=base/14037/16396.4 (contains TC fd 4)
TC: fd name=base/14037/16396.5 (contains TC fd 5)
BB: done
TC: done

backup_file type=fd mode=fd base/14037/16396.4 1073741824
or
backup_features tar
backup_mode tar
base_directory /path/to/data/
backup_tar dir.tar 983498875687487

[2] yes, I already hear json. A line deliminated format would have some
advantages though.

Greetings,

Andres Freund

Re: where should I stick that backup?

От

Andres Freund

Дата:

12 апреля 2020 г., 19:24:24

Hi,

On 2020-04-12 11:04:46 -0400, Robert Haas wrote:
> I would expect that we would want to provide a flexible way for a
> target or filter to be passed options from the pg_basebackup command
> line. So one might for example write this:
> 
> pg_basebackup --filter='lz4 -9' --filter='encrypt-with-rot13
> rotations=2' --target='shell ssh rhaas@depository pgfile
> create-exclusive - %f.lz4'

My gut feeling is that this would end up with too complicated
pg_basebackup invocations, resulting in the complexity getting
reimplemented in the target command.  A lot of users don't want to
figure out what compression, encryption, ... command makes sense for
which archiving target. And e.g. an s3 target might want to integrate
with an AWS HSM etc, making it unattractive to do the encryption outside
the target.

Greetings,

Andres Freund

Re: where should I stick that backup?

От

David Steele

Дата:

12 апреля 2020 г., 21:19:00

On 4/12/20 11:04 AM, Robert Haas wrote:
> On Sun, Apr 12, 2020 at 10:09 AM Magnus Hagander <magnus@hagander.net> wrote:
>> There are certainly cases for it. It might not be they have to be the same connection, but still be the same
session,meaning before the first time you perform some step of authentication, get a token, and then use that for all
thefiles. You'd need somewhere to maintain that state, even if it doesn't happen to be a socket. But there are
definitelyplenty of cases where keeping an open socket can be a huge performance gain -- especially when it comes to
notre-negotiating encryption etc.

> 
> Hmm, OK.

When we implemented connection-sharing for S3 in pgBackRest it was a 
significant performance boost, even for large files since they must be 
uploaded in parts. The same goes for files transferred over SSH, though 
in this case the overhead is per-file and can be mitigated with control 
master.

We originally (late 2013) implemented everything with commmand-line 
tools during the POC phase. The idea was to get something viable quickly 
and then improve as needed. At the time our config file had entries 
something like this:

[global:command]
compress=/usr/bin/gzip --stdout %file%
decompress=/usr/bin/gzip -dc %file%
checksum=/usr/bin/shasum %file% | awk '{print $1}'
manifest=/opt/local/bin/gfind %path% -printf 
'%P\t%y\t%u\t%g\t%m\t%T@\t%i\t%s\t%l\n'
psql=/Library/PostgreSQL/9.3/bin/psql -X %option%

[db]
psql_options=--cluster=9.3/main

[db:command:option]
psql=--port=6001

These appear to be for MacOS, but Linux would be similar.

This *did* work, but it was really hard to debug when things went wrong, 
the per-file cost was high, and the slight differences between the 
command-line tools on different platforms was maddening. For example, 
lots of versions of 'find' would error if a file disappeared while 
building the manifest, which is a pretty common occurrence in PostgreSQL 
(most newer distros had an option to fix this). I know that doesn't 
apply here, but it's an example. Also, debugging was complicated with so 
many processes, with any degree of parallelism the process list got 
pretty crazy, fsync was not happening, etc. It's been a long time but I 
don't have any good memories of the solution that used all command-line 
tools.

Once we had a POC that solved our basic problem, i.e. backup up about 
50TB of data reasonably efficiently, we immediately started working on a 
version that did not rely on command-line tools and we never looked 
back. Currently the only command-line tool we use is ssh.

I'm sure it would be possible to create a solution that worked better 
than ours, but I'm pretty certain it would still be hard for users to 
make it work correctly and to prove it worked correctly.

>> For compression and encryption, it could perhaps be as simple as "the command has to be pipe on both input and
output"and basically send the response back to pg_basebackup.

>>
>> But that won't help if the target is to relocate things...
> 
> Right. And, also, it forces things to be sequential in a way I'm not
> too happy about. Like, if we have some kind of parallel backup, which
> I hope we will, then you can imagine (among other possibilities)
> getting files for each tablespace concurrently, and piping them
> through the output command concurrently. But if we emit the result in
> a tarfile, then it has to be sequential; there's just no other choice.
> I think we should try to come up with something that can work in a
> multi-threaded environment.
> 
>> That is one way to go for it -- and in a case like that, I'd suggest the shellscript interface would be an
implementationof the other API. A number of times through the years I've bounced ideas around for what to do with
archive_commandwith different people (never quite to the level of "it's time to write a patch"), and it's mostly come
downto some sort of shlib api where in turn we'd ship a backwards compatible implementation that would behave like
archive_command.I'd envision something similar here.

> 
> I agree. Let's imagine that there are a conceptually unlimited number
> of "targets" and "filters". Targets and filters accept data via the
> same API, but a target is expected to dispose of the data, whereas a
> filter is expected to pass it, via that same API, to a subsequent
> filter or target. So filters could include things like "gzip", "lz4",
> and "encrypt-with-rot13", whereas targets would include things like
> "file" (the thing we have today - write my data into some local
> files!), "shell" (which writes my data to a shell command, as
> originally proposed), and maybe eventually things like "netbackup" and
> "s3". Ideally this will all eventually be via a loadable module
> interface so that third-party filters and targets can be fully
> supported, but perhaps we could consider that an optional feature for
> v1. Note that there is quite a bit of work to do here just to
> reorganize the code.
> 
> I would expect that we would want to provide a flexible way for a
> target or filter to be passed options from the pg_basebackup command
> line. So one might for example write this:
> 
> pg_basebackup --filter='lz4 -9' --filter='encrypt-with-rot13
> rotations=2' --target='shell ssh rhaas@depository pgfile
> create-exclusive - %f.lz4'
> 
> The idea is that the first word of the filter or target identifies
> which one should be used, and the rest is just options text in
> whatever form the provider cares to accept them; but with some
> %<character> substitutions allowed, for things like the file name.
> (The aforementioned escaping problems for things like filenames with
> spaces in them still need to be sorted out, but this is just a sketch,
> so while I think it's quite solvable, I am going to refrain from
> proposing a precise solution here.)

This is basically the solution we have landed on after many iterations.

We implement two types of filters, In and InOut.  The In filters process 
data and produce a result, e.g. SHA1, size, page checksum, etc. The 
InOut filters modify data, e.g. compression, encryption. Yeah, the names 
could probably be better...

I have attached our filter interface (filter.intern.h) as a concrete 
example of how this works.

We call 'targets' storage and have a standard interface for creating 
storage drivers. I have also attached our storage interface 
(storage.intern.h) as a concrete example of how this works.

Note that for just performing backup this is overkill, but once you 
consider verify this is pretty much the minimum storage interface 
needed, according to our experience.

Regards,
-- 
-David
david@pgmasters.net

Вложения

Re: where should I stick that backup?

От

David Steele

Дата:

12 апреля 2020 г., 21:57:05

On 4/12/20 3:17 PM, Andres Freund wrote:
> 
>> More generally, can you think of any ideas for how to structure an API
>> here that are easier to use than "write some C code"? Or do you think
>> we should tell people to write some C code if they want to
>> compress/encrypt/relocate their backup in some non-standard way?
> 
>> For the record, I'm not against eventually having more than one way to
>> do this, maybe a shell-script interface for simpler things and some
>> kind of API for more complex needs (e.g. NetBackup integration,
>> perhaps). And I did wonder if there was some other way we could do
>> this.
> 
> I'm doubtful that an API based on string replacement is the way to
> go. It's hard for me to see how that's not either going to substantially
> restrict the way the "tasks" are done, or yield a very complicated
> interface.
> 
> I wonder whether the best approach here could be that pg_basebackup (and
> perhaps other tools) opens pipes to/from a subcommand and over the pipe
> it communicates with the subtask using a textual ([2]) description of
> tasks. Like:
> 
> backup mode=files base_directory=/path/to/data/directory
> backup_file name=base/14037/16396.14 size=1073741824
> backup_file name=pg_wal/XXXX size=16777216
> or
> backup mode=tar
> base_directory /path/to/data/
> backup_tar name=dir.tar size=983498875687487

This is pretty much what pgBackRest does. We call them "local" processes 
and they do most of the work during backup/restore/archive-get/archive-push.

> The obvious problem with that proposal is that we don't want to
> unnecessarily store the incoming data on the system pg_basebackup is
> running on, just for the subcommand to get access to them. More on that
> in a second.

We also implement "remote" processes so the local processes can get data 
that doesn't happen to be local, i.e. on a remote PostgreSQL cluster.

> A huge advantage of a scheme like this would be that it wouldn't have to
> be specific to pg_basebackup. It could just as well work directly on the
> server, avoiding an unnecesary loop through the network. Which
> e.g. could integrate with filesystem snapshots etc.  Without needing to
> build the 'archive target' once with server libraries, and once with
> client libraries.

Yes -- needing to store the data locally or stream it through one main 
process is a major bottleneck.

Working on the server is key because it allows you to compress before 
transferring the data. With parallel processing it is trivial to flood a 
network. We have a recent example from a community user of backing up 
25TB in 4 hours. Compression on the server makes this possible (and a 
fast network, in this case).

For security reasons, it's also nice to be able to encrypt data before 
it leaves the database server. Calculating checksums/size at the source 
is also ideal.

> One reason I think something like this could be advantageous over a C
> API is that it's quite feasible to implement it from a number of
> different language, including shell if really desired, without needing
> to provide a C API via a FFI.

We migrated from Perl to C and kept our local/remote protocol the same, 
which really helped. So, we had times when the C code was using a Perl 
local/remote and vice versa. The idea is certainly workable in our 
experience.

> It'd also make it quite natural to split out compression from
> pg_basebackup's main process, which IME currently makes it not really
> feasible to use pg_basebackup's compression.

This is a major advantage.

> There's various ways we could address the issue for how the subcommand
> can access the file data. The most flexible probably would be to rely on
> exchanging file descriptors between basebackup and the subprocess (these
> days all supported platforms have that, I think).  Alternatively we
> could invoke the subcommand before really starting the backup, and ask
> how many files it'd like to receive in parallel, and restart the
> subcommand with that number of file descriptors open.

We don't exchange FDs. Each local is responsible for getting the data 
from PostgreSQL or the repo based on knowing the data source and a path. 
For pg_basebackup, however, I'd imagine each local would want a 
replication connection with the ability to request specific files that 
were passed to it by the main process.

> [2] yes, I already hear json. A line deliminated format would have some
> advantages though.

We use JSON, but each protocol request/response is linefeed-delimited. 
So for example here's what it looks like when the main process requests 
a local process to backup a specific file:

{"{"cmd":"backupFile","param":["base/32768/33001",true,65536,null,true,0,"pg_data/base/32768/33001",false,0,3,"20200412-213313F",false,null]}"}

And the local responds with:

{"{"out":[1,65536,65536,"6bf316f11d28c28914ea9be92c00de9bea6d9a6b",{"align":true,"error":[0,[3,5],7],"valid":false}]}"}

We use arrays for parameters but of course these could be done with 
objects for more readability.

We are considering a move to HTTP since lots of services (e.g. S3, GCS, 
Azure, etc.) require it (so we implement it) and we're not sure it makes 
sense to maintain our own protocol format. That said, we'd still prefer 
to use JSON for our payloads (like GCS) rather than XML (as S3 does).

Regards,
-- 
-David
david@pgmasters.net

Re: where should I stick that backup?

От

Andres Freund

Дата:

12 апреля 2020 г., 22:37:13

Hi,

On 2020-04-12 17:57:05 -0400, David Steele wrote:
> On 4/12/20 3:17 PM, Andres Freund wrote:
> > [proposal outline[
>
> This is pretty much what pgBackRest does. We call them "local" processes and
> they do most of the work during backup/restore/archive-get/archive-push.

Hah. I swear, I didn't look.


> > The obvious problem with that proposal is that we don't want to
> > unnecessarily store the incoming data on the system pg_basebackup is
> > running on, just for the subcommand to get access to them. More on that
> > in a second.
> 
> We also implement "remote" processes so the local processes can get data
> that doesn't happen to be local, i.e. on a remote PostgreSQL cluster.

What is the interface between those? I.e. do the files have to be
spooled as a whole locally?


> > There's various ways we could address the issue for how the subcommand
> > can access the file data. The most flexible probably would be to rely on
> > exchanging file descriptors between basebackup and the subprocess (these
> > days all supported platforms have that, I think).  Alternatively we
> > could invoke the subcommand before really starting the backup, and ask
> > how many files it'd like to receive in parallel, and restart the
> > subcommand with that number of file descriptors open.
> 
> We don't exchange FDs. Each local is responsible for getting the data from
> PostgreSQL or the repo based on knowing the data source and a path. For
> pg_basebackup, however, I'd imagine each local would want a replication
> connection with the ability to request specific files that were passed to it
> by the main process.

I don't like this much. It'll push more complexity into each of the
"targets" and we can't easily share that complexity. And also, needing
to request individual files will add a lot of back/forth, and thus
latency issues. The server would always have to pre-send a list of
files, we'd have to deal with those files vanishing, etc.


> > [2] yes, I already hear json. A line deliminated format would have some
> > advantages though.
> 
> We use JSON, but each protocol request/response is linefeed-delimited. So
> for example here's what it looks like when the main process requests a local
> process to backup a specific file:
> 
>
{"{"cmd":"backupFile","param":["base/32768/33001",true,65536,null,true,0,"pg_data/base/32768/33001",false,0,3,"20200412-213313F",false,null]}"}
> 
> And the local responds with:
> 
>
{"{"out":[1,65536,65536,"6bf316f11d28c28914ea9be92c00de9bea6d9a6b",{"align":true,"error":[0,[3,5],7],"valid":false}]}"}

As long as it's line delimited, I don't really care :)


> We are considering a move to HTTP since lots of services (e.g. S3, GCS,
> Azure, etc.) require it (so we implement it) and we're not sure it makes
> sense to maintain our own protocol format. That said, we'd still prefer to
> use JSON for our payloads (like GCS) rather than XML (as S3 does).

I'm not quite sure what you mean here? You mean actual requests for each
of what currently are lines? If so, that sounds *terrible*.

Greetings,

Andres Freund

Re: where should I stick that backup?

От

David Steele

Дата:

12 апреля 2020 г., 23:04:32

On 4/12/20 6:37 PM, Andres Freund wrote:
> Hi,
> 
> On 2020-04-12 17:57:05 -0400, David Steele wrote:
>> On 4/12/20 3:17 PM, Andres Freund wrote:
>>> [proposal outline[
>>
>> This is pretty much what pgBackRest does. We call them "local" processes and
>> they do most of the work during backup/restore/archive-get/archive-push.
> 
> Hah. I swear, I didn't look.

I believe you. If you spend enough time thinking about this (and we've 
spent a lot) then I think this is is where you arrive.

>>> The obvious problem with that proposal is that we don't want to
>>> unnecessarily store the incoming data on the system pg_basebackup is
>>> running on, just for the subcommand to get access to them. More on that
>>> in a second.
>>
>> We also implement "remote" processes so the local processes can get data
>> that doesn't happen to be local, i.e. on a remote PostgreSQL cluster.
> 
> What is the interface between those? I.e. do the files have to be
> spooled as a whole locally?

Currently we use SSH to talk to a remote, but we are planning on using 
our own TLS servers in the future. We don't spool anything -- the file 
is streamed from the PostgreSQL server (via remote protocol if needed) 
to the repo (which could also be remote, e.g. S3) without spoolng to 
disk. We have buffers, of course, which are configurable with the 
buffer-size option.

>>> There's various ways we could address the issue for how the subcommand
>>> can access the file data. The most flexible probably would be to rely on
>>> exchanging file descriptors between basebackup and the subprocess (these
>>> days all supported platforms have that, I think).  Alternatively we
>>> could invoke the subcommand before really starting the backup, and ask
>>> how many files it'd like to receive in parallel, and restart the
>>> subcommand with that number of file descriptors open.
>>
>> We don't exchange FDs. Each local is responsible for getting the data from
>> PostgreSQL or the repo based on knowing the data source and a path. For
>> pg_basebackup, however, I'd imagine each local would want a replication
>> connection with the ability to request specific files that were passed to it
>> by the main process.
> 
> I don't like this much. It'll push more complexity into each of the
> "targets" and we can't easily share that complexity. And also, needing
> to request individual files will add a lot of back/forth, and thus
> latency issues. The server would always have to pre-send a list of
> files, we'd have to deal with those files vanishing, etc.

Sure, unless we had a standard interface to "get a file from the 
PostgreSQL cluster", which is what pgBackRest has via the storage interface.

Attached is our implementation for "backupFile". I think it's pretty 
concise considering what it does. Most of it is dedicated to checksum 
deltas and backup resume. The straight copy with filters starts at line 189.

>>> [2] yes, I already hear json. A line deliminated format would have some
>>> advantages though.
>>
>> We use JSON, but each protocol request/response is linefeed-delimited. So
>> for example here's what it looks like when the main process requests a local
>> process to backup a specific file:
>>
>>
{"{"cmd":"backupFile","param":["base/32768/33001",true,65536,null,true,0,"pg_data/base/32768/33001",false,0,3,"20200412-213313F",false,null]}"}
>>
>> And the local responds with:
>>
>>
{"{"out":[1,65536,65536,"6bf316f11d28c28914ea9be92c00de9bea6d9a6b",{"align":true,"error":[0,[3,5],7],"valid":false}]}"}
> 
> As long as it's line delimited, I don't really care :)

Agreed.

>> We are considering a move to HTTP since lots of services (e.g. S3, GCS,
>> Azure, etc.) require it (so we implement it) and we're not sure it makes
>> sense to maintain our own protocol format. That said, we'd still prefer to
>> use JSON for our payloads (like GCS) rather than XML (as S3 does).
> 
> I'm not quite sure what you mean here? You mean actual requests for each
> of what currently are lines? If so, that sounds *terrible*.

I know it sounds like a lot, but in practice the local (currently) only 
performs four operations: backup file, restore file, push file to 
archive, get file from archive. In that context a little protocol 
overhead won't be noticed so if it means removing redundant code I'm all 
for it. That said, we have not done this yet -- it's just under 
consideration.

Regards,
-- 
-David
david@pgmasters.net

Вложения

file.c

Re: where should I stick that backup?

От

Robert Haas

Дата:

13 апреля 2020 г., 00:02:50

On Sun, Apr 12, 2020 at 3:17 PM Andres Freund <andres@anarazel.de> wrote:
> A huge advantage of a scheme like this would be that it wouldn't have to
> be specific to pg_basebackup. It could just as well work directly on the
> server, avoiding an unnecesary loop through the network. Which
> e.g. could integrate with filesystem snapshots etc.  Without needing to
> build the 'archive target' once with server libraries, and once with
> client libraries.

That's quite appealing. One downside - IMHO significant - is that you
have to have a separate process to do *anything*. If you want to add a
filter that just logs everything it's asked to do, for example, you've
gotta have a whole process for that, which likely adds a lot of
overhead even if you can somehow avoid passing all the data through an
extra set of pipes. The interface I proposed would allow you to inject
very lightweight filters at very low cost. This design really doesn't.

Note that you could build this on top of what I proposed, but not the
other way around.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: where should I stick that backup?

От

Stephen Frost

Дата:

13 апреля 2020 г., 00:27:43

Greetings,

* David Steele (david@pgmasters.net) wrote:
> On 4/12/20 6:37 PM, Andres Freund wrote:
> >On 2020-04-12 17:57:05 -0400, David Steele wrote:
> >>On 4/12/20 3:17 PM, Andres Freund wrote:
> >>>There's various ways we could address the issue for how the subcommand
> >>>can access the file data. The most flexible probably would be to rely on
> >>>exchanging file descriptors between basebackup and the subprocess (these
> >>>days all supported platforms have that, I think).  Alternatively we
> >>>could invoke the subcommand before really starting the backup, and ask
> >>>how many files it'd like to receive in parallel, and restart the
> >>>subcommand with that number of file descriptors open.
> >>
> >>We don't exchange FDs. Each local is responsible for getting the data from
> >>PostgreSQL or the repo based on knowing the data source and a path. For
> >>pg_basebackup, however, I'd imagine each local would want a replication
> >>connection with the ability to request specific files that were passed to it
> >>by the main process.
> >
> >I don't like this much. It'll push more complexity into each of the
> >"targets" and we can't easily share that complexity. And also, needing
> >to request individual files will add a lot of back/forth, and thus
> >latency issues. The server would always have to pre-send a list of
> >files, we'd have to deal with those files vanishing, etc.
>
> Sure, unless we had a standard interface to "get a file from the PostgreSQL
> cluster", which is what pgBackRest has via the storage interface.

There's a couple of other pieces here that I think bear mentioning.  The
first is that pgBackRest has an actual 'restore' command- and that works
with the filters and works with the storage drivers, so what you're
looking at when it comes to these interfaces isn't just "put a file" but
it's also "get a file".  That's actually quite important to have when
you start thinking about these more complicated methods of doing
backups.

That then leads into the fact that, with a manifest, you can do things
like excluding 0-byte files from going through any of this processing or
from being stored (which costs actual money too, with certain cloud
storage options..), or even for just storing *small* files, which we
tend to have lots of in PG and which also end up costing more and you
end up 'losing' money because you've got lots of 8K files around.

We haven't fully optimized for it in pgBackRest, yet, but avoiding
having lots of little files (again, because there's real $$ costs
involved) is something we actively think about and consider and is made
possible when you've got a 'restore' command.  Having a manifest where a
given file might actually be a reference to a *part* of a file (ie:
pgbackrest_smallfiles, offset: 8192, length: 16384) could result in
savings when using cloud storage.

These are the kinds of things we're thinking about today.  Maybe there's
some way you could implement something like that using shell commands as
an API, but it sure looks like it'd be pretty hard from here.  Even just
managing to get users to use the right shell commands for backup, and
then the right ones for restore, seems awful daunting.

I get that I'm probably going to get flak for playing up the 'worst
case', but the reality is that far too many people don't fully test
their restore processes and trying to figure out the right shell
commands to pass into some 'restore' command, or even just to pull all
of the data back down from $cloudstorage to perform a restore, when
everything is down and your boss is breathing down your neck to get it
all back online as fast as possible, isn't how I want this project to be
remembered.  David and I are constantly talking about how to make the
restore process as smooth and as fast as possible, because that's where
the rubber really meets the road- you've gotta make that part easy and
fast because that's the high-pressure situation.  Taking backups is
rarely where the real pressure is at- sure, take it today, take it
tomorrow, let it run for a few hours, it's all fine, but when you need
something restored, you best make that as simple and as fast as
absolutely possible because that's the time when your entire business is
potentially going to be offline and waiting for you to get everything
back up.

Thanks,

Stephen

Вложения

signature.asc

Re: where should I stick that backup?

От

Andres Freund

Дата:

13 апреля 2020 г., 00:27:50

Hi,

On 2020-04-12 20:02:50 -0400, Robert Haas wrote:
> On Sun, Apr 12, 2020 at 3:17 PM Andres Freund <andres@anarazel.de> wrote:
> > A huge advantage of a scheme like this would be that it wouldn't have to
> > be specific to pg_basebackup. It could just as well work directly on the
> > server, avoiding an unnecesary loop through the network. Which
> > e.g. could integrate with filesystem snapshots etc.  Without needing to
> > build the 'archive target' once with server libraries, and once with
> > client libraries.
> 
> That's quite appealing. One downside - IMHO significant - is that you
> have to have a separate process to do *anything*. If you want to add a
> filter that just logs everything it's asked to do, for example, you've
> gotta have a whole process for that, which likely adds a lot of
> overhead even if you can somehow avoid passing all the data through an
> extra set of pipes. The interface I proposed would allow you to inject
> very lightweight filters at very low cost. This design really doesn't.

Well, in what you described it'd still be all done inside pg_basebackup,
or did I misunderstand? Once you fetched it from the server, I can't
imagine the overhead of filtering it a bit differently would matter.

But even if, the "target" could just reply with "skip" or such, instead
of providing an fd.

What kind of filtering are you thinking of where this is a problem?
Besides just logging the filenames?  I just can't imagine how that's a
relevant overhead compared to having to do things like
'shell ssh rhaas@depository pgfile create-exclusive - %f.lz4'

I really think we want the option to eventually do this server-side. And
I don't quite see it as viable to go for an API that allows to specify
shell fragments that are going to be executed server side.

> Note that you could build this on top of what I proposed, but not the
> other way around.

Why should it not be possible the other way round?

Greetings,

Andres Freund

Re: where should I stick that backup?

От

Stephen Frost

Дата:

13 апреля 2020 г., 01:18:28

Greetings,

Answering both in one since they're largely the same.

* Bruce Momjian (bruce@momjian.us) wrote:
> On Fri, Apr 10, 2020 at 10:54:10AM -0400, Stephen Frost wrote:
> > * Robert Haas (robertmhaas@gmail.com) wrote:
> > > On Thu, Apr 9, 2020 at 6:44 PM Bruce Momjian <bruce@momjian.us> wrote:
> > > > Good point, but if there are multiple APIs, it makes shell script
> > > > flexibility even more useful.
> > >
> > > This is really the key point for me. There are so many existing tools
> > > that store a file someplace that we really can't ever hope to support
> > > them all in core, or even to have well-written extensions that support
> > > them all available on PGXN or wherever. We need to integrate with the
> > > tools that other people have created, not try to reinvent them all in
> > > PostgreSQL.
> >
> > So, this goes to what I was just mentioning to Bruce independently- you
> > could have made the same argument about FDWs, but it just doesn't
> > actually hold any water.  Sure, some of the FDWs aren't great, but
> > there's certainly no shortage of them, and the ones that are
> > particularly important (like postgres_fdw) are well written and in core.
>
> No, no one made that argument.  It isn't clear how a shell script API
> would map to relational database queries.  The point is how well the
> APIs match, and then if they are close, does it give us the flexibility
> we need.  You can't just look at flexibility without an API match.

If what we're talking about is the file_fdw, which certainly isn't very
complicated, it's not hard to see how you could use shell scripts for
it.  What happens is that it starts to get harder and require custom
code when you want to do something more complex- which is very nearly
what we're talking about here too.  Sure, for a simple 'bzip2' filter, a
shell script might be alright, but it's not going to cut it for the more
complex use-cases that users, today, expect solutions to.

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Fri, Apr 10, 2020 at 10:54 AM Stephen Frost <sfrost@snowman.net> wrote:
> > So, this goes to what I was just mentioning to Bruce independently- you
> > could have made the same argument about FDWs, but it just doesn't
> > actually hold any water.  Sure, some of the FDWs aren't great, but
> > there's certainly no shortage of them, and the ones that are
> > particularly important (like postgres_fdw) are well written and in core.
>
> That's a fairly different use case. In the case of the FDW interface:

There's two different questions we're talking about here and I feel like
they're being conflated.  To try and clarify:

- Could you implement FDWs with shell scripts, and custom programs?  I'm
  pretty confident that the answer is yes, but the thrust of that
  argument is primarily to show that you *can* implement just about
  anything using a shell script "API", so just saying it's possible to
  do doesn't make it necessarily a good solution.  The FDW system is
  complicated, and also good, because we made it so and because it's
  possible to do more sophisticated things with a C API, but it could
  have started out with shell scripts that just returned data in much
  the same way that COPY PROGRAM works today.  What matters is that
  forward thinking to consider what you're going to want to do tomorrow,
  not just thinking about how you can solve for the simple cases today
  with a shell out to an existing command.

- Does providing a C-library interface deter people from implementing
  solutions that use that interface?  Perhaps it does, but it doesn't
  have nearly the dampening effect that is being portrayed here, and we
  can see that pretty clearly from the FDW situation.  Sure, not all of
  those are good solutions, but lots and lots of archive command shell
  scripts are also pretty terrible, and there *are* a few good solutions
  out there, including the ones that we ourselves ship.  At least when
  it comes to FDWs, there's an option there for us to ship a *good*
  answer ourselves for certain (and, in particular, the very very
  common) use-cases.

> - We're only talking about writing a handful of tar files, and that's
> in the context of a full-database backup, which is a much
> heavier-weight operation than a query.

This is true for -Ft, but not -Fp, and I don't think there's enough
thought being put into this when it comes to parallelism and that you
don't want to be limited to one process per tablespace.

> - There is not really any state that needs to be maintained across calls.

As mentioned elsewhere, this isn't really true.

> > How does this solution give them a good way to do the right thing
> > though?  In a way that will work with large databases and complex
> > requirements?  The answer seems to be "well, everyone will have to write
> > their own tool to do that" and that basically means that, at best, we're
> > only providing half of a solution and expecting all of our users to
> > provide the other half, and to always do it correctly and in a well
> > written way.  Acknowledging that most users aren't going to actually do
> > that and instead they'll implement half measures that aren't reliable
> > shouldn't be seen as an endorsement of this approach.
>
> I don't acknowledge that. I think it's possible to use tools like the
> proposed option in a perfectly reliable way, and I've already given a
> bunch of examples of how it could be done. Writing a file is not such
> a complex operation that every bit of code that writes one reliably
> has to be written by someone associated with the PostgreSQL project. I
> strongly suspect that people who use a cloud provider's tools to
> upload their backup files will be quite happy with the results, and if
> they aren't, I hope they will blame the cloud provider's tool for
> eating the data rather than this option for making it easy to give the
> data to the thing that ate it.

The examples you've given of how this could be done "right" involve
someone writing custom code (or having code that's been written by the
PG project) to be executed from this shell command interface, even just
to perform a local backup.

As for where the blame goes, I don't find that to be a particularly
useful thing to argue about.  In any of this, if we are ultimately
saying "well, it's the user's fault, or the fault of the tools that the
user chose to use with our interface" then it seems like we've lost.
Maybe that's going to far and maybe we can't hold ourselves to that high
of a standard but I like to think of this project, in particular, as
being the one that's trying really hard to go as far in that direction
as possible.

To that end, if we contemplate adding support for some cloud vendor's
storage, as an example, and discover that the command line tools for it
suck or don't meet our expectations, I'd expect us to either refuse to
support it, or to forgo using the command-line tools and instead
implement support for talking to the cloud storage interface directly,
if it works well.

Thanks,

Stephen

Вложения

signature.asc

Re: where should I stick that backup?

От

Bruce Momjian

Дата:

13 апреля 2020 г., 13:59:56

On Sun, Apr 12, 2020 at 09:18:28PM -0400, Stephen Frost wrote:
> * Bruce Momjian (bruce@momjian.us) wrote:
> > On Fri, Apr 10, 2020 at 10:54:10AM -0400, Stephen Frost wrote:
> > > * Robert Haas (robertmhaas@gmail.com) wrote:
> > > > On Thu, Apr 9, 2020 at 6:44 PM Bruce Momjian <bruce@momjian.us> wrote:
> > > > > Good point, but if there are multiple APIs, it makes shell script
> > > > > flexibility even more useful.
> > > > 
> > > > This is really the key point for me. There are so many existing tools
> > > > that store a file someplace that we really can't ever hope to support
> > > > them all in core, or even to have well-written extensions that support
> > > > them all available on PGXN or wherever. We need to integrate with the
> > > > tools that other people have created, not try to reinvent them all in
> > > > PostgreSQL.
> > > 
> > > So, this goes to what I was just mentioning to Bruce independently- you
> > > could have made the same argument about FDWs, but it just doesn't
> > > actually hold any water.  Sure, some of the FDWs aren't great, but
> > > there's certainly no shortage of them, and the ones that are
> > > particularly important (like postgres_fdw) are well written and in core.
> > 
> > No, no one made that argument.  It isn't clear how a shell script API
> > would map to relational database queries.  The point is how well the
> > APIs match, and then if they are close, does it give us the flexibility
> > we need.  You can't just look at flexibility without an API match.
> 
> If what we're talking about is the file_fdw, which certainly isn't very
> complicated, it's not hard to see how you could use shell scripts for
> it.  What happens is that it starts to get harder and require custom
> code when you want to do something more complex- which is very nearly
> what we're talking about here too.  Sure, for a simple 'bzip2' filter, a
> shell script might be alright, but it's not going to cut it for the more
> complex use-cases that users, today, expect solutions to.

Well, file_fdw is the simplest FDW, and we might have been able to do
that in shell script, but almost all the other FDWs couldn't, so we
might as well choose a C API for FDWs and use the same one for file_fdw.
It seems like basic engineering that you choose the closest API that
meets most of your deployment requirements, and meets all of the
required ones.

> To that end, if we contemplate adding support for some cloud vendor's
> storage, as an example, and discover that the command line tools for it
> suck or don't meet our expectations, I'd expect us to either refuse to
> support it, or to forgo using the command-line tools and instead
> implement support for talking to the cloud storage interface directly,
> if it works well.

Do we choose a more inflexible API on a hypothetical risk?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: where should I stick that backup?

От

Robert Haas

Дата:

13 апреля 2020 г., 14:20:03

On Sun, Apr 12, 2020 at 9:18 PM Stephen Frost <sfrost@snowman.net> wrote:
> There's two different questions we're talking about here and I feel like
> they're being conflated.  To try and clarify:
>
> - Could you implement FDWs with shell scripts, and custom programs?  I'm
>   pretty confident that the answer is yes, but the thrust of that
>   argument is primarily to show that you *can* implement just about
>   anything using a shell script "API", so just saying it's possible to
>   do doesn't make it necessarily a good solution.  The FDW system is
>   complicated, and also good, because we made it so and because it's
>   possible to do more sophisticated things with a C API, but it could
>   have started out with shell scripts that just returned data in much
>   the same way that COPY PROGRAM works today.  What matters is that
>   forward thinking to consider what you're going to want to do tomorrow,
>   not just thinking about how you can solve for the simple cases today
>   with a shell out to an existing command.
>
> - Does providing a C-library interface deter people from implementing
>   solutions that use that interface?  Perhaps it does, but it doesn't
>   have nearly the dampening effect that is being portrayed here, and we
>   can see that pretty clearly from the FDW situation.  Sure, not all of
>   those are good solutions, but lots and lots of archive command shell
>   scripts are also pretty terrible, and there *are* a few good solutions
>   out there, including the ones that we ourselves ship.  At least when
>   it comes to FDWs, there's an option there for us to ship a *good*
>   answer ourselves for certain (and, in particular, the very very
>   common) use-cases.
>
> > - We're only talking about writing a handful of tar files, and that's
> > in the context of a full-database backup, which is a much
> > heavier-weight operation than a query.
>
> This is true for -Ft, but not -Fp, and I don't think there's enough
> thought being put into this when it comes to parallelism and that you
> don't want to be limited to one process per tablespace.
>
> > - There is not really any state that needs to be maintained across calls.
>
> As mentioned elsewhere, this isn't really true.

These are fair points, and my thinking has been somewhat refined by
this discussion, so let me try to clarify my (current) position a bit.
I believe that there are two subtly different questions here.

Question #1 is "Would it be useful to people to be able to pipe the
tar files that they get from pg_basebackup into some other command
rather than writing them to the filesystem, and should we give them
the option to do so?"

Question #2 is "Is piping the tar files that pg_basebackup would
produce into some other program the best possible way of providing
more flexibility about where backups get written?"

I'm prepared to concede that the answer to question #2 is no. I had
earlier assumed that establishing connections was pretty fast and
that, even if not, there were solutions to that problem, like setting
up an SSH tunnel in advance. Several people have said, well, no,
establishing connections is a problem. As I acknowledged from the
beginning, plain format backups are a problem. So I think a convincing
argument has been made that a shell command won't meet everyone's
needs, and a more complex API is required for some cases.

But I still think the answer to question #1 is yes. I disagree
entirely with any argument to the effect that because some users might
do unsafe things with the option, we ought not to provide it.
Practically speaking, it would work fine for many people even with no
other changes, and if we add something like pgfile, which I'm willing
to do, it would work for more people in more situations. It is a
useful thing to have, full stop.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: where should I stick that backup?

От

Robert Haas

Дата:

14 апреля 2020 г., 14:16:15

On Sun, Apr 12, 2020 at 8:27 PM Andres Freund <andres@anarazel.de> wrote:
> > That's quite appealing. One downside - IMHO significant - is that you
> > have to have a separate process to do *anything*. If you want to add a
> > filter that just logs everything it's asked to do, for example, you've
> > gotta have a whole process for that, which likely adds a lot of
> > overhead even if you can somehow avoid passing all the data through an
> > extra set of pipes. The interface I proposed would allow you to inject
> > very lightweight filters at very low cost. This design really doesn't.
>
> Well, in what you described it'd still be all done inside pg_basebackup,
> or did I misunderstand? Once you fetched it from the server, I can't
> imagine the overhead of filtering it a bit differently would matter.
>
> But even if, the "target" could just reply with "skip" or such, instead
> of providing an fd.
>
> What kind of filtering are you thinking of where this is a problem?
> Besides just logging the filenames?  I just can't imagine how that's a
> relevant overhead compared to having to do things like
> 'shell ssh rhaas@depository pgfile create-exclusive - %f.lz4'

Anything you want to do in the same process. I mean, right now we have
basically one target (filesystem) and one filter (compression).
Neither of those things spawn a process. It seems logical to imagine
that there might be other things that are similar in the future. It
seems to me that there are definitely things where you will want to
spawn a process; that's why I like having shell commands as one
option. But I don't think we should require that you can't have a
filter or a target unless you also spawn a process for it.

> I really think we want the option to eventually do this server-side. And
> I don't quite see it as viable to go for an API that allows to specify
> shell fragments that are going to be executed server side.

The server-side thing is a good point, but I think it adds quite a bit
of complexity, too. I'm worried that this is ballooning to an
unworkable amount of complexity - and not just code complexity, but
bikeshedding complexity, too. Like, I started with a command-line
option that could probably have been implemented in a few hundred
lines of code. Now, we're up to something where you have to build
custom processes that speak a novel protocol and work on both the
client and the server side. That's at least several thousand lines of
code, maybe over ten thousand if the sample binaries that use the new
protocol are more than just simple demonstrations of how to code to
the interface. More importantly, it means agreeing on the nature of
this custom protocol, which seems like something where I could put in
a ton of effort to create something and then have somebody complain
because it's not JSON, or because it is JSON, or because the
capability negotiation system isn't right, or whatever. I'm not
exactly saying that we shouldn't do it; I think it has some appeal.
But I'd sure like to find some way of getting started that doesn't
involve having to do everything in one patch, and then getting told to
change it all again - possibly with different people wanting
contradictory things.

> > Note that you could build this on top of what I proposed, but not the
> > other way around.
>
> Why should it not be possible the other way round?

Because a C function call API lets you decide to spawn a process, but
if the framework inherently spawns a process, you can't decide not to
do so in a particular case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: where should I stick that backup?

От

Stephen Frost

Дата:

14 апреля 2020 г., 15:08:25

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Sun, Apr 12, 2020 at 8:27 PM Andres Freund <andres@anarazel.de> wrote:
> > I really think we want the option to eventually do this server-side. And
> > I don't quite see it as viable to go for an API that allows to specify
> > shell fragments that are going to be executed server side.
>
> The server-side thing is a good point, but I think it adds quite a bit
> of complexity, too. I'm worried that this is ballooning to an
> unworkable amount of complexity - and not just code complexity, but
> bikeshedding complexity, too. Like, I started with a command-line
> option that could probably have been implemented in a few hundred
> lines of code. Now, we're up to something where you have to build
> custom processes that speak a novel protocol and work on both the
> client and the server side. That's at least several thousand lines of
> code, maybe over ten thousand if the sample binaries that use the new
> protocol are more than just simple demonstrations of how to code to
> the interface. More importantly, it means agreeing on the nature of
> this custom protocol, which seems like something where I could put in
> a ton of effort to create something and then have somebody complain
> because it's not JSON, or because it is JSON, or because the
> capability negotiation system isn't right, or whatever. I'm not
> exactly saying that we shouldn't do it; I think it has some appeal.
> But I'd sure like to find some way of getting started that doesn't
> involve having to do everything in one patch, and then getting told to
> change it all again - possibly with different people wanting
> contradictory things.

Doing things incrementally and not all in one patch absolutely makes a
lot of sense and is a good idea.

Wouldn't it make sense to, given that we have some idea of what we want
it to eventually look like, to make progress in that direction though?

That is- I tend to agree with Andres that having this supported
server-side eventually is what we should be thinking about as an
end-goal (what is the point of pg_basebackup in all of this, after all,
if the goal is to get a backup of PG from the PG server to s3..?  why
go through some other program or through the replication protocol?) and
having the server exec'ing out to run shell script fragments to make
that happen looks like it would be really awkward and full of potential
risks and issues and agreement that it wouldn't be a good fit.

If, instead, we worked on a C-based interface which includes filters and
storage drivers, and was implemented through libpgcommon, we could start
with that being all done through pg_basebackup and work to hammer out
the complications and issues that we run into there and, once it seems
reasonably stable and works well, we could potentially pull that into
the backend to be run directly without having to have pg_basebackup
involved in the process.

There's been good progress in the direction of having more done by the
backend already, and that's thanks to you and it's good work-
specifically that the backend now has the ability to generate a
manifest, with checksums included as the backup is being run, which is
definitely an important piece.

Thanks,

Stephen

Вложения

signature.asc

Re: where should I stick that backup?

От

Robert Haas

Дата:

14 апреля 2020 г., 15:38:03

On Tue, Apr 14, 2020 at 11:08 AM Stephen Frost <sfrost@snowman.net> wrote:
> Wouldn't it make sense to, given that we have some idea of what we want
> it to eventually look like, to make progress in that direction though?

Well, yes. :-)

> That is- I tend to agree with Andres that having this supported
> server-side eventually is what we should be thinking about as an
> end-goal (what is the point of pg_basebackup in all of this, after all,
> if the goal is to get a backup of PG from the PG server to s3..?  why
> go through some other program or through the replication protocol?) and
> having the server exec'ing out to run shell script fragments to make
> that happen looks like it would be really awkward and full of potential
> risks and issues and agreement that it wouldn't be a good fit.

I'm fairly deeply uncomfortable with what Andres is proposing. I see
that it's very powerful, and can do a lot of things, and that if
you're building something that does sophisticated things with storage,
you probably want an API like that. It does a great job making
complicated things possible. However, I feel that it does a lousy job
making simple things simple. Suppose you want to compress using your
favorite compression program. Well, you can't. Your favorite
compression program doesn't speak the bespoke PostgreSQL protocol
required for backup plugins. Neither does your favorite encryption
program. Either would be perfectly happy to accept a tarfile on stdin
and dump out a compressed or encrypted version, as the case may be, on
stdout, but sorry, no such luck. You need a special program that
speaks the magic PostgreSQL protocol but otherwise does pretty much
the exact same thing as the standard one.

It's possibly not the exact same thing. A special might, for example,
use multiple threads for parallel compression rather than multiple
processes, perhaps gaining a bit of efficiency. But it's doubtful
whether all users care about such marginal improvements. All they're
going to see is that they can use gzip and maybe lz4 because we
provide the necessary special magic tools to integrate with those, but
for some reason we don't have a special magic tool that they can use
with their own favorite compressor, and so they can't use it. I think
people are going to find that fairly unhelpful.

Now, it's a problem we can work around. We could have a "shell
gateway" program which acts as a plugin, speaks the backup plugin
protocol, and internally does fork-and-exec stuff to spin up copies of
any binary you want to act as a filter. I don't see any real problem
with that. I do think it's very significantly more complicated than
just what Andres called an FFI. It's gonna be way easier to just write
something that spawns shell processes directly than it is to write
something that spawns a process and talks to it using this protocol
and passes around file descriptors using the various different
mechanisms that different platforms use for that, and then that
process turns around and spawns some other processes and passes along
the file descriptors to them. Now you've added a whole bunch of
platform-specific code and a whole bunch of code to generate and parse
protocol messages to achieve exactly the same thing that you could've
done far more simply with a C API. Even accepting as a given the need
to make the C API work separately on both the client and server side,
you've probably at least doubled, and I suspect more like quadrupled,
the amount of infrastructure that has to be built.

So...

> If, instead, we worked on a C-based interface which includes filters and
> storage drivers, and was implemented through libpgcommon, we could start
> with that being all done through pg_basebackup and work to hammer out
> the complications and issues that we run into there and, once it seems
> reasonably stable and works well, we could potentially pull that into
> the backend to be run directly without having to have pg_basebackup
> involved in the process.

...let's do this. Actually, I don't really mind if we target something
that can work on both the client and server side initially, but based
on C, not a new wire protocol with file descriptor passing. That new
wire protocol, and the file descriptor passing infrastructure that
goes with it, are things that I *really* think should be pushed off to
version 2, because I think they're going to generate a lot of
additional work and complexity, and I don't want to deal with all of
it at once.

Also, I don't really see what's wrong with the server forking
processes that exec("/usr/bin/lz4") or whatever. We do similar things
in other places and, while it won't work for cases where you want to
compress a shazillion files, that's not really a problem here anyway.
At least at the moment, the server-side format is *always* tar, so the
problem of needing a separate subprocess for every file in the data
directory does not arise.

> There's been good progress in the direction of having more done by the
> backend already, and that's thanks to you and it's good work-
> specifically that the backend now has the ability to generate a
> manifest, with checksums included as the backup is being run, which is
> definitely an important piece.

Thanks. I'm actually pretty pleased about making some of that
infrastructure available on the frontend side, and would like to go
further in that direction over time. My only concern is that any given
patch shouldn't be made to require too much collateral infrastructure
work, and any infrastructure work that it will require should be
agreed, so far as we can, early in the development process, so that
there's time to do it at a suitably unhurried pace.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: where should I stick that backup?

От

Andres Freund

Дата:

15 апреля 2020 г., 01:50:08

Hi,

On 2020-04-14 11:38:03 -0400, Robert Haas wrote:
> I'm fairly deeply uncomfortable with what Andres is proposing. I see
> that it's very powerful, and can do a lot of things, and that if
> you're building something that does sophisticated things with storage,
> you probably want an API like that. It does a great job making
> complicated things possible. However, I feel that it does a lousy job
> making simple things simple.

I think it's pretty much exactly the opposite. Your approach seems to
move all the complexity to the user, having to build entire combination
of commands themselves.  Instead of having one or two default commands
that do backups in common situations, everyone has to assemble them from
pieces.

Moved from later in your email, since it seems to make more sense to
have it here:
> All they're going to see is that they can use gzip and maybe lz4
> because we provide the necessary special magic tools to integrate with
> those, but for some reason we don't have a special magic tool that
> they can use with their own favorite compressor, and so they can't use
> it. I think people are going to find that fairly unhelpful.

I have no problem with providing people with the opportunity to use
their personal favorite compressor, but forcing them to have to do that,
and to ensure it's installed etc, strikes me as a spectacurly bad
default situation. Most people don't have the time to research which
compression algorithms work the best for which precise situation.

How do you imagine a default scripted invocation of the new backup stuff
to look like?  Having to specify multiple commandline "fragments" for
compression, storing files, ...  can't be what we want the common case
should look like. It'll just again lead to everyone copy & pasting
examples that all are wrong in different ways. They'll not at all work
across platforms (or often not across OS versions).

In general, I think it's good to give expert users the ability to
customize things like backups and archiving. But defaulting to every
non-expert user having to all that expert work (or coyping it from bad
examples) is one of the most user hostile things in postgres.

> Also, I don't really see what's wrong with the server forking
> processes that exec("/usr/bin/lz4") or whatever. We do similar things
> in other places and, while it won't work for cases where you want to
> compress a shazillion files, that's not really a problem here anyway.
> At least at the moment, the server-side format is *always* tar, so the
> problem of needing a separate subprocess for every file in the data
> directory does not arise.

I really really don't understand this. Are you suggesting that for
server side compression etc we're going to add the ability to specify
shell commands as argument to the base backup command?  That seems so
obviously a non-starter?  A good default for backup configurations
should be that the PG user that the backup is done under is only allowed
to do that, and not that it directly has arbitrary remote command
execution.

> Suppose you want to compress using your favorite compression
> program. Well, you can't. Your favorite compression program doesn't
> speak the bespoke PostgreSQL protocol required for backup
> plugins.  Neither does your favorite encryption program. Either would
> be perfectly happy to accept a tarfile on stdin and dump out a
> compressed or encrypted version, as the case may be, on stdout, but
> sorry, no such luck. You need a special program that speaks the magic
> PostgreSQL protocol but otherwise does pretty much the exact same
> thing as the standard one.

But the tool speaking the protocol can just allow piping through
whatever tool?  Given that there likely is benefits to either doing
things on the client side or on the server side, it seems inevitable
that there's multiple places that would make sense to have the
capability for?

> It's possibly not the exact same thing. A special might, for example,
> use multiple threads for parallel compression rather than multiple
> processes, perhaps gaining a bit of efficiency. But it's doubtful
> whether all users care about such marginal improvements.

Marginal improvements? Compression scales decently well with the number
of cores.  pg_basebackup's compression is useless because it's so slow
(and because its clientside, but that's IME the lesser issue).  I feel I
must be misunderstanding what you mean here.

gzip - vs pigz -p $numcores on my machine: 180MB/s vs 2.5GB/s. The
latter will still sometimes be a bottleneck (it's a bottlenck in pigz,
not available compression cycles), but a lot less commonly than 180.

Greetings,

Andres Freund

Re: where should I stick that backup?

От

Robert Haas

Дата:

15 апреля 2020 г., 13:23:30

On Tue, Apr 14, 2020 at 9:50 PM Andres Freund <andres@anarazel.de> wrote:
> On 2020-04-14 11:38:03 -0400, Robert Haas wrote:
> > I'm fairly deeply uncomfortable with what Andres is proposing. I see
> > that it's very powerful, and can do a lot of things, and that if
> > you're building something that does sophisticated things with storage,
> > you probably want an API like that. It does a great job making
> > complicated things possible. However, I feel that it does a lousy job
> > making simple things simple.
>
> I think it's pretty much exactly the opposite. Your approach seems to
> move all the complexity to the user, having to build entire combination
> of commands themselves.  Instead of having one or two default commands
> that do backups in common situations, everyone has to assemble them from
> pieces.

I think we're mostly talking about different things. I was speaking
mostly about the difficulty of developing it. I agree that a project
which is easier to develop is likely to provide fewer benefits to the
end user. On the other hand, it might be more likely to get done, and
projects that don't get done provide few benefits to users. I strongly
believe we need an incremental approach here.

> In general, I think it's good to give expert users the ability to
> customize things like backups and archiving. But defaulting to every
> non-expert user having to all that expert work (or coyping it from bad
> examples) is one of the most user hostile things in postgres.

I'm not against adding more built-in compression algorithms, but I
also believe (as I have several times now) that the world moves a lot
faster than PostgreSQL, which has not added a single new compression
algorithm to pg_basebackup ever. We had 1 compression algorithm in
2011, and we still have that same 1 algorithm today. So, either nobody
cares, or adding new algorithms is sufficiently challenging - for
either technical or political reasons - that nobody's managed to get
it done. I think having a simple framework in pg_basebackup for
plugging in new algorithms would make it noticeably simpler to add LZ4
or whatever your favorite compression algorithm is. And I think having
that framework also be able to use shell commands, so that users don't
have to wait a decade or more for new choices to show up, is also a
good idea.

I don't disagree that the situation around things like archive_command
is awful, but a good part of that is that every time somebody shows up
and says "hey, let's try to make a small improvement," between two and
forty people show up and start explaining why it's still going to be
terrible. Eventually the pile of requirements get so large, and/or
there are enough contradictory opinions, that the person who made the
proposal for how to improve things gives up and leaves. So then we
still have the documentation suggesting "cp". When people - it happens
to be me in this case, but the problem is much more general - show up
and propose improvements to difficult areas, we can and should give
them good advice on how to improve their proposals. But we should not
insist that they have to build something incredibly complex and
grandiose and solve every problem in that area. We should be happy if
we get ANY improvement in a difficult area, not send dozens of angry
emails complaining that their proposal is imperfect.

> I really really don't understand this. Are you suggesting that for
> server side compression etc we're going to add the ability to specify
> shell commands as argument to the base backup command?  That seems so
> obviously a non-starter?  A good default for backup configurations
> should be that the PG user that the backup is done under is only allowed
> to do that, and not that it directly has arbitrary remote command
> execution.

I hadn't really considered that aspect, and that's certainly a
concern. But I also don't understand why you think it's somehow a big
deal. My point is not that clients should have the ability to execute
arbitrary commands on the server. It's that shelling out to an
external binary provided by the operating system is a reasonable thing
to do, versus having everything have to be done by binaries that we
create. Which I think is what you are also saying right here:

> But the tool speaking the protocol can just allow piping through
> whatever tool?  Given that there likely is benefits to either doing
> things on the client side or on the server side, it seems inevitable
> that there's multiple places that would make sense to have the
> capability for?

Unless I am misunderstanding you, this is exactly what i was
proposing, and have been proposing since the first email on the
thread.

> > It's possibly not the exact same thing. A special might, for example,
> > use multiple threads for parallel compression rather than multiple
> > processes, perhaps gaining a bit of efficiency. But it's doubtful
> > whether all users care about such marginal improvements.
>
> Marginal improvements? Compression scales decently well with the number
> of cores.  pg_basebackup's compression is useless because it's so slow
> (and because its clientside, but that's IME the lesser issue).  I feel I
> must be misunderstanding what you mean here.
>
> gzip - vs pigz -p $numcores on my machine: 180MB/s vs 2.5GB/s. The
> latter will still sometimes be a bottleneck (it's a bottlenck in pigz,
> not available compression cycles), but a lot less commonly than 180.

That's really, really, really not what I was talking about.

I'm quite puzzled by your reading of this email. You seem to have
missed my point entirely. I don't know whether that's because I did a
poor job writing it or because you didn't read it carefully enough or
what. What I'm saying is: I don't immediately wish to undertake the
problem of building a new wire protocol that the client and server can
use to talk to external binaries. I would prefer to start with a C
API, because I think it will be far less work and still able to meet a
number of important needs. The new wire protocol that can be used to
talk to external binaries can be added later.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: where should I stick that backup?

От

Andres Freund

Дата:

15 апреля 2020 г., 22:13:46

Hi,

On 2020-04-15 09:23:30 -0400, Robert Haas wrote:
> On Tue, Apr 14, 2020 at 9:50 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2020-04-14 11:38:03 -0400, Robert Haas wrote:
> > > I'm fairly deeply uncomfortable with what Andres is proposing. I see
> > > that it's very powerful, and can do a lot of things, and that if
> > > you're building something that does sophisticated things with storage,
> > > you probably want an API like that. It does a great job making
> > > complicated things possible. However, I feel that it does a lousy job
> > > making simple things simple.
> >
> > I think it's pretty much exactly the opposite. Your approach seems to
> > move all the complexity to the user, having to build entire combination
> > of commands themselves.  Instead of having one or two default commands
> > that do backups in common situations, everyone has to assemble them from
> > pieces.
> 
> I think we're mostly talking about different things.

That certainly would explain some misunderstandings ;)

I mostly still am trying to define where we eventually want to be on a
medium to high level. And I don't think we really have agreement on
that.  My original understanding of your eventual goal is that it's the
example invocation of pg_basebackup upthread, namely a bunch of shell
arguments to pg_basebackup.  And I don't think that's good enough. IMO
it only really make to design incremental steps after we have a rough
agreement on the eventual goal. Otherwise we'll just end up supporting
the outcomes of missteps for a long time.

> I was speaking mostly about the difficulty of developing it. I agree
> that a project which is easier to develop is likely to provide fewer
> benefits to the end user. On the other hand, it might be more likely
> to get done, and projects that don't get done provide few benefits to
> users. I strongly believe we need an incremental approach here.

I agree.  My concern is just that we should not expose things to the
user that will make it much harder to evolve going forward.

> I'm not against adding more built-in compression algorithms, but I
> also believe (as I have several times now) that the world moves a lot
> faster than PostgreSQL, which has not added a single new compression
> algorithm to pg_basebackup ever. We had 1 compression algorithm in
> 2011, and we still have that same 1 algorithm today. So, either nobody
> cares, or adding new algorithms is sufficiently challenging - for
> either technical or political reasons - that nobody's managed to get
> it done.

Imo most of the discussion has been around toast, and there the
situation imo is much more complicated than just about adding the
compression algorithm.  I don't recall a discussion about adding an
optional dependency to other compression algorithms to pg_basebackup
that didn't go anywhere for either technical or political reasons.

> I think having a simple framework in pg_basebackup for plugging in new
> algorithms would make it noticeably simpler to add LZ4 or whatever
> your favorite compression algorithm is. And I think having that
> framework also be able to use shell commands, so that users don't have
> to wait a decade or more for new choices to show up, is also a good
> idea.

As long as here's sensible defaults, and so that the user doesn't have
to specify paths to binaries for the common cases, I'm OK with that. I'm
not ok with requiring the user to specify shell fragments for things
that should be built in.

If we think the appropriate way to implement extensible compression is
by piping to commandline binaries ([1]), I'd imo e.g. ok if we had a
builtin list of [{fileending, shell-fragment-for-compression}] that is
filled with appropriate values detected at build time for a few common
cases. But then also allowed adding new methods via commandline options.

I guess what I perceived to be the fundamental difference, before this
email, between our positions is that I (still) think that exposing
detailed postprocessing shell fragment style arguments to pg_basebackup,
especially as the only option to use the new capabilities, will nail us
into a corner - but you don't necessarily think so?  Where I had/have no
problems with implementing features by *internally* piping through
external binaries, as long as the user doesn't have to always specify
them.

[1] I am not sure, nor the opposite, that piping is a great idea medium
term. One concern is that IIRC windows pipe performance is not great,
and that there's some other portability problems as well. I think
there's also valid concerns about per-file overhead, which might be a
problem for some future uses.

> > I really really don't understand this. Are you suggesting that for
> > server side compression etc we're going to add the ability to specify
> > shell commands as argument to the base backup command?  That seems so
> > obviously a non-starter?  A good default for backup configurations
> > should be that the PG user that the backup is done under is only allowed
> > to do that, and not that it directly has arbitrary remote command
> > execution.
> 
> I hadn't really considered that aspect, and that's certainly a
> concern. But I also don't understand why you think it's somehow a big
> deal. My point is not that clients should have the ability to execute
> arbitrary commands on the server. It's that shelling out to an
> external binary provided by the operating system is a reasonable thing
> to do, versus having everything have to be done by binaries that we
> create. Which I think is what you are also saying right here:

> > But the tool speaking the protocol can just allow piping through
> > whatever tool?  Given that there likely is benefits to either doing
> > things on the client side or on the server side, it seems inevitable
> > that there's multiple places that would make sense to have the
> > capability for?
> 
> Unless I am misunderstanding you, this is exactly what i was
> proposing, and have been proposing since the first email on the
> thread.

Well, no and yes. As I said above, for me there's a difference between
piping to commands as an internal implementation detail, and between
that being the non-poweruser interface.  It may or may not be the right
tradeoff to implement server side compression by piping the output
to/from some binary. IMO it's clearly not the right way to implement
server side compression by specifying shell fragments as arguments to
BASE_BACKUP.

Nor do I think it's the right thing, albeit a tad more debatable, that
for decent client side compression one has to specify a binary whose
path will differ on various platforms (on windows you can't rely on
PATH).

If we were to go for building all this via piopes, utilizing that to
make compression etc extensible for powerusers makes sense to me.

But I don't think it makes sense to design a C API without a rough
picture of how things should eventually look like.  If we were, e.g.,
eventually going to do all the work of compressing and transferring data
in one external binary, then a C API exposing transformations in
pg_basebackup doesn't necessarily make sense.  If it turns out that
pipes are too inefficient on windows to implement compression filters,
that we need parallel awareness in the API, etc it'll influence the API.

> > > It's possibly not the exact same thing. A special might, for example,
> > > use multiple threads for parallel compression rather than multiple
> > > processes, perhaps gaining a bit of efficiency. But it's doubtful
> > > whether all users care about such marginal improvements.
> >
> > Marginal improvements? Compression scales decently well with the number
> > of cores.  pg_basebackup's compression is useless because it's so slow
> > (and because its clientside, but that's IME the lesser issue).  I feel I
> > must be misunderstanding what you mean here.
> >
> > gzip - vs pigz -p $numcores on my machine: 180MB/s vs 2.5GB/s. The
> > latter will still sometimes be a bottleneck (it's a bottlenck in pigz,
> > not available compression cycles), but a lot less commonly than 180.
> 
> That's really, really, really not what I was talking about.

What did you mean with the "marginal improvements" paragraph above?

Greetings,

Andres Freund

Re: where should I stick that backup?

От

Robert Haas

Дата:

15 апреля 2020 г., 23:55:34

On Wed, Apr 15, 2020 at 6:13 PM Andres Freund <andres@anarazel.de> wrote:
> I guess what I perceived to be the fundamental difference, before this
> email, between our positions is that I (still) think that exposing
> detailed postprocessing shell fragment style arguments to pg_basebackup,
> especially as the only option to use the new capabilities, will nail us
> into a corner - but you don't necessarily think so?  Where I had/have no
> problems with implementing features by *internally* piping through
> external binaries, as long as the user doesn't have to always specify
> them.

My principle concern is actually around having a C API and a flexible
command-line interface. If we rearrange the code and the pg_basebackup
command line syntax so that it's easy to add new "filters" and
"targets", then I think that's a very good step forward. It's of less
concern to me whether those "filters" and "targets" are (1) C code
that we ship as part of pg_basebackup, (2) C code by extension authors
that we dynamically load into pg_basebackup, (3) off-the-shelf
external programs that we invoke, or (4) special external programs
that we provide which do special magic. However, of those options, I
like #4 least, because it seems like a pain in the tail to implement.
It may turn out to be the most powerful and flexible, though I'm not
completely sure about that yet.

As to exactly how far we can get with #3, I think it depends a good
deal on the answer to this question you pose in a footnote:

> [1] I am not sure, nor the opposite, that piping is a great idea medium
> term. One concern is that IIRC windows pipe performance is not great,
> and that there's some other portability problems as well. I think
> there's also valid concerns about per-file overhead, which might be a
> problem for some future uses.

If piping stuff through shell commands performs well for use cases
like compression, then I think we can get pretty far with piping
things through shell commands. It means we can use any compression at
all with no build-time dependency on that compressor. People can
install anything they want, stick it in $PATH, and away they go. I see
no particular reason to dislike that kind of thing; in fact, I think
it offers many compelling advantages. On the other hand, if we really
need to interact directly with the library to get decent performance,
because, say, pipes are too slow, then the approach of piping things
through an arbitrary shell commands is a lot less exciting.

Even then, though, I wonder how many runtime dependencies we're
seriously willing to add. I imagine we can add one or two more
compression algorithms without giving everybody fits, even if it means
adding optional build-time and run-time dependencies on some external
libraries. Any more than that is likely to provoke a backlash. And I
doubt whether we're willing to have the postgresql operating system
package depend on something like libgcrypt at all. I would expect such
a proposal to meet with vigorous objections. But without such a
dependency, how would we realistically get encrypted backups except by
piping through a shell command? I don't really see a way, and letting
a user specify a shell fragment to define what happens there seems
pretty reasonable to me. I'm also not very sure to what we can assume,
with either compression or encryption, that one size fits all. If
there are six popular compression libraries and four popular
encryption libraries, does anyone really believe that it's going to be
OK for 'yum install postgresql-server' to suck in all of those things?
Or, even if that were OK or if it we could somehow avoid it, what are
the chances that we'd actually go to the trouble of building
interfaces to all of those things? I'd rate them as slim to none; we
suck at that sort of thing. Exhibit A: The work to make PostgreSQL
support more than one SSL library.

I'm becoming fairly uncertain as to how far we can get with shell
commands; some of the concerns raised about, for example, connection
management when talking to stuff like S3 are very worrying. At the
same time, I think we need to think pretty seriously about some of the
upsides of shell commands. The average user cannot write a C library
that implements an API. The average user cannot write a C binary that
speaks a novel, PostgreSQL-specific protocol. Even the above-average
user who is capable of doing those things probably won't have the time
to actually do it. So if thing you have to do to make PostgreSQL talk
to the new sljgsjl compressor is either of those things, then we will
not have sljgsjl compression support for probably a decade after it
becomes the gold standard that everyone else in the industry is using.
If what you have to do is 'yum install sljgsjl' and then pg_basebackup
--client-filter='shell sljgsjl', people can start using it as soon as
their favorite distro packages it, without anyone who reads this
mailing list needing to do any work whatsoever.  If what you have to
do is create a 'sljgsjl.json' file in some PostgreSQL install
directory that describes the salient properties of this compressor,
and then after that you can say pg_basebackup --client-filter=sljgsjl,
that's also accessible to a broad swath of users. Now, it may be that
there's no practical way to make things that easy. But, to the extent
that we can, I think we should. The ability to integrate new
technology without action by PostgreSQL core developers is not the
only consideration here, but it's definitely a good thing to have
insofar as we reasonably can.

> But I don't think it makes sense to design a C API without a rough
> picture of how things should eventually look like.  If we were, e.g.,
> eventually going to do all the work of compressing and transferring data
> in one external binary, then a C API exposing transformations in
> pg_basebackup doesn't necessarily make sense.  If it turns out that
> pipes are too inefficient on windows to implement compression filters,
> that we need parallel awareness in the API, etc it'll influence the API.

Yeah. I think we really need to understand the performance
characteristics of pipes better. If they're slow, then anything that
needs to be fast has to work some other way (but we could still
provide a pipe-based slow way for niche uses).

> > That's really, really, really not what I was talking about.
>
> What did you mean with the "marginal improvements" paragraph above?

I was talking about running one compressor processor with multiple
compression threads each reading from a separate pipe, vs. running
multiple processes each with a single thread doing the same thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: where should I stick that backup?

От

Robert Haas

Дата:

17 апреля 2020 г., 02:22:38

On Wed, Apr 15, 2020 at 7:55 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Yeah. I think we really need to understand the performance
> characteristics of pipes better. If they're slow, then anything that
> needs to be fast has to work some other way (but we could still
> provide a pipe-based slow way for niche uses).

Hmm. Could we learn what we need to know about this by doing something
as taking a basebackup of a cluster with some data in it (say, created
by pgbench -i -s 400 or something) and then comparing the speed of cat
< base.tar | gzip > base.tgz to the speed of gzip < base.tar >
base.tgz? It seems like there's no difference between those except
that the first one relays through an extra process and an extra pipe.

I don't know exactly how to do the equivalent of this on Windows, but
I bet somebody does.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: where should I stick that backup?

От

Robert Haas

Дата:

17 апреля 2020 г., 16:19:32

On Thu, Apr 16, 2020 at 10:22 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Hmm. Could we learn what we need to know about this by doing something
> as taking a basebackup of a cluster with some data in it (say, created
> by pgbench -i -s 400 or something) and then comparing the speed of cat
> < base.tar | gzip > base.tgz to the speed of gzip < base.tar >
> base.tgz? It seems like there's no difference between those except
> that the first one relays through an extra process and an extra pipe.

I decided to try this. First I experimented on my laptop using a
backup of a pristine pgbench database, scale factor 100, ~1.5GB.

[rhaas pgbackup]$ for i in 1 2 3; do echo "= run number $i = "; sync;
sync; time gzip < base.tar > base.tar.gz; rm -f base.tar.gz; sync;
sync; time cat < base.tar | gzip > base.tar.gz; rm -f base.tar.gz;
sync; sync; time cat < base.tar | cat | cat | gzip > base.tar.gz; rm
-f base.tar.gz; done

= run number 1 =
real 0m24.011s
user 0m23.542s
sys 0m0.408s

real 0m23.623s
user 0m23.447s
sys 0m0.908s

real 0m23.688s
user 0m23.847s
sys 0m2.085s
= run number 2 =

real 0m23.704s
user 0m23.290s
sys 0m0.374s

real 0m23.389s
user 0m23.239s
sys 0m0.879s

real 0m23.762s
user 0m23.888s
sys 0m2.057s
= run number 3 =

real 0m23.567s
user 0m23.187s
sys 0m0.361s

real 0m23.573s
user 0m23.422s
sys 0m0.903s

real 0m23.749s
user 0m23.884s
sys 0m2.113s

It looks like piping everything through an extra copy of 'cat' may
even be *faster* than having the process read it directly; two out of
three runs with the extra "cat" finished very slightly quicker than
the test where gzip read the file directly. The third set of numbers
for each test run is with three copies of "cat" interposed. That
appears to be slower than with no extra pipes, but not very much, and
it might just be noise.

Next I tried it out on Linux. For this I used 'cthulhu', an older box
with lots and lots of memory and cores. Here I took the scale factor
up to 400, so it's about 5.9GB of data. Same command as above produced
these results:

= run number 1 =

real 2m35.797s
user 2m30.990s
sys 0m4.760s

real 2m35.407s
user 2m32.730s
sys 0m16.714s

real 2m40.598s
user 2m39.054s
sys 0m37.596s
= run number 2 =

real 2m35.529s
user 2m30.971s
sys 0m4.510s

real 2m33.933s
user 2m31.685s
sys 0m16.003s

real 2m45.563s
user 2m44.042s
sys 0m40.357s
= run number 3 =

real 2m35.876s
user 2m31.437s
sys 0m4.391s

real 2m33.872s
user 2m31.629s
sys 0m16.266s

real 2m40.836s
user 2m39.359s
sys 0m38.960s

These results are pretty similar to the MacOS results. The overall
performance was worse, but I think that is probably explained by the
fact that the MacBook is a Haswell-class processor rather than
Westmere, and with significantly higher RAM speed. The pattern that
one extra pipe seems to be perhaps slightly faster, and three extra
pipes a tad slower, persists. So at least in this test, the overhead
added by each pipe appears to be <1%, which I would classify as good
enough not to worry too much about.

> I don't know exactly how to do the equivalent of this on Windows, but
> I bet somebody does.

However, I still don't know what the situation is on Windows. I did do
some searching around on the Internet to try to find out whether pipes
being slow on Windows is a generally-known phenomenon, and I didn't
find anything very compelling, but I don't have an environment set up
to the test myself.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: where should I stick that backup?

От

Andres Freund

Дата:

17 апреля 2020 г., 23:44:08

Hi,

On 2020-04-17 12:19:32 -0400, Robert Haas wrote:
> On Thu, Apr 16, 2020 at 10:22 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > Hmm. Could we learn what we need to know about this by doing something
> > as taking a basebackup of a cluster with some data in it (say, created
> > by pgbench -i -s 400 or something) and then comparing the speed of cat
> > < base.tar | gzip > base.tgz to the speed of gzip < base.tar >
> > base.tgz? It seems like there's no difference between those except
> > that the first one relays through an extra process and an extra pipe.
>
> I decided to try this. First I experimented on my laptop using a
> backup of a pristine pgbench database, scale factor 100, ~1.5GB.
>
> [rhaas pgbackup]$ for i in 1 2 3; do echo "= run number $i = "; sync;
> sync; time gzip < base.tar > base.tar.gz; rm -f base.tar.gz; sync;
> sync; time cat < base.tar | gzip > base.tar.gz; rm -f base.tar.gz;
> sync; sync; time cat < base.tar | cat | cat | gzip > base.tar.gz; rm
> -f base.tar.gz; done

Given that gzip is too slow to be practically usable for anything where
compression speed matters (like e.g. practical database backups), i'm
not sure this measures something useful. The overhead of gzip will
dominate to a degree that even the slowest possible pipe implementation
would be fast enough.

andres@awork3:/tmp/pgbase$ ls -lh
total 7.7G
-rw------- 1 andres andres 137K Apr 17 13:09 backup_manifest
-rw------- 1 andres andres 7.7G Apr 17 13:09 base.tar
-rw------- 1 andres andres  17M Apr 17 13:09 pg_wal.tar

Measuring with pv base.tar |gzip > /dev/null I can see that the
performance varies from somewhere around 20MB/s to about 90MB/s,
averaging ~60MB/s.

andres@awork3:/tmp/pgbase$ pv base.tar |gzip > /dev/null
7.62GiB 0:02:09 [60.2MiB/s]
[===============================================================================================================>]100%
 

Whereas e.g. zstd takes a much much shorter time, even in single
threaded mode:

andres@awork3:/tmp/pgbase$ pv base.tar |zstd -T1 |wc -c
7.62GiB 0:00:14 [ 530MiB/s]
[===============================================================================================================>]100%
 
448956321

not to speak of using parallel compression (pigz is parallel gzip):

andres@awork3:/tmp/pgbase$ pv base.tar |pigz -p 20 |wc -c
7.62GiB 0:00:07 [1.03GiB/s]
[===============================================================================================================>]100%
 
571718276

andres@awork3:/tmp/pgbase$ pv base.tar |zstd -T20 |wc -c
7.62GiB 0:00:04 [1.78GiB/s]
[===============================================================================================================>]100%
 
448956321


Looking at raw pipe speed, I think it's not too hard to see some
limitations:

andres@awork3:/tmp/pgbase$ time (cat base.tar | wc -c )
8184994304

real    0m3.217s
user    0m0.054s
sys    0m4.856s
andres@awork3:/tmp/pgbase$ time (cat base.tar | cat | wc -c )
8184994304

real    0m3.246s
user    0m0.113s
sys    0m7.086s
andres@awork3:/tmp/pgbase$ time (cat base.tar | cat | cat | cat | cat | cat | wc -c )
8184994304

real    0m4.262s
user    0m0.257s
sys    0m20.706s

but I'm not sure how deep pipelines we're thinking would be common.

To make sure this is still relevant in the compression context:

andres@awork3:/tmp/pgbase$ pv base.tar | zstd -T20 > /dev/null
7.62GiB 0:00:04 [1.77GiB/s]
[===============================================================================================================>]100%
 
andres@awork3:/tmp/pgbase$ pv base.tar | cat | cat | zstd -T20 > /dev/null
7.62GiB 0:00:05 [1.38GiB/s]
[===============================================================================================================>]100%
 

It's much less noticable if the cat's are after the zstd, there's so
much less data as pgbench's data is so compressible.


This does seem to suggest that composing features through chains of
pipes wouldn't be a good idea. But not that we shouldn't implement
compression via pipes (nor the opposite).


> > I don't know exactly how to do the equivalent of this on Windows, but
> > I bet somebody does.
>
> However, I still don't know what the situation is on Windows. I did do
> some searching around on the Internet to try to find out whether pipes
> being slow on Windows is a generally-known phenomenon, and I didn't
> find anything very compelling, but I don't have an environment set up
> to the test myself.

I tried to measure something. But I'm not a windows person. And it's
just a kvm VM. I don't know how well that translates into other
environments.

I downloaded gnuwin32 coreutils and zstd and performed some
measurements. The first results were *shockingly* bad:

zstd -T0 < onegbofrandom  | wc -c
linux host:    0.467s
windows guest:    0.968s

zstd -T0 < onegbofrandom  | cat | wc -c
linux host:    0.479s
windows guest:    6.058s

zstd -T0 < onegbofrandom  | cat | cat | wc -c
linux host:    0.516s
windows guest:    7.830s

I think that's because cat reads or writes in too small increments for
windows (but damn, that's slow). Replacing cat with dd:

zstd -T0 < onegbofrandom  | dd bs=512 | wc -c
linux host:    3.091s
windows guest:    5.909s

zstd -T0 < onegbofrandom  | dd bs=64k | wc -c
linux host:    0.540s
windows guest:    1.128s

zstd -T0 < onegbofrandom  | dd bs=1M | wc -c
linux host:    0.516s
windows guest:    1.043s

zstd -T0 < onegbofrandom  | dd bs=1 | wc -c
linux host:    1547s
windows guest:    2607s
(yes, really, it's this slow)

zstd -T0 < onegbofrandom > NUL
zstd -T0 < onegbofrandom > /dev/null
linux host:    0.361s
windows guest:    0.602s

zstd -T0 < onegbofrandom | dd bs=1M of=NUL
zstd -T0 < onegbofrandom | dd bs=1M of=/dev/null
linux host:    0.454s
windows guest:    0.802s

zstd -T0 < onegbofrandom | dd bs=64k | dd bs=64k | dd bs=64k | wc -c
linux host:    0.521s
windows guest:    1.376s


This suggest that pipes do have a considerably higher overhead on
windows, but that it's not all that terrible if one takes care to use
large buffers in each pipe element.

It's notable though that even the simplest use of a pipe does add a
considerable overhead compared to using the files directly.

Greetings,

Andres Freund

Re: where should I stick that backup?

От

Robert Haas

Дата:

18 апреля 2020 г., 15:04:48

On Fri, Apr 17, 2020 at 7:44 PM Andres Freund <andres@anarazel.de> wrote:
> This suggest that pipes do have a considerably higher overhead on
> windows, but that it's not all that terrible if one takes care to use
> large buffers in each pipe element.
>
> It's notable though that even the simplest use of a pipe does add a
> considerable overhead compared to using the files directly.

Thanks for these results. I think that this shows that it's probably
not a great idea to force everything to go through pipes in every
case, but on the other hand, there's no reason to be a particularly
scared of the performance implications of letting some things go
through pipes. For instance, if we decide that LZ4 compression is
going to be a good choice for most users, we might want to do that
in-process rather than via pipes. However, if somebody wants to pipe
through an external compressor that they prefer, that's going to be a
little slower, but not necessarily to a degree that creates big
problems. People with bigger databases will need to be more careful
about which options they choose, but that's kind of inevitable.

Do you agree?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: where should I stick that backup?

От

Greg Stark

Дата:

18 апреля 2020 г., 22:37:09

Fwiw, it was common trick in the Oracle world to create a named pipe
to gzip and then write your backup to it. I really like that way of
doing things but I suppose it's probably too old-fashioned to expect
to survive. And in practice while it worked for a manual process for a
sysadmin it's pretty awkward to automate reliably.

Re: where should I stick that backup?

От

Amit Kapila

Дата:

20 апреля 2020 г., 02:48:22

On Sat, Apr 18, 2020 at 8:35 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Apr 17, 2020 at 7:44 PM Andres Freund <andres@anarazel.de> wrote:
> > This suggest that pipes do have a considerably higher overhead on
> > windows, but that it's not all that terrible if one takes care to use
> > large buffers in each pipe element.
> >
> > It's notable though that even the simplest use of a pipe does add a
> > considerable overhead compared to using the files directly.
>
> Thanks for these results. I think that this shows that it's probably
> not a great idea to force everything to go through pipes in every
> case, but on the other hand, there's no reason to be a particularly
> scared of the performance implications of letting some things go
> through pipes. For instance, if we decide that LZ4 compression is
> going to be a good choice for most users, we might want to do that
> in-process rather than via pipes.
>

How will the user know how to use this compressed backup?  I mean to
say if we use some compression algorithm to compress the data then the
user should know how to decompress and use the backup.   IIUC, if
currently, the user uses tar format to backup, it can simply untar it
and start the server but will that be possible if we provide some
in-built compression methods like LZ4?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: where should I stick that backup?

От

Amit Kapila

Дата:

20 апреля 2020 г., 04:15:18

On Sat, Apr 18, 2020 at 5:14 AM Andres Freund <andres@anarazel.de> wrote:
>
> zstd -T0 < onegbofrandom > NUL
> zstd -T0 < onegbofrandom > /dev/null
> linux host:     0.361s
> windows guest:  0.602s
>
> zstd -T0 < onegbofrandom | dd bs=1M of=NUL
> zstd -T0 < onegbofrandom | dd bs=1M of=/dev/null
> linux host:     0.454s
> windows guest:  0.802s
>
> zstd -T0 < onegbofrandom | dd bs=64k | dd bs=64k | dd bs=64k | wc -c
> linux host:     0.521s
> windows guest:  1.376s
>
>
> This suggest that pipes do have a considerably higher overhead on
> windows, but that it's not all that terrible if one takes care to use
> large buffers in each pipe element.
>

I have also done some similar experiments on my Win-7 box and the
results are as follows:

zstd -T0 < 16396 > NUL

Execution time: 2.240 s

zstd -T0 < 16396 | dd bs=1M > NUL

Execution time: 4.240 s

zstd -T0 < 16396 | dd bs=64k | dd bs=64k | dd bs=64k | wc -c

Execution time: 5.959 s

In the above tests, 16396 is a 1GB file generated via pgbench.  The
above results indicate that adding more pipe chains with dd adds
significant overhead but how can we distinguish what is exact overhead
due to pipe?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: where should I stick that backup?

От

Amit Kapila

Дата:

20 апреля 2020 г., 06:00:09

On Mon, Apr 20, 2020 at 8:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Apr 18, 2020 at 8:35 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Fri, Apr 17, 2020 at 7:44 PM Andres Freund <andres@anarazel.de> wrote:
> > > This suggest that pipes do have a considerably higher overhead on
> > > windows, but that it's not all that terrible if one takes care to use
> > > large buffers in each pipe element.
> > >
> > > It's notable though that even the simplest use of a pipe does add a
> > > considerable overhead compared to using the files directly.
> >
> > Thanks for these results. I think that this shows that it's probably
> > not a great idea to force everything to go through pipes in every
> > case, but on the other hand, there's no reason to be a particularly
> > scared of the performance implications of letting some things go
> > through pipes. For instance, if we decide that LZ4 compression is
> > going to be a good choice for most users, we might want to do that
> > in-process rather than via pipes.
> >
>
> How will the user know how to use this compressed backup?  I mean to
> say if we use some compression algorithm to compress the data then the
> user should know how to decompress and use the backup.   IIUC, if
> currently, the user uses tar format to backup, it can simply untar it
> and start the server but will that be possible if we provide some
> in-built compression methods like LZ4?
>

One idea could be that we can write something like BACKUP COMPRESSION:
<LZ4 or whatever compression we have used> in backup_label file and
then probably recovery can take care of decompressing it.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: where should I stick that backup?

От

Amit Kapila

Дата:

20 апреля 2020 г., 06:36:03

On Mon, Apr 13, 2020 at 5:57 AM Stephen Frost <sfrost@snowman.net> wrote:
>
> There's a couple of other pieces here that I think bear mentioning.  The
> first is that pgBackRest has an actual 'restore' command- and that works
> with the filters and works with the storage drivers, so what you're
> looking at when it comes to these interfaces isn't just "put a file" but
> it's also "get a file".  That's actually quite important to have when
> you start thinking about these more complicated methods of doing
> backups.
>

I also think it is important to provide a way or interface to restore
the data user has backed up using whatever new API we provide as here.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: where should I stick that backup?

От

Amit Kapila

Дата:

20 апреля 2020 г., 09:44:58

On Thu, Apr 16, 2020 at 3:44 AM Andres Freund <andres@anarazel.de> wrote:
>
> > I think having a simple framework in pg_basebackup for plugging in new
> > algorithms would make it noticeably simpler to add LZ4 or whatever
> > your favorite compression algorithm is. And I think having that
> > framework also be able to use shell commands, so that users don't have
> > to wait a decade or more for new choices to show up, is also a good
> > idea.
>
> As long as here's sensible defaults, and so that the user doesn't have
> to specify paths to binaries for the common cases, I'm OK with that. I'm
> not ok with requiring the user to specify shell fragments for things
> that should be built in.
>
> If we think the appropriate way to implement extensible compression is
> by piping to commandline binaries ([1]),
>

I can see how such a scheme could be useful for backups but how do we
restore such a backup?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: where should I stick that backup?

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения