Обсуждение: where should I stick that backup?
There are a couple of things that pg_basebackup can't do that might be an issue for some users. One of them is that you might want to do something like encrypt your backup. Another is that you might want to store someplace other than in the filesystem, like maybe S3. We could certainly teach pg_basebackup how to do specifically those things, and maybe that is worthwhile. However, I wonder if it would be useful to provide a more general capability, either instead of doing those more specific things or in addition to doing those more specific things. What I'm thinking about is: suppose we add an option to pg_basebackup with a name like --pipe-output. This would be mutually exclusive with -D, but would work at least with -Ft and maybe also with -Fp. The argument to --pipe-output would be a shell command to be executed once per output file. Any instance of %f in the shell command would be replaced with the name of the file that would have been written (and %% would turn into a single %). The shell command itself would be executed via system(). So if you want to compress, but using some other compression program instead of gzip, you could do something like: pg_basebackup -Ft --pipe-output 'bzip > %f.bz2' And if you want to encrypt, you could do something like: pg_basebackup -Ft --pipe-output 'gpg -e -o %f.gpg' And if you want to ship it off to be stored in a concrete bunker deep underground, you can just do something like: pg_basebackup -Ft --pipe-output 'send-to-underground-storage.sh backup-2020-04-03 %f' You still have to write send-to-underground-storage.sh, of course, and that may involve some work, and maybe also some expensive construction. But what you don't have to do is first copy the entire backup to your local filesystem and then as a second step figure out how to put it through whatever post-processing it needs. Instead, you can simply take your backup and stick it anywhere you like. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Apr 03, 2020 at 10:19:21AM -0400, Robert Haas wrote: > What I'm thinking about is: suppose we add an option to pg_basebackup > with a name like --pipe-output. This would be mutually exclusive with > -D, but would work at least with -Ft and maybe also with -Fp. The > argument to --pipe-output would be a shell command to be executed once > per output file. Any instance of %f in the shell command would be > replaced with the name of the file that would have been written (and > %% would turn into a single %). The shell command itself would be > executed via system(). So if you want to compress, but using some > other compression program instead of gzip, you could do something > like: > > pg_basebackup -Ft --pipe-output 'bzip > %f.bz2' Seems good to me. I agree -Fp is a "maybe" since the overhead will be high for small files.
Greetings, * Noah Misch (noah@leadboat.com) wrote: > On Fri, Apr 03, 2020 at 10:19:21AM -0400, Robert Haas wrote: > > What I'm thinking about is: suppose we add an option to pg_basebackup > > with a name like --pipe-output. This would be mutually exclusive with > > -D, but would work at least with -Ft and maybe also with -Fp. The > > argument to --pipe-output would be a shell command to be executed once > > per output file. Any instance of %f in the shell command would be > > replaced with the name of the file that would have been written (and > > %% would turn into a single %). The shell command itself would be > > executed via system(). So if you want to compress, but using some > > other compression program instead of gzip, you could do something > > like: > > > > pg_basebackup -Ft --pipe-output 'bzip > %f.bz2' > > Seems good to me. I agree -Fp is a "maybe" since the overhead will be high > for small files. For my 2c, at least, introducing more shell commands into critical parts of the system is absolutely the wrong direction to go in. archive_command continues to be a mess that we refuse to clean up or even properly document and the project would be much better off by trying to eliminate it rather than add in new ways for users to end up with bad or invalid backups. Further, having a generic shell script approach like this would result in things like "well, we don't need to actually add support for X, Y or Z, because we have this wonderful generic shell script thing and you can write your own, and therefore we won't accept patches which do add those capabilities because then we'd have to actually maintain that support." In short, -1 from me. Thanks, Stephen
Вложения
On Mon, Apr 6, 2020 at 10:45 AM Stephen Frost <sfrost@snowman.net> wrote: > For my 2c, at least, introducing more shell commands into critical parts > of the system is absolutely the wrong direction to go in. > archive_command continues to be a mess that we refuse to clean up or > even properly document and the project would be much better off by > trying to eliminate it rather than add in new ways for users to end up > with bad or invalid backups. > > Further, having a generic shell script approach like this would result > in things like "well, we don't need to actually add support for X, Y or > Z, because we have this wonderful generic shell script thing and you can > write your own, and therefore we won't accept patches which do add those > capabilities because then we'd have to actually maintain that support." > > In short, -1 from me. I'm not sure that there's any point in responding to this because I believe that the wording of this email suggests that you've made up your mind that it's bad and that position is not subject to change no matter what anyone else may say. However, I'm going to try on reply anyway, on the theory that (1) I might be wrong and (2) even if I'm right, it might influence the opinions of others who have not spoken yet, and whose opinions may be less settled. First of all, while I agree that archive_command has some problems, I don't think that means that every case where we use a shell command for anything is a hopeless mess. The only problem I really see in this case is that if you route to a local file via an intermediate program you wouldn't get an fsync() any more. But we could probably figure out some clever things to work around that problem, if that's the issue. If there's some other problem, what is it? Second, PostgreSQL is not realistically going to link pg_basebackup against every compression, encryption, and remote storage library out there. One, yeah, we don't want to maintain that. Two, we don't want PostgreSQL to have build-time dependencies on a dozen or more libraries that people might want to use for stuff like this. We might well want to incorporate support for a few of the more popular things in this area, but people will always want support for newer things than what existing server releases feature, and for more of them. Third, I am getting pretty tired of being told every time I try to do something that is related in any way to backup that it's wrong. If your experience with pgbackrest motivated you to propose ways of improving backup and restore functionality in the community, that would be great. But in my experience so far, it seems to mostly involve making a lot of negative comments that make it hard to get anything done. I would appreciate it if you would adopt a more constructive tone. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 6, 2020 at 4:45 PM Stephen Frost <sfrost@snowman.net> wrote: > > Greetings, > > * Noah Misch (noah@leadboat.com) wrote: > > On Fri, Apr 03, 2020 at 10:19:21AM -0400, Robert Haas wrote: > > > What I'm thinking about is: suppose we add an option to pg_basebackup > > > with a name like --pipe-output. This would be mutually exclusive with > > > -D, but would work at least with -Ft and maybe also with -Fp. The > > > argument to --pipe-output would be a shell command to be executed once > > > per output file. Any instance of %f in the shell command would be > > > replaced with the name of the file that would have been written (and > > > %% would turn into a single %). The shell command itself would be > > > executed via system(). So if you want to compress, but using some > > > other compression program instead of gzip, you could do something > > > like: > > > > > > pg_basebackup -Ft --pipe-output 'bzip > %f.bz2' > > > > Seems good to me. I agree -Fp is a "maybe" since the overhead will be high > > for small files. > > For my 2c, at least, introducing more shell commands into critical parts > of the system is absolutely the wrong direction to go in. > archive_command continues to be a mess that we refuse to clean up or > even properly document and the project would be much better off by > trying to eliminate it rather than add in new ways for users to end up > with bad or invalid backups. I think the bigger problem with archive_command more comes from how it's defined to work tbh. Which leaves a lot of things open. This sounds to me like a much narrower use-case, which makes it a lot more OK. But I agree we have to be careful not to get back into that whole mess. One thing would be to clearly document such things *from the beginning*, and not try to retrofit it years later like we ended up doing with archive_command. And as Robert mentions downthread, the fsync() issue is definitely a real one, but if that is documented clearly ahead of time, that's a reasonable level foot-gun I'd say. > Further, having a generic shell script approach like this would result > in things like "well, we don't need to actually add support for X, Y or > Z, because we have this wonderful generic shell script thing and you can > write your own, and therefore we won't accept patches which do add those > capabilities because then we'd have to actually maintain that support." In principle, I agree with "shellscripts suck". Now, if we were just talking about compression, it would actually be interesting to implement some sort of "postgres compression API" if you will, that is implemented by a shared library. This library could then be used from pg_basebackup or from anything else that needs compression. And anybody who wants could then do a "<compression X> for PostgreSQL" module, removing the need for us to carry such code upstream. There's been discussions of that for the backend before IIRC, but I don't recall the conclusions. And in particular, I don't recall if it included the idea of being able to use it in situations like this as well, and with *run-time loading*. And that said, then we'd limit ourselves to compression. We'd still need a way to deal with encryption... -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Apr 6, 2020 at 10:45 AM Stephen Frost <sfrost@snowman.net> wrote: > > For my 2c, at least, introducing more shell commands into critical parts > > of the system is absolutely the wrong direction to go in. > > archive_command continues to be a mess that we refuse to clean up or > > even properly document and the project would be much better off by > > trying to eliminate it rather than add in new ways for users to end up > > with bad or invalid backups. > > > > Further, having a generic shell script approach like this would result > > in things like "well, we don't need to actually add support for X, Y or > > Z, because we have this wonderful generic shell script thing and you can > > write your own, and therefore we won't accept patches which do add those > > capabilities because then we'd have to actually maintain that support." > > > > In short, -1 from me. > > I'm not sure that there's any point in responding to this because I > believe that the wording of this email suggests that you've made up > your mind that it's bad and that position is not subject to change no > matter what anyone else may say. However, I'm going to try on reply > anyway, on the theory that (1) I might be wrong and (2) even if I'm > right, it might influence the opinions of others who have not spoken > yet, and whose opinions may be less settled. Chances certainly aren't good that you'll convince me that putting more absolutely crticial-to-get-perfect shell scripts into the backup path is a good idea. > First of all, while I agree that archive_command has some problems, I > don't think that means that every case where we use a shell command > for anything is a hopeless mess. The only problem I really see in this > case is that if you route to a local file via an intermediate program > you wouldn't get an fsync() any more. But we could probably figure out > some clever things to work around that problem, if that's the issue. > If there's some other problem, what is it? We certainly haven't solved the issues with archive_command (at least, not in core), so this "well, maybe we could fix all the issues" claim really doesn't hold any water. Having commands like this ends up just punting on the whole problem and saying "here user, you deal with it." *Maybe* if we *also* wrote dedicated tools to be used with these commands (as has been proposed multiple times with archive_command, but hasn't actually happened, at least, not in core), we could build something where this would work reasonably well and it'd be alright, but that wasn't what seemed to be suggested here, and if we're going to write all that code anyway, it doesn't really seem like a shell interface is a best one to go with. There's also been something of an expectation that if we're going to provide an interface then we should have an example of something that uses it- but when it comes to something like archive_command, the example we came up with was terrible and yet it's still in our documentation and is commonly used, much to the disservice of our users. Sure, we can point to our users and say "well, that's now how you should actually use that feature, you should do all this other stuff in that command" and punt on this and push it back on our users and tell them that they're using the interface we provide wrong but the only folks who can possibly actually like that answer is ourselves- our users aren't happy with it because they're left with a broken backup that they can't restore from when they needed to. That your initial email had more-or-less the exact same kind of "example" certainly doesn't inspire confidence that this would end up being used sensibly by our users. Yes, fsync() is part of the issue but it's not the only one- retry logic, and making sure the results are correct, is pretty darn important too, especially with things like s3 (even dedicated tools have issues in this area- I just saw a report about wal-g failing to archive a WAL file properly because there was an error which resulted in a 0-byte WAL file being stored; wal-g did properly retry, but then it saw the file was there and figured "all is well" and returned success even though the file was 0-byte in s3). I don't doubt that David could point out a few other issues- he routinely does whenever I chat with him about various ideas I've got. So, instead of talking about 'bzip2 > %f.bz2', and then writing into our documentation that that's how this feature can be used, what about proposing something that would actually work reliably with this interface? Which properly fsync's everything, has good retry logic for when failures happen, is able to actually detect when a failure happened, how to restore from a backup taken this way, and it'd probably be good to show how pg_verifybackup could be used to make sure the backup is actually correct and valid too. > Second, PostgreSQL is not realistically going to link pg_basebackup > against every compression, encryption, and remote storage library out > there. One, yeah, we don't want to maintain that. Two, we don't want > PostgreSQL to have build-time dependencies on a dozen or more > libraries that people might want to use for stuff like this. We might > well want to incorporate support for a few of the more popular things > in this area, but people will always want support for newer things > than what existing server releases feature, and for more of them. We don't need to link to 'every compression, encryption and remote storage library out there'. In some cases, yes, it makes sense to use an existing library (OpenSSL, zlib, lz4), but in many other cases it makes more sense to build support directly into the system (s3, gcs, probably others) because a good library doesn't exist. It'd also be good to build a nicely extensible system which people can add to, to support other storage or compression options but I don't think that's reasonable to do with a shell-script based interface- maybe with shared libraries, as Magnus suggests elsewhere, maybe, but even there I have some doubts. > Third, I am getting pretty tired of being told every time I try to do > something that is related in any way to backup that it's wrong. If > your experience with pgbackrest motivated you to propose ways of > improving backup and restore functionality in the community, that > would be great. But in my experience so far, it seems to mostly > involve making a lot of negative comments that make it hard to get > anything done. I would appreciate it if you would adopt a more > constructive tone. pgbackrest is how we're working to improve backup and restore functionality in the community, and we've come a long way and gone through a great deal of fire getting there. I appreciate that it's not in core and I'd love to discuss how we can change that, but it's absolutely a part of the PG community and ecosystem- with changes being made in core routinely which improve the in-core tools as well as pgbackrest by the authors contributing back. As far as my tone, I'm afraid that's simply coming from having dealt with and discussed many of these, well, shortcuts, to trying to improve backup and recovery. Did David and I discuss using s3cmd? Of course. Did we research various s3 libraries? http libraries? SSL libraries? compression libraries? Absolutely, which is why we ended up using OpenSSL (PG links to it already, so if you're happy enough with PG's SSL then you'll probably accept pgbackrest using the same one- and yes, we've talked about supporting others as PG is moving in that direction too), and zlib (same reasons), we've now added lz4 (after researching it and deciding it was pretty reasonable to include), but when it came to dealing with s3, we wrote our own HTTP and s3 code- none of the existing libraries were a great answer and trying to make it work with s3cmd was, well, about like saying that you should just use CSV files and forget about this whole database thing. We're very likely to write our own code for gcs too, but we already have the HTTP code, which means it's not actually all that heavy of a lift to do. I'm not against trying to improve the situation in core, and I've even talked about and tried to give feedback about what would make the most sense for that to look like, but I feel like every time I do that there's a bunch of push-back that I want it to look like pgbackrest or that I'm being negative about things that don't look like pgbackrest. Guess what? Yes, I do think it should look like pgbackrest, but that's not because I have some not invented here syndrome issue, it's because we've been through this and have learned a great deal and have taken what we've learned and worked to build the best tool we can, much the way the PG community works to make the best database we can. Yes, we were able to argue and make it clear that a manifest really did make sense and even that it should be in json format, and then argue that checking WAL was a pretty important part of verifying any backup, but each and every one of these ends up being a long and drawn out argument and it's draining. The thing is, this stuff isn't new to us. Thanks, Stephen
Вложения
Greetings, * Magnus Hagander (magnus@hagander.net) wrote: > On Mon, Apr 6, 2020 at 4:45 PM Stephen Frost <sfrost@snowman.net> wrote: > > * Noah Misch (noah@leadboat.com) wrote: > > > On Fri, Apr 03, 2020 at 10:19:21AM -0400, Robert Haas wrote: > > > > What I'm thinking about is: suppose we add an option to pg_basebackup > > > > with a name like --pipe-output. This would be mutually exclusive with > > > > -D, but would work at least with -Ft and maybe also with -Fp. The > > > > argument to --pipe-output would be a shell command to be executed once > > > > per output file. Any instance of %f in the shell command would be > > > > replaced with the name of the file that would have been written (and > > > > %% would turn into a single %). The shell command itself would be > > > > executed via system(). So if you want to compress, but using some > > > > other compression program instead of gzip, you could do something > > > > like: > > > > > > > > pg_basebackup -Ft --pipe-output 'bzip > %f.bz2' > > > > > > Seems good to me. I agree -Fp is a "maybe" since the overhead will be high > > > for small files. > > > > For my 2c, at least, introducing more shell commands into critical parts > > of the system is absolutely the wrong direction to go in. > > archive_command continues to be a mess that we refuse to clean up or > > even properly document and the project would be much better off by > > trying to eliminate it rather than add in new ways for users to end up > > with bad or invalid backups. > > I think the bigger problem with archive_command more comes from how > it's defined to work tbh. Which leaves a lot of things open. > > This sounds to me like a much narrower use-case, which makes it a lot > more OK. But I agree we have to be careful not to get back into that > whole mess. One thing would be to clearly document such things *from > the beginning*, and not try to retrofit it years later like we ended > up doing with archive_command. This sounds like a much broader use-case to me, not a narrower one. I agree that we don't want to try and retrofit things years later. > And as Robert mentions downthread, the fsync() issue is definitely a > real one, but if that is documented clearly ahead of time, that's a > reasonable level foot-gun I'd say. Documented how..? > > Further, having a generic shell script approach like this would result > > in things like "well, we don't need to actually add support for X, Y or > > Z, because we have this wonderful generic shell script thing and you can > > write your own, and therefore we won't accept patches which do add those > > capabilities because then we'd have to actually maintain that support." > > In principle, I agree with "shellscripts suck". > > Now, if we were just talking about compression, it would actually be > interesting to implement some sort of "postgres compression API" if > you will, that is implemented by a shared library. This library could > then be used from pg_basebackup or from anything else that needs > compression. And anybody who wants could then do a "<compression X> > for PostgreSQL" module, removing the need for us to carry such code > upstream. Getting a bit off-track here, but I actually think we should absolutely figure out a way to support custom compression options in PG. I had been thinking of something along the lines of per-datatype actually, where each data type could define it's own compression method, since we know that different data has different characteristics and therefore might benefit from different ways of compressing it. Though it's also true that generically there are tradeoffs between cpu time, memory size, resulting size on disk, etc, and having ways to pick between those could also be interesting. > There's been discussions of that for the backend before IIRC, but I > don't recall the conclusions. And in particular, I don't recall if it > included the idea of being able to use it in situations like this as > well, and with *run-time loading*. Run-time loading brings in the fun that maybe we aren't able to load the library when we need to too, and what then? :) > And that said, then we'd limit ourselves to compression. We'd still > need a way to deal with encryption... And shipping stuff off to some remote server too, at least if we are going to tell users that they can use this approach to send their backups to s3... (and that reminds me- there's other things to think about there too, like maybe you don't want to ship off 0-byte files to s3, or maybe you don't want to ship tiny files, because there's costs associated with these things...). Thanks, Stephen
Вложения
On Mon, Apr 6, 2020 at 2:23 PM Stephen Frost <sfrost@snowman.net> wrote: > So, instead of talking about 'bzip2 > %f.bz2', and then writing into our > documentation that that's how this feature can be used, what about > proposing something that would actually work reliably with this > interface? Which properly fsync's everything, has good retry logic for > when failures happen, is able to actually detect when a failure > happened, how to restore from a backup taken this way, and it'd probably > be good to show how pg_verifybackup could be used to make sure the > backup is actually correct and valid too. I don't really understand the problem here. Suppose I do: mkdir ~/my-brand-new-empty-directory cd ~/my-brand-new-empty-directory pg_basebackup -Ft --pipe-output 'bzip2 > %f.bz2' initdb -S --dont-expect-that-this-is-a-data-directory . # because right now it would complain about pg_wal and pg_tblspc being missing I think if all that works, my backup should be good and durably on disk. If it's not, then either pg_basebackup or bzip2 or initdb didn't report errors that they should have reported. If you're worried about that, say because you suspect those programs are buggy or because you think the kernel may not be reporting errors properly, you can use tar -jxvf + pg_validatebackup to check. What *exactly* do you think can go wrong here? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 6, 2020 at 1:32 PM Magnus Hagander <magnus@hagander.net> wrote: > Now, if we were just talking about compression, it would actually be > interesting to implement some sort of "postgres compression API" if > you will, that is implemented by a shared library. This library could > then be used from pg_basebackup or from anything else that needs > compression. And anybody who wants could then do a "<compression X> > for PostgreSQL" module, removing the need for us to carry such code > upstream. I think it could be more general than a compression library. It could be a store-my-stuff-and-give-it-back-to-me library, which might do compression or encryption or cloud storage or any combination of the three, and probably other stuff too. Imagine that you first call an init function with a namespace that is basically a string provided by the user. Then you open a file either for read or for write (but not both). Then you read or write a series of chunks (depending on the file mode). Then you close the file. Then you can do the same with more files. Finally at the end you close the namespace. You don't really need to care where or how the functions you are calling store the data. You just need them to return proper error indicators if by chance they fail. As compared with my previous proposal, this would work much better for pg_basebackup -Fp, because you wouldn't launch a new bzip2 process for every file. You'd just bzopen(), which is presumably quite lightweight by comparison. The reasons I didn't propose it are: 1. Running bzip2 on every file in a plain-format backup seems a lot sillier than running it on every tar file in a tar-format backup. 2. I'm not confident that the command specified here actually needs to be anything very complicated (unlike archive_command). 3. The barrier to entry for a loadable module is a lot higher than for a shell command. 4. I think that all of our existing infrastructure for loadable modules is backend-only. Now all of these are up for discussion. I am sure we can make the loadable module stuff work in frontend code; it would just take some work. A C interface for extensibility is very significantly harder to use than a shell interface, but it's still way better than no interface. The idea that this shell command can be something simple is my current belief, but it may turn out to be wrong. And I'm sure somebody can propose a good reason to do something with every file in a plain-format backup rather than using tar format. All that being said, I still find it hard to believe that we will want to add dependencies for libraries that we'd need to do encryption or S3 cloud storage to PostgreSQL itself. So if we go with this more integrated approach we should consider the possibility that, when the dust settles, PostgreSQL will only have pg_basebackup --output-plugin=lz4 and Aurora will also have pg_basebackup --output-plugin=s3. From my point of view, that would be less than ideal. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Apr 6, 2020 at 2:23 PM Stephen Frost <sfrost@snowman.net> wrote: > > So, instead of talking about 'bzip2 > %f.bz2', and then writing into our > > documentation that that's how this feature can be used, what about > > proposing something that would actually work reliably with this > > interface? Which properly fsync's everything, has good retry logic for > > when failures happen, is able to actually detect when a failure > > happened, how to restore from a backup taken this way, and it'd probably > > be good to show how pg_verifybackup could be used to make sure the > > backup is actually correct and valid too. > > I don't really understand the problem here. Suppose I do: > > mkdir ~/my-brand-new-empty-directory > cd ~/my-brand-new-empty-directory > pg_basebackup -Ft --pipe-output 'bzip2 > %f.bz2' > initdb -S --dont-expect-that-this-is-a-data-directory . # because > right now it would complain about pg_wal and pg_tblspc being missing > > I think if all that works, my backup should be good and durably on > disk. If it's not, then either pg_basebackup or bzip2 or initdb didn't > report errors that they should have reported. If you're worried about > that, say because you suspect those programs are buggy or because you > think the kernel may not be reporting errors properly, you can use tar > -jxvf + pg_validatebackup to check. > > What *exactly* do you think can go wrong here? What if %f.bz2 already exists? How about if %f has a space in it? What about if I'd like to verify that the backup looks reasonably valid without having to find space to store it entirely decompressed? Also, this argument feels disingenuous to me. That isn't the only thing you're promoting this feature be used for, as you say below. If the only thing this feature is *actually* intended for is to add bzip2 support, then we should just add bzip2 support directly and call it a day, but what you're really talking about here is a generic interface that you'll want to push users to for things like "how do I back up to s3" or "how do I back up to GCS" and so we should be thinking about those cases and not just a relatively simple use case. This is the same kind of slippery slope that our archive command is built on- sure, if everything "works" then it's "fine", even with our documented example, but we know that not everything works in the real world, and just throwing an 'initdb -S' in there isn't a general solution because users want to do things like send WAL to s3 or GCS or such. I don't think there's any doubt that there'll be no shortage of shell scripts and various other things that'll be used with this that, yes, will be provided by our users and therefore we can blame them for doing it wrong, but then they'll complain on our lists and we'll spend time educating them as to how to write proper software to be used, or pointing them to a solution that someone writes specifically for this. I don't view that as, ultimately, a good solution. * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Apr 6, 2020 at 1:32 PM Magnus Hagander <magnus@hagander.net> wrote: > > Now, if we were just talking about compression, it would actually be > > interesting to implement some sort of "postgres compression API" if > > you will, that is implemented by a shared library. This library could > > then be used from pg_basebackup or from anything else that needs > > compression. And anybody who wants could then do a "<compression X> > > for PostgreSQL" module, removing the need for us to carry such code > > upstream. > > I think it could be more general than a compression library. It could > be a store-my-stuff-and-give-it-back-to-me library, which might do > compression or encryption or cloud storage or any combination of the > three, and probably other stuff too. Imagine that you first call an > init function with a namespace that is basically a string provided by > the user. Then you open a file either for read or for write (but not > both). Then you read or write a series of chunks (depending on the > file mode). Then you close the file. Then you can do the same with > more files. Finally at the end you close the namespace. You don't > really need to care where or how the functions you are calling store > the data. You just need them to return proper error indicators if by > chance they fail. Yes, having a storage layer makes a lot of sense here, with features that are understood by the core system and which each driver understands, and then having a filter system which is also pluggable and can support things like compression and hashing for this would also be great. I can point you to examples of all of the above, already implemented, in C, all OSS. Sure seems like a pretty good and reasonable approach to take when other folks are doing it. > As compared with my previous proposal, this would work much better for > pg_basebackup -Fp, because you wouldn't launch a new bzip2 process for > every file. You'd just bzopen(), which is presumably quite lightweight > by comparison. The reasons I didn't propose it are: > > 1. Running bzip2 on every file in a plain-format backup seems a lot > sillier than running it on every tar file in a tar-format backup. This is circular, isn't it? It's silly because you're launching new bzip2 processes for every file, but if you were using bzopen() then you wouldn't have that issue and therefore compressing every file in a plain-format backup would be entirely reasonable. > 2. I'm not confident that the command specified here actually needs to > be anything very complicated (unlike archive_command). This.. just doesn't make sense to me. The above talks about pushing things to cloud storage and such, which is definitely much more complicated than what had really been contemplated when archive_command was introduced. > 3. The barrier to entry for a loadable module is a lot higher than for > a shell command. Sure. > 4. I think that all of our existing infrastructure for loadable > modules is backend-only. That certainly doesn't seem a terrible hurdle, but I'm not convinced that we'd actually need or want this to be done through loadable modules- I'd argue that we should, instead, be thinking about building a system where we could accept patches that add in new drivers and new filters to core, where they're reviewed and well written. > Now all of these are up for discussion. I am sure we can make the > loadable module stuff work in frontend code; it would just take some > work. A C interface for extensibility is very significantly harder to > use than a shell interface, but it's still way better than no > interface. The idea that this shell command can be something simple is > my current belief, but it may turn out to be wrong. And I'm sure > somebody can propose a good reason to do something with every file in > a plain-format backup rather than using tar format. I've already tried to point out that the shell command you're talking about isn't going to be able to just be a simple command if the idea is that it'd be used to send things to s3 or gcs or anything like that. *Maybe* it could be simple if the only thing it's used for is a simple compression filter (though we'd have to deal with the whole fsync thing, as discussed), but it seems very likely that everyone would be a lot happier if we just built in support for bzip2, lz4, gzip, whatever, and that certainly doesn't strike me as a large ask in terms of code complexity or level of effort. > All that being said, I still find it hard to believe that we will want > to add dependencies for libraries that we'd need to do encryption or > S3 cloud storage to PostgreSQL itself. So if we go with this more > integrated approach we should consider the possibility that, when the > dust settles, PostgreSQL will only have pg_basebackup > --output-plugin=lz4 and Aurora will also have pg_basebackup > --output-plugin=s3. From my point of view, that would be less than > ideal. We already have libraries for encryption and we do not have to add libraries for s3 storage to support it as an option, as I mentioned up-thread. I don't find the argument that someone else might extend pg_basebackup (or whatever) to add on new features to be one that concerns me terribly much, provided we give people the opportunity to add those same features into core if they're willing to put in the effort to make it happen. I'm quite concerned that using this generic "you can just write a shell script to do it" approach will be used, over and over again, as an argument or at least a deterrent to having something proper in core and will ultimately result in us not having any good solution in core for the very common use cases that our users have today. That certainly seems like what's happened with archive_command. Thanks, Stephen
Вложения
On Wed, Apr 8, 2020 at 1:05 PM Stephen Frost <sfrost@snowman.net> wrote: > What if %f.bz2 already exists? That cannot occur in the scenario I described. > How about if %f has a space in it? For a tar-format backup I don't think that can happen, because the file names will be base.tar and ${tablespace_oid}.tar. For a plain format backup it's a potential issue. > What > about if I'd like to verify that the backup looks reasonably valid > without having to find space to store it entirely decompressed? Then we need to make pg_validatebackup better. > Also, this argument feels disingenuous to me. > [ lots more stuff ] This all just sounds like fearmongering to me. "archive_command doesn't work very well, so maybe your thing won't either." Maybe it won't, but the fact that archive_command doesn't isn't a reason. > Yes, having a storage layer makes a lot of sense here, with features > that are understood by the core system and which each driver > understands, and then having a filter system which is also pluggable and > can support things like compression and hashing for this would also be > great. It's good to know that you prefer a C interface to one based on shell scripting. I hope that we will also get some other opinions on that question, as my own feelings are somewhat divided (but with some bias toward trying to making the shell scripting thing work, because I believe it gives a lot more practical flexibility). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greeitngs, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, Apr 8, 2020 at 1:05 PM Stephen Frost <sfrost@snowman.net> wrote: > > What if %f.bz2 already exists? > > That cannot occur in the scenario I described. Of course it can. > > How about if %f has a space in it? > > For a tar-format backup I don't think that can happen, because the > file names will be base.tar and ${tablespace_oid}.tar. For a plain > format backup it's a potential issue. I agree that it might not be an issue for tar-format. > > What > > about if I'd like to verify that the backup looks reasonably valid > > without having to find space to store it entirely decompressed? > > Then we need to make pg_validatebackup better. Sure- but shouldn't the design be contemplating how these various tools will work together? > > Also, this argument feels disingenuous to me. > > [ lots more stuff ] > > This all just sounds like fearmongering to me. "archive_command > doesn't work very well, so maybe your thing won't either." Maybe it > won't, but the fact that archive_command doesn't isn't a reason. I was trying to explain that we have literally gone down exactly this path before and it's not been a good result, hence we should be really careful before going down it again. I don't consider that to be fearmongering, nor that we should be dismissing that concern out of hand. > > Yes, having a storage layer makes a lot of sense here, with features > > that are understood by the core system and which each driver > > understands, and then having a filter system which is also pluggable and > > can support things like compression and hashing for this would also be > > great. > > It's good to know that you prefer a C interface to one based on shell > scripting. I hope that we will also get some other opinions on that > question, as my own feelings are somewhat divided (but with some bias > toward trying to making the shell scripting thing work, because I > believe it gives a lot more practical flexibility). Yes, I do prefer a C interface. One might even say "been there, done that." Hopefully sharing such experience is still useful to do on these lists. Thanks, Stephen
Вложения
On Wed, Apr 8, 2020 at 2:06 PM Stephen Frost <sfrost@snowman.net> wrote: > * Robert Haas (robertmhaas@gmail.com) wrote: > > On Wed, Apr 8, 2020 at 1:05 PM Stephen Frost <sfrost@snowman.net> wrote: > > > What if %f.bz2 already exists? > > > > That cannot occur in the scenario I described. > > Of course it can. Not really. The steps I described involved creating a new directory. Yeah, in theory, somebody could inject a file into that directory after we created it and before bzip writes any files into it, but pg_basebackup already has the exact same race condition. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, Apr 8, 2020 at 2:06 PM Stephen Frost <sfrost@snowman.net> wrote: > > * Robert Haas (robertmhaas@gmail.com) wrote: > > > On Wed, Apr 8, 2020 at 1:05 PM Stephen Frost <sfrost@snowman.net> wrote: > > > > What if %f.bz2 already exists? > > > > > > That cannot occur in the scenario I described. > > > > Of course it can. > > Not really. The steps I described involved creating a new directory. > Yeah, in theory, somebody could inject a file into that directory > after we created it and before bzip writes any files into it, but > pg_basebackup already has the exact same race condition. With pg_basebackup, at least we could reasonably fix that race condition. Thanks, Stephen
Вложения
On Mon, Apr 6, 2020 at 07:32:45PM +0200, Magnus Hagander wrote: > On Mon, Apr 6, 2020 at 4:45 PM Stephen Frost <sfrost@snowman.net> wrote: > > For my 2c, at least, introducing more shell commands into critical parts > > of the system is absolutely the wrong direction to go in. > > archive_command continues to be a mess that we refuse to clean up or > > even properly document and the project would be much better off by > > trying to eliminate it rather than add in new ways for users to end up > > with bad or invalid backups. > > I think the bigger problem with archive_command more comes from how > it's defined to work tbh. Which leaves a lot of things open. > > This sounds to me like a much narrower use-case, which makes it a lot > more OK. But I agree we have to be careful not to get back into that > whole mess. One thing would be to clearly document such things *from > the beginning*, and not try to retrofit it years later like we ended > up doing with archive_command. > > And as Robert mentions downthread, the fsync() issue is definitely a > real one, but if that is documented clearly ahead of time, that's a > reasonable level foot-gun I'd say. I think we need to step back and look at the larger issue. The real argument goes back to the Unix command-line API vs the VMS/Windows API. The former has discrete parts that can be stitched together, while the VMS/Windows API presents a more duplicative but more holistic API for every piece. We have discussed using shell commands for archive_command, and even more recently, for the server pass phrase. To get more specific, I think we have to understand how the _requirements_ of the job match the shell script API, with stdin, stdout, stderr, return code, and command-line arguments. Looking at archive_command, the command-line arguments allow specification of file names, but quoting can be complex. The error return code and stderr output seem to work fine. There is no clean API for fsync and testing if the file exists, so that all that has to be hand done in one command-line. This is why many users use pre-written archive_command shell scripts. This brings up a few questions: * Should we have split apart archive_command into file-exists, copy, fsync-file? Should we add that now? * How well does this backup requirement match with the shell command API? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > I think we need to step back and look at the larger issue. The real > argument goes back to the Unix command-line API vs the VMS/Windows API. > The former has discrete parts that can be stitched together, while the > VMS/Windows API presents a more duplicative but more holistic API for > every piece. We have discussed using shell commands for > archive_command, and even more recently, for the server pass phrase. When it comes to something like the server pass phrase, it seems much more reasonable to consider using a shell script (though still perhaps not ideal) because it's not involved directly in ensuring that the data is reliably stored and it's pretty clear that if it doesn't work the worst thing that happens is that the database doesn't start up, but it won't corrupt any data or destroy it or do other bad things. > To get more specific, I think we have to understand how the > _requirements_ of the job match the shell script API, with stdin, > stdout, stderr, return code, and command-line arguments. Looking at > archive_command, the command-line arguments allow specification of file > names, but quoting can be complex. The error return code and stderr > output seem to work fine. There is no clean API for fsync and testing > if the file exists, so that all that has to be hand done in one > command-line. This is why many users use pre-written archive_command > shell scripts. We aren't considering all of the use-cases really though, in specific, things like pushing to s3 or gcs require, at least, good retry logic, and that's without starting to think about things like high-rate systems (spawning lots of new processes isn't free, particularly if they're written in shell script but any interpreted language is expensive) and wanting to parallelize. > This brings up a few questions: > > * Should we have split apart archive_command into file-exists, copy, > fsync-file? Should we add that now? No.. The right approach to improving on archive command is to add a way for an extension to take over that job, maybe with a complete background worker of its own, or perhaps a shared library that can be loaded by the archiver process, at least if we're talking about how to allow people to extend it. Potentially a better answer is to just build this stuff into PG- things like "archive WAL to s3/GCS with these credentials" are what an awful lot of users want. There's then some who want "archive first to this other server, and then archive to s3/GCS", or more complex options. I'll also point out that there's not one "s3".. there's quite a few alternatives, including some which are open source, which talk the s3 protocol (sadly, they don't all do it perfectly, which is why we are talking about building a GCS-specific driver for gcs rather than using their s3 gateway, but still, s3 isn't just 'one thing'). > * How well does this backup requirement match with the shell command > API? For my part, it's not just a question of an API, but it's a question of who is going to implement a good and reliable solution- PG developers, or some admin who is just trying to get PG up and running in their environment..? One aspect of that is being knowledgable about where all the land mines are- like the whole fsync thing. Sure, if you're a PG developer or you've been around long enough, you're going to realize that 'cp' isn't going to fsync() the file and therefore it's a pretty high risk choice for archive_command, and you'll understand just how important WAL is, but there's certainly an awful lot of folks out there who don't realize that or at least don't think about it when they're standing up a new system and instead they just are following our docs with the expectation that those docs are providing good advice. Thanks, Stephen
Вложения
On Thu, Apr 9, 2020 at 04:15:07PM -0400, Stephen Frost wrote: > Greetings, > > * Bruce Momjian (bruce@momjian.us) wrote: > > I think we need to step back and look at the larger issue. The real > > argument goes back to the Unix command-line API vs the VMS/Windows API. > > The former has discrete parts that can be stitched together, while the > > VMS/Windows API presents a more duplicative but more holistic API for > > every piece. We have discussed using shell commands for > > archive_command, and even more recently, for the server pass phrase. > > When it comes to something like the server pass phrase, it seems much > more reasonable to consider using a shell script (though still perhaps > not ideal) because it's not involved directly in ensuring that the data > is reliably stored and it's pretty clear that if it doesn't work the > worst thing that happens is that the database doesn't start up, but it > won't corrupt any data or destroy it or do other bad things. Well, the pass phrase relates to security, so it is important too. I don't think the _importance_ of the action is the most determining issue. Rather, I think it is how well the action fits the shell script API. > > To get more specific, I think we have to understand how the > > _requirements_ of the job match the shell script API, with stdin, > > stdout, stderr, return code, and command-line arguments. Looking at > > archive_command, the command-line arguments allow specification of file > > names, but quoting can be complex. The error return code and stderr > > output seem to work fine. There is no clean API for fsync and testing > > if the file exists, so that all that has to be hand done in one > > command-line. This is why many users use pre-written archive_command > > shell scripts. > > We aren't considering all of the use-cases really though, in specific, > things like pushing to s3 or gcs require, at least, good retry logic, > and that's without starting to think about things like high-rate systems > (spawning lots of new processes isn't free, particularly if they're > written in shell script but any interpreted language is expensive) and > wanting to parallelize. Good point, but if there are multiple APIs, it makes shell script flexibility even more useful. > > This brings up a few questions: > > > > * Should we have split apart archive_command into file-exists, copy, > > fsync-file? Should we add that now? > > No.. The right approach to improving on archive command is to add a way > for an extension to take over that job, maybe with a complete background > worker of its own, or perhaps a shared library that can be loaded by the > archiver process, at least if we're talking about how to allow people to > extend it. That seems quite vague, which is the issue we had years ago when considering doing archive_command as a link to a C library. > Potentially a better answer is to just build this stuff into PG- things > like "archive WAL to s3/GCS with these credentials" are what an awful > lot of users want. There's then some who want "archive first to this > other server, and then archive to s3/GCS", or more complex options. Yes, we certainly know how to do a file system copy, but what about copying files to other things like S3? I don't know how we would do that and allow users to change things like file paths or URLs. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Thu, Apr 9, 2020 at 6:44 PM Bruce Momjian <bruce@momjian.us> wrote: > Good point, but if there are multiple APIs, it makes shell script > flexibility even more useful. This is really the key point for me. There are so many existing tools that store a file someplace that we really can't ever hope to support them all in core, or even to have well-written extensions that support them all available on PGXN or wherever. We need to integrate with the tools that other people have created, not try to reinvent them all in PostgreSQL. Now what I understand Stephen to be saying is that a lot of those tools actually suck, and I think that's a completely valid point. But I also think that it's unwise to decide that such problems are our problems rather than problems with those tools. That's a hole with no bottom. One thing I do think would be realistic would be to invent a set of tools that are perform certain local filesystem operations in a "hardened" way. Maybe a single tool with subcommands and options. So you could say, e.g. 'pgfile cp SOURCE TARGET' and it would create a temporary file in the target directory, write the contents of the source into that file, fsync the file, rename it into place, and do more fsyncs to make sure it's all durable in case of a crash. You could have a variant of this that instead of using the temporary file and rename in place approach, does the thing where you open the target file with O_CREAT|O_EXCL, writes the bytes, and then closes and fsyncs it. And you could have other things too, like 'pgfile mkdir DIR' to create a directory and fsync it for durability. A toolset like this would probably help people write better archive commands - it would certainly been an improvement over what we have now, anyway, and it could also be used with the feature that I proposed upthread. For example, if you're concerned that bzip might overwrite an existing file and that it might not fsync, then instead of saying: pg_basebackup -Ft --pipe-output 'bzip > %f.bz2' You could instead write: pg_basebackup -Ft --pipe-output 'bzip | pgfile create-exclusive - %f.bz2' or whatever we pick for actual syntax. And that provides a kind of hardening that can be used with any other command line tool that can be used as a filter. If you want to compress with bzip, encrypt, and then copy the file to a remote system, you could do: pg_basebackup -Ft --pipe-output 'bzip | gpg -e | ssh someuser@somehost pgfile create-exclusive - /backups/tuesday/%f.bz2' It is of course not impossible to teach pg_basebackup to do all of that stuff internally, but I have a really difficult time imagining us ever getting it done. There are just too many possibilities, and new ones arise all the time. A 'pgfile' utility wouldn't help at all for people who are storing to S3 or whatever. They could use 'aws s3' as a target for --pipe-output, but if it turns out that said tool is insufficiently robust in terms of overwriting files or doing fsyncs or whatever, then they might have problems. Now, Stephen or anyone else could choose to provide alternative tools with more robust behavior, and that would be great. But even if he didn't, people could take their chances with what's already out there. To me, that's a good thing. Yeah, maybe they'll do dumb things that don't work, but realistically, they can do dumb stuff without the proposed option too. > Yes, we certainly know how to do a file system copy, but what about > copying files to other things like S3? I don't know how we would do > that and allow users to change things like file paths or URLs. Right. I think it's key that we provide people with tools that are highly flexible and, ideally, also highly composable. (Incidentally, pg_basebackup already has an option to output the entire backup as a tarfile on standard output, and a user can already pipe that into any tool they like. However, it doesn't work with tablespaces. So you could think of this proposal as extending the existing functionality to cover that case.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Bruce Momjian (bruce@momjian.us) wrote: > On Thu, Apr 9, 2020 at 04:15:07PM -0400, Stephen Frost wrote: > > * Bruce Momjian (bruce@momjian.us) wrote: > > > I think we need to step back and look at the larger issue. The real > > > argument goes back to the Unix command-line API vs the VMS/Windows API. > > > The former has discrete parts that can be stitched together, while the > > > VMS/Windows API presents a more duplicative but more holistic API for > > > every piece. We have discussed using shell commands for > > > archive_command, and even more recently, for the server pass phrase. > > > > When it comes to something like the server pass phrase, it seems much > > more reasonable to consider using a shell script (though still perhaps > > not ideal) because it's not involved directly in ensuring that the data > > is reliably stored and it's pretty clear that if it doesn't work the > > worst thing that happens is that the database doesn't start up, but it > > won't corrupt any data or destroy it or do other bad things. > > Well, the pass phrase relates to security, so it is important too. I > don't think the _importance_ of the action is the most determining > issue. Rather, I think it is how well the action fits the shell script > API. There isn't a single 'shell script API' though, and it's possible to craft a 'shell script API' to fit nearly any use-case, but that doesn't make it a good solution. The amount we depend on the external code for the correct operation of the system is relevant, and important to consider. > > > To get more specific, I think we have to understand how the > > > _requirements_ of the job match the shell script API, with stdin, > > > stdout, stderr, return code, and command-line arguments. Looking at > > > archive_command, the command-line arguments allow specification of file > > > names, but quoting can be complex. The error return code and stderr > > > output seem to work fine. There is no clean API for fsync and testing > > > if the file exists, so that all that has to be hand done in one > > > command-line. This is why many users use pre-written archive_command > > > shell scripts. > > > > We aren't considering all of the use-cases really though, in specific, > > things like pushing to s3 or gcs require, at least, good retry logic, > > and that's without starting to think about things like high-rate systems > > (spawning lots of new processes isn't free, particularly if they're > > written in shell script but any interpreted language is expensive) and > > wanting to parallelize. > > Good point, but if there are multiple APIs, it makes shell script > flexibility even more useful. This doesn't seem to answer the concerns that I brought up. Trying to understand it did make me think of another relevant question that was brought up in this discussion- can we really expect users to actually implement a C library for this, if we provided a way for them to? For that, I'd point to FDWs, where we certainly don't have any shortage of external, written in C, solutions. Another would be logical decoding. > > > This brings up a few questions: > > > > > > * Should we have split apart archive_command into file-exists, copy, > > > fsync-file? Should we add that now? > > > > No.. The right approach to improving on archive command is to add a way > > for an extension to take over that job, maybe with a complete background > > worker of its own, or perhaps a shared library that can be loaded by the > > archiver process, at least if we're talking about how to allow people to > > extend it. > > That seems quite vague, which is the issue we had years ago when > considering doing archive_command as a link to a C library. That prior discussion isn't really relevant though, as it was before we had extensions, and before we had background workers that can run as part of an extension. > > Potentially a better answer is to just build this stuff into PG- things > > like "archive WAL to s3/GCS with these credentials" are what an awful > > lot of users want. There's then some who want "archive first to this > > other server, and then archive to s3/GCS", or more complex options. > > Yes, we certainly know how to do a file system copy, but what about > copying files to other things like S3? I don't know how we would do > that and allow users to change things like file paths or URLs. There's a few different ways we could go about this. The simple answer would be to use GUCs, which would simplify things like dealing with the restore side too. Another option would be to have a concept of 'repository' objects in the system, not unlike tablespaces, but they'd have more options. To deal with that during recovery though, we'd need a way to get the relevant information from the catalogs (maybe we write the catalog out to a flat file on update, not unlike what we used to do with pg_shadow), perhaps even in a format that users could modify if they needed to. The nice thing about having actual objects in the system is that it'd be a bit cleaner to be able to define multiple ones and then have SQL-level functions/commands that work with them. A good deal of this does involve the question about how to deal with recovery though, since you might want to, or need to, use different options when it comes to recovery. Back to the use-case that I was mentioning, you could certainly want something like "try to get the WAL from the local archive, and if that doesn't work, try to get it from the s3 repo". What that implies then is that you'd really like a way to configure multiple repos, which is where we start to see the fragility of our GUC system. Pushing that out to something external doesn't strike me as the right answer though, but rather, we should think about how to resolve these issues with the GUC system, or come up with something better. This isn't the only area where the GUC system isn't really helping us- synchronous standby names is getting to be a pretty complicated GUC, for example. Of course, we could start out with just supporting a single repo with just a few new GUCs to configure it, that wouldn't be hard and there's good examples out there about what's needed to configure an s3 repo. Thanks, Stephen
Вложения
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Thu, Apr 9, 2020 at 6:44 PM Bruce Momjian <bruce@momjian.us> wrote: > > Good point, but if there are multiple APIs, it makes shell script > > flexibility even more useful. > > This is really the key point for me. There are so many existing tools > that store a file someplace that we really can't ever hope to support > them all in core, or even to have well-written extensions that support > them all available on PGXN or wherever. We need to integrate with the > tools that other people have created, not try to reinvent them all in > PostgreSQL. So, this goes to what I was just mentioning to Bruce independently- you could have made the same argument about FDWs, but it just doesn't actually hold any water. Sure, some of the FDWs aren't great, but there's certainly no shortage of them, and the ones that are particularly important (like postgres_fdw) are well written and in core. > Now what I understand Stephen to be saying is that a lot of those > tools actually suck, and I think that's a completely valid point. But > I also think that it's unwise to decide that such problems are our > problems rather than problems with those tools. That's a hole with no > bottom. I don't really think 'bzip2' sucks as a tool, or that bash does. They weren't designed or intended to meet the expectations that we have for data durability though, which is why relying on them for exactly that ends up being a bad recipe. > One thing I do think would be realistic would be to invent a set of > tools that are perform certain local filesystem operations in a > "hardened" way. Maybe a single tool with subcommands and options. So > you could say, e.g. 'pgfile cp SOURCE TARGET' and it would create a > temporary file in the target directory, write the contents of the > source into that file, fsync the file, rename it into place, and do > more fsyncs to make sure it's all durable in case of a crash. You > could have a variant of this that instead of using the temporary file > and rename in place approach, does the thing where you open the target > file with O_CREAT|O_EXCL, writes the bytes, and then closes and fsyncs > it. And you could have other things too, like 'pgfile mkdir DIR' to > create a directory and fsync it for durability. A toolset like this > would probably help people write better archive commands - it would > certainly been an improvement over what we have now, anyway, and it > could also be used with the feature that I proposed upthread. This argument leads in a direction to justify anything as being sensible to implement using shell scripts. If we're open to writing the shell level tools that would be needed, we could reimplement all of our indexes that way, or FDWs, or TDE, or just about anything else. What we would end up with though is that we'd have more complications changing those interfaces because people will be using those tools, and maybe those tools don't get updated at the same time as PG does, and maybe there's critical changes that need to be made in back branches and we can't really do that with these interfaces. > It is of course not impossible to teach pg_basebackup to do all of > that stuff internally, but I have a really difficult time imagining us > ever getting it done. There are just too many possibilities, and new > ones arise all the time. I agree that it's certainly a fair bit of work, but it can be accomplished incrementally and, with a good design, allow for adding in new options in the future with relative ease. Now is the time to discuss what that design looks like, think about how we can implement it in a way that all of the tools we have are able to work together, and have them all support and be tested together with these different options. The concerns about there being too many possibilities and new ones coming up all the time could be applied equally to FDWs, but rather than ending up with a dearth of options and external solutions there, what we've actually seen is an explosion of options and externally written libraries for a large variety of options. > A 'pgfile' utility wouldn't help at all for people who are storing to > S3 or whatever. They could use 'aws s3' as a target for --pipe-output, > but if it turns out that said tool is insufficiently robust in terms > of overwriting files or doing fsyncs or whatever, then they might have > problems. Now, Stephen or anyone else could choose to provide > alternative tools with more robust behavior, and that would be great. > But even if he didn't, people could take their chances with what's > already out there. To me, that's a good thing. Yeah, maybe they'll do > dumb things that don't work, but realistically, they can do dumb stuff > without the proposed option too. How does this solution give them a good way to do the right thing though? In a way that will work with large databases and complex requirements? The answer seems to be "well, everyone will have to write their own tool to do that" and that basically means that, at best, we're only providing half of a solution and expecting all of our users to provide the other half, and to always do it correctly and in a well written way. Acknowledging that most users aren't going to actually do that and instead they'll implement half measures that aren't reliable shouldn't be seen as an endorsement of this approach. Thanks, Stephen
Вложения
On Fri, Apr 10, 2020 at 10:54:10AM -0400, Stephen Frost wrote: > Greetings, > > * Robert Haas (robertmhaas@gmail.com) wrote: > > On Thu, Apr 9, 2020 at 6:44 PM Bruce Momjian <bruce@momjian.us> wrote: > > > Good point, but if there are multiple APIs, it makes shell script > > > flexibility even more useful. > > > > This is really the key point for me. There are so many existing tools > > that store a file someplace that we really can't ever hope to support > > them all in core, or even to have well-written extensions that support > > them all available on PGXN or wherever. We need to integrate with the > > tools that other people have created, not try to reinvent them all in > > PostgreSQL. > > So, this goes to what I was just mentioning to Bruce independently- you > could have made the same argument about FDWs, but it just doesn't > actually hold any water. Sure, some of the FDWs aren't great, but > there's certainly no shortage of them, and the ones that are > particularly important (like postgres_fdw) are well written and in core. No, no one made that argument. It isn't clear how a shell script API would map to relational database queries. The point is how well the APIs match, and then if they are close, does it give us the flexibility we need. You can't just look at flexibility without an API match. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Fri, Apr 10, 2020 at 10:54 AM Stephen Frost <sfrost@snowman.net> wrote: > So, this goes to what I was just mentioning to Bruce independently- you > could have made the same argument about FDWs, but it just doesn't > actually hold any water. Sure, some of the FDWs aren't great, but > there's certainly no shortage of them, and the ones that are > particularly important (like postgres_fdw) are well written and in core. That's a fairly different use case. In the case of the FDW interface: - The number of interface method calls is very high, at least one per tuple and a bunch of extra ones for each query. - There is a significant amount of complex state that needs to be maintained across API calls. - The return values are often tuples, which are themselves an in-memory data structure. But here: - We're only talking about writing a handful of tar files, and that's in the context of a full-database backup, which is a much heavier-weight operation than a query. - There is not really any state that needs to be maintained across calls. - The expected result is that a file gets written someplace, which is not an in-memory data structure but something that gets written to a place outside of PostgreSQL. > The concerns about there being too many possibilities and new ones > coming up all the time could be applied equally to FDWs, but rather than > ending up with a dearth of options and external solutions there, what > we've actually seen is an explosion of options and externally written > libraries for a large variety of options. Sure, but a lot of those FDWs are relatively low-quality, and it's often hard to find one that does what you want. And even if you do, you don't really know how good it is. Unfortunately, in that case there's no real alternative, because implementing something based on shell commands couldn't ever have reasonable performance or a halfway decent feature set. That's not the case here. > How does this solution give them a good way to do the right thing > though? In a way that will work with large databases and complex > requirements? The answer seems to be "well, everyone will have to write > their own tool to do that" and that basically means that, at best, we're > only providing half of a solution and expecting all of our users to > provide the other half, and to always do it correctly and in a well > written way. Acknowledging that most users aren't going to actually do > that and instead they'll implement half measures that aren't reliable > shouldn't be seen as an endorsement of this approach. I don't acknowledge that. I think it's possible to use tools like the proposed option in a perfectly reliable way, and I've already given a bunch of examples of how it could be done. Writing a file is not such a complex operation that every bit of code that writes one reliably has to be written by someone associated with the PostgreSQL project. I strongly suspect that people who use a cloud provider's tools to upload their backup files will be quite happy with the results, and if they aren't, I hope they will blame the cloud provider's tool for eating the data rather than this option for making it easy to give the data to the thing that ate it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-04-10 12:20:01 -0400, Robert Haas wrote: > - We're only talking about writing a handful of tar files, and that's > in the context of a full-database backup, which is a much > heavier-weight operation than a query. > - There is not really any state that needs to be maintained across calls. > - The expected result is that a file gets written someplace, which is > not an in-memory data structure but something that gets written to a > place outside of PostgreSQL. Wouldn't there be state like a S3/ssh/https/... connection? And perhaps a 'backup_id' in the backup metadata DB that'd one would want to update at the end? Greetings, Andres Freund
On 10/4/20 15:49, Robert Haas wrote: > On Thu, Apr 9, 2020 at 6:44 PM Bruce Momjian <bruce@momjian.us> wrote: >> Good point, but if there are multiple APIs, it makes shell script >> flexibility even more useful. > [snip] > > One thing I do think would be realistic would be to invent a set of > tools that are perform certain local filesystem operations in a > "hardened" way. +10 > Maybe a single tool with subcommands and options. So > you could say, e.g. 'pgfile cp SOURCE TARGET' and it would create a > temporary file in the target directory, write the contents of the > source into that file, fsync the file, rename it into place, and do > more fsyncs to make sure it's all durable in case of a crash. You > could have a variant of this that instead of using the temporary file > and rename in place approach, does the thing where you open the target > file with O_CREAT|O_EXCL, writes the bytes, and then closes and fsyncs > it. Behaviour might be decided in the same way as the default for 'wal_sync_method' gets chosen, as the most appropriate for a particular system. > And you could have other things too, like 'pgfile mkdir DIR' to > create a directory and fsync it for durability. A toolset like this > would probably help people write better archive commands Definitely, "mkdir" and "create-exclusive" (along with cp) would be a great addition and simplify the kind of tasks properly (i.e. with risking data loss every time) > [excerpted] > > pg_basebackup -Ft --pipe-output 'bzip | pgfile create-exclusive - %f.bz2' > > [....] > > pg_basebackup -Ft --pipe-output 'bzip | gpg -e | ssh someuser@somehost > pgfile create-exclusive - /backups/tuesday/%f.bz2' Yep. Would also fit the case for non-synchronous NFS mounts for backups... > It is of course not impossible to teach pg_basebackup to do all of > that stuff internally, but I have a really difficult time imagining us > ever getting it done. There are just too many possibilities, and new > ones arise all the time. Indeed. The beauty of Unix-like OSs is precisely this. > A 'pgfile' utility wouldn't help at all for people who are storing to > S3 or whatever. They could use 'aws s3' as a target for --pipe-output, > [snip] > (Incidentally, pg_basebackup already has an option to output the > entire backup as a tarfile on standard output, and a user can already > pipe that into any tool they like. However, it doesn't work with > tablespaces. So you could think of this proposal as extending the > existing functionality to cover that case.) Been there already :S Having pg_basebackup output multiple tarballs (one per tablespace), ideally separated via something so that splitting can be trivially done on the receiving end. ...but that's probably matter for another thread. Thanks, / J.L.
On Fri, Apr 10, 2020 at 3:38 PM Andres Freund <andres@anarazel.de> wrote: > Wouldn't there be state like a S3/ssh/https/... connection? And perhaps > a 'backup_id' in the backup metadata DB that'd one would want to update > at the end? Good question. I don't know that there would be but, uh, maybe? It's not obvious to me why all of that would need to be done using the same connection, but if it is, the idea I proposed isn't going to work very nicely. More generally, can you think of any ideas for how to structure an API here that are easier to use than "write some C code"? Or do you think we should tell people to write some C code if they want to compress/encrypt/relocate their backup in some non-standard way? For the record, I'm not against eventually having more than one way to do this, maybe a shell-script interface for simpler things and some kind of API for more complex needs (e.g. NetBackup integration, perhaps). And I did wonder if there was some other way we could do this. For instance, we could add an option --tar-everything that sticks all the things that would have been returned by the backup inside another level of tar file and sends the result to stdout. Then you can pipe it into a single command that gets invoked only once for all the data, rather than once per tablespace. That might be better, but I'm not sure it's better. It's better if you want to do complicated things that involve steps that happen before and after and persistent connections and so on, but it seems worse for simple things like piping through a non-default compressor. Larry Wall somewhat famously commented that a good programming language should (and I paraphrase) make simple things simple and complex things possible. My hesitation in going straight to a C API is that it does not make simple things simple; and I'd like to be really sure that there is no way of achieving that valuable goal before we give up on it. However, there is no doubt that a C API is potentially more powerful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10/4/20 21:38, Andres Freund wrote: > Hi, > > On 2020-04-10 12:20:01 -0400, Robert Haas wrote: >> - We're only talking about writing a handful of tar files, and that's >> in the context of a full-database backup, which is a much >> heavier-weight operation than a query. >> - There is not really any state that needs to be maintained across calls. >> - The expected result is that a file gets written someplace, which is >> not an in-memory data structure but something that gets written to a >> place outside of PostgreSQL. > Wouldn't there be state like a S3/ssh/https/... connection? ...to try and save opening a new connection in the context of a (potentially) multi-TB backup? :S > And perhaps > a 'backup_id' in the backup metadata DB that'd one would want to update > at the end? This is, indeed, material for external tools. Each cater for a particular set of end-user requirements. We got many examples already, with most even co-authored by this list's regulars... and IMHO none is suitable for ALL use-cases. BUT I agree that providing better tools with Postgres itself, ready to use --- that is, uncomment the default "archive_command" and get going for a very basic starting point --- is a huge advancement in the right direction. More importantly (IMO): including the call to "pgfile" or equivalent quite clearly signals any inadvertent user that there is more to safely archiving WAL segments than just doing "cp -a" blindly and hoping that the tool magically does all required steps [needed to ensure data safety in this case, rather than the usual behaviour]. It's probably more effective than just ammending the existing comments to point users to a (new?) section within the documentation. This comment is from experience: I've lost count of how many times I have had to "fix" the default command for WAL archiving --- precisely because it had been blindly copied from the default without further thinking of the implications should there happen any (deviation-from-expected-behaviour) during WAL archiving .... only to be noticed at (attempted) recovery time :\ HTH. Thanks, J.L.
On Sat, Apr 11, 2020 at 10:22 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Apr 10, 2020 at 3:38 PM Andres Freund <andres@anarazel.de> wrote:
> Wouldn't there be state like a S3/ssh/https/... connection? And perhaps
> a 'backup_id' in the backup metadata DB that'd one would want to update
> at the end?
Good question. I don't know that there would be but, uh, maybe? It's
not obvious to me why all of that would need to be done using the same
connection, but if it is, the idea I proposed isn't going to work very
nicely.
There are certainly cases for it. It might not be they have to be the same connection, but still be the same session, meaning before the first time you perform some step of authentication, get a token, and then use that for all the files. You'd need somewhere to maintain that state, even if it doesn't happen to be a socket. But there are definitely plenty of cases where keeping an open socket can be a huge performance gain -- especially when it comes to not re-negotiating encryption etc.
More generally, can you think of any ideas for how to structure an API
here that are easier to use than "write some C code"? Or do you think
we should tell people to write some C code if they want to
compress/encrypt/relocate their backup in some non-standard way?
For compression and encryption, it could perhaps be as simple as "the command has to be pipe on both input and output" and basically send the response back to pg_basebackup.
But that won't help if the target is to relocate things...
For the record, I'm not against eventually having more than one way to
do this, maybe a shell-script interface for simpler things and some
kind of API for more complex needs (e.g. NetBackup integration,
perhaps). And I did wonder if there was some other way we could do
this. For instance, we could add an option --tar-everything that
sticks all the things that would have been returned by the backup
inside another level of tar file and sends the result to stdout. Then
you can pipe it into a single command that gets invoked only once for
all the data, rather than once per tablespace. That might be better,
but I'm not sure it's better. It's better if you want to do
complicated things that involve steps that happen before and after and
persistent connections and so on, but it seems worse for simple things
like piping through a non-default compressor.
That is one way to go for it -- and in a case like that, I'd suggest the shellscript interface would be an implementation of the other API. A number of times through the years I've bounced ideas around for what to do with archive_command with different people (never quite to the level of "it's time to write a patch"), and it's mostly come down to some sort of shlib api where in turn we'd ship a backwards compatible implementation that would behave like archive_command. I'd envision something similar here.
Larry Wall somewhat famously commented that a good programming
language should (and I paraphrase) make simple things simple and
complex things possible. My hesitation in going straight to a C API is
that it does not make simple things simple; and I'd like to be really
sure that there is no way of achieving that valuable goal before we
give up on it. However, there is no doubt that a C API is potentially
more powerful.
Is there another language that it would make sense to support in the form of "native plugins". Assume we had some generic way to say let people write such plugins in python (we can then bikeshed about which language we should use). That would give them a much higher level language, while also making it possible for a "better" API.
Note that I'm not suggesting supporting a python script running as a regular script -- that could easily be done by anybody making a shellscript implementation. It would be an actual API where the postgres tool would instantiate the python interpreter in-process and create an object there. This would allow things like keeping state across calls, and would also give access to the extensive library availability of the language (e.g. you could directly import an S3 compatible library to upload files etc).
Doing that for just pg_basebackup would probably be overkill, but it might be a generic choice that could extend to other things as well.
On Sun, Apr 12, 2020 at 10:09 AM Magnus Hagander <magnus@hagander.net> wrote: > There are certainly cases for it. It might not be they have to be the same connection, but still be the same session, meaningbefore the first time you perform some step of authentication, get a token, and then use that for all the files. You'dneed somewhere to maintain that state, even if it doesn't happen to be a socket. But there are definitely plenty ofcases where keeping an open socket can be a huge performance gain -- especially when it comes to not re-negotiating encryptionetc. Hmm, OK. > For compression and encryption, it could perhaps be as simple as "the command has to be pipe on both input and output"and basically send the response back to pg_basebackup. > > But that won't help if the target is to relocate things... Right. And, also, it forces things to be sequential in a way I'm not too happy about. Like, if we have some kind of parallel backup, which I hope we will, then you can imagine (among other possibilities) getting files for each tablespace concurrently, and piping them through the output command concurrently. But if we emit the result in a tarfile, then it has to be sequential; there's just no other choice. I think we should try to come up with something that can work in a multi-threaded environment. > That is one way to go for it -- and in a case like that, I'd suggest the shellscript interface would be an implementationof the other API. A number of times through the years I've bounced ideas around for what to do with archive_commandwith different people (never quite to the level of "it's time to write a patch"), and it's mostly come downto some sort of shlib api where in turn we'd ship a backwards compatible implementation that would behave like archive_command.I'd envision something similar here. I agree. Let's imagine that there are a conceptually unlimited number of "targets" and "filters". Targets and filters accept data via the same API, but a target is expected to dispose of the data, whereas a filter is expected to pass it, via that same API, to a subsequent filter or target. So filters could include things like "gzip", "lz4", and "encrypt-with-rot13", whereas targets would include things like "file" (the thing we have today - write my data into some local files!), "shell" (which writes my data to a shell command, as originally proposed), and maybe eventually things like "netbackup" and "s3". Ideally this will all eventually be via a loadable module interface so that third-party filters and targets can be fully supported, but perhaps we could consider that an optional feature for v1. Note that there is quite a bit of work to do here just to reorganize the code. I would expect that we would want to provide a flexible way for a target or filter to be passed options from the pg_basebackup command line. So one might for example write this: pg_basebackup --filter='lz4 -9' --filter='encrypt-with-rot13 rotations=2' --target='shell ssh rhaas@depository pgfile create-exclusive - %f.lz4' The idea is that the first word of the filter or target identifies which one should be used, and the rest is just options text in whatever form the provider cares to accept them; but with some %<character> substitutions allowed, for things like the file name. (The aforementioned escaping problems for things like filenames with spaces in them still need to be sorted out, but this is just a sketch, so while I think it's quite solvable, I am going to refrain from proposing a precise solution here.) As to the underlying C API behind this, I propose approximately the following set of methods: 1. Begin a session. Returns a pointer to a session handle. Gets the options provided on the command line. In the case of a filter, also gets a pointer to the session handle for the next filter, or for the target (which means we set up the final target first, and then stack the filters on top of it). 2. Begin a file. Gets a session handle and a file name. Returns a pointer to a file handle. 3. Write data to a file. Gets a file handle, a byte count, and some bytes. 4. End a file. Gets a file handle. 5. End a session. Gets a session handle. If we get parallelism at some point, then there could be multiple files in progress at the same time. Maybe some targets, or even filters, won't be able to handle that, so we could have a flag someplace indicating that a particular target or filter isn't parallelism-capable. As an example, writing output to a bunch of files in a directory is fine to do in parallel, but if you want the entire backup in one giant tar file, you need each file sequentially. > Is there another language that it would make sense to support in the form of "native plugins". Assume we had some genericway to say let people write such plugins in python (we can then bikeshed about which language we should use). Thatwould give them a much higher level language, while also making it possible for a "better" API. The idea of using LUA has been floated before, and I imagine that an interface like the above could also be made to have language bindings for the scripting language of your choice - e.g. Python. However, I think we should start by trying to square away the C interface and then anybody who feels motivated can try to put language bindings on top of it. I tend to feel that's a bit of a fringe feature myself, since realistically shell commands are about as much as (and occasionally more than) typical users can manage. However, it would not surprise me very much if there are power users out there for whom C is too much but Python or LUA or something is just right, and if somebody builds something nifty that caters to that audience, I think that's great. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-04-11 16:22:09 -0400, Robert Haas wrote: > On Fri, Apr 10, 2020 at 3:38 PM Andres Freund <andres@anarazel.de> wrote: > > Wouldn't there be state like a S3/ssh/https/... connection? And perhaps > > a 'backup_id' in the backup metadata DB that'd one would want to update > > at the end? > > Good question. I don't know that there would be but, uh, maybe? It's > not obvious to me why all of that would need to be done using the same > connection, but if it is, the idea I proposed isn't going to work very > nicely. Well, it depends on what you want to support. If you're only interested in supporting tarball mode ([1]), *maybe* you can get away without longer lived sessions (but I'm doubtful). But if you're interested in also supporting archiving plain files, then the cost of establishing sessions, and the latency penalty of having to wait for command completion would imo be prohibitive. A lot of solutions for storing backups can achieve pretty decent throughput, but have very significant latency. That's of course in addition to network latency itself. [1] I don't think we should restrict it that way. Would make it much more complicated to support incremental backup, pg_rewind, deduplication, etc. > More generally, can you think of any ideas for how to structure an API > here that are easier to use than "write some C code"? Or do you think > we should tell people to write some C code if they want to > compress/encrypt/relocate their backup in some non-standard way? > For the record, I'm not against eventually having more than one way to > do this, maybe a shell-script interface for simpler things and some > kind of API for more complex needs (e.g. NetBackup integration, > perhaps). And I did wonder if there was some other way we could do > this. I'm doubtful that an API based on string replacement is the way to go. It's hard for me to see how that's not either going to substantially restrict the way the "tasks" are done, or yield a very complicated interface. I wonder whether the best approach here could be that pg_basebackup (and perhaps other tools) opens pipes to/from a subcommand and over the pipe it communicates with the subtask using a textual ([2]) description of tasks. Like: backup mode=files base_directory=/path/to/data/directory backup_file name=base/14037/16396.14 size=1073741824 backup_file name=pg_wal/XXXX size=16777216 or backup mode=tar base_directory /path/to/data/ backup_tar name=dir.tar size=983498875687487 The obvious problem with that proposal is that we don't want to unnecessarily store the incoming data on the system pg_basebackup is running on, just for the subcommand to get access to them. More on that in a second. A huge advantage of a scheme like this would be that it wouldn't have to be specific to pg_basebackup. It could just as well work directly on the server, avoiding an unnecesary loop through the network. Which e.g. could integrate with filesystem snapshots etc. Without needing to build the 'archive target' once with server libraries, and once with client libraries. One reason I think something like this could be advantageous over a C API is that it's quite feasible to implement it from a number of different language, including shell if really desired, without needing to provide a C API via a FFI. It'd also make it quite natural to split out compression from pg_basebackup's main process, which IME currently makes it not really feasible to use pg_basebackup's compression. There's various ways we could address the issue for how the subcommand can access the file data. The most flexible probably would be to rely on exchanging file descriptors between basebackup and the subprocess (these days all supported platforms have that, I think). Alternatively we could invoke the subcommand before really starting the backup, and ask how many files it'd like to receive in parallel, and restart the subcommand with that number of file descriptors open. If we relied on FDs, here's an example for how a trace between pg_basebackup (BB) a backup target command (TC) could look like: BB: required_capabilities fd_send files BB: provided_capabilities fd_send file_size files tar TC: required_capabilities fd_send files file_size BB: backup mode=files base_directory=/path/to/data/directory BB: backup_file method=fd name=base/14037/16396.1 size=1073741824 BB: backup_file method=fd name=base/14037/16396.2 size=1073741824 BB: backup_file method=fd name=base/14037/16396.3 size=1073741824 TC: fd name=base/14037/16396.1 (contains TC fd 4) TC: fd name=base/14037/16396.2 (contains TC fd 5) BB: backup_file method=fd name=base/14037/16396.4 size=1073741824 TC: fd name=base/14037/16396.3 (contains TC fd 6) BB: backup_file method=fd name=base/14037/16396.5 size=1073741824 TC: fd name=base/14037/16396.4 (contains TC fd 4) TC: fd name=base/14037/16396.5 (contains TC fd 5) BB: done TC: done backup_file type=fd mode=fd base/14037/16396.4 1073741824 or backup_features tar backup_mode tar base_directory /path/to/data/ backup_tar dir.tar 983498875687487 [2] yes, I already hear json. A line deliminated format would have some advantages though. Greetings, Andres Freund
Hi, On 2020-04-12 11:04:46 -0400, Robert Haas wrote: > I would expect that we would want to provide a flexible way for a > target or filter to be passed options from the pg_basebackup command > line. So one might for example write this: > > pg_basebackup --filter='lz4 -9' --filter='encrypt-with-rot13 > rotations=2' --target='shell ssh rhaas@depository pgfile > create-exclusive - %f.lz4' My gut feeling is that this would end up with too complicated pg_basebackup invocations, resulting in the complexity getting reimplemented in the target command. A lot of users don't want to figure out what compression, encryption, ... command makes sense for which archiving target. And e.g. an s3 target might want to integrate with an AWS HSM etc, making it unattractive to do the encryption outside the target. Greetings, Andres Freund
On 4/12/20 11:04 AM, Robert Haas wrote: > On Sun, Apr 12, 2020 at 10:09 AM Magnus Hagander <magnus@hagander.net> wrote: >> There are certainly cases for it. It might not be they have to be the same connection, but still be the same session,meaning before the first time you perform some step of authentication, get a token, and then use that for all thefiles. You'd need somewhere to maintain that state, even if it doesn't happen to be a socket. But there are definitelyplenty of cases where keeping an open socket can be a huge performance gain -- especially when it comes to notre-negotiating encryption etc. > > Hmm, OK. When we implemented connection-sharing for S3 in pgBackRest it was a significant performance boost, even for large files since they must be uploaded in parts. The same goes for files transferred over SSH, though in this case the overhead is per-file and can be mitigated with control master. We originally (late 2013) implemented everything with commmand-line tools during the POC phase. The idea was to get something viable quickly and then improve as needed. At the time our config file had entries something like this: [global:command] compress=/usr/bin/gzip --stdout %file% decompress=/usr/bin/gzip -dc %file% checksum=/usr/bin/shasum %file% | awk '{print $1}' manifest=/opt/local/bin/gfind %path% -printf '%P\t%y\t%u\t%g\t%m\t%T@\t%i\t%s\t%l\n' psql=/Library/PostgreSQL/9.3/bin/psql -X %option% [db] psql_options=--cluster=9.3/main [db:command:option] psql=--port=6001 These appear to be for MacOS, but Linux would be similar. This *did* work, but it was really hard to debug when things went wrong, the per-file cost was high, and the slight differences between the command-line tools on different platforms was maddening. For example, lots of versions of 'find' would error if a file disappeared while building the manifest, which is a pretty common occurrence in PostgreSQL (most newer distros had an option to fix this). I know that doesn't apply here, but it's an example. Also, debugging was complicated with so many processes, with any degree of parallelism the process list got pretty crazy, fsync was not happening, etc. It's been a long time but I don't have any good memories of the solution that used all command-line tools. Once we had a POC that solved our basic problem, i.e. backup up about 50TB of data reasonably efficiently, we immediately started working on a version that did not rely on command-line tools and we never looked back. Currently the only command-line tool we use is ssh. I'm sure it would be possible to create a solution that worked better than ours, but I'm pretty certain it would still be hard for users to make it work correctly and to prove it worked correctly. >> For compression and encryption, it could perhaps be as simple as "the command has to be pipe on both input and output"and basically send the response back to pg_basebackup. >> >> But that won't help if the target is to relocate things... > > Right. And, also, it forces things to be sequential in a way I'm not > too happy about. Like, if we have some kind of parallel backup, which > I hope we will, then you can imagine (among other possibilities) > getting files for each tablespace concurrently, and piping them > through the output command concurrently. But if we emit the result in > a tarfile, then it has to be sequential; there's just no other choice. > I think we should try to come up with something that can work in a > multi-threaded environment. > >> That is one way to go for it -- and in a case like that, I'd suggest the shellscript interface would be an implementationof the other API. A number of times through the years I've bounced ideas around for what to do with archive_commandwith different people (never quite to the level of "it's time to write a patch"), and it's mostly come downto some sort of shlib api where in turn we'd ship a backwards compatible implementation that would behave like archive_command.I'd envision something similar here. > > I agree. Let's imagine that there are a conceptually unlimited number > of "targets" and "filters". Targets and filters accept data via the > same API, but a target is expected to dispose of the data, whereas a > filter is expected to pass it, via that same API, to a subsequent > filter or target. So filters could include things like "gzip", "lz4", > and "encrypt-with-rot13", whereas targets would include things like > "file" (the thing we have today - write my data into some local > files!), "shell" (which writes my data to a shell command, as > originally proposed), and maybe eventually things like "netbackup" and > "s3". Ideally this will all eventually be via a loadable module > interface so that third-party filters and targets can be fully > supported, but perhaps we could consider that an optional feature for > v1. Note that there is quite a bit of work to do here just to > reorganize the code. > > I would expect that we would want to provide a flexible way for a > target or filter to be passed options from the pg_basebackup command > line. So one might for example write this: > > pg_basebackup --filter='lz4 -9' --filter='encrypt-with-rot13 > rotations=2' --target='shell ssh rhaas@depository pgfile > create-exclusive - %f.lz4' > > The idea is that the first word of the filter or target identifies > which one should be used, and the rest is just options text in > whatever form the provider cares to accept them; but with some > %<character> substitutions allowed, for things like the file name. > (The aforementioned escaping problems for things like filenames with > spaces in them still need to be sorted out, but this is just a sketch, > so while I think it's quite solvable, I am going to refrain from > proposing a precise solution here.) This is basically the solution we have landed on after many iterations. We implement two types of filters, In and InOut. The In filters process data and produce a result, e.g. SHA1, size, page checksum, etc. The InOut filters modify data, e.g. compression, encryption. Yeah, the names could probably be better... I have attached our filter interface (filter.intern.h) as a concrete example of how this works. We call 'targets' storage and have a standard interface for creating storage drivers. I have also attached our storage interface (storage.intern.h) as a concrete example of how this works. Note that for just performing backup this is overkill, but once you consider verify this is pretty much the minimum storage interface needed, according to our experience. Regards, -- -David david@pgmasters.net
Вложения
On 4/12/20 3:17 PM, Andres Freund wrote: > >> More generally, can you think of any ideas for how to structure an API >> here that are easier to use than "write some C code"? Or do you think >> we should tell people to write some C code if they want to >> compress/encrypt/relocate their backup in some non-standard way? > >> For the record, I'm not against eventually having more than one way to >> do this, maybe a shell-script interface for simpler things and some >> kind of API for more complex needs (e.g. NetBackup integration, >> perhaps). And I did wonder if there was some other way we could do >> this. > > I'm doubtful that an API based on string replacement is the way to > go. It's hard for me to see how that's not either going to substantially > restrict the way the "tasks" are done, or yield a very complicated > interface. > > I wonder whether the best approach here could be that pg_basebackup (and > perhaps other tools) opens pipes to/from a subcommand and over the pipe > it communicates with the subtask using a textual ([2]) description of > tasks. Like: > > backup mode=files base_directory=/path/to/data/directory > backup_file name=base/14037/16396.14 size=1073741824 > backup_file name=pg_wal/XXXX size=16777216 > or > backup mode=tar > base_directory /path/to/data/ > backup_tar name=dir.tar size=983498875687487 This is pretty much what pgBackRest does. We call them "local" processes and they do most of the work during backup/restore/archive-get/archive-push. > The obvious problem with that proposal is that we don't want to > unnecessarily store the incoming data on the system pg_basebackup is > running on, just for the subcommand to get access to them. More on that > in a second. We also implement "remote" processes so the local processes can get data that doesn't happen to be local, i.e. on a remote PostgreSQL cluster. > A huge advantage of a scheme like this would be that it wouldn't have to > be specific to pg_basebackup. It could just as well work directly on the > server, avoiding an unnecesary loop through the network. Which > e.g. could integrate with filesystem snapshots etc. Without needing to > build the 'archive target' once with server libraries, and once with > client libraries. Yes -- needing to store the data locally or stream it through one main process is a major bottleneck. Working on the server is key because it allows you to compress before transferring the data. With parallel processing it is trivial to flood a network. We have a recent example from a community user of backing up 25TB in 4 hours. Compression on the server makes this possible (and a fast network, in this case). For security reasons, it's also nice to be able to encrypt data before it leaves the database server. Calculating checksums/size at the source is also ideal. > One reason I think something like this could be advantageous over a C > API is that it's quite feasible to implement it from a number of > different language, including shell if really desired, without needing > to provide a C API via a FFI. We migrated from Perl to C and kept our local/remote protocol the same, which really helped. So, we had times when the C code was using a Perl local/remote and vice versa. The idea is certainly workable in our experience. > It'd also make it quite natural to split out compression from > pg_basebackup's main process, which IME currently makes it not really > feasible to use pg_basebackup's compression. This is a major advantage. > There's various ways we could address the issue for how the subcommand > can access the file data. The most flexible probably would be to rely on > exchanging file descriptors between basebackup and the subprocess (these > days all supported platforms have that, I think). Alternatively we > could invoke the subcommand before really starting the backup, and ask > how many files it'd like to receive in parallel, and restart the > subcommand with that number of file descriptors open. We don't exchange FDs. Each local is responsible for getting the data from PostgreSQL or the repo based on knowing the data source and a path. For pg_basebackup, however, I'd imagine each local would want a replication connection with the ability to request specific files that were passed to it by the main process. > [2] yes, I already hear json. A line deliminated format would have some > advantages though. We use JSON, but each protocol request/response is linefeed-delimited. So for example here's what it looks like when the main process requests a local process to backup a specific file: {"{"cmd":"backupFile","param":["base/32768/33001",true,65536,null,true,0,"pg_data/base/32768/33001",false,0,3,"20200412-213313F",false,null]}"} And the local responds with: {"{"out":[1,65536,65536,"6bf316f11d28c28914ea9be92c00de9bea6d9a6b",{"align":true,"error":[0,[3,5],7],"valid":false}]}"} We use arrays for parameters but of course these could be done with objects for more readability. We are considering a move to HTTP since lots of services (e.g. S3, GCS, Azure, etc.) require it (so we implement it) and we're not sure it makes sense to maintain our own protocol format. That said, we'd still prefer to use JSON for our payloads (like GCS) rather than XML (as S3 does). Regards, -- -David david@pgmasters.net
Hi, On 2020-04-12 17:57:05 -0400, David Steele wrote: > On 4/12/20 3:17 PM, Andres Freund wrote: > > [proposal outline[ > > This is pretty much what pgBackRest does. We call them "local" processes and > they do most of the work during backup/restore/archive-get/archive-push. Hah. I swear, I didn't look. > > The obvious problem with that proposal is that we don't want to > > unnecessarily store the incoming data on the system pg_basebackup is > > running on, just for the subcommand to get access to them. More on that > > in a second. > > We also implement "remote" processes so the local processes can get data > that doesn't happen to be local, i.e. on a remote PostgreSQL cluster. What is the interface between those? I.e. do the files have to be spooled as a whole locally? > > There's various ways we could address the issue for how the subcommand > > can access the file data. The most flexible probably would be to rely on > > exchanging file descriptors between basebackup and the subprocess (these > > days all supported platforms have that, I think). Alternatively we > > could invoke the subcommand before really starting the backup, and ask > > how many files it'd like to receive in parallel, and restart the > > subcommand with that number of file descriptors open. > > We don't exchange FDs. Each local is responsible for getting the data from > PostgreSQL or the repo based on knowing the data source and a path. For > pg_basebackup, however, I'd imagine each local would want a replication > connection with the ability to request specific files that were passed to it > by the main process. I don't like this much. It'll push more complexity into each of the "targets" and we can't easily share that complexity. And also, needing to request individual files will add a lot of back/forth, and thus latency issues. The server would always have to pre-send a list of files, we'd have to deal with those files vanishing, etc. > > [2] yes, I already hear json. A line deliminated format would have some > > advantages though. > > We use JSON, but each protocol request/response is linefeed-delimited. So > for example here's what it looks like when the main process requests a local > process to backup a specific file: > > {"{"cmd":"backupFile","param":["base/32768/33001",true,65536,null,true,0,"pg_data/base/32768/33001",false,0,3,"20200412-213313F",false,null]}"} > > And the local responds with: > > {"{"out":[1,65536,65536,"6bf316f11d28c28914ea9be92c00de9bea6d9a6b",{"align":true,"error":[0,[3,5],7],"valid":false}]}"} As long as it's line delimited, I don't really care :) > We are considering a move to HTTP since lots of services (e.g. S3, GCS, > Azure, etc.) require it (so we implement it) and we're not sure it makes > sense to maintain our own protocol format. That said, we'd still prefer to > use JSON for our payloads (like GCS) rather than XML (as S3 does). I'm not quite sure what you mean here? You mean actual requests for each of what currently are lines? If so, that sounds *terrible*. Greetings, Andres Freund
On 4/12/20 6:37 PM, Andres Freund wrote: > Hi, > > On 2020-04-12 17:57:05 -0400, David Steele wrote: >> On 4/12/20 3:17 PM, Andres Freund wrote: >>> [proposal outline[ >> >> This is pretty much what pgBackRest does. We call them "local" processes and >> they do most of the work during backup/restore/archive-get/archive-push. > > Hah. I swear, I didn't look. I believe you. If you spend enough time thinking about this (and we've spent a lot) then I think this is is where you arrive. >>> The obvious problem with that proposal is that we don't want to >>> unnecessarily store the incoming data on the system pg_basebackup is >>> running on, just for the subcommand to get access to them. More on that >>> in a second. >> >> We also implement "remote" processes so the local processes can get data >> that doesn't happen to be local, i.e. on a remote PostgreSQL cluster. > > What is the interface between those? I.e. do the files have to be > spooled as a whole locally? Currently we use SSH to talk to a remote, but we are planning on using our own TLS servers in the future. We don't spool anything -- the file is streamed from the PostgreSQL server (via remote protocol if needed) to the repo (which could also be remote, e.g. S3) without spoolng to disk. We have buffers, of course, which are configurable with the buffer-size option. >>> There's various ways we could address the issue for how the subcommand >>> can access the file data. The most flexible probably would be to rely on >>> exchanging file descriptors between basebackup and the subprocess (these >>> days all supported platforms have that, I think). Alternatively we >>> could invoke the subcommand before really starting the backup, and ask >>> how many files it'd like to receive in parallel, and restart the >>> subcommand with that number of file descriptors open. >> >> We don't exchange FDs. Each local is responsible for getting the data from >> PostgreSQL or the repo based on knowing the data source and a path. For >> pg_basebackup, however, I'd imagine each local would want a replication >> connection with the ability to request specific files that were passed to it >> by the main process. > > I don't like this much. It'll push more complexity into each of the > "targets" and we can't easily share that complexity. And also, needing > to request individual files will add a lot of back/forth, and thus > latency issues. The server would always have to pre-send a list of > files, we'd have to deal with those files vanishing, etc. Sure, unless we had a standard interface to "get a file from the PostgreSQL cluster", which is what pgBackRest has via the storage interface. Attached is our implementation for "backupFile". I think it's pretty concise considering what it does. Most of it is dedicated to checksum deltas and backup resume. The straight copy with filters starts at line 189. >>> [2] yes, I already hear json. A line deliminated format would have some >>> advantages though. >> >> We use JSON, but each protocol request/response is linefeed-delimited. So >> for example here's what it looks like when the main process requests a local >> process to backup a specific file: >> >> {"{"cmd":"backupFile","param":["base/32768/33001",true,65536,null,true,0,"pg_data/base/32768/33001",false,0,3,"20200412-213313F",false,null]}"} >> >> And the local responds with: >> >> {"{"out":[1,65536,65536,"6bf316f11d28c28914ea9be92c00de9bea6d9a6b",{"align":true,"error":[0,[3,5],7],"valid":false}]}"} > > As long as it's line delimited, I don't really care :) Agreed. >> We are considering a move to HTTP since lots of services (e.g. S3, GCS, >> Azure, etc.) require it (so we implement it) and we're not sure it makes >> sense to maintain our own protocol format. That said, we'd still prefer to >> use JSON for our payloads (like GCS) rather than XML (as S3 does). > > I'm not quite sure what you mean here? You mean actual requests for each > of what currently are lines? If so, that sounds *terrible*. I know it sounds like a lot, but in practice the local (currently) only performs four operations: backup file, restore file, push file to archive, get file from archive. In that context a little protocol overhead won't be noticed so if it means removing redundant code I'm all for it. That said, we have not done this yet -- it's just under consideration. Regards, -- -David david@pgmasters.net
Вложения
On Sun, Apr 12, 2020 at 3:17 PM Andres Freund <andres@anarazel.de> wrote: > A huge advantage of a scheme like this would be that it wouldn't have to > be specific to pg_basebackup. It could just as well work directly on the > server, avoiding an unnecesary loop through the network. Which > e.g. could integrate with filesystem snapshots etc. Without needing to > build the 'archive target' once with server libraries, and once with > client libraries. That's quite appealing. One downside - IMHO significant - is that you have to have a separate process to do *anything*. If you want to add a filter that just logs everything it's asked to do, for example, you've gotta have a whole process for that, which likely adds a lot of overhead even if you can somehow avoid passing all the data through an extra set of pipes. The interface I proposed would allow you to inject very lightweight filters at very low cost. This design really doesn't. Note that you could build this on top of what I proposed, but not the other way around. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * David Steele (david@pgmasters.net) wrote: > On 4/12/20 6:37 PM, Andres Freund wrote: > >On 2020-04-12 17:57:05 -0400, David Steele wrote: > >>On 4/12/20 3:17 PM, Andres Freund wrote: > >>>There's various ways we could address the issue for how the subcommand > >>>can access the file data. The most flexible probably would be to rely on > >>>exchanging file descriptors between basebackup and the subprocess (these > >>>days all supported platforms have that, I think). Alternatively we > >>>could invoke the subcommand before really starting the backup, and ask > >>>how many files it'd like to receive in parallel, and restart the > >>>subcommand with that number of file descriptors open. > >> > >>We don't exchange FDs. Each local is responsible for getting the data from > >>PostgreSQL or the repo based on knowing the data source and a path. For > >>pg_basebackup, however, I'd imagine each local would want a replication > >>connection with the ability to request specific files that were passed to it > >>by the main process. > > > >I don't like this much. It'll push more complexity into each of the > >"targets" and we can't easily share that complexity. And also, needing > >to request individual files will add a lot of back/forth, and thus > >latency issues. The server would always have to pre-send a list of > >files, we'd have to deal with those files vanishing, etc. > > Sure, unless we had a standard interface to "get a file from the PostgreSQL > cluster", which is what pgBackRest has via the storage interface. There's a couple of other pieces here that I think bear mentioning. The first is that pgBackRest has an actual 'restore' command- and that works with the filters and works with the storage drivers, so what you're looking at when it comes to these interfaces isn't just "put a file" but it's also "get a file". That's actually quite important to have when you start thinking about these more complicated methods of doing backups. That then leads into the fact that, with a manifest, you can do things like excluding 0-byte files from going through any of this processing or from being stored (which costs actual money too, with certain cloud storage options..), or even for just storing *small* files, which we tend to have lots of in PG and which also end up costing more and you end up 'losing' money because you've got lots of 8K files around. We haven't fully optimized for it in pgBackRest, yet, but avoiding having lots of little files (again, because there's real $$ costs involved) is something we actively think about and consider and is made possible when you've got a 'restore' command. Having a manifest where a given file might actually be a reference to a *part* of a file (ie: pgbackrest_smallfiles, offset: 8192, length: 16384) could result in savings when using cloud storage. These are the kinds of things we're thinking about today. Maybe there's some way you could implement something like that using shell commands as an API, but it sure looks like it'd be pretty hard from here. Even just managing to get users to use the right shell commands for backup, and then the right ones for restore, seems awful daunting. I get that I'm probably going to get flak for playing up the 'worst case', but the reality is that far too many people don't fully test their restore processes and trying to figure out the right shell commands to pass into some 'restore' command, or even just to pull all of the data back down from $cloudstorage to perform a restore, when everything is down and your boss is breathing down your neck to get it all back online as fast as possible, isn't how I want this project to be remembered. David and I are constantly talking about how to make the restore process as smooth and as fast as possible, because that's where the rubber really meets the road- you've gotta make that part easy and fast because that's the high-pressure situation. Taking backups is rarely where the real pressure is at- sure, take it today, take it tomorrow, let it run for a few hours, it's all fine, but when you need something restored, you best make that as simple and as fast as absolutely possible because that's the time when your entire business is potentially going to be offline and waiting for you to get everything back up. Thanks, Stephen
Вложения
Hi, On 2020-04-12 20:02:50 -0400, Robert Haas wrote: > On Sun, Apr 12, 2020 at 3:17 PM Andres Freund <andres@anarazel.de> wrote: > > A huge advantage of a scheme like this would be that it wouldn't have to > > be specific to pg_basebackup. It could just as well work directly on the > > server, avoiding an unnecesary loop through the network. Which > > e.g. could integrate with filesystem snapshots etc. Without needing to > > build the 'archive target' once with server libraries, and once with > > client libraries. > > That's quite appealing. One downside - IMHO significant - is that you > have to have a separate process to do *anything*. If you want to add a > filter that just logs everything it's asked to do, for example, you've > gotta have a whole process for that, which likely adds a lot of > overhead even if you can somehow avoid passing all the data through an > extra set of pipes. The interface I proposed would allow you to inject > very lightweight filters at very low cost. This design really doesn't. Well, in what you described it'd still be all done inside pg_basebackup, or did I misunderstand? Once you fetched it from the server, I can't imagine the overhead of filtering it a bit differently would matter. But even if, the "target" could just reply with "skip" or such, instead of providing an fd. What kind of filtering are you thinking of where this is a problem? Besides just logging the filenames? I just can't imagine how that's a relevant overhead compared to having to do things like 'shell ssh rhaas@depository pgfile create-exclusive - %f.lz4' I really think we want the option to eventually do this server-side. And I don't quite see it as viable to go for an API that allows to specify shell fragments that are going to be executed server side. > Note that you could build this on top of what I proposed, but not the > other way around. Why should it not be possible the other way round? Greetings, Andres Freund
Greetings, Answering both in one since they're largely the same. * Bruce Momjian (bruce@momjian.us) wrote: > On Fri, Apr 10, 2020 at 10:54:10AM -0400, Stephen Frost wrote: > > * Robert Haas (robertmhaas@gmail.com) wrote: > > > On Thu, Apr 9, 2020 at 6:44 PM Bruce Momjian <bruce@momjian.us> wrote: > > > > Good point, but if there are multiple APIs, it makes shell script > > > > flexibility even more useful. > > > > > > This is really the key point for me. There are so many existing tools > > > that store a file someplace that we really can't ever hope to support > > > them all in core, or even to have well-written extensions that support > > > them all available on PGXN or wherever. We need to integrate with the > > > tools that other people have created, not try to reinvent them all in > > > PostgreSQL. > > > > So, this goes to what I was just mentioning to Bruce independently- you > > could have made the same argument about FDWs, but it just doesn't > > actually hold any water. Sure, some of the FDWs aren't great, but > > there's certainly no shortage of them, and the ones that are > > particularly important (like postgres_fdw) are well written and in core. > > No, no one made that argument. It isn't clear how a shell script API > would map to relational database queries. The point is how well the > APIs match, and then if they are close, does it give us the flexibility > we need. You can't just look at flexibility without an API match. If what we're talking about is the file_fdw, which certainly isn't very complicated, it's not hard to see how you could use shell scripts for it. What happens is that it starts to get harder and require custom code when you want to do something more complex- which is very nearly what we're talking about here too. Sure, for a simple 'bzip2' filter, a shell script might be alright, but it's not going to cut it for the more complex use-cases that users, today, expect solutions to. * Robert Haas (robertmhaas@gmail.com) wrote: > On Fri, Apr 10, 2020 at 10:54 AM Stephen Frost <sfrost@snowman.net> wrote: > > So, this goes to what I was just mentioning to Bruce independently- you > > could have made the same argument about FDWs, but it just doesn't > > actually hold any water. Sure, some of the FDWs aren't great, but > > there's certainly no shortage of them, and the ones that are > > particularly important (like postgres_fdw) are well written and in core. > > That's a fairly different use case. In the case of the FDW interface: There's two different questions we're talking about here and I feel like they're being conflated. To try and clarify: - Could you implement FDWs with shell scripts, and custom programs? I'm pretty confident that the answer is yes, but the thrust of that argument is primarily to show that you *can* implement just about anything using a shell script "API", so just saying it's possible to do doesn't make it necessarily a good solution. The FDW system is complicated, and also good, because we made it so and because it's possible to do more sophisticated things with a C API, but it could have started out with shell scripts that just returned data in much the same way that COPY PROGRAM works today. What matters is that forward thinking to consider what you're going to want to do tomorrow, not just thinking about how you can solve for the simple cases today with a shell out to an existing command. - Does providing a C-library interface deter people from implementing solutions that use that interface? Perhaps it does, but it doesn't have nearly the dampening effect that is being portrayed here, and we can see that pretty clearly from the FDW situation. Sure, not all of those are good solutions, but lots and lots of archive command shell scripts are also pretty terrible, and there *are* a few good solutions out there, including the ones that we ourselves ship. At least when it comes to FDWs, there's an option there for us to ship a *good* answer ourselves for certain (and, in particular, the very very common) use-cases. > - We're only talking about writing a handful of tar files, and that's > in the context of a full-database backup, which is a much > heavier-weight operation than a query. This is true for -Ft, but not -Fp, and I don't think there's enough thought being put into this when it comes to parallelism and that you don't want to be limited to one process per tablespace. > - There is not really any state that needs to be maintained across calls. As mentioned elsewhere, this isn't really true. > > How does this solution give them a good way to do the right thing > > though? In a way that will work with large databases and complex > > requirements? The answer seems to be "well, everyone will have to write > > their own tool to do that" and that basically means that, at best, we're > > only providing half of a solution and expecting all of our users to > > provide the other half, and to always do it correctly and in a well > > written way. Acknowledging that most users aren't going to actually do > > that and instead they'll implement half measures that aren't reliable > > shouldn't be seen as an endorsement of this approach. > > I don't acknowledge that. I think it's possible to use tools like the > proposed option in a perfectly reliable way, and I've already given a > bunch of examples of how it could be done. Writing a file is not such > a complex operation that every bit of code that writes one reliably > has to be written by someone associated with the PostgreSQL project. I > strongly suspect that people who use a cloud provider's tools to > upload their backup files will be quite happy with the results, and if > they aren't, I hope they will blame the cloud provider's tool for > eating the data rather than this option for making it easy to give the > data to the thing that ate it. The examples you've given of how this could be done "right" involve someone writing custom code (or having code that's been written by the PG project) to be executed from this shell command interface, even just to perform a local backup. As for where the blame goes, I don't find that to be a particularly useful thing to argue about. In any of this, if we are ultimately saying "well, it's the user's fault, or the fault of the tools that the user chose to use with our interface" then it seems like we've lost. Maybe that's going to far and maybe we can't hold ourselves to that high of a standard but I like to think of this project, in particular, as being the one that's trying really hard to go as far in that direction as possible. To that end, if we contemplate adding support for some cloud vendor's storage, as an example, and discover that the command line tools for it suck or don't meet our expectations, I'd expect us to either refuse to support it, or to forgo using the command-line tools and instead implement support for talking to the cloud storage interface directly, if it works well. Thanks, Stephen
Вложения
On Sun, Apr 12, 2020 at 09:18:28PM -0400, Stephen Frost wrote: > * Bruce Momjian (bruce@momjian.us) wrote: > > On Fri, Apr 10, 2020 at 10:54:10AM -0400, Stephen Frost wrote: > > > * Robert Haas (robertmhaas@gmail.com) wrote: > > > > On Thu, Apr 9, 2020 at 6:44 PM Bruce Momjian <bruce@momjian.us> wrote: > > > > > Good point, but if there are multiple APIs, it makes shell script > > > > > flexibility even more useful. > > > > > > > > This is really the key point for me. There are so many existing tools > > > > that store a file someplace that we really can't ever hope to support > > > > them all in core, or even to have well-written extensions that support > > > > them all available on PGXN or wherever. We need to integrate with the > > > > tools that other people have created, not try to reinvent them all in > > > > PostgreSQL. > > > > > > So, this goes to what I was just mentioning to Bruce independently- you > > > could have made the same argument about FDWs, but it just doesn't > > > actually hold any water. Sure, some of the FDWs aren't great, but > > > there's certainly no shortage of them, and the ones that are > > > particularly important (like postgres_fdw) are well written and in core. > > > > No, no one made that argument. It isn't clear how a shell script API > > would map to relational database queries. The point is how well the > > APIs match, and then if they are close, does it give us the flexibility > > we need. You can't just look at flexibility without an API match. > > If what we're talking about is the file_fdw, which certainly isn't very > complicated, it's not hard to see how you could use shell scripts for > it. What happens is that it starts to get harder and require custom > code when you want to do something more complex- which is very nearly > what we're talking about here too. Sure, for a simple 'bzip2' filter, a > shell script might be alright, but it's not going to cut it for the more > complex use-cases that users, today, expect solutions to. Well, file_fdw is the simplest FDW, and we might have been able to do that in shell script, but almost all the other FDWs couldn't, so we might as well choose a C API for FDWs and use the same one for file_fdw. It seems like basic engineering that you choose the closest API that meets most of your deployment requirements, and meets all of the required ones. > To that end, if we contemplate adding support for some cloud vendor's > storage, as an example, and discover that the command line tools for it > suck or don't meet our expectations, I'd expect us to either refuse to > support it, or to forgo using the command-line tools and instead > implement support for talking to the cloud storage interface directly, > if it works well. Do we choose a more inflexible API on a hypothetical risk? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EnterpriseDB https://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Sun, Apr 12, 2020 at 9:18 PM Stephen Frost <sfrost@snowman.net> wrote: > There's two different questions we're talking about here and I feel like > they're being conflated. To try and clarify: > > - Could you implement FDWs with shell scripts, and custom programs? I'm > pretty confident that the answer is yes, but the thrust of that > argument is primarily to show that you *can* implement just about > anything using a shell script "API", so just saying it's possible to > do doesn't make it necessarily a good solution. The FDW system is > complicated, and also good, because we made it so and because it's > possible to do more sophisticated things with a C API, but it could > have started out with shell scripts that just returned data in much > the same way that COPY PROGRAM works today. What matters is that > forward thinking to consider what you're going to want to do tomorrow, > not just thinking about how you can solve for the simple cases today > with a shell out to an existing command. > > - Does providing a C-library interface deter people from implementing > solutions that use that interface? Perhaps it does, but it doesn't > have nearly the dampening effect that is being portrayed here, and we > can see that pretty clearly from the FDW situation. Sure, not all of > those are good solutions, but lots and lots of archive command shell > scripts are also pretty terrible, and there *are* a few good solutions > out there, including the ones that we ourselves ship. At least when > it comes to FDWs, there's an option there for us to ship a *good* > answer ourselves for certain (and, in particular, the very very > common) use-cases. > > > - We're only talking about writing a handful of tar files, and that's > > in the context of a full-database backup, which is a much > > heavier-weight operation than a query. > > This is true for -Ft, but not -Fp, and I don't think there's enough > thought being put into this when it comes to parallelism and that you > don't want to be limited to one process per tablespace. > > > - There is not really any state that needs to be maintained across calls. > > As mentioned elsewhere, this isn't really true. These are fair points, and my thinking has been somewhat refined by this discussion, so let me try to clarify my (current) position a bit. I believe that there are two subtly different questions here. Question #1 is "Would it be useful to people to be able to pipe the tar files that they get from pg_basebackup into some other command rather than writing them to the filesystem, and should we give them the option to do so?" Question #2 is "Is piping the tar files that pg_basebackup would produce into some other program the best possible way of providing more flexibility about where backups get written?" I'm prepared to concede that the answer to question #2 is no. I had earlier assumed that establishing connections was pretty fast and that, even if not, there were solutions to that problem, like setting up an SSH tunnel in advance. Several people have said, well, no, establishing connections is a problem. As I acknowledged from the beginning, plain format backups are a problem. So I think a convincing argument has been made that a shell command won't meet everyone's needs, and a more complex API is required for some cases. But I still think the answer to question #1 is yes. I disagree entirely with any argument to the effect that because some users might do unsafe things with the option, we ought not to provide it. Practically speaking, it would work fine for many people even with no other changes, and if we add something like pgfile, which I'm willing to do, it would work for more people in more situations. It is a useful thing to have, full stop. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Apr 12, 2020 at 8:27 PM Andres Freund <andres@anarazel.de> wrote: > > That's quite appealing. One downside - IMHO significant - is that you > > have to have a separate process to do *anything*. If you want to add a > > filter that just logs everything it's asked to do, for example, you've > > gotta have a whole process for that, which likely adds a lot of > > overhead even if you can somehow avoid passing all the data through an > > extra set of pipes. The interface I proposed would allow you to inject > > very lightweight filters at very low cost. This design really doesn't. > > Well, in what you described it'd still be all done inside pg_basebackup, > or did I misunderstand? Once you fetched it from the server, I can't > imagine the overhead of filtering it a bit differently would matter. > > But even if, the "target" could just reply with "skip" or such, instead > of providing an fd. > > What kind of filtering are you thinking of where this is a problem? > Besides just logging the filenames? I just can't imagine how that's a > relevant overhead compared to having to do things like > 'shell ssh rhaas@depository pgfile create-exclusive - %f.lz4' Anything you want to do in the same process. I mean, right now we have basically one target (filesystem) and one filter (compression). Neither of those things spawn a process. It seems logical to imagine that there might be other things that are similar in the future. It seems to me that there are definitely things where you will want to spawn a process; that's why I like having shell commands as one option. But I don't think we should require that you can't have a filter or a target unless you also spawn a process for it. > I really think we want the option to eventually do this server-side. And > I don't quite see it as viable to go for an API that allows to specify > shell fragments that are going to be executed server side. The server-side thing is a good point, but I think it adds quite a bit of complexity, too. I'm worried that this is ballooning to an unworkable amount of complexity - and not just code complexity, but bikeshedding complexity, too. Like, I started with a command-line option that could probably have been implemented in a few hundred lines of code. Now, we're up to something where you have to build custom processes that speak a novel protocol and work on both the client and the server side. That's at least several thousand lines of code, maybe over ten thousand if the sample binaries that use the new protocol are more than just simple demonstrations of how to code to the interface. More importantly, it means agreeing on the nature of this custom protocol, which seems like something where I could put in a ton of effort to create something and then have somebody complain because it's not JSON, or because it is JSON, or because the capability negotiation system isn't right, or whatever. I'm not exactly saying that we shouldn't do it; I think it has some appeal. But I'd sure like to find some way of getting started that doesn't involve having to do everything in one patch, and then getting told to change it all again - possibly with different people wanting contradictory things. > > Note that you could build this on top of what I proposed, but not the > > other way around. > > Why should it not be possible the other way round? Because a C function call API lets you decide to spawn a process, but if the framework inherently spawns a process, you can't decide not to do so in a particular case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Sun, Apr 12, 2020 at 8:27 PM Andres Freund <andres@anarazel.de> wrote: > > I really think we want the option to eventually do this server-side. And > > I don't quite see it as viable to go for an API that allows to specify > > shell fragments that are going to be executed server side. > > The server-side thing is a good point, but I think it adds quite a bit > of complexity, too. I'm worried that this is ballooning to an > unworkable amount of complexity - and not just code complexity, but > bikeshedding complexity, too. Like, I started with a command-line > option that could probably have been implemented in a few hundred > lines of code. Now, we're up to something where you have to build > custom processes that speak a novel protocol and work on both the > client and the server side. That's at least several thousand lines of > code, maybe over ten thousand if the sample binaries that use the new > protocol are more than just simple demonstrations of how to code to > the interface. More importantly, it means agreeing on the nature of > this custom protocol, which seems like something where I could put in > a ton of effort to create something and then have somebody complain > because it's not JSON, or because it is JSON, or because the > capability negotiation system isn't right, or whatever. I'm not > exactly saying that we shouldn't do it; I think it has some appeal. > But I'd sure like to find some way of getting started that doesn't > involve having to do everything in one patch, and then getting told to > change it all again - possibly with different people wanting > contradictory things. Doing things incrementally and not all in one patch absolutely makes a lot of sense and is a good idea. Wouldn't it make sense to, given that we have some idea of what we want it to eventually look like, to make progress in that direction though? That is- I tend to agree with Andres that having this supported server-side eventually is what we should be thinking about as an end-goal (what is the point of pg_basebackup in all of this, after all, if the goal is to get a backup of PG from the PG server to s3..? why go through some other program or through the replication protocol?) and having the server exec'ing out to run shell script fragments to make that happen looks like it would be really awkward and full of potential risks and issues and agreement that it wouldn't be a good fit. If, instead, we worked on a C-based interface which includes filters and storage drivers, and was implemented through libpgcommon, we could start with that being all done through pg_basebackup and work to hammer out the complications and issues that we run into there and, once it seems reasonably stable and works well, we could potentially pull that into the backend to be run directly without having to have pg_basebackup involved in the process. There's been good progress in the direction of having more done by the backend already, and that's thanks to you and it's good work- specifically that the backend now has the ability to generate a manifest, with checksums included as the backup is being run, which is definitely an important piece. Thanks, Stephen
Вложения
On Tue, Apr 14, 2020 at 11:08 AM Stephen Frost <sfrost@snowman.net> wrote: > Wouldn't it make sense to, given that we have some idea of what we want > it to eventually look like, to make progress in that direction though? Well, yes. :-) > That is- I tend to agree with Andres that having this supported > server-side eventually is what we should be thinking about as an > end-goal (what is the point of pg_basebackup in all of this, after all, > if the goal is to get a backup of PG from the PG server to s3..? why > go through some other program or through the replication protocol?) and > having the server exec'ing out to run shell script fragments to make > that happen looks like it would be really awkward and full of potential > risks and issues and agreement that it wouldn't be a good fit. I'm fairly deeply uncomfortable with what Andres is proposing. I see that it's very powerful, and can do a lot of things, and that if you're building something that does sophisticated things with storage, you probably want an API like that. It does a great job making complicated things possible. However, I feel that it does a lousy job making simple things simple. Suppose you want to compress using your favorite compression program. Well, you can't. Your favorite compression program doesn't speak the bespoke PostgreSQL protocol required for backup plugins. Neither does your favorite encryption program. Either would be perfectly happy to accept a tarfile on stdin and dump out a compressed or encrypted version, as the case may be, on stdout, but sorry, no such luck. You need a special program that speaks the magic PostgreSQL protocol but otherwise does pretty much the exact same thing as the standard one. It's possibly not the exact same thing. A special might, for example, use multiple threads for parallel compression rather than multiple processes, perhaps gaining a bit of efficiency. But it's doubtful whether all users care about such marginal improvements. All they're going to see is that they can use gzip and maybe lz4 because we provide the necessary special magic tools to integrate with those, but for some reason we don't have a special magic tool that they can use with their own favorite compressor, and so they can't use it. I think people are going to find that fairly unhelpful. Now, it's a problem we can work around. We could have a "shell gateway" program which acts as a plugin, speaks the backup plugin protocol, and internally does fork-and-exec stuff to spin up copies of any binary you want to act as a filter. I don't see any real problem with that. I do think it's very significantly more complicated than just what Andres called an FFI. It's gonna be way easier to just write something that spawns shell processes directly than it is to write something that spawns a process and talks to it using this protocol and passes around file descriptors using the various different mechanisms that different platforms use for that, and then that process turns around and spawns some other processes and passes along the file descriptors to them. Now you've added a whole bunch of platform-specific code and a whole bunch of code to generate and parse protocol messages to achieve exactly the same thing that you could've done far more simply with a C API. Even accepting as a given the need to make the C API work separately on both the client and server side, you've probably at least doubled, and I suspect more like quadrupled, the amount of infrastructure that has to be built. So... > If, instead, we worked on a C-based interface which includes filters and > storage drivers, and was implemented through libpgcommon, we could start > with that being all done through pg_basebackup and work to hammer out > the complications and issues that we run into there and, once it seems > reasonably stable and works well, we could potentially pull that into > the backend to be run directly without having to have pg_basebackup > involved in the process. ...let's do this. Actually, I don't really mind if we target something that can work on both the client and server side initially, but based on C, not a new wire protocol with file descriptor passing. That new wire protocol, and the file descriptor passing infrastructure that goes with it, are things that I *really* think should be pushed off to version 2, because I think they're going to generate a lot of additional work and complexity, and I don't want to deal with all of it at once. Also, I don't really see what's wrong with the server forking processes that exec("/usr/bin/lz4") or whatever. We do similar things in other places and, while it won't work for cases where you want to compress a shazillion files, that's not really a problem here anyway. At least at the moment, the server-side format is *always* tar, so the problem of needing a separate subprocess for every file in the data directory does not arise. > There's been good progress in the direction of having more done by the > backend already, and that's thanks to you and it's good work- > specifically that the backend now has the ability to generate a > manifest, with checksums included as the backup is being run, which is > definitely an important piece. Thanks. I'm actually pretty pleased about making some of that infrastructure available on the frontend side, and would like to go further in that direction over time. My only concern is that any given patch shouldn't be made to require too much collateral infrastructure work, and any infrastructure work that it will require should be agreed, so far as we can, early in the development process, so that there's time to do it at a suitably unhurried pace. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-04-14 11:38:03 -0400, Robert Haas wrote: > I'm fairly deeply uncomfortable with what Andres is proposing. I see > that it's very powerful, and can do a lot of things, and that if > you're building something that does sophisticated things with storage, > you probably want an API like that. It does a great job making > complicated things possible. However, I feel that it does a lousy job > making simple things simple. I think it's pretty much exactly the opposite. Your approach seems to move all the complexity to the user, having to build entire combination of commands themselves. Instead of having one or two default commands that do backups in common situations, everyone has to assemble them from pieces. Moved from later in your email, since it seems to make more sense to have it here: > All they're going to see is that they can use gzip and maybe lz4 > because we provide the necessary special magic tools to integrate with > those, but for some reason we don't have a special magic tool that > they can use with their own favorite compressor, and so they can't use > it. I think people are going to find that fairly unhelpful. I have no problem with providing people with the opportunity to use their personal favorite compressor, but forcing them to have to do that, and to ensure it's installed etc, strikes me as a spectacurly bad default situation. Most people don't have the time to research which compression algorithms work the best for which precise situation. How do you imagine a default scripted invocation of the new backup stuff to look like? Having to specify multiple commandline "fragments" for compression, storing files, ... can't be what we want the common case should look like. It'll just again lead to everyone copy & pasting examples that all are wrong in different ways. They'll not at all work across platforms (or often not across OS versions). In general, I think it's good to give expert users the ability to customize things like backups and archiving. But defaulting to every non-expert user having to all that expert work (or coyping it from bad examples) is one of the most user hostile things in postgres. > Also, I don't really see what's wrong with the server forking > processes that exec("/usr/bin/lz4") or whatever. We do similar things > in other places and, while it won't work for cases where you want to > compress a shazillion files, that's not really a problem here anyway. > At least at the moment, the server-side format is *always* tar, so the > problem of needing a separate subprocess for every file in the data > directory does not arise. I really really don't understand this. Are you suggesting that for server side compression etc we're going to add the ability to specify shell commands as argument to the base backup command? That seems so obviously a non-starter? A good default for backup configurations should be that the PG user that the backup is done under is only allowed to do that, and not that it directly has arbitrary remote command execution. > Suppose you want to compress using your favorite compression > program. Well, you can't. Your favorite compression program doesn't > speak the bespoke PostgreSQL protocol required for backup > plugins. Neither does your favorite encryption program. Either would > be perfectly happy to accept a tarfile on stdin and dump out a > compressed or encrypted version, as the case may be, on stdout, but > sorry, no such luck. You need a special program that speaks the magic > PostgreSQL protocol but otherwise does pretty much the exact same > thing as the standard one. But the tool speaking the protocol can just allow piping through whatever tool? Given that there likely is benefits to either doing things on the client side or on the server side, it seems inevitable that there's multiple places that would make sense to have the capability for? > It's possibly not the exact same thing. A special might, for example, > use multiple threads for parallel compression rather than multiple > processes, perhaps gaining a bit of efficiency. But it's doubtful > whether all users care about such marginal improvements. Marginal improvements? Compression scales decently well with the number of cores. pg_basebackup's compression is useless because it's so slow (and because its clientside, but that's IME the lesser issue). I feel I must be misunderstanding what you mean here. gzip - vs pigz -p $numcores on my machine: 180MB/s vs 2.5GB/s. The latter will still sometimes be a bottleneck (it's a bottlenck in pigz, not available compression cycles), but a lot less commonly than 180. Greetings, Andres Freund
On Tue, Apr 14, 2020 at 9:50 PM Andres Freund <andres@anarazel.de> wrote: > On 2020-04-14 11:38:03 -0400, Robert Haas wrote: > > I'm fairly deeply uncomfortable with what Andres is proposing. I see > > that it's very powerful, and can do a lot of things, and that if > > you're building something that does sophisticated things with storage, > > you probably want an API like that. It does a great job making > > complicated things possible. However, I feel that it does a lousy job > > making simple things simple. > > I think it's pretty much exactly the opposite. Your approach seems to > move all the complexity to the user, having to build entire combination > of commands themselves. Instead of having one or two default commands > that do backups in common situations, everyone has to assemble them from > pieces. I think we're mostly talking about different things. I was speaking mostly about the difficulty of developing it. I agree that a project which is easier to develop is likely to provide fewer benefits to the end user. On the other hand, it might be more likely to get done, and projects that don't get done provide few benefits to users. I strongly believe we need an incremental approach here. > In general, I think it's good to give expert users the ability to > customize things like backups and archiving. But defaulting to every > non-expert user having to all that expert work (or coyping it from bad > examples) is one of the most user hostile things in postgres. I'm not against adding more built-in compression algorithms, but I also believe (as I have several times now) that the world moves a lot faster than PostgreSQL, which has not added a single new compression algorithm to pg_basebackup ever. We had 1 compression algorithm in 2011, and we still have that same 1 algorithm today. So, either nobody cares, or adding new algorithms is sufficiently challenging - for either technical or political reasons - that nobody's managed to get it done. I think having a simple framework in pg_basebackup for plugging in new algorithms would make it noticeably simpler to add LZ4 or whatever your favorite compression algorithm is. And I think having that framework also be able to use shell commands, so that users don't have to wait a decade or more for new choices to show up, is also a good idea. I don't disagree that the situation around things like archive_command is awful, but a good part of that is that every time somebody shows up and says "hey, let's try to make a small improvement," between two and forty people show up and start explaining why it's still going to be terrible. Eventually the pile of requirements get so large, and/or there are enough contradictory opinions, that the person who made the proposal for how to improve things gives up and leaves. So then we still have the documentation suggesting "cp". When people - it happens to be me in this case, but the problem is much more general - show up and propose improvements to difficult areas, we can and should give them good advice on how to improve their proposals. But we should not insist that they have to build something incredibly complex and grandiose and solve every problem in that area. We should be happy if we get ANY improvement in a difficult area, not send dozens of angry emails complaining that their proposal is imperfect. > I really really don't understand this. Are you suggesting that for > server side compression etc we're going to add the ability to specify > shell commands as argument to the base backup command? That seems so > obviously a non-starter? A good default for backup configurations > should be that the PG user that the backup is done under is only allowed > to do that, and not that it directly has arbitrary remote command > execution. I hadn't really considered that aspect, and that's certainly a concern. But I also don't understand why you think it's somehow a big deal. My point is not that clients should have the ability to execute arbitrary commands on the server. It's that shelling out to an external binary provided by the operating system is a reasonable thing to do, versus having everything have to be done by binaries that we create. Which I think is what you are also saying right here: > But the tool speaking the protocol can just allow piping through > whatever tool? Given that there likely is benefits to either doing > things on the client side or on the server side, it seems inevitable > that there's multiple places that would make sense to have the > capability for? Unless I am misunderstanding you, this is exactly what i was proposing, and have been proposing since the first email on the thread. > > It's possibly not the exact same thing. A special might, for example, > > use multiple threads for parallel compression rather than multiple > > processes, perhaps gaining a bit of efficiency. But it's doubtful > > whether all users care about such marginal improvements. > > Marginal improvements? Compression scales decently well with the number > of cores. pg_basebackup's compression is useless because it's so slow > (and because its clientside, but that's IME the lesser issue). I feel I > must be misunderstanding what you mean here. > > gzip - vs pigz -p $numcores on my machine: 180MB/s vs 2.5GB/s. The > latter will still sometimes be a bottleneck (it's a bottlenck in pigz, > not available compression cycles), but a lot less commonly than 180. That's really, really, really not what I was talking about. I'm quite puzzled by your reading of this email. You seem to have missed my point entirely. I don't know whether that's because I did a poor job writing it or because you didn't read it carefully enough or what. What I'm saying is: I don't immediately wish to undertake the problem of building a new wire protocol that the client and server can use to talk to external binaries. I would prefer to start with a C API, because I think it will be far less work and still able to meet a number of important needs. The new wire protocol that can be used to talk to external binaries can be added later. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-04-15 09:23:30 -0400, Robert Haas wrote: > On Tue, Apr 14, 2020 at 9:50 PM Andres Freund <andres@anarazel.de> wrote: > > On 2020-04-14 11:38:03 -0400, Robert Haas wrote: > > > I'm fairly deeply uncomfortable with what Andres is proposing. I see > > > that it's very powerful, and can do a lot of things, and that if > > > you're building something that does sophisticated things with storage, > > > you probably want an API like that. It does a great job making > > > complicated things possible. However, I feel that it does a lousy job > > > making simple things simple. > > > > I think it's pretty much exactly the opposite. Your approach seems to > > move all the complexity to the user, having to build entire combination > > of commands themselves. Instead of having one or two default commands > > that do backups in common situations, everyone has to assemble them from > > pieces. > > I think we're mostly talking about different things. That certainly would explain some misunderstandings ;) I mostly still am trying to define where we eventually want to be on a medium to high level. And I don't think we really have agreement on that. My original understanding of your eventual goal is that it's the example invocation of pg_basebackup upthread, namely a bunch of shell arguments to pg_basebackup. And I don't think that's good enough. IMO it only really make to design incremental steps after we have a rough agreement on the eventual goal. Otherwise we'll just end up supporting the outcomes of missteps for a long time. > I was speaking mostly about the difficulty of developing it. I agree > that a project which is easier to develop is likely to provide fewer > benefits to the end user. On the other hand, it might be more likely > to get done, and projects that don't get done provide few benefits to > users. I strongly believe we need an incremental approach here. I agree. My concern is just that we should not expose things to the user that will make it much harder to evolve going forward. > I'm not against adding more built-in compression algorithms, but I > also believe (as I have several times now) that the world moves a lot > faster than PostgreSQL, which has not added a single new compression > algorithm to pg_basebackup ever. We had 1 compression algorithm in > 2011, and we still have that same 1 algorithm today. So, either nobody > cares, or adding new algorithms is sufficiently challenging - for > either technical or political reasons - that nobody's managed to get > it done. Imo most of the discussion has been around toast, and there the situation imo is much more complicated than just about adding the compression algorithm. I don't recall a discussion about adding an optional dependency to other compression algorithms to pg_basebackup that didn't go anywhere for either technical or political reasons. > I think having a simple framework in pg_basebackup for plugging in new > algorithms would make it noticeably simpler to add LZ4 or whatever > your favorite compression algorithm is. And I think having that > framework also be able to use shell commands, so that users don't have > to wait a decade or more for new choices to show up, is also a good > idea. As long as here's sensible defaults, and so that the user doesn't have to specify paths to binaries for the common cases, I'm OK with that. I'm not ok with requiring the user to specify shell fragments for things that should be built in. If we think the appropriate way to implement extensible compression is by piping to commandline binaries ([1]), I'd imo e.g. ok if we had a builtin list of [{fileending, shell-fragment-for-compression}] that is filled with appropriate values detected at build time for a few common cases. But then also allowed adding new methods via commandline options. I guess what I perceived to be the fundamental difference, before this email, between our positions is that I (still) think that exposing detailed postprocessing shell fragment style arguments to pg_basebackup, especially as the only option to use the new capabilities, will nail us into a corner - but you don't necessarily think so? Where I had/have no problems with implementing features by *internally* piping through external binaries, as long as the user doesn't have to always specify them. [1] I am not sure, nor the opposite, that piping is a great idea medium term. One concern is that IIRC windows pipe performance is not great, and that there's some other portability problems as well. I think there's also valid concerns about per-file overhead, which might be a problem for some future uses. > > I really really don't understand this. Are you suggesting that for > > server side compression etc we're going to add the ability to specify > > shell commands as argument to the base backup command? That seems so > > obviously a non-starter? A good default for backup configurations > > should be that the PG user that the backup is done under is only allowed > > to do that, and not that it directly has arbitrary remote command > > execution. > > I hadn't really considered that aspect, and that's certainly a > concern. But I also don't understand why you think it's somehow a big > deal. My point is not that clients should have the ability to execute > arbitrary commands on the server. It's that shelling out to an > external binary provided by the operating system is a reasonable thing > to do, versus having everything have to be done by binaries that we > create. Which I think is what you are also saying right here: > > But the tool speaking the protocol can just allow piping through > > whatever tool? Given that there likely is benefits to either doing > > things on the client side or on the server side, it seems inevitable > > that there's multiple places that would make sense to have the > > capability for? > > Unless I am misunderstanding you, this is exactly what i was > proposing, and have been proposing since the first email on the > thread. Well, no and yes. As I said above, for me there's a difference between piping to commands as an internal implementation detail, and between that being the non-poweruser interface. It may or may not be the right tradeoff to implement server side compression by piping the output to/from some binary. IMO it's clearly not the right way to implement server side compression by specifying shell fragments as arguments to BASE_BACKUP. Nor do I think it's the right thing, albeit a tad more debatable, that for decent client side compression one has to specify a binary whose path will differ on various platforms (on windows you can't rely on PATH). If we were to go for building all this via piopes, utilizing that to make compression etc extensible for powerusers makes sense to me. But I don't think it makes sense to design a C API without a rough picture of how things should eventually look like. If we were, e.g., eventually going to do all the work of compressing and transferring data in one external binary, then a C API exposing transformations in pg_basebackup doesn't necessarily make sense. If it turns out that pipes are too inefficient on windows to implement compression filters, that we need parallel awareness in the API, etc it'll influence the API. > > > It's possibly not the exact same thing. A special might, for example, > > > use multiple threads for parallel compression rather than multiple > > > processes, perhaps gaining a bit of efficiency. But it's doubtful > > > whether all users care about such marginal improvements. > > > > Marginal improvements? Compression scales decently well with the number > > of cores. pg_basebackup's compression is useless because it's so slow > > (and because its clientside, but that's IME the lesser issue). I feel I > > must be misunderstanding what you mean here. > > > > gzip - vs pigz -p $numcores on my machine: 180MB/s vs 2.5GB/s. The > > latter will still sometimes be a bottleneck (it's a bottlenck in pigz, > > not available compression cycles), but a lot less commonly than 180. > > That's really, really, really not what I was talking about. What did you mean with the "marginal improvements" paragraph above? Greetings, Andres Freund
On Wed, Apr 15, 2020 at 6:13 PM Andres Freund <andres@anarazel.de> wrote: > I guess what I perceived to be the fundamental difference, before this > email, between our positions is that I (still) think that exposing > detailed postprocessing shell fragment style arguments to pg_basebackup, > especially as the only option to use the new capabilities, will nail us > into a corner - but you don't necessarily think so? Where I had/have no > problems with implementing features by *internally* piping through > external binaries, as long as the user doesn't have to always specify > them. My principle concern is actually around having a C API and a flexible command-line interface. If we rearrange the code and the pg_basebackup command line syntax so that it's easy to add new "filters" and "targets", then I think that's a very good step forward. It's of less concern to me whether those "filters" and "targets" are (1) C code that we ship as part of pg_basebackup, (2) C code by extension authors that we dynamically load into pg_basebackup, (3) off-the-shelf external programs that we invoke, or (4) special external programs that we provide which do special magic. However, of those options, I like #4 least, because it seems like a pain in the tail to implement. It may turn out to be the most powerful and flexible, though I'm not completely sure about that yet. As to exactly how far we can get with #3, I think it depends a good deal on the answer to this question you pose in a footnote: > [1] I am not sure, nor the opposite, that piping is a great idea medium > term. One concern is that IIRC windows pipe performance is not great, > and that there's some other portability problems as well. I think > there's also valid concerns about per-file overhead, which might be a > problem for some future uses. If piping stuff through shell commands performs well for use cases like compression, then I think we can get pretty far with piping things through shell commands. It means we can use any compression at all with no build-time dependency on that compressor. People can install anything they want, stick it in $PATH, and away they go. I see no particular reason to dislike that kind of thing; in fact, I think it offers many compelling advantages. On the other hand, if we really need to interact directly with the library to get decent performance, because, say, pipes are too slow, then the approach of piping things through an arbitrary shell commands is a lot less exciting. Even then, though, I wonder how many runtime dependencies we're seriously willing to add. I imagine we can add one or two more compression algorithms without giving everybody fits, even if it means adding optional build-time and run-time dependencies on some external libraries. Any more than that is likely to provoke a backlash. And I doubt whether we're willing to have the postgresql operating system package depend on something like libgcrypt at all. I would expect such a proposal to meet with vigorous objections. But without such a dependency, how would we realistically get encrypted backups except by piping through a shell command? I don't really see a way, and letting a user specify a shell fragment to define what happens there seems pretty reasonable to me. I'm also not very sure to what we can assume, with either compression or encryption, that one size fits all. If there are six popular compression libraries and four popular encryption libraries, does anyone really believe that it's going to be OK for 'yum install postgresql-server' to suck in all of those things? Or, even if that were OK or if it we could somehow avoid it, what are the chances that we'd actually go to the trouble of building interfaces to all of those things? I'd rate them as slim to none; we suck at that sort of thing. Exhibit A: The work to make PostgreSQL support more than one SSL library. I'm becoming fairly uncertain as to how far we can get with shell commands; some of the concerns raised about, for example, connection management when talking to stuff like S3 are very worrying. At the same time, I think we need to think pretty seriously about some of the upsides of shell commands. The average user cannot write a C library that implements an API. The average user cannot write a C binary that speaks a novel, PostgreSQL-specific protocol. Even the above-average user who is capable of doing those things probably won't have the time to actually do it. So if thing you have to do to make PostgreSQL talk to the new sljgsjl compressor is either of those things, then we will not have sljgsjl compression support for probably a decade after it becomes the gold standard that everyone else in the industry is using. If what you have to do is 'yum install sljgsjl' and then pg_basebackup --client-filter='shell sljgsjl', people can start using it as soon as their favorite distro packages it, without anyone who reads this mailing list needing to do any work whatsoever. If what you have to do is create a 'sljgsjl.json' file in some PostgreSQL install directory that describes the salient properties of this compressor, and then after that you can say pg_basebackup --client-filter=sljgsjl, that's also accessible to a broad swath of users. Now, it may be that there's no practical way to make things that easy. But, to the extent that we can, I think we should. The ability to integrate new technology without action by PostgreSQL core developers is not the only consideration here, but it's definitely a good thing to have insofar as we reasonably can. > But I don't think it makes sense to design a C API without a rough > picture of how things should eventually look like. If we were, e.g., > eventually going to do all the work of compressing and transferring data > in one external binary, then a C API exposing transformations in > pg_basebackup doesn't necessarily make sense. If it turns out that > pipes are too inefficient on windows to implement compression filters, > that we need parallel awareness in the API, etc it'll influence the API. Yeah. I think we really need to understand the performance characteristics of pipes better. If they're slow, then anything that needs to be fast has to work some other way (but we could still provide a pipe-based slow way for niche uses). > > That's really, really, really not what I was talking about. > > What did you mean with the "marginal improvements" paragraph above? I was talking about running one compressor processor with multiple compression threads each reading from a separate pipe, vs. running multiple processes each with a single thread doing the same thing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 15, 2020 at 7:55 PM Robert Haas <robertmhaas@gmail.com> wrote: > Yeah. I think we really need to understand the performance > characteristics of pipes better. If they're slow, then anything that > needs to be fast has to work some other way (but we could still > provide a pipe-based slow way for niche uses). Hmm. Could we learn what we need to know about this by doing something as taking a basebackup of a cluster with some data in it (say, created by pgbench -i -s 400 or something) and then comparing the speed of cat < base.tar | gzip > base.tgz to the speed of gzip < base.tar > base.tgz? It seems like there's no difference between those except that the first one relays through an extra process and an extra pipe. I don't know exactly how to do the equivalent of this on Windows, but I bet somebody does. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Apr 16, 2020 at 10:22 PM Robert Haas <robertmhaas@gmail.com> wrote: > Hmm. Could we learn what we need to know about this by doing something > as taking a basebackup of a cluster with some data in it (say, created > by pgbench -i -s 400 or something) and then comparing the speed of cat > < base.tar | gzip > base.tgz to the speed of gzip < base.tar > > base.tgz? It seems like there's no difference between those except > that the first one relays through an extra process and an extra pipe. I decided to try this. First I experimented on my laptop using a backup of a pristine pgbench database, scale factor 100, ~1.5GB. [rhaas pgbackup]$ for i in 1 2 3; do echo "= run number $i = "; sync; sync; time gzip < base.tar > base.tar.gz; rm -f base.tar.gz; sync; sync; time cat < base.tar | gzip > base.tar.gz; rm -f base.tar.gz; sync; sync; time cat < base.tar | cat | cat | gzip > base.tar.gz; rm -f base.tar.gz; done = run number 1 = real 0m24.011s user 0m23.542s sys 0m0.408s real 0m23.623s user 0m23.447s sys 0m0.908s real 0m23.688s user 0m23.847s sys 0m2.085s = run number 2 = real 0m23.704s user 0m23.290s sys 0m0.374s real 0m23.389s user 0m23.239s sys 0m0.879s real 0m23.762s user 0m23.888s sys 0m2.057s = run number 3 = real 0m23.567s user 0m23.187s sys 0m0.361s real 0m23.573s user 0m23.422s sys 0m0.903s real 0m23.749s user 0m23.884s sys 0m2.113s It looks like piping everything through an extra copy of 'cat' may even be *faster* than having the process read it directly; two out of three runs with the extra "cat" finished very slightly quicker than the test where gzip read the file directly. The third set of numbers for each test run is with three copies of "cat" interposed. That appears to be slower than with no extra pipes, but not very much, and it might just be noise. Next I tried it out on Linux. For this I used 'cthulhu', an older box with lots and lots of memory and cores. Here I took the scale factor up to 400, so it's about 5.9GB of data. Same command as above produced these results: = run number 1 = real 2m35.797s user 2m30.990s sys 0m4.760s real 2m35.407s user 2m32.730s sys 0m16.714s real 2m40.598s user 2m39.054s sys 0m37.596s = run number 2 = real 2m35.529s user 2m30.971s sys 0m4.510s real 2m33.933s user 2m31.685s sys 0m16.003s real 2m45.563s user 2m44.042s sys 0m40.357s = run number 3 = real 2m35.876s user 2m31.437s sys 0m4.391s real 2m33.872s user 2m31.629s sys 0m16.266s real 2m40.836s user 2m39.359s sys 0m38.960s These results are pretty similar to the MacOS results. The overall performance was worse, but I think that is probably explained by the fact that the MacBook is a Haswell-class processor rather than Westmere, and with significantly higher RAM speed. The pattern that one extra pipe seems to be perhaps slightly faster, and three extra pipes a tad slower, persists. So at least in this test, the overhead added by each pipe appears to be <1%, which I would classify as good enough not to worry too much about. > I don't know exactly how to do the equivalent of this on Windows, but > I bet somebody does. However, I still don't know what the situation is on Windows. I did do some searching around on the Internet to try to find out whether pipes being slow on Windows is a generally-known phenomenon, and I didn't find anything very compelling, but I don't have an environment set up to the test myself. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-04-17 12:19:32 -0400, Robert Haas wrote: > On Thu, Apr 16, 2020 at 10:22 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Hmm. Could we learn what we need to know about this by doing something > > as taking a basebackup of a cluster with some data in it (say, created > > by pgbench -i -s 400 or something) and then comparing the speed of cat > > < base.tar | gzip > base.tgz to the speed of gzip < base.tar > > > base.tgz? It seems like there's no difference between those except > > that the first one relays through an extra process and an extra pipe. > > I decided to try this. First I experimented on my laptop using a > backup of a pristine pgbench database, scale factor 100, ~1.5GB. > > [rhaas pgbackup]$ for i in 1 2 3; do echo "= run number $i = "; sync; > sync; time gzip < base.tar > base.tar.gz; rm -f base.tar.gz; sync; > sync; time cat < base.tar | gzip > base.tar.gz; rm -f base.tar.gz; > sync; sync; time cat < base.tar | cat | cat | gzip > base.tar.gz; rm > -f base.tar.gz; done Given that gzip is too slow to be practically usable for anything where compression speed matters (like e.g. practical database backups), i'm not sure this measures something useful. The overhead of gzip will dominate to a degree that even the slowest possible pipe implementation would be fast enough. andres@awork3:/tmp/pgbase$ ls -lh total 7.7G -rw------- 1 andres andres 137K Apr 17 13:09 backup_manifest -rw------- 1 andres andres 7.7G Apr 17 13:09 base.tar -rw------- 1 andres andres 17M Apr 17 13:09 pg_wal.tar Measuring with pv base.tar |gzip > /dev/null I can see that the performance varies from somewhere around 20MB/s to about 90MB/s, averaging ~60MB/s. andres@awork3:/tmp/pgbase$ pv base.tar |gzip > /dev/null 7.62GiB 0:02:09 [60.2MiB/s] [===============================================================================================================>]100% Whereas e.g. zstd takes a much much shorter time, even in single threaded mode: andres@awork3:/tmp/pgbase$ pv base.tar |zstd -T1 |wc -c 7.62GiB 0:00:14 [ 530MiB/s] [===============================================================================================================>]100% 448956321 not to speak of using parallel compression (pigz is parallel gzip): andres@awork3:/tmp/pgbase$ pv base.tar |pigz -p 20 |wc -c 7.62GiB 0:00:07 [1.03GiB/s] [===============================================================================================================>]100% 571718276 andres@awork3:/tmp/pgbase$ pv base.tar |zstd -T20 |wc -c 7.62GiB 0:00:04 [1.78GiB/s] [===============================================================================================================>]100% 448956321 Looking at raw pipe speed, I think it's not too hard to see some limitations: andres@awork3:/tmp/pgbase$ time (cat base.tar | wc -c ) 8184994304 real 0m3.217s user 0m0.054s sys 0m4.856s andres@awork3:/tmp/pgbase$ time (cat base.tar | cat | wc -c ) 8184994304 real 0m3.246s user 0m0.113s sys 0m7.086s andres@awork3:/tmp/pgbase$ time (cat base.tar | cat | cat | cat | cat | cat | wc -c ) 8184994304 real 0m4.262s user 0m0.257s sys 0m20.706s but I'm not sure how deep pipelines we're thinking would be common. To make sure this is still relevant in the compression context: andres@awork3:/tmp/pgbase$ pv base.tar | zstd -T20 > /dev/null 7.62GiB 0:00:04 [1.77GiB/s] [===============================================================================================================>]100% andres@awork3:/tmp/pgbase$ pv base.tar | cat | cat | zstd -T20 > /dev/null 7.62GiB 0:00:05 [1.38GiB/s] [===============================================================================================================>]100% It's much less noticable if the cat's are after the zstd, there's so much less data as pgbench's data is so compressible. This does seem to suggest that composing features through chains of pipes wouldn't be a good idea. But not that we shouldn't implement compression via pipes (nor the opposite). > > I don't know exactly how to do the equivalent of this on Windows, but > > I bet somebody does. > > However, I still don't know what the situation is on Windows. I did do > some searching around on the Internet to try to find out whether pipes > being slow on Windows is a generally-known phenomenon, and I didn't > find anything very compelling, but I don't have an environment set up > to the test myself. I tried to measure something. But I'm not a windows person. And it's just a kvm VM. I don't know how well that translates into other environments. I downloaded gnuwin32 coreutils and zstd and performed some measurements. The first results were *shockingly* bad: zstd -T0 < onegbofrandom | wc -c linux host: 0.467s windows guest: 0.968s zstd -T0 < onegbofrandom | cat | wc -c linux host: 0.479s windows guest: 6.058s zstd -T0 < onegbofrandom | cat | cat | wc -c linux host: 0.516s windows guest: 7.830s I think that's because cat reads or writes in too small increments for windows (but damn, that's slow). Replacing cat with dd: zstd -T0 < onegbofrandom | dd bs=512 | wc -c linux host: 3.091s windows guest: 5.909s zstd -T0 < onegbofrandom | dd bs=64k | wc -c linux host: 0.540s windows guest: 1.128s zstd -T0 < onegbofrandom | dd bs=1M | wc -c linux host: 0.516s windows guest: 1.043s zstd -T0 < onegbofrandom | dd bs=1 | wc -c linux host: 1547s windows guest: 2607s (yes, really, it's this slow) zstd -T0 < onegbofrandom > NUL zstd -T0 < onegbofrandom > /dev/null linux host: 0.361s windows guest: 0.602s zstd -T0 < onegbofrandom | dd bs=1M of=NUL zstd -T0 < onegbofrandom | dd bs=1M of=/dev/null linux host: 0.454s windows guest: 0.802s zstd -T0 < onegbofrandom | dd bs=64k | dd bs=64k | dd bs=64k | wc -c linux host: 0.521s windows guest: 1.376s This suggest that pipes do have a considerably higher overhead on windows, but that it's not all that terrible if one takes care to use large buffers in each pipe element. It's notable though that even the simplest use of a pipe does add a considerable overhead compared to using the files directly. Greetings, Andres Freund
On Fri, Apr 17, 2020 at 7:44 PM Andres Freund <andres@anarazel.de> wrote: > This suggest that pipes do have a considerably higher overhead on > windows, but that it's not all that terrible if one takes care to use > large buffers in each pipe element. > > It's notable though that even the simplest use of a pipe does add a > considerable overhead compared to using the files directly. Thanks for these results. I think that this shows that it's probably not a great idea to force everything to go through pipes in every case, but on the other hand, there's no reason to be a particularly scared of the performance implications of letting some things go through pipes. For instance, if we decide that LZ4 compression is going to be a good choice for most users, we might want to do that in-process rather than via pipes. However, if somebody wants to pipe through an external compressor that they prefer, that's going to be a little slower, but not necessarily to a degree that creates big problems. People with bigger databases will need to be more careful about which options they choose, but that's kind of inevitable. Do you agree? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Fwiw, it was common trick in the Oracle world to create a named pipe to gzip and then write your backup to it. I really like that way of doing things but I suppose it's probably too old-fashioned to expect to survive. And in practice while it worked for a manual process for a sysadmin it's pretty awkward to automate reliably.
On Sat, Apr 18, 2020 at 8:35 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Apr 17, 2020 at 7:44 PM Andres Freund <andres@anarazel.de> wrote: > > This suggest that pipes do have a considerably higher overhead on > > windows, but that it's not all that terrible if one takes care to use > > large buffers in each pipe element. > > > > It's notable though that even the simplest use of a pipe does add a > > considerable overhead compared to using the files directly. > > Thanks for these results. I think that this shows that it's probably > not a great idea to force everything to go through pipes in every > case, but on the other hand, there's no reason to be a particularly > scared of the performance implications of letting some things go > through pipes. For instance, if we decide that LZ4 compression is > going to be a good choice for most users, we might want to do that > in-process rather than via pipes. > How will the user know how to use this compressed backup? I mean to say if we use some compression algorithm to compress the data then the user should know how to decompress and use the backup. IIUC, if currently, the user uses tar format to backup, it can simply untar it and start the server but will that be possible if we provide some in-built compression methods like LZ4? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Apr 18, 2020 at 5:14 AM Andres Freund <andres@anarazel.de> wrote: > > zstd -T0 < onegbofrandom > NUL > zstd -T0 < onegbofrandom > /dev/null > linux host: 0.361s > windows guest: 0.602s > > zstd -T0 < onegbofrandom | dd bs=1M of=NUL > zstd -T0 < onegbofrandom | dd bs=1M of=/dev/null > linux host: 0.454s > windows guest: 0.802s > > zstd -T0 < onegbofrandom | dd bs=64k | dd bs=64k | dd bs=64k | wc -c > linux host: 0.521s > windows guest: 1.376s > > > This suggest that pipes do have a considerably higher overhead on > windows, but that it's not all that terrible if one takes care to use > large buffers in each pipe element. > I have also done some similar experiments on my Win-7 box and the results are as follows: zstd -T0 < 16396 > NUL Execution time: 2.240 s zstd -T0 < 16396 | dd bs=1M > NUL Execution time: 4.240 s zstd -T0 < 16396 | dd bs=64k | dd bs=64k | dd bs=64k | wc -c Execution time: 5.959 s In the above tests, 16396 is a 1GB file generated via pgbench. The above results indicate that adding more pipe chains with dd adds significant overhead but how can we distinguish what is exact overhead due to pipe? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Apr 20, 2020 at 8:18 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, Apr 18, 2020 at 8:35 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Fri, Apr 17, 2020 at 7:44 PM Andres Freund <andres@anarazel.de> wrote: > > > This suggest that pipes do have a considerably higher overhead on > > > windows, but that it's not all that terrible if one takes care to use > > > large buffers in each pipe element. > > > > > > It's notable though that even the simplest use of a pipe does add a > > > considerable overhead compared to using the files directly. > > > > Thanks for these results. I think that this shows that it's probably > > not a great idea to force everything to go through pipes in every > > case, but on the other hand, there's no reason to be a particularly > > scared of the performance implications of letting some things go > > through pipes. For instance, if we decide that LZ4 compression is > > going to be a good choice for most users, we might want to do that > > in-process rather than via pipes. > > > > How will the user know how to use this compressed backup? I mean to > say if we use some compression algorithm to compress the data then the > user should know how to decompress and use the backup. IIUC, if > currently, the user uses tar format to backup, it can simply untar it > and start the server but will that be possible if we provide some > in-built compression methods like LZ4? > One idea could be that we can write something like BACKUP COMPRESSION: <LZ4 or whatever compression we have used> in backup_label file and then probably recovery can take care of decompressing it. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Apr 13, 2020 at 5:57 AM Stephen Frost <sfrost@snowman.net> wrote: > > There's a couple of other pieces here that I think bear mentioning. The > first is that pgBackRest has an actual 'restore' command- and that works > with the filters and works with the storage drivers, so what you're > looking at when it comes to these interfaces isn't just "put a file" but > it's also "get a file". That's actually quite important to have when > you start thinking about these more complicated methods of doing > backups. > I also think it is important to provide a way or interface to restore the data user has backed up using whatever new API we provide as here. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Apr 16, 2020 at 3:44 AM Andres Freund <andres@anarazel.de> wrote: > > > I think having a simple framework in pg_basebackup for plugging in new > > algorithms would make it noticeably simpler to add LZ4 or whatever > > your favorite compression algorithm is. And I think having that > > framework also be able to use shell commands, so that users don't have > > to wait a decade or more for new choices to show up, is also a good > > idea. > > As long as here's sensible defaults, and so that the user doesn't have > to specify paths to binaries for the common cases, I'm OK with that. I'm > not ok with requiring the user to specify shell fragments for things > that should be built in. > > If we think the appropriate way to implement extensible compression is > by piping to commandline binaries ([1]), > I can see how such a scheme could be useful for backups but how do we restore such a backup? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com