Обсуждение: Parallel pg_dump for 9.1
People have been talking about a parallel version of pg_dump a few times already. I have been working on some proof-of-concept code for this feature every now and then and I am planning to contribute this for 9.1. There are two main issues with a parallel version of pg_dump: The first one is that it requires a consistent snapshot among multiple pg_dump clients and the second is that currently the output goes to a single file and it is unclear what to do about multiple processes writing into a single file. - There are ideas on how to solve the issue with the consistent snapshot but in the end you can always solve it by stopping your application(s). I actually assume that whenever people are interested in a very fast dump, it is because they are doing some maintenance task (like migrating to a different server) that involves pg_dump. In these cases, they would stop their system anyway. Even if we had consistent snapshots in a future version, would we forbid people to run parallel dumps against old server versions? What I suggest is to just display a big warning if run against a server without consistent snapshot support (which currently is every version). - Regarding the output of pg_dump I am proposing two solutions. The first one is to introduce a new archive type "directory" where each table and each blob is a file in a directory, similar to the experimental "files" archive type. Also the idea has come up that you should be able to specify multiple directories in order to make use of several physical disk drives. Thinking this further, in order to manage all the mess that you can create with this, every file of the same backup needs to have a unique identifier and pg_restore should have a check parameter that tells you if your backup directory is in a sane and complete state (think about moving a file from one backup directory to another one or trying to restore from two directories which are from different backup sets...). The second solution to the single-file-problem is to generate no output at all, i.e. whatever you export from your source database you import directly into your target database, which in the end turns out to be a parallel form of "pg_dump | psql". In fact, technically this is rather a parallel pg_restore than a pg_dump as you need to respect the dependencies between objects. The good news is that with the parallel pg_restore of the custom archive format we have everything in place already for this dependency checking. The addition is a new archive type that dumps (just-in-time) whatever the dependency-algorithm decides to restore next. This is probably the fastest way that we can copy or upgrade a database when pg_migrator cannot be used (for example when you migrate to a different hardware architecture). As said, I have some working code for the features described (unix only), if anybody would like to give it a try already now, just let me know, I'd be happy to get some early test reports and you could check for the speedup to expect. But before I continue, I'd like to have a discussion about what is what people actually want and what is the best way to go forward here. I am currently not planning to make parallel dumps work with the custom format even though this would be possible if we changed the format to a certain degree. Comments? Joachim
On Mon, Mar 29, 2010 at 04:46:48PM +0200, Joachim Wieland wrote: > People have been talking about a parallel version of pg_dump a few > times already. I have been working on some proof-of-concept code for > this feature every now and then and I am planning to contribute this > for 9.1. > > There are two main issues with a parallel version of pg_dump: > > The first one is that it requires a consistent snapshot among > multiple pg_dump clients Cloning snapshots seems like the way to fix this. I don't know how far this project has drifted from the PostgreSQL code, but you might want to look here: http://pgfoundry.org/projects/snapclone/ > and the second is that currently the output goes to a single file > and it is unclear what to do about multiple processes writing into a > single file. I don't think that's a good idea. Coming up with a directory structure for the new parallel pg_dump seems like a much better idea. Andrew, do you have some notes on this? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On Mon, Mar 29, 2010 at 10:46 AM, Joachim Wieland <joe@mcknight.de> wrote: > - There are ideas on how to solve the issue with the consistent > snapshot but in the end you can always solve it by stopping your > application(s). I actually assume that whenever people are interested > in a very fast dump, it is because they are doing some maintenance > task (like migrating to a different server) that involves pg_dump. In > these cases, they would stop their system anyway. > Even if we had consistent snapshots in a future version, would we > forbid people to run parallel dumps against old server versions? What > I suggest is to just display a big warning if run against a server > without consistent snapshot support (which currently is every > version). Seems reasonable. > - Regarding the output of pg_dump I am proposing two solutions. The > first one is to introduce a new archive type "directory" where each > table and each blob is a file in a directory, similar to the > experimental "files" archive type. Also the idea has come up that you > should be able to specify multiple directories in order to make use of > several physical disk drives. Thinking this further, in order to > manage all the mess that you can create with this, every file of the > same backup needs to have a unique identifier and pg_restore should > have a check parameter that tells you if your backup directory is in a > sane and complete state (think about moving a file from one backup > directory to another one or trying to restore from two directories > which are from different backup sets...). I think that specifying several directories is a piece of complexity that would be best left alone for a first version of this. But a single directory with multiple files sounds pretty reasonable. Of course we'll also need to support that format in non-parallel mode, and in pg_restore. > The second solution to the single-file-problem is to generate no > output at all, i.e. whatever you export from your source database you > import directly into your target database, which in the end turns out > to be a parallel form of "pg_dump | psql". This is a very interesting idea but you might want to get the other thing merged first, as it's going to present a different set of issues. > I am currently not planning to make parallel dumps work with the > custom format even though this would be possible if we changed the > format to a certain degree. I'm thinking we probably don't want to change the existing formats. ...Robert
Robert Haas wrote: > On Mon, Mar 29, 2010 at 10:46 AM, Joachim Wieland <joe@mcknight.de> wrote: [...] >> - Regarding the output of pg_dump I am proposing two solutions. The >> first one is to introduce a new archive type "directory" where each >> table and each blob is a file in a directory, similar to the >> experimental "files" archive type. Also the idea has come up that you >> should be able to specify multiple directories in order to make use of >> several physical disk drives. Thinking this further, in order to >> manage all the mess that you can create with this, every file of the >> same backup needs to have a unique identifier and pg_restore should >> have a check parameter that tells you if your backup directory is in a >> sane and complete state (think about moving a file from one backup >> directory to another one or trying to restore from two directories >> which are from different backup sets...). > > I think that specifying several directories is a piece of complexity > that would be best left alone for a first version of this. But a > single directory with multiple files sounds pretty reasonable. Of > course we'll also need to support that format in non-parallel mode, > and in pg_restore. > >> The second solution to the single-file-problem is to generate no >> output at all, i.e. whatever you export from your source database you >> import directly into your target database, which in the end turns out >> to be a parallel form of "pg_dump | psql". > > This is a very interesting idea but you might want to get the other > thing merged first, as it's going to present a different set of > issues. I had some prior discussion with joachim (and I suspect I had some influence in him trying to implement that) on that. The reason why this is really needed is that the current pg_restore -j is actually a net loss(vs "pg_dump | psql") in a LOT of scenarios that are basically "duplicate this database to that location" (or any migration really). The example at had is a 240GB production database with around 850 tables, it takes ~145min to dump that database single threaded(completely CPU bound), simply loading the SQL using psql can restore it in ~150min(again CPU bound both for COPY and index creation), -j8 brings that down to ~55min. So if you do the math(and a bit of handwaving): * using pg_dump | psql you get greatest(140,150) -> 150min. * using pg_dump -Z0 -Fc && pg_restore -j8 you get 145+55 -> 200min * using a theoretical parallel pg_dump and the existing parallel restore you would get: 50(just a guess for how fast it might be) + 55 -> 105min * a parallel dump & restore that can pipline would end up at greatest(50,55)->55min So a parallel dump alone would only give you a 50% speedup in total time for doing a migration/upgrade/dump-to-devbox despite the fact that it uses 8x the resources. A piplined solution would result in a ~3x speedup in total time and you don't even have to even think about stuff that might be a problem like having available diskspace on the source/destination to hold a full temporary dump(if you don't you might even have to add some transfer time as well). Stefan
On Mon, Mar 29, 2010 at 1:16 PM, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote: > Robert Haas wrote: >> >> On Mon, Mar 29, 2010 at 10:46 AM, Joachim Wieland <joe@mcknight.de> wrote: > > [...] >>> >>> - Regarding the output of pg_dump I am proposing two solutions. The >>> first one is to introduce a new archive type "directory" where each >>> table and each blob is a file in a directory, similar to the >>> experimental "files" archive type. Also the idea has come up that you >>> should be able to specify multiple directories in order to make use of >>> several physical disk drives. Thinking this further, in order to >>> manage all the mess that you can create with this, every file of the >>> same backup needs to have a unique identifier and pg_restore should >>> have a check parameter that tells you if your backup directory is in a >>> sane and complete state (think about moving a file from one backup >>> directory to another one or trying to restore from two directories >>> which are from different backup sets...). >> >> I think that specifying several directories is a piece of complexity >> that would be best left alone for a first version of this. But a >> single directory with multiple files sounds pretty reasonable. Of >> course we'll also need to support that format in non-parallel mode, >> and in pg_restore. >> >>> The second solution to the single-file-problem is to generate no >>> output at all, i.e. whatever you export from your source database you >>> import directly into your target database, which in the end turns out >>> to be a parallel form of "pg_dump | psql". >> >> This is a very interesting idea but you might want to get the other >> thing merged first, as it's going to present a different set of >> issues. > > I had some prior discussion with joachim (and I suspect I had some influence > in him trying to implement that) on that. > The reason why this is really needed is that the current pg_restore -j is > actually a net loss(vs "pg_dump | psql") in a LOT of scenarios that are > basically "duplicate this database to that location" (or any migration > really). > The example at had is a 240GB production database with around 850 tables, it > takes ~145min to dump that database single threaded(completely CPU bound), > simply loading the SQL using psql can restore it in ~150min(again CPU bound > both for COPY and index creation), -j8 brings that down to ~55min. > So if you do the math(and a bit of handwaving): > > * using pg_dump | psql you get greatest(140,150) -> 150min. > * using pg_dump -Z0 -Fc && pg_restore -j8 you get 145+55 -> 200min > * using a theoretical parallel pg_dump and the existing parallel restore you > would get: 50(just a guess for how fast it might be) + 55 -> 105min > * a parallel dump & restore that can pipline would end up at > greatest(50,55)->55min > > > So a parallel dump alone would only give you a 50% speedup in total time for > doing a migration/upgrade/dump-to-devbox despite the fact that it uses 8x > the resources. A piplined solution would result in a ~3x speedup in total > time and you don't even have to even think about stuff that might be a > problem like having available diskspace on the source/destination to hold a > full temporary dump(if you don't you might even have to add some transfer > time as well). It's a great idea - but there are two features here. I've seen many patches implementing two features during my relatively short time with the project and if the rejection rate hasn't been 100% it's certainly been close. If Joachim thinks he's got it all working, by all means submit both patches. One can apply over the other if they are interdependent. But I STRONGLY suggest separating this into two pieces - it is MUCH easier to get things applied that way, for good and valid reasons. ...Robert
On 3/29/10 7:46 AM, Joachim Wieland wrote: > I actually assume that whenever people are interested > in a very fast dump, it is because they are doing some maintenance > task (like migrating to a different server) that involves pg_dump. In > these cases, they would stop their system anyway. Actually, I'd say that there's a broad set of cases of people who want to do a parallel pg_dump while their system is active. Parallel pg_dump on a stopped system will help some people (for migration, particularly) but parallel pg_dump with snapshot cloning will help a lot more people. For example, imagine a user who has a 16-core machine on a 14-drive RAID 10, and a 100-table 1TB database. At 2am, this person might reasonaly want to allocate a large portion of the machine resources to the dump by giving it 4 threads, without cutting access to the application. So: if parallel dump in single-user mode is what you can get done, then do it. We can always improve it later, and we have to start somewhere.But we will eventually need parallel pg_dump on activesystems, and that should remain on the TODO list. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > On 3/29/10 7:46 AM, Joachim Wieland wrote: >> I actually assume that whenever people are interested >> in a very fast dump, it is because they are doing some maintenance >> task (like migrating to a different server) that involves pg_dump. In >> these cases, they would stop their system anyway. > Actually, I'd say that there's a broad set of cases of people who want > to do a parallel pg_dump while their system is active. Parallel pg_dump > on a stopped system will help some people (for migration, particularly) > but parallel pg_dump with snapshot cloning will help a lot more people. I doubt that. My thought about it is that parallel dump will suck enough resources from the source server, both disk and CPU, that you would never want to use it on a live production machine. Not even at 2am. And your proposed use case is hardly a "broad set" in any case. Thus, Joachim's approach seems perfectly sane from here. I certainly don't see that there's an argument for spending 10x more development effort to pick up such use cases. Another question that's worth asking is exactly what the use case would be for parallel pg_dump against a live server, whether the snapshots are synchronized or not. You will not be able to use that dump as a basis for PITR, so there is no practical way of incorporating any changes that occur after the dump begins. So what are you making it for? If it's a routine backup for disaster recovery, fine, but it's not apparent why you want max speed and to heck with live performance for that purpose. I think migration to a new server version (that's too incompatible for PITR or pg_migrate migration) is really the only likely use case. regards, tom lane
On Mon, Mar 29, 2010 at 4:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Josh Berkus <josh@agliodbs.com> writes: >> On 3/29/10 7:46 AM, Joachim Wieland wrote: >>> I actually assume that whenever people are interested >>> in a very fast dump, it is because they are doing some maintenance >>> task (like migrating to a different server) that involves pg_dump. In >>> these cases, they would stop their system anyway. > >> Actually, I'd say that there's a broad set of cases of people who want >> to do a parallel pg_dump while their system is active. Parallel pg_dump >> on a stopped system will help some people (for migration, particularly) >> but parallel pg_dump with snapshot cloning will help a lot more people. > > I doubt that. My thought about it is that parallel dump will suck > enough resources from the source server, both disk and CPU, that you > would never want to use it on a live production machine. Not even at > 2am. And your proposed use case is hardly a "broad set" in any case. > Thus, Joachim's approach seems perfectly sane from here. I certainly > don't see that there's an argument for spending 10x more development > effort to pick up such use cases. > > Another question that's worth asking is exactly what the use case would > be for parallel pg_dump against a live server, whether the snapshots are > synchronized or not. You will not be able to use that dump as a basis > for PITR, so there is no practical way of incorporating any changes that > occur after the dump begins. So what are you making it for? If it's a > routine backup for disaster recovery, fine, but it's not apparent why > you want max speed and to heck with live performance for that purpose. > I think migration to a new server version (that's too incompatible for > PITR or pg_migrate migration) is really the only likely use case. It's completely possible that you could want to clone a server for dev and have more CPU and I/O bandwidth available than can be efficiently used by a non-parallel pg_dump. But certainly what Joachim is talking about will be a good start. I think there is merit to the synchronized snapshot stuff for pg_dump and perhaps other applications as well, but I think Joachim's (well-taken) point is that we don't have to treat it as a hard prerequisite. ...Robert
Robert Haas wrote: > > It's completely possible that you could want to clone a server for dev > and have more CPU and I/O bandwidth available than can be efficiently > used by a non-parallel pg_dump. But certainly what Joachim is talking > about will be a good start. I think there is merit to the > synchronized snapshot stuff for pg_dump and perhaps other applications > as well, but I think Joachim's (well-taken) point is that we don't > have to treat it as a hard prerequisite. > > > Possibly. I think the most useful thing that could be done right now is probably the least controversial, namely creating a directory type of archive, with support for pg_restore, including parallel pg_restore. Personally I think that's worth doing in its own right anyway. cheers andrew
Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: >> On 3/29/10 7:46 AM, Joachim Wieland wrote: >>> I actually assume that whenever people are interested >>> in a very fast dump, it is because they are doing some maintenance >>> task (like migrating to a different server) that involves pg_dump. In >>> these cases, they would stop their system anyway. > >> Actually, I'd say that there's a broad set of cases of people who want >> to do a parallel pg_dump while their system is active. Parallel pg_dump >> on a stopped system will help some people (for migration, particularly) >> but parallel pg_dump with snapshot cloning will help a lot more people. > > I doubt that. My thought about it is that parallel dump will suck > enough resources from the source server, both disk and CPU, that you > would never want to use it on a live production machine. Not even at > 2am. And your proposed use case is hardly a "broad set" in any case. > Thus, Joachim's approach seems perfectly sane from here. I certainly > don't see that there's an argument for spending 10x more development > effort to pick up such use cases. > > Another question that's worth asking is exactly what the use case would > be for parallel pg_dump against a live server, whether the snapshots are > synchronized or not. You will not be able to use that dump as a basis > for PITR, so there is no practical way of incorporating any changes that > occur after the dump begins. So what are you making it for? If it's a > routine backup for disaster recovery, fine, but it's not apparent why > you want max speed and to heck with live performance for that purpose. > I think migration to a new server version (that's too incompatible for > PITR or pg_migrate migration) is really the only likely use case. I really doubt that - on fast systems pg_dump is completely CPU bottlenecked and typical 1-2U typical hardware you get these days has 8-16 cores so simply dedicating a few cores to dumping the database during quieter times is very realistic. Databases are growing larger and larger and the single threaded nature of pg_dump makes it very hard to even stay withing reasonable time limits for doing the backup. Stefan
On tis, 2010-03-30 at 08:39 +0200, Stefan Kaltenbrunner wrote: > on fast systems pg_dump is completely CPU bottlenecked Might be useful to profile why that is. I don't think pg_dump has historically been developed with CPU efficiency in mind.
On Tue, 30 Mar 2010 13:01:54 +0200, Peter Eisentraut <peter_e@gmx.net> wrote: > On tis, 2010-03-30 at 08:39 +0200, Stefan Kaltenbrunner wrote: >> on fast systems pg_dump is completely CPU bottlenecked > > Might be useful to profile why that is. I don't think pg_dump has > historically been developed with CPU efficiency in mind. Already done that (I had posted some WIP patches to speed up COPY, hopefully I'll have time to finish those one day ;) Most of the time spent in the postmaster process during "COPY TO" is in the datum -> string functions.
Peter Eisentraut wrote: > On tis, 2010-03-30 at 08:39 +0200, Stefan Kaltenbrunner wrote: >> on fast systems pg_dump is completely CPU bottlenecked > > Might be useful to profile why that is. I don't think pg_dump has > historically been developed with CPU efficiency in mind. It's not pg_dump that is the problem - it is COPY that is the limit. In my specific case als the fact that a lot of the columns are bytea adds to the horrible CPU overhead (fixed in 9.0). Still our bulk load & unload performance is still way slower on a per core comparision than a lot of other databases :( Stefan
On Mar 30, 2010, at 8:15 AM, Stefan Kaltenbrunner wrote: > Peter Eisentraut wrote: >> On tis, 2010-03-30 at 08:39 +0200, Stefan Kaltenbrunner wrote: >>> on fast systems pg_dump is completely CPU bottlenecked >> Might be useful to profile why that is. I don't think pg_dump has >> historically been developed with CPU efficiency in mind. > > It's not pg_dump that is the problem - it is COPY that is the limit. > In my specific case als the fact that a lot of the columns are bytea > adds to the horrible CPU overhead (fixed in 9.0). Still our bulk > load & unload performance is still way slower on a per core > comparision than a lot of other databases :( > Don't forget the zlib compression used in -Fc (unless you use -Z0) takes a fair amount of cpu too. I did some tests and it turned out that -Z0 actually took longer than - Z1 simply because there was a lot more data to write out, thus I became IO bound not CPU bound. There's a thing called pigz around that is a parallel gzip implementation - wonder how much of that could be adapted to pg_dumps use as compression does use a considerable amount of time (even at - Z1). The biggest problem I can immediately see is that it uses threads. -- Jeff Trout <jeff@jefftrout.com> http://www.stuarthamm.net/ http://www.dellsmartexitin.com/
Jeff wrote: > > On Mar 30, 2010, at 8:15 AM, Stefan Kaltenbrunner wrote: > >> Peter Eisentraut wrote: >>> On tis, 2010-03-30 at 08:39 +0200, Stefan Kaltenbrunner wrote: >>>> on fast systems pg_dump is completely CPU bottlenecked >>> Might be useful to profile why that is. I don't think pg_dump has >>> historically been developed with CPU efficiency in mind. >> >> It's not pg_dump that is the problem - it is COPY that is the limit. >> In my specific case als the fact that a lot of the columns are bytea >> adds to the horrible CPU overhead (fixed in 9.0). Still our bulk load >> & unload performance is still way slower on a per core comparision >> than a lot of other databases :( >> > > Don't forget the zlib compression used in -Fc (unless you use -Z0) takes > a fair amount of cpu too. > I did some tests and it turned out that -Z0 actually took longer than > -Z1 simply because there was a lot more data to write out, thus I became > IO bound not CPU bound. > > There's a thing called pigz around that is a parallel gzip > implementation - wonder how much of that could be adapted to pg_dumps > use as compression does use a considerable amount of time (even at > -Z1). The biggest problem I can immediately see is that it uses threads. all my numbers are with -Z0 and it is the backend (COPY and/or index creation) that is the limit. If you start using compression you are shifting the load to pg_dump. Stefan