Обсуждение: Parallel pg_dump for 9.1

Поиск
Список
Период
Сортировка

Parallel pg_dump for 9.1

От
Joachim Wieland
Дата:
People have been talking about a parallel version of pg_dump a few
times already. I have been working on some proof-of-concept code for
this feature every now and then and I am planning to contribute this
for 9.1.

There are two main issues with a parallel version of pg_dump:

The first one is that it requires a consistent snapshot among multiple
pg_dump clients and the second is that currently the output goes to a
single file and it is unclear what to do about multiple processes
writing into a single file.

- There are ideas on how to solve the issue with the consistent
snapshot but in the end you can always solve it by stopping your
application(s). I actually assume that whenever people are interested
in a very fast dump, it is because they are doing some maintenance
task (like migrating to a different server) that involves pg_dump. In
these cases, they would stop their system anyway.
Even if we had consistent snapshots in a future version, would we
forbid people to run parallel dumps against old server versions? What
I suggest is to just display a big warning if run against a server
without consistent snapshot support (which currently is every
version).

- Regarding the output of pg_dump I am proposing two solutions. The
first one is to introduce a new archive type "directory" where each
table and each blob is a file in a directory, similar to the
experimental "files" archive type. Also the idea has come up that you
should be able to specify multiple directories in order to make use of
several physical disk drives. Thinking this further, in order to
manage all the mess that you can create with this, every file of the
same backup needs to have a unique identifier and pg_restore should
have a check parameter that tells you if your backup directory is in a
sane and complete state (think about moving a file from one backup
directory to another one or trying to restore from two directories
which are from different backup sets...).

The second solution to the single-file-problem is to generate no
output at all, i.e. whatever you export from your source database you
import directly into your target database, which in the end turns out
to be a parallel form of "pg_dump | psql".

In fact, technically this is rather a parallel pg_restore than a
pg_dump as you need to respect the dependencies between objects. The
good news is that with the parallel pg_restore of the custom archive
format we have everything in place already for this dependency
checking. The addition is a new archive type that dumps (just-in-time)
whatever the dependency-algorithm decides to restore next.

This is probably the fastest way that we can copy or upgrade a
database when pg_migrator cannot be used (for example when you migrate
to a different hardware architecture).

As said, I have some working code for the features described (unix
only), if anybody would like to give it a try already now, just let me
know, I'd be happy to get some early test reports and you could check
for the speedup to expect. But before I continue, I'd like to have a
discussion about what is what people actually want and what is the
best way to go forward here.

I am currently not planning to make parallel dumps work with the
custom format even though this would be possible if we changed the
format to a certain degree.

Comments?


Joachim


Re: Parallel pg_dump for 9.1

От
David Fetter
Дата:
On Mon, Mar 29, 2010 at 04:46:48PM +0200, Joachim Wieland wrote:
> People have been talking about a parallel version of pg_dump a few
> times already. I have been working on some proof-of-concept code for
> this feature every now and then and I am planning to contribute this
> for 9.1.
> 
> There are two main issues with a parallel version of pg_dump:
> 
> The first one is that it requires a consistent snapshot among
> multiple pg_dump clients

Cloning snapshots seems like the way to fix this.  I don't know how
far this project has drifted from the PostgreSQL code, but you might
want to look here:

http://pgfoundry.org/projects/snapclone/

> and the second is that currently the output goes to a single file
> and it is unclear what to do about multiple processes writing into a
> single file.

I don't think that's a good idea.  Coming up with a directory
structure for the new parallel pg_dump seems like a much better idea.

Andrew, do you have some notes on this?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Parallel pg_dump for 9.1

От
Robert Haas
Дата:
On Mon, Mar 29, 2010 at 10:46 AM, Joachim Wieland <joe@mcknight.de> wrote:
> - There are ideas on how to solve the issue with the consistent
> snapshot but in the end you can always solve it by stopping your
> application(s). I actually assume that whenever people are interested
> in a very fast dump, it is because they are doing some maintenance
> task (like migrating to a different server) that involves pg_dump. In
> these cases, they would stop their system anyway.
> Even if we had consistent snapshots in a future version, would we
> forbid people to run parallel dumps against old server versions? What
> I suggest is to just display a big warning if run against a server
> without consistent snapshot support (which currently is every
> version).

Seems reasonable.

> - Regarding the output of pg_dump I am proposing two solutions. The
> first one is to introduce a new archive type "directory" where each
> table and each blob is a file in a directory, similar to the
> experimental "files" archive type. Also the idea has come up that you
> should be able to specify multiple directories in order to make use of
> several physical disk drives. Thinking this further, in order to
> manage all the mess that you can create with this, every file of the
> same backup needs to have a unique identifier and pg_restore should
> have a check parameter that tells you if your backup directory is in a
> sane and complete state (think about moving a file from one backup
> directory to another one or trying to restore from two directories
> which are from different backup sets...).

I think that specifying several directories is a piece of complexity
that would be best left alone for a first version of this.  But a
single directory with multiple files sounds pretty reasonable.  Of
course we'll also need to support that format in non-parallel mode,
and in pg_restore.

> The second solution to the single-file-problem is to generate no
> output at all, i.e. whatever you export from your source database you
> import directly into your target database, which in the end turns out
> to be a parallel form of "pg_dump | psql".

This is a very interesting idea but you might want to get the other
thing merged first, as it's going to present a different set of
issues.

> I am currently not planning to make parallel dumps work with the
> custom format even though this would be possible if we changed the
> format to a certain degree.

I'm thinking we probably don't want to change the existing formats.

...Robert


Re: Parallel pg_dump for 9.1

От
Stefan Kaltenbrunner
Дата:
Robert Haas wrote:
> On Mon, Mar 29, 2010 at 10:46 AM, Joachim Wieland <joe@mcknight.de> wrote:
[...]
>> - Regarding the output of pg_dump I am proposing two solutions. The
>> first one is to introduce a new archive type "directory" where each
>> table and each blob is a file in a directory, similar to the
>> experimental "files" archive type. Also the idea has come up that you
>> should be able to specify multiple directories in order to make use of
>> several physical disk drives. Thinking this further, in order to
>> manage all the mess that you can create with this, every file of the
>> same backup needs to have a unique identifier and pg_restore should
>> have a check parameter that tells you if your backup directory is in a
>> sane and complete state (think about moving a file from one backup
>> directory to another one or trying to restore from two directories
>> which are from different backup sets...).
> 
> I think that specifying several directories is a piece of complexity
> that would be best left alone for a first version of this.  But a
> single directory with multiple files sounds pretty reasonable.  Of
> course we'll also need to support that format in non-parallel mode,
> and in pg_restore.
> 
>> The second solution to the single-file-problem is to generate no
>> output at all, i.e. whatever you export from your source database you
>> import directly into your target database, which in the end turns out
>> to be a parallel form of "pg_dump | psql".
> 
> This is a very interesting idea but you might want to get the other
> thing merged first, as it's going to present a different set of
> issues.

I had some prior discussion with joachim (and I suspect I had some 
influence in him trying to implement that) on that.
The reason why this is really needed is that the current pg_restore -j 
is actually a net loss(vs "pg_dump | psql") in a LOT of scenarios that 
are basically "duplicate this database to that location" (or any 
migration really).
The example at had is a 240GB production database with around 850 
tables, it takes ~145min to dump that database single 
threaded(completely CPU bound), simply loading the SQL using psql can 
restore it in ~150min(again CPU bound both for COPY and index creation), 
-j8 brings that down to ~55min.
So if  you do the math(and a bit of handwaving):

* using pg_dump | psql you get greatest(140,150) -> 150min.
* using pg_dump -Z0 -Fc && pg_restore -j8 you get 145+55 -> 200min
* using a theoretical parallel pg_dump and the existing parallel restore 
you would get: 50(just a guess for how fast it might be) + 55 -> 105min
* a parallel dump & restore that can pipline would end up at 
greatest(50,55)->55min


So a parallel dump alone would only give you a 50% speedup in total time 
for doing a migration/upgrade/dump-to-devbox despite the fact that it 
uses 8x the resources. A piplined solution would result in a ~3x speedup 
in total time and you don't even have to even think about stuff that 
might be a problem like having available diskspace on the 
source/destination to hold a full temporary dump(if you don't you might 
even have to add some transfer time as well).



Stefan


Re: Parallel pg_dump for 9.1

От
Robert Haas
Дата:
On Mon, Mar 29, 2010 at 1:16 PM, Stefan Kaltenbrunner
<stefan@kaltenbrunner.cc> wrote:
> Robert Haas wrote:
>>
>> On Mon, Mar 29, 2010 at 10:46 AM, Joachim Wieland <joe@mcknight.de> wrote:
>
> [...]
>>>
>>> - Regarding the output of pg_dump I am proposing two solutions. The
>>> first one is to introduce a new archive type "directory" where each
>>> table and each blob is a file in a directory, similar to the
>>> experimental "files" archive type. Also the idea has come up that you
>>> should be able to specify multiple directories in order to make use of
>>> several physical disk drives. Thinking this further, in order to
>>> manage all the mess that you can create with this, every file of the
>>> same backup needs to have a unique identifier and pg_restore should
>>> have a check parameter that tells you if your backup directory is in a
>>> sane and complete state (think about moving a file from one backup
>>> directory to another one or trying to restore from two directories
>>> which are from different backup sets...).
>>
>> I think that specifying several directories is a piece of complexity
>> that would be best left alone for a first version of this.  But a
>> single directory with multiple files sounds pretty reasonable.  Of
>> course we'll also need to support that format in non-parallel mode,
>> and in pg_restore.
>>
>>> The second solution to the single-file-problem is to generate no
>>> output at all, i.e. whatever you export from your source database you
>>> import directly into your target database, which in the end turns out
>>> to be a parallel form of "pg_dump | psql".
>>
>> This is a very interesting idea but you might want to get the other
>> thing merged first, as it's going to present a different set of
>> issues.
>
> I had some prior discussion with joachim (and I suspect I had some influence
> in him trying to implement that) on that.
> The reason why this is really needed is that the current pg_restore -j is
> actually a net loss(vs "pg_dump | psql") in a LOT of scenarios that are
> basically "duplicate this database to that location" (or any migration
> really).
> The example at had is a 240GB production database with around 850 tables, it
> takes ~145min to dump that database single threaded(completely CPU bound),
> simply loading the SQL using psql can restore it in ~150min(again CPU bound
> both for COPY and index creation), -j8 brings that down to ~55min.
> So if  you do the math(and a bit of handwaving):
>
> * using pg_dump | psql you get greatest(140,150) -> 150min.
> * using pg_dump -Z0 -Fc && pg_restore -j8 you get 145+55 -> 200min
> * using a theoretical parallel pg_dump and the existing parallel restore you
> would get: 50(just a guess for how fast it might be) + 55 -> 105min
> * a parallel dump & restore that can pipline would end up at
> greatest(50,55)->55min
>
>
> So a parallel dump alone would only give you a 50% speedup in total time for
> doing a migration/upgrade/dump-to-devbox despite the fact that it uses 8x
> the resources. A piplined solution would result in a ~3x speedup in total
> time and you don't even have to even think about stuff that might be a
> problem like having available diskspace on the source/destination to hold a
> full temporary dump(if you don't you might even have to add some transfer
> time as well).

It's a great idea - but there are two features here.  I've seen many
patches implementing two features during my relatively short time with
the project and if the rejection rate hasn't been 100% it's certainly
been close.  If Joachim thinks he's got it all working, by all means
submit both patches.  One can apply over the other if they are
interdependent.  But I STRONGLY suggest separating this into two
pieces - it is MUCH easier to get things applied that way, for good
and valid reasons.

...Robert


Re: Parallel pg_dump for 9.1

От
Josh Berkus
Дата:
On 3/29/10 7:46 AM, Joachim Wieland wrote:
> I actually assume that whenever people are interested
> in a very fast dump, it is because they are doing some maintenance
> task (like migrating to a different server) that involves pg_dump. In
> these cases, they would stop their system anyway.

Actually, I'd say that there's a broad set of cases of people who want
to do a parallel pg_dump while their system is active.  Parallel pg_dump
on a stopped system will help some people (for migration, particularly)
but parallel pg_dump with snapshot cloning will help a lot more people.

For example, imagine a user who has a 16-core machine on a 14-drive RAID
10, and a 100-table 1TB database.  At 2am, this person might reasonaly
want to allocate a large portion of the machine resources to the dump by
giving it 4 threads, without cutting access to the application.

So: if parallel dump in single-user mode is what you can get done, then
do it.  We can always improve it later, and we have to start somewhere.But we will eventually need parallel pg_dump on
activesystems, and
 
that should remain on the TODO list.


--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


Re: Parallel pg_dump for 9.1

От
Tom Lane
Дата:
Josh Berkus <josh@agliodbs.com> writes:
> On 3/29/10 7:46 AM, Joachim Wieland wrote:
>> I actually assume that whenever people are interested
>> in a very fast dump, it is because they are doing some maintenance
>> task (like migrating to a different server) that involves pg_dump. In
>> these cases, they would stop their system anyway.

> Actually, I'd say that there's a broad set of cases of people who want
> to do a parallel pg_dump while their system is active.  Parallel pg_dump
> on a stopped system will help some people (for migration, particularly)
> but parallel pg_dump with snapshot cloning will help a lot more people.

I doubt that.  My thought about it is that parallel dump will suck
enough resources from the source server, both disk and CPU, that you
would never want to use it on a live production machine.  Not even at
2am.  And your proposed use case is hardly a "broad set" in any case.
Thus, Joachim's approach seems perfectly sane from here.  I certainly
don't see that there's an argument for spending 10x more development
effort to pick up such use cases.

Another question that's worth asking is exactly what the use case would
be for parallel pg_dump against a live server, whether the snapshots are
synchronized or not.  You will not be able to use that dump as a basis
for PITR, so there is no practical way of incorporating any changes that
occur after the dump begins.  So what are you making it for?  If it's a
routine backup for disaster recovery, fine, but it's not apparent why
you want max speed and to heck with live performance for that purpose.
I think migration to a new server version (that's too incompatible for
PITR or pg_migrate migration) is really the only likely use case.
        regards, tom lane


Re: Parallel pg_dump for 9.1

От
Robert Haas
Дата:
On Mon, Mar 29, 2010 at 4:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>> On 3/29/10 7:46 AM, Joachim Wieland wrote:
>>> I actually assume that whenever people are interested
>>> in a very fast dump, it is because they are doing some maintenance
>>> task (like migrating to a different server) that involves pg_dump. In
>>> these cases, they would stop their system anyway.
>
>> Actually, I'd say that there's a broad set of cases of people who want
>> to do a parallel pg_dump while their system is active.  Parallel pg_dump
>> on a stopped system will help some people (for migration, particularly)
>> but parallel pg_dump with snapshot cloning will help a lot more people.
>
> I doubt that.  My thought about it is that parallel dump will suck
> enough resources from the source server, both disk and CPU, that you
> would never want to use it on a live production machine.  Not even at
> 2am.  And your proposed use case is hardly a "broad set" in any case.
> Thus, Joachim's approach seems perfectly sane from here.  I certainly
> don't see that there's an argument for spending 10x more development
> effort to pick up such use cases.
>
> Another question that's worth asking is exactly what the use case would
> be for parallel pg_dump against a live server, whether the snapshots are
> synchronized or not.  You will not be able to use that dump as a basis
> for PITR, so there is no practical way of incorporating any changes that
> occur after the dump begins.  So what are you making it for?  If it's a
> routine backup for disaster recovery, fine, but it's not apparent why
> you want max speed and to heck with live performance for that purpose.
> I think migration to a new server version (that's too incompatible for
> PITR or pg_migrate migration) is really the only likely use case.

It's completely possible that you could want to clone a server for dev
and have more CPU and I/O bandwidth available than can be efficiently
used by a non-parallel pg_dump.  But certainly what Joachim is talking
about will be a good start.  I think there is merit to the
synchronized snapshot stuff for pg_dump and perhaps other applications
as well, but I think Joachim's (well-taken) point is that we don't
have to treat it as a hard prerequisite.

...Robert


Re: Parallel pg_dump for 9.1

От
Andrew Dunstan
Дата:

Robert Haas wrote:
>
> It's completely possible that you could want to clone a server for dev
> and have more CPU and I/O bandwidth available than can be efficiently
> used by a non-parallel pg_dump.  But certainly what Joachim is talking
> about will be a good start.  I think there is merit to the
> synchronized snapshot stuff for pg_dump and perhaps other applications
> as well, but I think Joachim's (well-taken) point is that we don't
> have to treat it as a hard prerequisite.
>
>
>   

Possibly. I think the most useful thing that could be done right now is 
probably the least controversial, namely creating a directory type of 
archive, with support for pg_restore, including parallel pg_restore.

Personally I think that's worth doing in its own right anyway.

cheers

andrew


Re: Parallel pg_dump for 9.1

От
Stefan Kaltenbrunner
Дата:
Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>> On 3/29/10 7:46 AM, Joachim Wieland wrote:
>>> I actually assume that whenever people are interested
>>> in a very fast dump, it is because they are doing some maintenance
>>> task (like migrating to a different server) that involves pg_dump. In
>>> these cases, they would stop their system anyway.
> 
>> Actually, I'd say that there's a broad set of cases of people who want
>> to do a parallel pg_dump while their system is active.  Parallel pg_dump
>> on a stopped system will help some people (for migration, particularly)
>> but parallel pg_dump with snapshot cloning will help a lot more people.
> 
> I doubt that.  My thought about it is that parallel dump will suck
> enough resources from the source server, both disk and CPU, that you
> would never want to use it on a live production machine.  Not even at
> 2am.  And your proposed use case is hardly a "broad set" in any case.
> Thus, Joachim's approach seems perfectly sane from here.  I certainly
> don't see that there's an argument for spending 10x more development
> effort to pick up such use cases.
> 
> Another question that's worth asking is exactly what the use case would
> be for parallel pg_dump against a live server, whether the snapshots are
> synchronized or not.  You will not be able to use that dump as a basis
> for PITR, so there is no practical way of incorporating any changes that
> occur after the dump begins.  So what are you making it for?  If it's a
> routine backup for disaster recovery, fine, but it's not apparent why
> you want max speed and to heck with live performance for that purpose.
> I think migration to a new server version (that's too incompatible for
> PITR or pg_migrate migration) is really the only likely use case.

I really doubt that - on fast systems pg_dump is completely CPU 
bottlenecked and typical 1-2U typical hardware you get these days has 
8-16 cores so simply dedicating a few cores to dumping the database 
during quieter times is very realistic.
Databases are growing larger and larger and the single threaded nature 
of pg_dump makes it very hard to even stay withing reasonable time 
limits for doing the backup.


Stefan


Re: Parallel pg_dump for 9.1

От
Peter Eisentraut
Дата:
On tis, 2010-03-30 at 08:39 +0200, Stefan Kaltenbrunner wrote:
> on fast systems pg_dump is completely CPU bottlenecked 

Might be useful to profile why that is.  I don't think pg_dump has
historically been developed with CPU efficiency in mind.



Re: Parallel pg_dump for 9.1

От
"Pierre C"
Дата:
On Tue, 30 Mar 2010 13:01:54 +0200, Peter Eisentraut <peter_e@gmx.net>  
wrote:

> On tis, 2010-03-30 at 08:39 +0200, Stefan Kaltenbrunner wrote:
>> on fast systems pg_dump is completely CPU bottlenecked
>
> Might be useful to profile why that is.  I don't think pg_dump has
> historically been developed with CPU efficiency in mind.

Already done that (I had posted some WIP patches to speed up COPY,  
hopefully I'll have time to finish those one day ;)
Most of the time spent in the postmaster process during "COPY TO" is in  
the datum -> string functions.




Re: Parallel pg_dump for 9.1

От
Stefan Kaltenbrunner
Дата:
Peter Eisentraut wrote:
> On tis, 2010-03-30 at 08:39 +0200, Stefan Kaltenbrunner wrote:
>> on fast systems pg_dump is completely CPU bottlenecked 
> 
> Might be useful to profile why that is.  I don't think pg_dump has
> historically been developed with CPU efficiency in mind.

It's not pg_dump that is the problem - it is COPY that is the limit. In 
my specific case als the fact that a lot of the columns are bytea adds 
to the horrible CPU overhead (fixed in 9.0). Still our bulk load & 
unload performance is still way slower on a per core comparision than a 
lot of other databases :(


Stefan


Re: Parallel pg_dump for 9.1

От
Jeff
Дата:
On Mar 30, 2010, at 8:15 AM, Stefan Kaltenbrunner wrote:

> Peter Eisentraut wrote:
>> On tis, 2010-03-30 at 08:39 +0200, Stefan Kaltenbrunner wrote:
>>> on fast systems pg_dump is completely CPU bottlenecked
>> Might be useful to profile why that is.  I don't think pg_dump has
>> historically been developed with CPU efficiency in mind.
>
> It's not pg_dump that is the problem - it is COPY that is the limit.  
> In my specific case als the fact that a lot of the columns are bytea  
> adds to the horrible CPU overhead (fixed in 9.0). Still our bulk  
> load & unload performance is still way slower on a per core  
> comparision than a lot of other databases :(
>

Don't forget the zlib compression used in -Fc (unless you use -Z0)  
takes a fair amount of cpu too.
I did some tests and it turned out that -Z0 actually took longer than - 
Z1 simply because there was a lot more data to write out, thus I  
became IO bound not CPU bound.

There's a thing called pigz around that is a parallel gzip  
implementation - wonder how much of that could be adapted to pg_dumps  
use as compression does use a considerable amount of time (even at - 
Z1).  The biggest problem I can immediately see is that it uses threads.


--
Jeff Trout <jeff@jefftrout.com>
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/





Re: Parallel pg_dump for 9.1

От
Stefan Kaltenbrunner
Дата:
Jeff wrote:
> 
> On Mar 30, 2010, at 8:15 AM, Stefan Kaltenbrunner wrote:
> 
>> Peter Eisentraut wrote:
>>> On tis, 2010-03-30 at 08:39 +0200, Stefan Kaltenbrunner wrote:
>>>> on fast systems pg_dump is completely CPU bottlenecked
>>> Might be useful to profile why that is.  I don't think pg_dump has
>>> historically been developed with CPU efficiency in mind.
>>
>> It's not pg_dump that is the problem - it is COPY that is the limit. 
>> In my specific case als the fact that a lot of the columns are bytea 
>> adds to the horrible CPU overhead (fixed in 9.0). Still our bulk load 
>> & unload performance is still way slower on a per core comparision 
>> than a lot of other databases :(
>>
> 
> Don't forget the zlib compression used in -Fc (unless you use -Z0) takes 
> a fair amount of cpu too.
> I did some tests and it turned out that -Z0 actually took longer than 
> -Z1 simply because there was a lot more data to write out, thus I became 
> IO bound not CPU bound.
> 
> There's a thing called pigz around that is a parallel gzip 
> implementation - wonder how much of that could be adapted to pg_dumps 
> use as compression does use a considerable amount of time (even at 
> -Z1).  The biggest problem I can immediately see is that it uses threads.

all my numbers are with -Z0 and it is the backend (COPY and/or index 
creation) that is the limit. If you start using compression you are 
shifting the load to pg_dump.


Stefan