Обсуждение: Determining size of a database before dumping
Hi all, I am (re)writing a backup program and I want to add a section for backing up pSQL DBs. In the planning steps (making sure a given destination has enough space) I try to calculate how much space will be needed by a 'pg_dump' run *before* actually dumping it. Is there a relatively easy way to do that? Moreso, if it possible to do this from an unpriviledged account? If not, is there a way to add the permissions to a specific pg user to allow that user to perform this? Thanks in advance! Madison
Madison Kelly wrote: > Hi all, > > I am (re)writing a backup program and I want to add a section for > backing up pSQL DBs. In the planning steps (making sure a given > destination has enough space) I try to calculate how much space will be > needed by a 'pg_dump' run *before* actually dumping it. I suppose: pg_dump $PGD_OPTIONS | wc -c isn't efficient enough, right? Without knowing the options you plan to use with pg_dump (compression? dump just tables?, etc.) this is going to be hard to get a decent estimate from... -- Steve Wampler -- swampler@noao.edu The gods that smiled on your birth are now laughing out loud.
On Oct 2, 2006, at 22:17 , Madison Kelly wrote: > I am (re)writing a backup program and I want to add a section for > backing up pSQL DBs. In the planning steps (making sure a given > destination has enough space) I try to calculate how much space > will be needed by a 'pg_dump' run *before* actually dumping it. > > Is there a relatively easy way to do that? Moreso, if it possible > to do this from an unpriviledged account? If not, is there a way to > add the permissions to a specific pg user to allow that user to > perform this? You could dump the database to /dev/null, piping it through wc to catch the size, but that would of course be wasteful. You could count the disk space usage of the actual stored tuples, though this will necessarily be inexact: http://www.postgresql.org/docs/8.1/static/diskusage.html Or you could count the size of the physical database files (/var/lib/ postgresql or wherever). While these would be estimates, you could at least guarantee that the dump would not *exceed* the esimtate. Keep in mind that pg_dump can compress the dump and (iirc) will do so by default when you use the custom format (-Fc or --format=c). Alexander.
Steve Wampler wrote: > Madison Kelly wrote: >> Hi all, >> >> I am (re)writing a backup program and I want to add a section for >> backing up pSQL DBs. In the planning steps (making sure a given >> destination has enough space) I try to calculate how much space will be >> needed by a 'pg_dump' run *before* actually dumping it. > > I suppose: > > pg_dump $PGD_OPTIONS | wc -c > > isn't efficient enough, right? Without knowing the options you plan > to use with pg_dump (compression? dump just tables?, etc.) this is > going to be hard to get a decent estimate from... > For now, lets assume I am doing a raw dump (no compression) and no fancy switches. I would probably err of the side of caution and try dumping OIDs and all schema (plus whatever else is needed to insure a full restore to a clean DB). I could try piping the dump into something like 'wc' but with very large DBs I'd be worried about the (tremendous) disk I/O that would cause. This is also why I am hoping Pg keeps this info somewhere. Madison
Alexander Staubo wrote: > On Oct 2, 2006, at 22:17 , Madison Kelly wrote: > >> I am (re)writing a backup program and I want to add a section for >> backing up pSQL DBs. In the planning steps (making sure a given >> destination has enough space) I try to calculate how much space will >> be needed by a 'pg_dump' run *before* actually dumping it. >> >> Is there a relatively easy way to do that? Moreso, if it possible to >> do this from an unpriviledged account? If not, is there a way to add >> the permissions to a specific pg user to allow that user to perform this? > > You could dump the database to /dev/null, piping it through wc to catch > the size, but that would of course be wasteful. > > You could count the disk space usage of the actual stored tuples, though > this will necessarily be inexact: > > http://www.postgresql.org/docs/8.1/static/diskusage.html > > Or you could count the size of the physical database files > (/var/lib/postgresql or wherever). While these would be estimates, you > could at least guarantee that the dump would not *exceed* the esimtate. > > Keep in mind that pg_dump can compress the dump and (iirc) will do so by > default when you use the custom format (-Fc or --format=c). > > Alexander. Heh, that looks like just the article I would have wanted if I had properly RTFM. :D Many thanks!! Madi
Alexander Staubo <alex@purefiction.net> writes: > You could count the disk space usage of the actual stored tuples, > though this will necessarily be inexact: > http://www.postgresql.org/docs/8.1/static/diskusage.html > Or you could count the size of the physical database files (/var/lib/ > postgresql or wherever). While these would be estimates, you could at > least guarantee that the dump would not *exceed* the esimtate. You could guarantee no such thing; consider compression of TOAST values. Even for uncompressed data, datatypes such as int and float can easily print as more bytes than they occupy on-disk. Given all the non-data overhead involved (eg for indexes), it's probably unlikely that a text dump would exceed the "du" size of the database, but it's far from "guaranteed". regards, tom lane
On Oct 2, 2006, at 23:19 , Tom Lane wrote: > Alexander Staubo <alex@purefiction.net> writes: >> You could count the disk space usage of the actual stored tuples, >> though this will necessarily be inexact: >> http://www.postgresql.org/docs/8.1/static/diskusage.html >> Or you could count the size of the physical database files (/var/lib/ >> postgresql or wherever). While these would be estimates, you could at >> least guarantee that the dump would not *exceed* the esimtate. > > You could guarantee no such thing; consider compression of TOAST > values. > Even for uncompressed data, datatypes such as int and float can easily > print as more bytes than they occupy on-disk. Why does pg_dump serialize data less efficiently than PostgreSQL when using the "custom" format? (Pg_dump arguably has greater freedom in being able to apply space-saving optimizations to the output format. For example, one could use table statistics to selectively apply something like Rice coding for numeric data, or vertically decompose the tuples and emit sorted vectors using delta compression.) As for TOAST, should not pg_dump's compression compress just as well, or better? Alexander.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 10/02/06 16:19, Tom Lane wrote: > Alexander Staubo <alex@purefiction.net> writes: >> You could count the disk space usage of the actual stored tuples, >> though this will necessarily be inexact: >> http://www.postgresql.org/docs/8.1/static/diskusage.html >> Or you could count the size of the physical database files (/var/lib/ >> postgresql or wherever). While these would be estimates, you could at >> least guarantee that the dump would not *exceed* the esimtate. > > You could guarantee no such thing; consider compression of TOAST values. > Even for uncompressed data, datatypes such as int and float can easily > print as more bytes than they occupy on-disk. > > Given all the non-data overhead involved (eg for indexes), it's probably > unlikely that a text dump would exceed the "du" size of the database, > but it's far from "guaranteed". It's my experience that when there are lots of numeric fields, fixed-width text records are approx 2.5x larger than the original binary records. - -- Ron Johnson, Jr. Jefferson LA USA Is "common sense" really valid? For example, it is "common sense" to white-power racists that whites are superior to blacks, and that those with brown skins are mud people. However, that "common sense" is obviously wrong. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) iD8DBQFFIZgvS9HxQb37XmcRAp6xAKC74LV+2wR6Ao5Oq56RInkkDP8PZgCglKEv z0fvjrXTloWJJ7qdhfOpIoI= =jICB -----END PGP SIGNATURE-----
On Tue, 2006-10-03 at 00:42 +0200, Alexander Staubo wrote: > Why does pg_dump serialize data less efficiently than PostgreSQL when > using the "custom" format? (Pg_dump arguably has greater freedom in > being able to apply space-saving optimizations to the output format. > For example, one could use table statistics to selectively apply > something like Rice coding for numeric data, or vertically decompose > the tuples and emit sorted vectors using delta compression.) As for > TOAST, should not pg_dump's compression compress just as well, or > better? It would be a strange set of data that had a larger representation as a compressed pg_dump than the data directory itself. However, one could imagine a contrived case where that might happen. Let's say you had a single table with 10,000 columns of type INT4, 100M records, all with random numbers in the columns. I don't think standard gzip compression will compress random INT4s down to 32 bits. Another example is NULLs. What if only a few of those records had non- NULL values? If I understand correctly, PostgreSQL will represent those NULLs with just one bit. What you're saying is more theoretical. If pg_dump used specialized compression based on the data type of the columns, and everything was optimal, you're correct. There's no situation in which the dump *must* be bigger. However, since there is no practical demand for such compression, and it would be a lot of work, there is no *guarantee* that the data directory will be bigger. However, it probably is. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Tue, 2006-10-03 at 00:42 +0200, Alexander Staubo wrote: >> Why does pg_dump serialize data less efficiently than PostgreSQL when >> using the "custom" format? > What you're saying is more theoretical. If pg_dump used specialized > compression based on the data type of the columns, and everything was > optimal, you're correct. There's no situation in which the dump *must* > be bigger. However, since there is no practical demand for such > compression, and it would be a lot of work ... There are several reasons for not being overly tense about the pg_dump format: * We don't have infinite manpower * Cross-version and cross-platform portability of the dump files is critical * The more complicated it is, the more chance for bugs, which you'd possibly not notice until you *really needed* that dump. In practice, pushing the data through gzip gets most of the potential win, for a very small fraction of the effort it would take to have a smart custom compression mechanism. regards, tom lane