Обсуждение: Determining size of a database before dumping

Поиск
Список
Период
Сортировка

Determining size of a database before dumping

От
Madison Kelly
Дата:
Hi all,

   I am (re)writing a backup program and I want to add a section for
backing up pSQL DBs. In the planning steps (making sure a given
destination has enough space) I try to calculate how much space will be
needed by a 'pg_dump' run *before* actually dumping it.

   Is there a relatively easy way to do that? Moreso, if it possible to
do this from an unpriviledged account? If not, is there a way to add the
permissions to a specific pg user to allow that user to perform this?

   Thanks in advance!

Madison

Re: Determining size of a database before dumping

От
Steve Wampler
Дата:
Madison Kelly wrote:
> Hi all,
>
>   I am (re)writing a backup program and I want to add a section for
> backing up pSQL DBs. In the planning steps (making sure a given
> destination has enough space) I try to calculate how much space will be
> needed by a 'pg_dump' run *before* actually dumping it.

I suppose:

   pg_dump $PGD_OPTIONS | wc -c

isn't efficient enough, right?  Without knowing the options you plan
to use with pg_dump (compression?  dump just tables?, etc.) this is
going to be hard to get a decent estimate from...

--
Steve Wampler -- swampler@noao.edu
The gods that smiled on your birth are now laughing out loud.

Re: Determining size of a database before dumping

От
Alexander Staubo
Дата:
On Oct 2, 2006, at 22:17 , Madison Kelly wrote:

>   I am (re)writing a backup program and I want to add a section for
> backing up pSQL DBs. In the planning steps (making sure a given
> destination has enough space) I try to calculate how much space
> will be needed by a 'pg_dump' run *before* actually dumping it.
>
>   Is there a relatively easy way to do that? Moreso, if it possible
> to do this from an unpriviledged account? If not, is there a way to
> add the permissions to a specific pg user to allow that user to
> perform this?

You could dump the database to /dev/null, piping it through wc to
catch the size, but that would of course be wasteful.

You could count the disk space usage of the actual stored tuples,
though this will necessarily be inexact:

   http://www.postgresql.org/docs/8.1/static/diskusage.html

Or you could count the size of the physical database files (/var/lib/
postgresql or wherever). While these would be estimates, you could at
least guarantee that the dump would not *exceed* the esimtate.

Keep in mind that pg_dump can compress the dump and (iirc) will do so
by default when you use the custom format (-Fc or --format=c).

Alexander.


Re: Determining size of a database before dumping

От
Madison Kelly
Дата:
Steve Wampler wrote:
> Madison Kelly wrote:
>> Hi all,
>>
>>   I am (re)writing a backup program and I want to add a section for
>> backing up pSQL DBs. In the planning steps (making sure a given
>> destination has enough space) I try to calculate how much space will be
>> needed by a 'pg_dump' run *before* actually dumping it.
>
> I suppose:
>
>    pg_dump $PGD_OPTIONS | wc -c
>
> isn't efficient enough, right?  Without knowing the options you plan
> to use with pg_dump (compression?  dump just tables?, etc.) this is
> going to be hard to get a decent estimate from...
>

For now, lets assume I am doing a raw dump (no compression) and no fancy
switches. I would probably err of the side of caution and try dumping
OIDs and all schema (plus whatever else is needed to insure a full
restore to a clean DB).

I could try piping the dump into something like 'wc' but with very large
DBs I'd be worried about the (tremendous) disk I/O that would cause.
This is also why I am hoping Pg keeps this info somewhere.

Madison

Re: Determining size of a database before dumping

От
Madison Kelly
Дата:
Alexander Staubo wrote:
> On Oct 2, 2006, at 22:17 , Madison Kelly wrote:
>
>>   I am (re)writing a backup program and I want to add a section for
>> backing up pSQL DBs. In the planning steps (making sure a given
>> destination has enough space) I try to calculate how much space will
>> be needed by a 'pg_dump' run *before* actually dumping it.
>>
>>   Is there a relatively easy way to do that? Moreso, if it possible to
>> do this from an unpriviledged account? If not, is there a way to add
>> the permissions to a specific pg user to allow that user to perform this?
>
> You could dump the database to /dev/null, piping it through wc to catch
> the size, but that would of course be wasteful.
>
> You could count the disk space usage of the actual stored tuples, though
> this will necessarily be inexact:
>
>   http://www.postgresql.org/docs/8.1/static/diskusage.html
>
> Or you could count the size of the physical database files
> (/var/lib/postgresql or wherever). While these would be estimates, you
> could at least guarantee that the dump would not *exceed* the esimtate.
>
> Keep in mind that pg_dump can compress the dump and (iirc) will do so by
> default when you use the custom format (-Fc or --format=c).
>
> Alexander.

Heh, that looks like just the article I would have wanted if I had
properly RTFM. :D

Many thanks!!

Madi

Re: Determining size of a database before dumping

От
Tom Lane
Дата:
Alexander Staubo <alex@purefiction.net> writes:
> You could count the disk space usage of the actual stored tuples,
> though this will necessarily be inexact:
>    http://www.postgresql.org/docs/8.1/static/diskusage.html
> Or you could count the size of the physical database files (/var/lib/
> postgresql or wherever). While these would be estimates, you could at
> least guarantee that the dump would not *exceed* the esimtate.

You could guarantee no such thing; consider compression of TOAST values.
Even for uncompressed data, datatypes such as int and float can easily
print as more bytes than they occupy on-disk.

Given all the non-data overhead involved (eg for indexes), it's probably
unlikely that a text dump would exceed the "du" size of the database,
but it's far from "guaranteed".

            regards, tom lane

Re: Determining size of a database before dumping

От
Alexander Staubo
Дата:
On Oct 2, 2006, at 23:19 , Tom Lane wrote:

> Alexander Staubo <alex@purefiction.net> writes:
>> You could count the disk space usage of the actual stored tuples,
>> though this will necessarily be inexact:
>>    http://www.postgresql.org/docs/8.1/static/diskusage.html
>> Or you could count the size of the physical database files (/var/lib/
>> postgresql or wherever). While these would be estimates, you could at
>> least guarantee that the dump would not *exceed* the esimtate.
>
> You could guarantee no such thing; consider compression of TOAST
> values.
> Even for uncompressed data, datatypes such as int and float can easily
> print as more bytes than they occupy on-disk.

Why does pg_dump serialize data less efficiently than PostgreSQL when
using the "custom" format? (Pg_dump arguably has greater freedom in
being able to apply space-saving optimizations to the output format.
For example, one could use table statistics to selectively apply
something like Rice coding for numeric data, or vertically decompose
the tuples and emit sorted vectors using delta compression.) As for
TOAST, should not pg_dump's compression compress just as well, or
better?

Alexander.

Re: Determining size of a database before dumping

От
Ron Johnson
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/02/06 16:19, Tom Lane wrote:
> Alexander Staubo <alex@purefiction.net> writes:
>> You could count the disk space usage of the actual stored tuples,
>> though this will necessarily be inexact:
>>    http://www.postgresql.org/docs/8.1/static/diskusage.html
>> Or you could count the size of the physical database files (/var/lib/
>> postgresql or wherever). While these would be estimates, you could at
>> least guarantee that the dump would not *exceed* the esimtate.
>
> You could guarantee no such thing; consider compression of TOAST values.
> Even for uncompressed data, datatypes such as int and float can easily
> print as more bytes than they occupy on-disk.
>
> Given all the non-data overhead involved (eg for indexes), it's probably
> unlikely that a text dump would exceed the "du" size of the database,
> but it's far from "guaranteed".

It's my experience that when there are lots of numeric fields,
fixed-width text records are approx 2.5x larger than the original
binary records.

- --
Ron Johnson, Jr.
Jefferson LA  USA

Is "common sense" really valid?
For example, it is "common sense" to white-power racists that
whites are superior to blacks, and that those with brown skins
are mud people.
However, that "common sense" is obviously wrong.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)

iD8DBQFFIZgvS9HxQb37XmcRAp6xAKC74LV+2wR6Ao5Oq56RInkkDP8PZgCglKEv
z0fvjrXTloWJJ7qdhfOpIoI=
=jICB
-----END PGP SIGNATURE-----

Re: Determining size of a database before dumping

От
Jeff Davis
Дата:
On Tue, 2006-10-03 at 00:42 +0200, Alexander Staubo wrote:
> Why does pg_dump serialize data less efficiently than PostgreSQL when
> using the "custom" format? (Pg_dump arguably has greater freedom in
> being able to apply space-saving optimizations to the output format.
> For example, one could use table statistics to selectively apply
> something like Rice coding for numeric data, or vertically decompose
> the tuples and emit sorted vectors using delta compression.) As for
> TOAST, should not pg_dump's compression compress just as well, or
> better?

It would be a strange set of data that had a larger representation as a
compressed pg_dump than the data directory itself. However, one could
imagine a contrived case where that might happen.

Let's say you had a single table with 10,000 columns of type INT4, 100M
records, all with random numbers in the columns. I don't think standard
gzip compression will compress random INT4s down to 32 bits.

Another example is NULLs. What if only a few of those records had non-
NULL values? If I understand correctly, PostgreSQL will represent those
NULLs with just one bit.

What you're saying is more theoretical. If pg_dump used specialized
compression based on the data type of the columns, and everything was
optimal, you're correct. There's no situation in which the dump *must*
be bigger. However, since there is no practical demand for such
compression, and it would be a lot of work, there is no *guarantee* that
the data directory will be bigger. However, it probably is.

Regards,
    Jeff Davis


Re: Determining size of a database before dumping

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On Tue, 2006-10-03 at 00:42 +0200, Alexander Staubo wrote:
>> Why does pg_dump serialize data less efficiently than PostgreSQL when
>> using the "custom" format?

> What you're saying is more theoretical. If pg_dump used specialized
> compression based on the data type of the columns, and everything was
> optimal, you're correct. There's no situation in which the dump *must*
> be bigger. However, since there is no practical demand for such
> compression, and it would be a lot of work ...

There are several reasons for not being overly tense about the pg_dump
format:

* We don't have infinite manpower

* Cross-version and cross-platform portability of the dump files is
  critical

* The more complicated it is, the more chance for bugs, which you'd
  possibly not notice until you *really needed* that dump.

In practice, pushing the data through gzip gets most of the potential
win, for a very small fraction of the effort it would take to have a
smart custom compression mechanism.

            regards, tom lane