Обсуждение: diskspace

Поиск
Список
Период
Сортировка

diskspace

От
Geoff Winkless
Дата:
Hi

I'm trying to migrate a database from MySQL to postgres and I'm struggling with the amount of diskspace the resulting db takes.

I may be missing a setting somewhere but I can't see one anywhere obvious (apologies if I'm being stupid...)

Even postgres' idea of the size of the columns don't match up to its own reported size of the data.

eg I have a table "result":

CREATE TABLE result (
    st_id integer NOT NULL,
    log smallint NOT NULL,
    "time" integer NOT NULL,
    token character(4) NOT NULL,
    data character varying(500) DEFAULT NULL::character varying
);

# SELECT pg_size_pretty(sum(pg_column_size(data) + pg_column_size(st_id) + pg_column_size(log) + pg_column_size(token) + pg_column_size(time))) FROM result;
 pg_size_pretty
----------------
 178 MB
(1 row)

# SELECT pg_size_pretty(pg_relation_size('result'));
 pg_size_pretty
----------------
 613 MB
(1 row)

I'd naively expected these two figures to be similar.

I've run vacuum analyze and it made no difference (not a major surprise because all I've done so far is create the database and sequentially insert the data into the tables).

I expected a little overhead from what I'd read before the migration but that's a fairly huge difference.

As I said, sorry if I've missed the obvious "use loads of extra space" setting but I'd appreciate any suggestion as to what that setting might be called :)

Cheers

Geoff

Re: diskspace

От
Albe Laurenz
Дата:
Geoff Winkless wrote:
> I'm trying to migrate a database from MySQL to postgres and I'm struggling with the amount of
> diskspace the resulting db takes.
>
> I may be missing a setting somewhere but I can't see one anywhere obvious (apologies if I'm being
> stupid...)
>
> Even postgres' idea of the size of the columns don't match up to its own reported size of the data.
>
> eg I have a table "result":
>
> CREATE TABLE result (
>     st_id integer NOT NULL,
>     log smallint NOT NULL,
>     "time" integer NOT NULL,
>     token character(4) NOT NULL,
>     data character varying(500) DEFAULT NULL::character varying
> );
>
>
> # SELECT pg_size_pretty(sum(pg_column_size(data) + pg_column_size(st_id) + pg_column_size(log) +
> pg_column_size(token) + pg_column_size(time))) FROM result;
>  pg_size_pretty
> ----------------
>  178 MB
> (1 row)
>
> # SELECT pg_size_pretty(pg_relation_size('result'));
>  pg_size_pretty
> ----------------
>  613 MB
> (1 row)
>
>
> I'd naively expected these two figures to be similar.
>
> I've run vacuum analyze and it made no difference (not a major surprise because all I've done so far
> is create the database and sequentially insert the data into the tables).
>
> I expected a little overhead from what I'd read before the migration but that's a fairly huge
> difference.
>
> As I said, sorry if I've missed the obvious "use loads of extra space" setting but I'd appreciate any
> suggestion as to what that setting might be called :)

I don't think that pg_column_size() is a good tool to
measure table size.

I'd suggest that you use pg_table_size for the table itself
and pg_indexes_size for the size of ist indexes.
That should come close to the amount of disk space taken.

Yours,
Laurenz Albe


Re: diskspace

От
Scott Mead
Дата:
On Tue, Feb 5, 2013 at 9:51 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
Geoff Winkless wrote:
> I'm trying to migrate a database from MySQL to postgres and I'm struggling with the amount of
> diskspace the resulting db takes.
>
> I may be missing a setting somewhere but I can't see one anywhere obvious (apologies if I'm being
> stupid...)
>
> Even postgres' idea of the size of the columns don't match up to its own reported size of the data.
>
> eg I have a table "result":
>
> CREATE TABLE result (
>     st_id integer NOT NULL,
>     log smallint NOT NULL,
>     "time" integer NOT NULL,
>     token character(4) NOT NULL,
>     data character varying(500) DEFAULT NULL::character varying
> );
>
>
> # SELECT pg_size_pretty(sum(pg_column_size(data) + pg_column_size(st_id) + pg_column_size(log) +
> pg_column_size(token) + pg_column_size(time))) FROM result;
>  pg_size_pretty
> ----------------
>  178 MB
> (1 row)
>
> # SELECT pg_size_pretty(pg_relation_size('result'));
>  pg_size_pretty
> ----------------
>  613 MB
> (1 row)
>
>
> I'd naively expected these two figures to be similar.
>
> I've run vacuum analyze and it made no difference (not a major surprise because all I've done so far
> is create the database and sequentially insert the data into the tables).
>
> I expected a little overhead from what I'd read before the migration but that's a fairly huge
> difference.
>
> As I said, sorry if I've missed the obvious "use loads of extra space" setting but I'd appreciate any
> suggestion as to what that setting might be called :)

I don't think that pg_column_size() is a good tool to
measure table size.

I'd suggest that you use pg_table_size for the table itself
and pg_indexes_size for the size of ist indexes.
That should come close to the amount of disk space taken.


Agreed, don't forget, you have indexes, free space, vacuum-able stuff, etc... all laying in your datafiles.  Your measurements are telling you what you have purely in a raw form.

--Scott Mead
 

Yours,
Laurenz Albe


--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: diskspace

От
Rosser Schwarz
Дата:
Additionally, postgres has a number of "metadata" columns (e.g., xmin, xmax, cmin, cmax, &c).  Those can add up, particularly when their net size is greater than the user data size of a row.

rls


On Tue, Feb 5, 2013 at 11:00 AM, Scott Mead <scottm@openscg.com> wrote:
On Tue, Feb 5, 2013 at 9:51 AM, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
Geoff Winkless wrote:
> I'm trying to migrate a database from MySQL to postgres and I'm struggling with the amount of
> diskspace the resulting db takes.
>
> I may be missing a setting somewhere but I can't see one anywhere obvious (apologies if I'm being
> stupid...)
>
> Even postgres' idea of the size of the columns don't match up to its own reported size of the data.
>
> eg I have a table "result":
>
> CREATE TABLE result (
>     st_id integer NOT NULL,
>     log smallint NOT NULL,
>     "time" integer NOT NULL,
>     token character(4) NOT NULL,
>     data character varying(500) DEFAULT NULL::character varying
> );
>
>
> # SELECT pg_size_pretty(sum(pg_column_size(data) + pg_column_size(st_id) + pg_column_size(log) +
> pg_column_size(token) + pg_column_size(time))) FROM result;
>  pg_size_pretty
> ----------------
>  178 MB
> (1 row)
>
> # SELECT pg_size_pretty(pg_relation_size('result'));
>  pg_size_pretty
> ----------------
>  613 MB
> (1 row)
>
>
> I'd naively expected these two figures to be similar.
>
> I've run vacuum analyze and it made no difference (not a major surprise because all I've done so far
> is create the database and sequentially insert the data into the tables).
>
> I expected a little overhead from what I'd read before the migration but that's a fairly huge
> difference.
>
> As I said, sorry if I've missed the obvious "use loads of extra space" setting but I'd appreciate any
> suggestion as to what that setting might be called :)

I don't think that pg_column_size() is a good tool to
measure table size.

I'd suggest that you use pg_table_size for the table itself
and pg_indexes_size for the size of ist indexes.
That should come close to the amount of disk space taken.


Agreed, don't forget, you have indexes, free space, vacuum-able stuff, etc... all laying in your datafiles.  Your measurements are telling you what you have purely in a raw form.

--Scott Mead
 

Yours,
Laurenz Albe


--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin




--
:wq

Re: diskspace

От
Geoff Winkless
Дата:
Thanks for all the responses....

On 5 February 2013 19:36, Rosser Schwarz <rosser.schwarz@gmail.com> wrote:
On Tue, Feb 5, 2013 at 11:00 AM, Scott Mead <scottm@openscg.com> wrote:
Agreed, don't forget, you have indexes, free space, vacuum-able stuff, etc... all laying in your datafiles.  Your measurements are telling you what you have purely in a raw form.

Indexes are stored separately. There should be no free space (why would there be? FILLFACTOR is 100!) and I've vacuumed already.

Additionally, postgres has a number of "metadata" columns (e.g., xmin, xmax, cmin, cmax, &c).  Those can add up, particularly when their net size is greater than the user data size of a row.
 
That's a fair point.

611MB (actual disk size) - 138MB (summed data size) = 483MB
11M rows (ish).

That's nearly 48 bytes extra per row.


Let's say around 249MB (23 bytes per row, according to that page) for the columns you mention, so that leaves 234MB unexplained.

I can see 44 bytes per page header (given 5 columns, so 20 bytes ItemIdData, with 24 bytes PageHeaderData). Given page size of 8kb that would imply page headers of about 3.3MB over a 611MB table.

Even with some bytes for alignment I'm Still not seeing where 230MB (more than the space taken by the data itself) of my disk is gone.

Thanks!

Geoff

Re: diskspace

От
Scott Mead
Дата:

On Tue, Feb 5, 2013 at 3:33 PM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
Thanks for all the responses....

On 5 February 2013 19:36, Rosser Schwarz <rosser.schwarz@gmail.com> wrote:
On Tue, Feb 5, 2013 at 11:00 AM, Scott Mead <scottm@openscg.com> wrote:
Agreed, don't forget, you have indexes, free space, vacuum-able stuff, etc... all laying in your datafiles.  Your measurements are telling you what you have purely in a raw form.

Indexes are stored separately. There should be no free space (why would there be? FILLFACTOR is 100!) and I've vacuumed already.


Have you done any updates / deletes on this dataset? Or is this just pure insert / bulkload?  If you haven't done any update / delete, the following doesn't help at all, but:

Vacuuming doesn't reclaim space, it only marks it as available for re-use for other updates.  You can compact your datafiles by either a (WARNING) VACUUM FULL or CLUSTER.  Just remember, these both require a full table lock during their execution.

--Scott Mead


 
Additionally, postgres has a number of "metadata" columns (e.g., xmin, xmax, cmin, cmax, &c).  Those can add up, particularly when their net size is greater than the user data size of a row.
 
That's a fair point.

611MB (actual disk size) - 138MB (summed data size) = 483MB
11M rows (ish).

That's nearly 48 bytes extra per row.


Let's say around 249MB (23 bytes per row, according to that page) for the columns you mention, so that leaves 234MB unexplained.

I can see 44 bytes per page header (given 5 columns, so 20 bytes ItemIdData, with 24 bytes PageHeaderData). Given page size of 8kb that would imply page headers of about 3.3MB over a 611MB table.

Even with some bytes for alignment I'm Still not seeing where 230MB (more than the space taken by the data itself) of my disk is gone.

Thanks!

Geoff

Re: diskspace

От
Rosser Schwarz
Дата:
On Tue, Feb 5, 2013 at 12:33 PM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
Indexes are stored separately. There should be no free space (why would there be? FILLFACTOR is 100!) and I've vacuumed already.

Did you just VACUUM, or did you VACUUM FULL/CLUSTER?  Vanilla vacuuming merely marks space consumed by "dead" tuples as re-usable; it doesn't reclaim the space and reduce the disk footprint of your tables.

rls

--
:wq

Re: diskspace

От
Geoff Winkless
Дата:
On 5 February 2013 20:33, Geoff Winkless <pgsqladmin@geoff.dj> wrote:

Let's say around 249MB (23 bytes per row, according to that page) for the columns you mention, so that leaves 234MB unexplained.

I can see 44 bytes per page header (given 5 columns, so 20 bytes ItemIdData, with 24 bytes PageHeaderData). Given page size of 8kb that would imply page headers of about 3.3MB over a 611MB table.

Of course I got that slightly wrong: ItemIdData is for each row, not for each column; an extra 4 bytes for each row makes the per-row space 290MB, leaving 167MB unexplained.

In answer to other questions: the data was inserted in bulk, so there have been no updates; even so I have run a VACUUM FULL just in case and it makes no difference.

I'm assuming the remaining 167MB is related to the alignment requirements, although that does seem quite a lot at 15 bytes per row (perhaps I'm just unfortunate with the data sizes resulting in poor alignment).

I guess the answer to my question is that there is no answer to my question; pg just does use a massive (especially in relation to thin but tall tables) proportion of diskspace for its own purposes. 

Thanks again for all the responses.

Geoff

Re: diskspace

От
Albe Laurenz
Дата:
Geoff Winkless wrote:
[trying to account for the disk space used]
> Of course I got that slightly wrong: ItemIdData is for each row, not for each column; an extra 4 bytes
> for each row makes the per-row space 290MB, leaving 167MB unexplained.

> I'm assuming the remaining 167MB is related to the alignment requirements, although that does seem
> quite a lot at 15 bytes per row (perhaps I'm just unfortunate with the data sizes resulting in poor
> alignment).
>
> I guess the answer to my question is that there is no answer to my question; pg just does use a
> massive (especially in relation to thin but tall tables) proportion of diskspace for its own purposes.

I doubt that PostgreSQL has substantially more disk overhead
than other DBMS with comparable capabilities (comparison with
flat files or MyISAM would be unfair).

Have you tried using pg_filedump
(http://pgfoundry.org/frs/?group_id=1000541)
to dump a page or two of your table and figure
out what is where and where the space went?

Yours,
Laurenz Albe


Re: diskspace

От
Geoff Winkless
Дата:
On 6 February 2013 11:04, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
I doubt that PostgreSQL has substantially more disk overhead
than other DBMS with comparable capabilities (comparison with
flat files or MyISAM would be unfair).

You're right, of course; the same data on InnoDB works out if anything slightly larger, as far as I can tell. 

I wasn't (and I'm not) trying to do-down pgsql, just trying to figure out if there's a way of cutting back on the extra space used. In this instance it would be nice to be able to mark a table as WORM, for example, and remove the need for any of this stuff. At least in MySQL I can specify MyISAM for the table, since it rarely if ever needs updates and so there's no requirement for MVCC.

Have you tried using pg_filedump
(http://pgfoundry.org/frs/?group_id=1000541)
to dump a page or two of your table and figure
out what is where and where the space went?

I haven't; I will do for interest's sake, thanks for the suggestion.

Geoff

Re: diskspace

От
Geoff Winkless
Дата:
On 6 February 2013 11:12, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
On 6 February 2013 11:04, Albe Laurenz <laurenz.albe@wien.gv.at> wrote:
Have you tried using pg_filedump
(http://pgfoundry.org/frs/?group_id=1000541)
to dump a page or two of your table and figure
out what is where and where the space went?

I haven't; I will do for interest's sake, thanks for the suggestion.
 
A simple hexdump output shows that I can see each row taking an extra 28 to 31 (depending on the number of bytes padding to align after the varchar) bytes inline, which is about we thought. There's also a variable amount of wasted space in each page where the next row won't fit in the page, varying between 0 and (I guess) the largest row size + 30.

pg_filedump -a 22212| grep -i 'Free space' | cut -c46- | perl -nle '$sum += $_ } END { print $sum'

gives us a total of the "free space" values for all blocks in the 22212 table at 1875964 bytes. 

Given that there are 11367253 rows and we accept (being overly generous) an extra 35 bytes per row (379MB) plus 178MB real data plus just under 2MB free space plus 78494 block headers of 60 bytes each (ignoring the 4-bytes-per-row in the block header because we've already included that in the "35" bytes-per-row value) of 4MB gives a total of 563MB. I'm still 65MB short of the 618MB space taken. 

Not that that's going to make any significant difference but I am now intrigued as to where it's gone. :)

Geoff