On 10/13/2017 09:49 AM, Seamus Abshere wrote:
> hey,
>
> In the spreadsheet world, there is this concept of "shared strings," a
> simple way of compressing spreadsheets when the data is duplicated in
> many cells.
>
> In my database, I have a table with >200 million rows and >300 columns
> (all the households in the United States). For clarity of development
> and debugging, I have not made any effort to normalize its contents, so
> millions of rows have, for example, "SINGLE FAMILY RESIDENCE /
> TOWNHOUSE" (yes, that whole string!) instead of some code representing
> it.
>
> Theoretically / blue sky, could there be a table or column type that
> transparently handles "shared strings" like this, reducing size on disk
> at the cost of lookup overhead for all queries?
>
> (I guess maybe it's like TOAST, but content-hashed and de-duped and not
> only for large objects?)
>
> Thanks,
> Seamus
>
> --
> Seamus Abshere, SCEA
> https://www.faraday.io
> https://github.com/seamusabshere
> https://linkedin.com/in/seamusabshere
>
>
What data type are these columns now? I would be tempted to map the
full strings to an abbreviation just so I didn't have to alter all the
columns to an "id"; Optional to place any RI on the columns to the
abbreviation dictionary table. Just use the translation as a last step
in user facing reports. If you can map/abbreviate to 4 characters,
you've approximated the disk size of an integer.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general