Обсуждение: Varchar vs foreign key vs enumerator - table and index size

Поиск
Список
Период
Сортировка

Varchar vs foreign key vs enumerator - table and index size

От
Łukasz Walkowski
Дата:
Hi,
This is my first post on this group so welcome everyone! Currently I'm working on optimizing a quite simple database
usedto store events from one website. Every event is a set of data describing user behaviour. The main table that
storesall events is built using schema:  

     Column      |            Type             | Modifiers
-----------------+-----------------------------+-----------
 id              | bigint                      | not null
 browser         | character varying(255)      |
 created         | timestamp without time zone |
 eventsource     | character varying(255)      |
 eventtype       | character varying(255)      |
 ipaddress       | character varying(255)      |
 objectid        | bigint                      |
 sessionid       | character varying(255)      |
 shopids         | integer[]                   |
 source          | character varying(255)      |
 sourceid        | bigint                      |
 supplierid      | bigint                      |
 cookieuuid      | uuid                        |
 serializeddata  | bytea                       |
 devicetype      | character varying(255)      |
 operatingsystem | character varying(255)      |

 It was a quick project to play with EclipseLink, Hibernate and some Jersey Rest services, so isn't perfect. However
thedatabase became quite usefull and we decided to optimize this table as it grew quite large (128GB right now without
indexes,about 630M records). There is only primary key index on this table. Here is the list of changes that I'd like
tomake to the table (some of them should be done from the scratch):  

1. Changing ipaddress from varchar to inet - this should save some space and lower the size of potential index.

2. Changing id for some composite id with created contained in it.

3. And this part is most interesting for me. Columns browser, eventsource, eventtype, devicetype, operatingsystem
containa small pool of strings - for example for devicetype this is set to Computer, Mobile, Tablet or Unknown. Browser
isset to normalized browser name. In every case I can store those data using one of 3 different methods:  

- store as varchar as it is now - nice and easy, but index on those columns is quite big and I think storing many of
similarstrings is waste of space.  

- store only id's and join external tables as needed, for example for browsers I only need smallint key, as there is a
limitednumber of browsers. The column browser becomes smallint and we have additional table with two columns (id,
browservarchar). This should save some space on event table, but if I want name of the browser in some report I need to
jointables. Second thing - on every insert there is constraint that is checked for this field and this can affect
performance.I was thinking about the same strategy for the remaining fields - this would give me 5 additional tables
and5 additional constraints on event table. Browser table will have about ~100 records, eventtype and eventsource will
haveabout 8-12 records each, devicetype - 4 records, operatingsystem - didn't really check this one, but I think
somethingaround 100 like browser.  

- introduce enumerator type for each of the column and store those values as enumerator. This one should be the most
spaceefficient, but it will be problematic in case of changing column values like browser or operatingsystem as
alteringenumerator isn't that simple.  

For browser average text length is 19 characters, for eventsource and eventtype eventsource average text lenght is 24
characters.Database encoding is set to UTF8.  

My question is - what is estimated difference in table size between those 3 variants of storing columns? In theory
thirdone should give me the smallest database and index size but is the most problematic from all of the above.  

Lukasz Walkowski

Re: Varchar vs foreign key vs enumerator - table and index size

От
Tom Lane
Дата:
=?utf-8?Q?=C5=81ukasz_Walkowski?= <lukasz.walkowski@homplex.pl> writes:
> 3. And this part is most interesting for me. Columns browser, eventsource, eventtype, devicetype, operatingsystem
containa small pool of strings - for example for devicetype this is set to Computer, Mobile, Tablet or Unknown. Browser
isset to normalized browser name. In every case I can store those data using one of 3 different methods:  

> - store as varchar as it is now - nice and easy, but index on those columns is quite big and I think storing many of
similarstrings is waste of space.  

If you're starting to be concerned about space, it's definitely time to
get away from this choice.  Depending on what locale you're using,
comparing varchar values can be quite an expensive operation, too.

> - store only id's and join external tables as needed, for example for browsers I only need smallint key, as there is
alimited number of browsers. 

I think the main "pro" of this approach is that it doesn't use any
nonstandard SQL features, so you preserve your options to move to some
other database in the future.  The main "con" is that you'd be buying into
fairly significant rewriting of your application code, since just about
every query involving these columns would have to become a join.

FWIW, I'd be inclined to just use integer not smallint.  The space savings
from smallint is frequently illusory because of alignment considerations
--- for instance, an index on a single smallint column will *not* be any
smaller than one on a single int column.  And smallint has some minor
usage annoyances because it's a second-class citizen in the type promotion
hierarchy --- you may find yourself needing explicit casts to smallint
here and there.

> - introduce enumerator type for each of the column and store those values as enumerator. This one should be the most
spaceefficient, but it will be problematic in case of changing column values like browser or operatingsystem as
alteringenumerator isn't that simple.  

Space-wise this is going to be equivalent to the integer-foreign-key
solution.  It's much nicer from a notational standpoint, though, because
you don't need joins --- it's likely that you'd need few if any
application code changes to go this route.  (But I'd advise doing some
testing to verify that before you take it as a given.)

You're right though that enums are not a good option if you expect
frequent changes in the pool of allowed values.  I guess the question
is how often does that happen, in your application?  Adding a new value
from time to time isn't much of a problem unless you want to get picky
about how it sorts relative to existing values.  But you can't ever delete
an individual enum value, and we don't support renaming them either.
(Though if you're desperate, I believe a manual UPDATE on the pg_enum
catalog would work for that.)

Another thing to think about is whether you have auxiliary data about each
value that might usefully be stored as additional columns in the small
tables.  The enum approach doesn't directly handle that, though I suppose
you could still create small separate tables that use an enum column as
primary key.

            regards, tom lane


Re: Varchar vs foreign key vs enumerator - table and index size

От
Łukasz Walkowski
Дата:
Tom,

> If you're starting to be concerned about space, it's definitely time to
> get away from this choice.  Depending on what locale you're using,
> comparing varchar values can be quite an expensive operation, too.

I don't like wasting space and processing power even if more work is required to achieve this. We use pl_PL.UTF-8 as
ourlocale. 

> I think the main "pro" of this approach is that it doesn't use any
> nonstandard SQL features, so you preserve your options to move to some
> other database in the future.  The main "con" is that you'd be buying into
> fairly significant rewriting of your application code, since just about
> every query involving these columns would have to become a join.

Well, I don't really think I will move from Postgresql anytime soon. It's just the best database for me. Rewriting code
isone of the things I'm doing right now but before I touch database, I want to be sure that the choices I made are
good.

> FWIW, I'd be inclined to just use integer not smallint.  The space savings
> from smallint is frequently illusory because of alignment considerations
> --- for instance, an index on a single smallint column will *not* be any
> smaller than one on a single int column.  And smallint has some minor
> usage annoyances because it's a second-class citizen in the type promotion
> hierarchy --- you may find yourself needing explicit casts to smallint
> here and there.

Ok, thats important information. Thank you.

>
> Space-wise this is going to be equivalent to the integer-foreign-key
> solution.  It's much nicer from a notational standpoint, though, because
> you don't need joins --- it's likely that you'd need few if any
> application code changes to go this route.  (But I'd advise doing some
> testing to verify that before you take it as a given.)
>
> You're right though that enums are not a good option if you expect
> frequent changes in the pool of allowed values.  I guess the question
> is how often does that happen, in your application?  Adding a new value
> from time to time isn't much of a problem unless you want to get picky
> about how it sorts relative to existing values.  But you can't ever delete
> an individual enum value, and we don't support renaming them either.
> (Though if you're desperate, I believe a manual UPDATE on the pg_enum
> catalog would work for that.)
>
> Another thing to think about is whether you have auxiliary data about each
> value that might usefully be stored as additional columns in the small
> tables.  The enum approach doesn't directly handle that, though I suppose
> you could still create small separate tables that use an enum column as
> primary key.
>
>             regards, tom lane

So, I'll go for enumerators for device type, eventtype and eventsource as those columns are quite stable. For browser
andoperating system I'll do external tables. 

Thank you - any additional tips are welcome.

Reagards,
Lukasz Walkowski

Re: Varchar vs foreign key vs enumerator - table and index size

От
Craig James
Дата:
On Sat, Aug 31, 2013 at 10:06 AM, Łukasz Walkowski <lukasz.walkowski@homplex.pl> wrote:
> I think the main "pro" of this approach is that it doesn't use any
> nonstandard SQL features, so you preserve your options to move to some
> other database in the future.  The main "con" is that you'd be buying into
> fairly significant rewriting of your application code, since just about
> every query involving these columns would have to become a join.

Well, I don't really think I will move from Postgresql anytime soon. It's just the best database for me. Rewriting code is one of the things I'm doing right now but before I touch database, I want to be sure that the choices I made are good.

If your applications are read-heavy and only have a small-ish amount of code that inserts/updates the table, it may not be that much of a rewrite. You can create a integer/varchar table of key/values, use its key to replace the current varchar column, rename the original table, and create a view with the original table's name.  Code that only reads the data won't know the difference. And it's a portable solution.

I did this and it worked out well. If the key/value pairs table is relatively small, the planner does an excellent job of generating efficient queries against the big table.

Craig

Re: Varchar vs foreign key vs enumerator - table and index size

От
Vitalii Tymchyshyn
Дата:



2013/8/31 Łukasz Walkowski <lukasz.walkowski@homplex.pl>

3. And this part is most interesting for me. Columns browser, eventsource, eventtype, devicetype, operatingsystem contain a small pool of strings - for example for devicetype this is set to Computer, Mobile, Tablet or Unknown. Browser is set to normalized browser name. In every case I can store those data using one of 3 different methods:


Well, there are some more options:
a) Store int keys and do mapping in the application (e.g. with java enums). This can save you a join, that is especially useful if you are going to do paged output with limit/offset scenario. Optimizer sometimes produce suboptimal plans for join in offset/limit queries.
b) Store small varchar values as keys (up to "char" type if you really want to save space) and do user display mapping in application. It's different from (a) since it's harder to mess with the mapping and values are still more or less readable with simple select. But it can be less efficient than (a).
c) Do mixed approach with mapping table, loaded on start into application memory. This would be an optimization in case you get into optimizer troubles.

Best regards, Vitalii Tymchyshyn

Re: Varchar vs foreign key vs enumerator - table and index size

От
Łukasz Walkowski
Дата:
On 1 wrz 2013, at 03:31, Craig James <cjames@emolecules.com> wrote:

> If your applications are read-heavy and only have a small-ish amount of code that inserts/updates the table, it may
notbe that much of a rewrite. You can create a integer/varchar table of key/values, use its key to replace the current
varcharcolumn, rename the original table, and create a view with the original table's name.  Code that only reads the
datawon't know the difference. And it's a portable solution. 
>
> I did this and it worked out well. If the key/value pairs table is relatively small, the planner does an excellent
jobof generating efficient queries against the big table. 
>
> Craig

Actually this (event) table is write heavy. But the concept is really cool and worth trying. Thanks.


Lukasz

Re: Varchar vs foreign key vs enumerator - table and index size

От
Łukasz Walkowski
Дата:
On 1 wrz 2013, at 05:10, Vitalii Tymchyshyn <tivv00@gmail.com> wrote:
>
>
> Well, there are some more options:
> a) Store int keys and do mapping in the application (e.g. with java enums). This can save you a join, that is
especiallyuseful if you are going to do paged output with limit/offset scenario. Optimizer sometimes produce suboptimal
plansfor join in offset/limit queries. 
> b) Store small varchar values as keys (up to "char" type if you really want to save space) and do user display
mappingin application. It's different from (a) since it's harder to mess with the mapping and values are still more or
lessreadable with simple select. But it can be less efficient than (a). 
> c) Do mixed approach with mapping table, loaded on start into application memory. This would be an optimization in
caseyou get into optimizer troubles. 
>
> Best regards, Vitalii Tymchyshyn

I'd like to leave database in readable form because before I add some new queries and rest endpoints to the
application,I test them as ad-hoc queries using command line. So variant a) isn't good for me. Variant b) is worth
tryingand c) is easy to code, but I still prefer having all this data in database independent of application logic. 

Thanks for suggestion,
Lukasz

Re: Varchar vs foreign key vs enumerator - table and index size

От
Andrew Dunstan
Дата:
On 09/02/2013 05:53 AM, Łukasz Walkowski wrote:
> On 1 wrz 2013, at 05:10, Vitalii Tymchyshyn <tivv00@gmail.com> wrote:
>>
>> Well, there are some more options:
>> a) Store int keys and do mapping in the application (e.g. with java enums). This can save you a join, that is
especiallyuseful if you are going to do paged output with limit/offset scenario. Optimizer sometimes produce suboptimal
plansfor join in offset/limit queries. 
>> b) Store small varchar values as keys (up to "char" type if you really want to save space) and do user display
mappingin application. It's different from (a) since it's harder to mess with the mapping and values are still more or
lessreadable with simple select. But it can be less efficient than (a). 
>> c) Do mixed approach with mapping table, loaded on start into application memory. This would be an optimization in
caseyou get into optimizer troubles. 
>>
>> Best regards, Vitalii Tymchyshyn
> I'd like to leave database in readable form because before I add some new queries and rest endpoints to the
application,I test them as ad-hoc queries using command line. So variant a) isn't good for me. Variant b) is worth
tryingand c) is easy to code, but I still prefer having all this data in database independent of application logic. 
>


I think the possible use of Postgres enums has been too easily written
off in this thread. Looking at the original problem description they
look like quite a good fit, despite the OP's skepticism. What exactly is
wanted that can't be done with database enums? You can add new values to
the type very simply.  You can change the values of existing labels in
the type slightly less simply, but still without any great difficulty.
Things that are hard to do include removing labels in the set and
changing the sort order, because those things would require processing
tables where the type is used, unlike the simple things. But neither of
these is required for typical use cases. For most uses of this kind they
are very efficient both in storage and processing.

cheers

andrew


Re: Varchar vs foreign key vs enumerator - table and index size

От
Vitalii Tymchyshyn
Дата:
Well, in older version of Hibernate it was a little tricky to handle Postgresql Enums. Dunno if it's out of the box now.
Also adding new value is an explicit operation (much like with lookup table). I've had quite a complex code with second connection opening to support lookup table filling without flooding original transaction with additional locks that could lead to deadlocks.
BTW: Does adding new value to enum adds some locks? Can a check if value exists and adding new value be done in atomic fashion without grabbing some global lock?
P.S. As  I see, it can be a topic for good article for, say, dzone. The problem can be quite tricky in MVCC database and choice must be done wisely.

Best regards, Vitalii Tymchyshyn


2013/9/2 Andrew Dunstan <andrew@dunslane.net>

On 09/02/2013 05:53 AM, Łukasz Walkowski wrote:
On 1 wrz 2013, at 05:10, Vitalii Tymchyshyn <tivv00@gmail.com> wrote:

Well, there are some more options:
a) Store int keys and do mapping in the application (e.g. with java enums). This can save you a join, that is especially useful if you are going to do paged output with limit/offset scenario. Optimizer sometimes produce suboptimal plans for join in offset/limit queries.
b) Store small varchar values as keys (up to "char" type if you really want to save space) and do user display mapping in application. It's different from (a) since it's harder to mess with the mapping and values are still more or less readable with simple select. But it can be less efficient than (a).
c) Do mixed approach with mapping table, loaded on start into application memory. This would be an optimization in case you get into optimizer troubles.

Best regards, Vitalii Tymchyshyn
I'd like to leave database in readable form because before I add some new queries and rest endpoints to the application, I test them as ad-hoc queries using command line. So variant a) isn't good for me. Variant b) is worth trying and c) is easy to code, but I still prefer having all this data in database independent of application logic.



I think the possible use of Postgres enums has been too easily written off in this thread. Looking at the original problem description they look like quite a good fit, despite the OP's skepticism. What exactly is wanted that can't be done with database enums? You can add new values to the type very simply.  You can change the values of existing labels in the type slightly less simply, but still without any great difficulty. Things that are hard to do include removing labels in the set and changing the sort order, because those things would require processing tables where the type is used, unlike the simple things. But neither of these is required for typical use cases. For most uses of this kind they are very efficient both in storage and processing.

cheers

andrew



--
Best regards,
 Vitalii Tymchyshyn

Re: Varchar vs foreign key vs enumerator - table and index size

От
Jim Nasby
Дата:
On 8/31/13 8:35 AM, Łukasz Walkowski wrote:
> 3. And this part is most interesting for me. Columns browser, eventsource, eventtype, devicetype, operatingsystem
containa small pool of strings - for example for devicetype this is set to Computer, Mobile, Tablet or Unknown. Browser
isset to normalized browser name. In every case I can store those data using one of 3 different methods: 

Sorry for the late reply... the Enova Tools project in pgFoundry has code that lets you create a "dynamic lookup table"
thatallows for easily normalizing slow-changing data. You could then put a writable view on top of that so that the app
wouldn'tknow the difference. It also has a backfill framework that would help you move data from the old table to the
newtable. 
--
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net