Обсуждение: DB design advice: lots of small tables?

Поиск
Список
Период
Сортировка

DB design advice: lots of small tables?

От
lender
Дата:
Hello.

We are currently redesigning a medium/large office management web
application. There are 75 tables in our existing PostgreSQL database,
but that number is artificially low, due to some unfortunate design choices.

The main culprits are two tables named "catalog" and "catalog_entries".
They contain all those data sets that the previous designer deemed too
small for a separate table, so now they are all stored together. The
values in catalog_entries are typically used to populate dropdown select
fields.

The catalog table (simplified):

  id          SERIAL     -- artificial record ID
  code        VARCHAR    -- unique text identifier
  description VARCHAR    -- (info only, otherwise unused)

The catalog_entries table (simplified):

  id          SERIAL     -- artificial record ID
  catalog_id  INTEGER    -- parent catalog ID
  code        VARCHAR    -- unique (per catalog) text identifier
  rank        INTEGER    -- used for sorting the values
  text_short  VARCHAR    -- text for display (short version)
  text_long   TEXT       -- text for display (long version)

Here are some examples of what the tables contain:

Catalog: department
Entries: it, sales, accounting, cases, ...

Catalog: printers
Entries: ma_color, pa_color, pa_black, pdf, ...

Catalog: invoice_status
Entries: open, locked, entered, booked, cancelled, ...

Catalog: coverage
Entries: national, regional, international, obsolete

Records in other tables reference the values in catalog_entries by id.
For example, the "invoices" table has a "status_id" column pointing to a
record in catalog_entries. Of course, this leads to possible integrity
issues (there is nothing to prevent an invoice record referencing the
"ma_color" value instead of "open" in its status_id field).

There are 64 "catalogs" (data sets), in addition to the 75 real tables.

Now we have finally got the go-ahead to refactor this mess. Most of the
old "catalogs" will probably end up as separate tables. Others might be
replaced with ENUMs or booleans, especially the ones with only 2-3 values.

The reason why I'm hesitating and asking for advice now, is that after
refactoring, we'll end up with ~60 new tables, all of them rather small
and with practically identical columns. (Only five catalogs have more
than 20 entries; about half have five entries or less; five catalogs
have only two entries)

So, my first main question would be: is it "normal" or desirable to have
that many tiny tables? And is it a problem that many of the tables have
the same (or a similar) column definitions?


The second point is that we have redundant unique identifiers in
catalog_entries (id and code). The code value is used by the application
whenever we need to find to one of the values. For example, for a query
like "show all open invoices", we would either -

  1) select the id from catalog_entries where catalog_id refers to the
     "invoice_status" catalog and the code is "open"
  2) use that id to filter select * from invoices

- or do the same in one query using joins. This pattern occurs hundreds
of times in the application code. From a programming viewpoint, having
all-text ids would make things a lot simpler and cleaner (i.e., keep
only the "code" column).

The "id" column was used (AFAIK) to reduce the storage size. Most of the
data tables have less than 100k records, so the overhead wouldn't be too
dramatic, but a few tables (~10) have more; one of them has 1.2m
records. These tables can also refer to the old catalog_entries table
from more than one column. Changing all these references from INT to
VARCHAR would increase the DB size, and probably make scans less
performant. I'm not sure know how indexes on these columns would be
affected.

To summarize, the second question is whether we should ditch the
artificial numeric IDs and just use the "code" column as primary key in
the new tiny tables.


Thanks in advance for your advice.
crl


Re: DB design advice: lots of small tables?

От
Kevin Grittner
Дата:
lender <crlender@gmail.com> wrote:

> So, my first main question would be: is it "normal" or desirable
> to have that many tiny tables?

Yes.

> And is it a problem that many of the tables have the same (or a
> similar) column definitions?

No.

> To summarize, the second question is whether we should ditch the
> artificial numeric IDs and just use the "code" column as primary
> key in the new tiny tables.

This one becomes more of a judgment call, but I generally lean
toward using the visible unique identifier (your "code") as the
primary key and ditching a numeric "id".  The only two contra-
indications would be if the "code" values have a significant chance
of being changed or if you will have extreme numbers (billions) of
narrow rows which must reference the table.

<soapbox-rant>
I occasionally hear someone maintaining that having a meaningless
sequential ID column as the primary key of each table is required
by the relational model.  At those moments I swear I can actually
hear E.F. Codd turning in his grave.  It was a requirement of old
pre-relational databases from the 60's and 70's, and some equally
primitive ORMs still like to have one, but a big point of
relational databases is that you don't need to navigate artificial
linkages between tables -- the relationship can generally be
determined by the fact that they contain common data elements.  If
these are natural, meaningful values which are visible to the user
it often allows complex queries to be much better optimized, since
they aren't forced through a single navigational linkage.
</soapbox-rant>

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: DB design advice: lots of small tables?

От
Thomas Kellerer
Дата:
Kevin Grittner, 15.03.2013 14:36:
> <soapbox-rant>
> I occasionally hear someone maintaining that having a meaningless
> sequential ID column as the primary key of each table is required
> by the relational model.  At those moments I swear I can actually
> hear E.F. Codd turning in his grave.  It was a requirement of old
> pre-relational databases from the 60's and 70's, and some equally
> primitive ORMs still like to have one, but a big point of
> relational databases is that you don't need to navigate artificial
> linkages between tables -- the relationship can generally be
> determined by the fact that they contain common data elements.  If
> these are natural, meaningful values which are visible to the user
> it often allows complex queries to be much better optimized, since
> they aren't forced through a single navigational linkage.
> </soapbox-rant>

You might be interested in a discussion regarding this topic on comp.databases.theory:

https://groups.google.com/forum/?fromgroups=#!topic/comp.databases.theory/mqZZw3ojnjA



Re: DB design advice: lots of small tables?

От
François Beausoleil
Дата:
Le 2013-03-15 à 09:58, Thomas Kellerer a écrit :

> Kevin Grittner, 15.03.2013 14:36:
>> <soapbox-rant>
>> I occasionally hear someone maintaining that having a meaningless
>> sequential ID column as the primary key of each table is required
>> by the relational model.  At those moments I swear I can actually
>> hear E.F. Codd turning in his grave.  It was a requirement of old
>> pre-relational databases from the 60's and 70's, and some equally
>> primitive ORMs still like to have one, but a big point of
>> relational databases is that you don't need to navigate artificial
>> linkages between tables -- the relationship can generally be
>> determined by the fact that they contain common data elements.  If
>> these are natural, meaningful values which are visible to the user
>> it often allows complex queries to be much better optimized, since
>> they aren't forced through a single navigational linkage.
>> </soapbox-rant>
>
> You might be interested in a discussion regarding this topic on comp.databases.theory:
>
> https://groups.google.com/forum/?fromgroups=#!topic/comp.databases.theory/mqZZw3ojnjA

Along those lines, I love what Kenneth Downs says on his blog, The Database Programmer. Start at
http://database-programmer.blogspot.ca/2010/11/database-skills.htmland look for "Understanding Primary Keys, Foreign
Keysand Constraints". 

Ken suggests having a data dictionary and generating the schema from the dictionary. He has a PHP tool, being rewritten
butwith very slow progress. 

Keeping a meaningless ID is not a problem in and of itself. It makes it easier to edit records from the UI, since you
canreference the ID in the UPDATE and DELETE statements, without fear of colliding with anything else. It's not so much
aproblem on small lookup tables, but on larger entities (people, companies, etc), referencing through the ID is much,
mucheasier. 

Hope that helps!
François Beausoleil
Вложения

Re: DB design advice: lots of small tables?

От
lender
Дата:
Thanks everybody for your comments.

On 2013-03-15 16:03, François Beausoleil wrote:
> Keeping a meaningless ID is not a problem in and of itself. It makes
> it easier to edit records from the UI, since you can reference the ID in
> the UPDATE and DELETE statements, without fear of colliding with
> anything else. It's not so much a problem on small lookup tables, but on
> larger entities (people, companies, etc), referencing through the ID is
> much, much easier.

I'm not so much concerned with giving records in a large data set
surrogate IDs. There is often no good candidate for a natural key, or
the candidates aren't static enough.

The small tables I mentioned earlier all have clear and very static keys
(the "code" column). I'm only concerned about the practical consequences
of introducing strings where there used to be integers.


To give a practical example, this is how it looks with artificial IDs:

Table documents:
  id    | name        | type_id
 -------+-------------+----------
  62307 | Example.odt | 413

Table document_types:
  id  | code                             | text_short
 -----+----------------------------------+---------------------------
  413 | information_disclosure_statement | Information Disclosure [...]


Using the natural key, it would look like this:

Table documents:
  id | name         | type
 ----+--------------+----------------------------------
  23 | Example.odt  | information_disclosure_statement

Table document_types:
  code                             | text_short
 ----------------------------------+---------------------------
  information_disclosure_statement | Information Disclosure [...]


(admittedly, "information_disclosure_statement" is one of the longer
codes we use. The average is about 14 characters, the longest is 38
characters)


Now, what if we have hundreds of thousands of records in the "documents"
table? Apart from the increased storage requirements, will scanning the
table take noticably longer? Will the indexes suffer? Will creating,
updating, importing, dumping, restoring etc take (much) longer?
Comparing two integers is computationally less expensive than comparing
two variable-length strings, of course, but I have no empirical notion
of how much of a performance hit to expect.

I know that these questions cannot be answered with any accuracy without
knowing all the details, but that's the type of thing that has me a
little worried right now. I'm fine with getting a little less
performance; that should be mostly offset by the other changes and
improvements we're making. I just don't want to introduce a fundamental
mistake at this stage.

Thanks again,
crl


Re: DB design advice: lots of small tables?

От
Shaun Thomas
Дата:
On 03/15/2013 08:36 AM, Kevin Grittner wrote:

> I occasionally hear someone maintaining that having a meaningless
> sequential ID column as the primary key of each table is required
> by the relational model.

You know, I've heard you mention this a couple times, and I still don't
understand why you maintain such a position. Artificial the linkages may
be, but unless I'm missing something, the level of abstraction is often
necessary to avoid excessive storage requirements and bulky foreign keys.

For me, it comes down to two things:

1. Copy a "natural" arbitrary-length text string possibly millions of
times, or use a 32/64-bit integer lookup value.
2. Use multiple natural columns in a primary key, necessitating copying
all columns to child tables for foreign keys, or use one surrogate key
for both.

Yes, it complicates the schema. Possibly needlessly so. But until
someone comes up with a database storage method that automatically
deduplicates stored data, I can never advocate using arbitrary strings
as natural keys, no matter how unique and meaningful they are.

Maybe I just say that because I was burned by just such a column in a
previous engagement. It was 64-characters of arbitrary text, and was
used as a lookup value for dozens of tables. Had it been mapped to a
"meaningless" surrogate key, several tables would have been halved (or
more) in size. Don't even get me started on indexing that horrible
monstrosity in every table it lived in.

Small tables? Yeah, whatever. I don't care. But anything that has the
potential to be duplicated millions of times? You better bet that's
going to be a serial identifier.

Now, you've been doing this longer than I have, in a wider array of
contexts, so you clearly have some perspective I don't. From where I'm
sitting though, I don't get the barely suppressed rage. ;)

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: DB design advice: lots of small tables?

От
Jasen Betts
Дата:
On 2013-03-15, lender <crlender@gmail.com> wrote:
> Hello.
>
> We are currently redesigning a medium/large office management web
> application. There are 75 tables in our existing PostgreSQL database,
> but that number is artificially low, due to some unfortunate design choices.
>
> The main culprits are two tables named "catalog" and "catalog_entries".
> They contain all those data sets that the previous designer deemed too
> small for a separate table, so now they are all stored together. The
> values in catalog_entries are typically used to populate dropdown select
> fields.


> So, my first main question would be: is it "normal" or desirable to have
> that many tiny tables? And is it a problem that many of the tables have
> the same (or a similar) column definitions?

Dunno about "normal", but certainly "Normal" (as in "-form").
No problem.

> The second point is that we have redundant unique identifiers in
> catalog_entries (id and code). The code value is used by the application
> whenever we need to find to one of the values. For example, for a query
> like "show all open invoices", we would either -
>
>   1) select the id from catalog_entries where catalog_id refers to the
>      "invoice_status" catalog and the code is "open"
>   2) use that id to filter select * from invoices
>
> - or do the same in one query using joins. This pattern occurs hundreds
> of times in the application code. From a programming viewpoint, having
> all-text ids would make things a lot simpler and cleaner (i.e., keep
> only the "code" column).
>
> The "id" column was used (AFAIK) to reduce the storage size. Most of the
> data tables have less than 100k records, so the overhead wouldn't be too
> dramatic, but a few tables (~10) have more; one of them has 1.2m
> records. These tables can also refer to the old catalog_entries table
> from more than one column. Changing all these references from INT to
> VARCHAR would increase the DB size, and probably make scans less
> performant. I'm not sure know how indexes on these columns would be
> affected.
>
> To summarize, the second question is whether we should ditch the
> artificial numeric IDs and just use the "code" column as primary key in
> the new tiny tables.

I if they aren't hurting you keep them.

> Thanks in advance for your advice.

If you're worried about clutter It may make sense to put all the small tables
in a separate schema.


--
⚂⚃ 100% natural

Re: DB design advice: lots of small tables?

От
Kevin Grittner
Дата:
Shaun Thomas <sthomas@optionshouse.com> wrote:
> On 03/15/2013 08:36 AM, Kevin Grittner wrote:
>
>> I occasionally hear someone maintaining that having a meaningless
>> sequential ID column as the primary key of each table is required
>> by the relational model.
>
> You know, I've heard you mention this a couple times, and I still don't
> understand why you maintain such a position. Artificial the linkages may be, but
> unless I'm missing something, the level of abstraction is often necessary to
> avoid excessive storage requirements and bulky foreign keys.
>
> For me, it comes down to two things:
>
> 1. Copy a "natural" arbitrary-length text string possibly millions of
> times, or use a 32/64-bit integer lookup value.
> 2. Use multiple natural columns in a primary key, necessitating copying all
> columns to child tables for foreign keys, or use one surrogate key for both.

When I was working for the Wisconsin Courts the Circuit Court
databases had some tables which were maintained by a central Court
Operations group and some tables holding data generated by each
county.  The primary key of each county maintained table included a
county number (smallint).  Most county-maintained tables were
related to court cases, and included a court case number as part of
the primary key (varchar(14)).  Beyond that, other fields were
added to each table, and where there was a natural heirarchy to the
data the "child" normally had the PK of its parent and something to
make it unique within that set.  Often this was a unique number
that started at 1 for each parent.  Some of these tables, when
replicated to the central state-wide site, had hundreds of millions
of rows.

You would probably be surprised how many queries were able to
optimize much better than they could have with a single-column ID
in each table.  The plans often surprised me, and when I would
force the planner to take the plan which seemed obvious to me, 90%
of the time the planner had me beat.  The other times often
suggested new indexes or adjustments to cost factors which
benefited the whole workload.

> Yes, it complicates the schema.

That's not the point.

> But until someone comes up with a database storage method that automatically deduplicates stored data, I
> can never advocate using arbitrary strings as natural keys, no matter how unique
> and meaningful they are.

If micro-managing disk space usage it the primary concern,
single-column synthetic keys are likely to win in many
circumstances.  I have no doubt that there are cases where it can
benefit performance.  I can guarantee you that many of the queries
we ran at Wisconsin Courts would have been much slower with such
synthetic keys because they limit the available plans and force
some unnatural "navigation".  I know because I saw cases where
people forced the type of navigation the planner would need to do
if synthetic keys were used, and the result was performance orders
of magnitude worse.

> Maybe I just say that because I was burned by just such a column in a previous
> engagement. It was 64-characters of arbitrary text, and was used as a lookup
> value for dozens of tables. Had it been mapped to a "meaningless"
> surrogate key, several tables would have been halved (or more) in size.
> Don't even get me started on indexing that horrible monstrosity in every
> table it lived in.

That is getting pretty extreme.

> Small tables? Yeah, whatever. I don't care. But anything that has the
> potential to be duplicated millions of times? You better bet that's going to
> be a serial identifier.

Like I said, Wis Cts. has a table that has hundreds of millions of
rows and the primary key is a smallint, a varchar(14), and another
smallint.  You can poke around in it at this site by drilling down
on the "Court Record Events" button for a case:

http://wcca.wicourts.gov/

> Now, you've been doing this longer than I have, in a wider array of
> contexts, so you clearly have some perspective I don't. From where I'm
> sitting though, I don't get the barely suppressed rage. ;)

Oh, every time a programmer who had been working in Access or MySQL
was hired, I had to have this discussion all over again.  You're
ripping the scabs off the old wounds from those battles.  :-)  I usually
had to pull out plans from complex (and fast!) queries to make my
point.

The other thing is that it brings back memories from the early '80s
of working in a database product called TOTAL which required
explicit navigation over links.  Relational technology was such a
breath of fresh air compared to that, I dread a pendulum swing back
to that.

That's been my experience, anyway.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company