Re: FW: Surrogate keys (Was: enums)

Поиск
Список
Период
Сортировка
От	Dann Corbit
Тема	Re: FW: Surrogate keys (Was: enums)
Дата	18 января 2006 г. 21:34:21
Msg-id	D425483C2C5C9F49B5B7A41F8944154757D456@postal.corporate.connx.com обсуждение исходный текст
Ответ на	FW: Surrogate keys (Was: enums) ("Jim Nasby" <jnasby@pervasive.com>)
Ответы	Re: Surrogate keys (Was: enums) Re: FW: Surrogate keys (Was: enums)
Список	pgsql-hackers
Дерево обсуждения
Maybe it goes better into Advocacy or something, but I have found a quote by database big-wigs that I strongly disagree
with:

From:
http://www.db.ucsd.edu/cse132B/Thirdmanifesto.pdf

We have this.
"PROPOSITION 1.4: Unique Identifiers (UIDs) for records should be assigned by the DBMS only if a user-defined primary
keyis not available. 
Second generation systems support the notion of a primary key, which is a user-assigned unique identifier.  If a
primarykey exists for a collection that is known never to change, for example social security number, student
registrationnumber, or employee number, then no additional system-assigned UID is required. An immutable primary key
hasan extra advantage over a system-assigned unique identifier because it has a natural, human readable meaning.
Consequently,in data interchange or debugging this may be an advantage.  If no primary key is available for a
collection,then it is imperative that a system-assigned UID be provided. Because SQL supports update through a cursor,
secondgeneration systems must be able to update the last record retrieved, and this is only possible if it can be
uniquelyidentified. If no primary key serves this purpose, the system must include an extra UID. Therefore, several
secondgeneration systems already obey this proposition. Moreover, as will be noted in Proposition 2.3, some
collections,e.g. views, do not necessarily have system assigned UIDs, so building a system that requires them is likely
tobe proven undesirable. We close our discussion on Tenet 1 with a final proposition that deals with the notion of
rules."

This is a bad idea.

Let's take the example of a Social Security Number.

Not everyone has one:
http://www.ssa.gov/pubs/10002.html#how2

If people do have one, they can definitely change it.  If someone has stolen a SSN, then the wronged party is able to
gettheir SSN changed: 
http://101-identitytheft.com/ssn.htm
The odds of this happening are low, but if you cannot handle it, then the damage caused is considerable.
Now what happens if you want to have customers outside of the USA? {Don't worry, we'll never go global...}  I hope that
myobjections are very plain and obvious. 

The primary key should be immutable, meaning that its value should not be changed during the course of normal
operationsof the database.  What natural key is immutable?  The answer is that such an attribute does not exist.  To
usethem for such a purpose is begging for trouble. 

I saw the argument that there is a great volume of space wasted by adding a column that does not naturally occur in the
data. That argument is simply absurd.  Consider a database with 10 billion rows of data in it.  Each of those tables
getsan 8 byte primary key added for every row, resulting in 80 GB consumed.  The cost of 80 GB is perhaps $200.  With a
databasethat large (where the extra space consumed by an artificial key column has a cost that can easily be measured)
theodds of a problem arising due to a natural column changing its value are huge.  The cost of such a tragedy is
certainlymore than the $200 pittance! 

If there is an argument that we also have the parent key values propagated into the child tables as foreign keys, that
argumenthas no merit.  The other attribute that would have been chosen would also be propagated.  And so (for instance)
thereis no savings to propagating a SSN field into child tables verses propagating an 8 byte integer. 

I also saw an argument that the propagated ID values are confusing to end-users.  That is the fault of the database
designerwho game them a stupid name.  If they were things like InvoiceID and LineItemID then there will not be the same
sortof confusion.  The meaning and purpose of the column is immediately apparent.  As an alternative, the ubiquitous
OIDname for a column on a table is also very transparent.  Of course, when it is used in a foreign key, it must be
givena role name to avoid confusion in that case. 

At any rate, the use of natural keys is a mistake made by people who have never had to deal with very large database
systems.

IMO-YMMV.


> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Leandro Guimarães Faria Corcete Dutra
> Sent: Wednesday, January 18, 2006 4:31 PM
> To: Jim C. Nasby
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: FW: [HACKERS] Surrogate keys (Was: enums)
>
> Em Qua, 2006-01-18 às 17:22 -0600, Jim C. Nasby escreveu:
> > >
> > >    Forgive me my ignorance, but are ints inherently faster to compare
> than
> > >strings, or is it just an implementation detail?  Ideally, if this is
> so
> > >a fully data-independent system would create a hash behind the back of
> > >user in order to get performance.
> >
> > The CPU can do an integer comparison with one instruction; it can't do
> > that with a text string.
>
>     OK.  Again, data independence should be the goal here.
>
>
> > >    OK, hardly a typical example.  As I think I left clear, my problem
> is
> > >not using surrogate keys, but using them by default, or even
> > >exclusively.
> >
> > No? It's certainly not uncommon to have tables with 100M+ rows.
>
>     No, but neither are they *that* common.
>
>     Certainly, lots of database have a few of them.  But then, they have
> dozens, hundreds, thousands of much smaller tables.
>
>
> > And keep
> > in mind that this applies to every row of every table that has foreign
> > keys. I'd bet it's actually common to save 1G or more with surrogate
> > keys in moderately sized databases.
>
>     Only if you have quite some children, because otherwise, in the main
> tables, the surrogate keys add a field, an index and a sequence to an
> otherwise smaller table and index.
>
>
> > Of course, you do have to be intelligent here, too. The only key defined
> > on the table in my example is participant_id, project_id, date; there is
> > no surrogate key because there's no real reason to have one.
>
>     Quite.
>
>
> > >> (In actuality, there isn't participant_name... participants are
> > >> identified by email address (not a great idea, but I wasn't around
> > >> when that was chosen). As you can imagine, email addresses are
> > >> substantially longer than 4 bytes. When we normalized email out of
> > >> that main table things got substantially faster. That was a number of
> > >> years ago, so the table was probably 15-25% of it's current size, but
> > >> it still made a huge difference.)
> > >
> > >    This isn't normalisation at all, as far as I understand it.  It is
> just
> >
> > I don't have the rules of normalization memorized enough to know what
> > form this breaks, but I'm 99% certain it breaks at least one of them.
>
>     No, never.
>
>     Normalisation is about eliminating redundancy and, therefore, update
> anomalies.  Making all the table dependent on only the keys and the
> whole keys, by projecting relations to eliminate entity mixups.
>
>     What you mention is actually exposing an implementation detail,
> namely
> an integer that serves as a hash of the key.
>
>
> > Look at it this way: if someone wants to change their email address,
> > best case scenario is that you have cascading RI setup and it updates
> > thousands of rows in that table. Worst case scenario, you just de-linked
> > a whole bunch of data. But with a surrogate key, all you have to do is
> > update one row in one table and you're done.
>
>     OK, if you have lots of linked data.  But most tables are really
> dead
> ends.
>
>
> > >that we don't have data independence... so you had to expose an
> > >implementation detail?
> >
> > Expose to what? The application? First, this is a pretty minor thing to
> > expose; second, if it's that big a concern you can completely hide it by
> > using a view.
>
>     As someone said, you end up with ids everywhere, and no
> user-understandable data at all...
>
>
> > But the reality is, dealing with a numeric ID can be a heck of a lot
> > easier than an email address. Look at URLs that embbed one versus the
> > other for a good example.
>
>     Again, implementation details... levels mixup.
>
>
> > >> Why should it? It's trivial to create views that abstract surrogate
> > >> keys out, and if you really want to you can even make the views
> > >> updatable. But here's two other things to consider:
> > >
> > >    These views, in heavy querying environments, can be prohibitive.
> >
> > "Normalize 'til it hurts; denormalize 'til it works."
>
>     Lack of data implementation biting us again.
>
>
> > Yes, the added overhead of rules for updates/inserts/deletes could start
> > to add up in performance-critical code. But if performance is that
> > critical you're far more likely to run into other bottlenecks first. And
> > worst-case, you abstract behind a stored procedure that just has the
> > right queries hard-coded.
> >
> > As for select-only views you'll have a hard time showing any meaningful
> > performance penalty.
>
>     Yet real user-defined data types could make it all much simpler.
>
> --
> +55 (11) 5685 2219               xmpp:leandrod@jabber.org
> +55 (11) 9406 7191                       Yahoo!: lgcdutra
> +55 (11) 5686 9607         MSN: leandro@dutra.fastmail.fm
> +55 (11) 4390 5383                      ICQ/AIM: 61287803
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly
В списке pgsql-hackers по дате отправления:
Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: FW: Surrogate keys (Was: enums)