Обсуждение: pervasiveness of surrogate (also called synthetic) keys

Поиск

Список

Период

Сортировка

pervasiveness of surrogate (also called synthetic) keys

От

Jim Irrer

Дата:

28 апреля 2011 г., 14:30:11

A colleague of mine insists that using surrogate keys is the
common practice by an overwhelming margin in relational databases and
that they are used in 99 percent of large installations. I agree that many
situations benefit from them, but are they really as pervasive
as he claims?

Thanks,

- Jim

Re: pervasiveness of surrogate (also called synthetic) keys

От

Andy Colson

Дата:

28 апреля 2011 г., 14:44:10

On 4/28/2011 12:29 PM, Jim Irrer wrote:
> A colleague of mine insists that using surrogate keys is the
> common practice by an overwhelming margin in relational databases and
> that they are used in 99 percent of large installations.  I agree that many
> situations benefit from them, but are they really as pervasive
> as he claims?
>
> Thanks,
>
> - Jim

I dont see how you could know unless you went to all the "large
installations" and asked.  But since its a good idea, and you "should"
do it that way, and because I'm pessimistic, I'd say only 5% of RDB
users do it that way.

Oh!  Joke:  Why do DB Admins make better lovers?  They use surrogates!

Anyway, I'm not a large install, but I use em.  That's gotta count for
something.

Really, how could you count?  Was there a poll someplace?  Ask for some
data.  Otherwise seems like BS to me.

-Andy

Re: pervasiveness of surrogate (also called synthetic) keys

От

Rob Sargent

Дата:

28 апреля 2011 г., 14:53:31


On 04/28/2011 11:44 AM, Andy Colson wrote:
> On 4/28/2011 12:29 PM, Jim Irrer wrote:
>> A colleague of mine insists that using surrogate keys is the
>> common practice by an overwhelming margin in relational databases and
>> that they are used in 99 percent of large installations. I agree that
>> many
>> situations benefit from them, but are they really as pervasive
>> as he claims?
>>
>> Thanks,
>>
>> - Jim
>
> I dont see how you could know unless you went to all the "large
> installations" and asked. But since its a good idea, and you "should" do
> it that way, and because I'm pessimistic, I'd say only 5% of RDB users
> do it that way.
>
> Oh! Joke: Why do DB Admins make better lovers? They use surrogates!
>
> Anyway, I'm not a large install, but I use em. That's gotta count for
> something.
>
> Really, how could you count? Was there a poll someplace? Ask for some
> data. Otherwise seems like BS to me.
>
> -Andy
>

Hm, I get the feeling that only the good folks at Hibernate seem to
think using a "natural key" is the _only_ way to go.

Re: pervasiveness of surrogate (also called synthetic) keys

От

Andrew Sullivan

Дата:

28 апреля 2011 г., 15:20:49

On Thu, Apr 28, 2011 at 01:29:31PM -0400, Jim Irrer wrote:

> common practice by an overwhelming margin in relational databases and
> that they are used in 99 percent of large installations.

94.68536% of all the claims I ever hear are obviously pulled out of
thin air.

What conclusion does your colleague want to draw from this
overwhelming (if perhaps statistically dubious) penetration?  Surely
the argument doesn't conclude, "Therefore we should do that too?"  I
seem to recall my mother making some remark about others jumping off
cliffs.

A

--
Andrew Sullivan
ajs@crankycanuck.ca

Re: pervasiveness of surrogate (also called synthetic) keys

От

"Joshua D. Drake"

Дата:

28 апреля 2011 г., 15:27:14

On 04/28/2011 10:29 AM, Jim Irrer wrote:
> A colleague of mine insists that using surrogate keys is the
> common practice by an overwhelming margin in relational databases and
> that they are used in 99 percent of large installations.  I agree that
> many
> situations benefit from them, but are they really as pervasive
> as he claims?
Well there is no fact to back that up but, I will say that most toolkits
require the use of a synthetic key, rails, django etc....

JD

>
> Thanks,
>
> - Jim


--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Developement
Organizers of the PostgreSQL Conference - http://www.postgresqlconference.org/
@cmdpromptinc - @postgresconf - 509-416-6579

Re: pervasiveness of surrogate (also called synthetic) keys

От

Scott Ribe

Дата:

28 апреля 2011 г., 15:31:19

On Apr 28, 2011, at 11:53 AM, Rob Sargent wrote:

> Hm, I get the feeling that only the good folks at Hibernate seem to think using a "natural key" is the _only_ way to
go.

Well, natural keys are quite obviously the way to go, when they exist. The problem is, they usually don't really exist.
What'susually proposed as a natural key, will upon further investigation, either not be guaranteed unique, or not
guaranteedto be unchanging, or both. 

--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice

Re: pervasiveness of surrogate (also called synthetic) keys

От

Daniele Varrazzo

Дата:

28 апреля 2011 г., 16:03:51

On Thu, Apr 28, 2011 at 7:26 PM, Joshua D. Drake <jd@commandprompt.com> wrote:

> Well there is no fact to back that up but, I will say that most toolkits
> require the use of a synthetic key, rails, django etc....

Usually such tools are born with surrogate keys only, because it's
easier, and either grow up developing natural keys (e.g. the Django
ORM, SQLAlchemy) or fade into uselessness (e.g. SQLObject). But this
speaks more about tools than about the merits of the natural keys: if
the toolkit doesn't support them it's seriously getting in the way in
this and probably in other matters too.

-- Daniele

Re: pervasiveness of surrogate (also called synthetic) keys

От

Merlin Moncure

Дата:

28 апреля 2011 г., 17:39:29

On Thu, Apr 28, 2011 at 12:29 PM, Jim Irrer <irrer@umich.edu> wrote:
> A colleague of mine insists that using surrogate keys is the
> common practice by an overwhelming margin in relational databases and
> that they are used in 99 percent of large installations.  I agree that many
> situations benefit from them, but are they really as pervasive
> as he claims?

They are fairly pervasive, and increasingly so, which I find to be
really unfortunate.  Personally I think rote use of surrogate keys is
terrible and leads to bad table designs, especially if you don't
identify the true natural key with, say, a unique constraint.  This
pushes duplicate enforcement out of the database and into application
code, or (even worse) the user.  What everyone misses in the database
is that natural keys force good database design...if you can't
discover one, you probably have a design problem.   There are of
course exceptions, but they are limited.

The main/best case for surrogates is defense against cascading updates
-- obviously if a single record change can turn into 1000's of updates
statements you absolutely want to avoid that if you can.  But in a
typical database this only applies to particular tables.  'One off'
updates are of course no problem.  In a 'infinite performance' world,
this would still be an update, because of locks.

The performance case either way is complex.  Generally, I find natural
keys to be faster all things considered but this is going to be
situationally dependent.  Although integer indexes are tighter and
faster, natural keys give back by skipping entire joins, sorts, etc.
They also tend to have better tuple/page efficiency so that when
searching for a range of records you touch less pages.  Essentially,
you get more traffic routed through a smaller number, albeit bigger,
indexes.  Natural key tables also tend to be 'self documenting' which
is a nice property.

A lot of bad arguments made against natural keys are made, for example:
*) natural keys can change (so what? unless you are worried about cascades)
*) SSN are reused! (SSN is obviously not a complete key if you want to
identify a person)
*) most tables don't have unique natural keys (let's see em)
etc

merlin

Re: pervasiveness of surrogate (also called synthetic) keys

От

"David Johnston"

Дата:

28 апреля 2011 г., 18:01:19

Any system that generates transactional data has to use some kind of
synthetic key.  I guess you could rely upon some form of timestamp but from
a usability standpoint that is not really a good decision.  Inventory also
requires synthetic keys - whether you decide what they are or someone else
does.  A serial field is the common form of a synthetic key but any assigned
identifier can be defined in the same way.  You can even argue that a field
that concatenates the parts of the natural key qualifies as synthetic.

To actually use the data in the database it is often helpful to be able to
uniquely identify a row using a single field.

My personal take is that normalization is important.  Once you have
normalized you will find tables with compound natural primary keys.  If
these tables are going to be queried in such a way as to return a single
record (WHERE clause) it is likely that a synthetic key would be beneficial.
If they are normally going to be used only with the context of other core
tables (basically JOIN ON clause) then it is less important to have a single
representative value.

Justifying anything simply based upon what other people do is superficial.

As for the colleague; the basic response is "what is your source".  If they
are trying to convince you to include synthetic keys on specific tables ask
them to explain what specific benefit will be gained by doing so and let
them know that "you should always have a synthetic key" is not a helpful
response.

David J.

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Andy Colson
Sent: Thursday, April 28, 2011 1:44 PM
To: Jim Irrer
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] pervasiveness of surrogate (also called synthetic)
keys

On 4/28/2011 12:29 PM, Jim Irrer wrote:
> A colleague of mine insists that using surrogate keys is the common
> practice by an overwhelming margin in relational databases and that
> they are used in 99 percent of large installations.  I agree that many
> situations benefit from them, but are they really as pervasive as he
> claims?
>
> Thanks,
>
> - Jim

I dont see how you could know unless you went to all the "large
installations" and asked.  But since its a good idea, and you "should"
do it that way, and because I'm pessimistic, I'd say only 5% of RDB users do
it that way.

Oh!  Joke:  Why do DB Admins make better lovers?  They use surrogates!

Anyway, I'm not a large install, but I use em.  That's gotta count for
something.

Really, how could you count?  Was there a poll someplace?  Ask for some
data.  Otherwise seems like BS to me.

-Andy

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make
changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: pervasiveness of surrogate (also called synthetic) keys

От

Karsten Hilbert

Дата:

28 апреля 2011 г., 18:07:55

On Thu, Apr 28, 2011 at 03:39:19PM -0500, Merlin Moncure wrote:

> They are fairly pervasive, and increasingly so, which I find to be
> really unfortunate.  Personally I think rote use of surrogate keys is
> terrible and leads to bad table designs, especially if you don't
> identify the true natural key with, say, a unique constraint.

I was recently asked on this list why GNUmed uses all those
surrogate keys.

I should have added to my answer that we DO make *extensive*
use of all sorts of built-in constraints and custom triggers
to enforce "natural" keys. I must agree with a recent poster
that what appears to identify as a natural key often really
isn't or else becomes not so later on. It's vastly easier to
then deal with that by re-defining constraints without
having to touch primary keys.

> This
> pushes duplicate enforcement out of the database and into application
> code, or (even worse) the user.  What everyone misses in the database
> is that natural keys force good database design...if you can't
> discover one, you probably have a design problem.

Sure but that doesn't mean you need to actually *use*
natural keys as primary keys - enforce them with all sorts
of constraints, sure - but don't assume you properly figured
out the perfect schema the first time around.

I've so far found it *good* to have duplicate enforcement:

- the database enforces what we agree on the final data
  *should* look like

- the UI tries to lure the user into entering "valid" data

Now, those ("duplicative") database constraints have saved
our butt more than once preventing faulty patient data to be
inserted into medical records.

> A lot of bad arguments made against natural keys are made, for example:
> *) natural keys can change (so what? unless you are worried about cascades)

I find it is not so much that they can change:

    Sure, it doesn't matter whether a certain code reads
    "C03EB21" or "C03EB22".

but rather that they tend to go non-unique whenever the
whims of meatspace decide it's now more convenient to allow
dupes:

    Suddenly there must be two records with code "C03EB21".

> *) SSN are reused! (SSN is obviously not a complete key if you want to
> identify a person)
> *) most tables don't have unique natural keys (let's see em)

Now, those two arguments are bogus, I agree.

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: pervasiveness of surrogate (also called synthetic) keys

От

Merlin Moncure

Дата:

29 апреля 2011 г., 12:14:16

On Thu, Apr 28, 2011 at 4:07 PM, Karsten Hilbert
<Karsten.Hilbert@gmx.net> wrote:
> On Thu, Apr 28, 2011 at 03:39:19PM -0500, Merlin Moncure wrote:
>
>> They are fairly pervasive, and increasingly so, which I find to be
>> really unfortunate.  Personally I think rote use of surrogate keys is
>> terrible and leads to bad table designs, especially if you don't
>> identify the true natural key with, say, a unique constraint.
>
> I was recently asked on this list why GNUmed uses all those
> surrogate keys.
>
> I should have added to my answer that we DO make *extensive*
> use of all sorts of built-in constraints and custom triggers
> to enforce "natural" keys. I must agree with a recent poster
> that what appears to identify as a natural key often really
> isn't or else becomes not so later on. It's vastly easier to
> then deal with that by re-defining constraints without
> having to touch primary keys.
>
>> This
>> pushes duplicate enforcement out of the database and into application
>> code, or (even worse) the user.  What everyone misses in the database
>> is that natural keys force good database design...if you can't
>> discover one, you probably have a design problem.
>
> Sure but that doesn't mean you need to actually *use*
> natural keys as primary keys - enforce them with all sorts
> of constraints, sure - but don't assume you properly figured
> out the perfect schema the first time around.
>
> I've so far found it *good* to have duplicate enforcement:
>
> - the database enforces what we agree on the final data
>  *should* look like
>
> - the UI tries to lure the user into entering "valid" data
>
> Now, those ("duplicative") database constraints have saved
> our butt more than once preventing faulty patient data to be
> inserted into medical records.
>
>
>> A lot of bad arguments made against natural keys are made, for example:
>> *) natural keys can change (so what? unless you are worried about cascades)
>
> I find it is not so much that they can change:
>
>        Sure, it doesn't matter whether a certain code reads
>        "C03EB21" or "C03EB22".
>
> but rather that they tend to go non-unique whenever the
> whims of meatspace decide it's now more convenient to allow
> dupes:
>
>        Suddenly there must be two records with code "C03EB21".
>
>> *) SSN are reused! (SSN is obviously not a complete key if you want to
>> identify a person)
>> *) most tables don't have unique natural keys (let's see em)
>
> Now, those two arguments are bogus, I agree.

pretty much agree on all points.  I don't really think primary keys
tend to change very much in terms of schema but when they do it can be
a real headache.

I took a quick look at the gnumed schema and found it to be generally
very thorough and excellent.  If you're going to use surrogate keys,
that's they way to do it.  That's a neat trick btw to use inheritance
for the auditing feature...how is it working out for you?  Any general
comments on postgresql with regards to your product?

merlin

Re: pervasiveness of surrogate (also called synthetic) keys

От

Karsten Hilbert

Дата:

01 мая 2011 г., 19:13:20

On Fri, Apr 29, 2011 at 10:14:07AM -0500, Merlin Moncure wrote:

> I took a quick look at the gnumed schema and found it to be generally
> very thorough and excellent.  If you're going to use surrogate keys,
> that's they way to do it.

Good to know since I'm only a lowly medical doctor not
having much schooling in database matters beyond this list,
the PostgreSQL docs, and the Celko book.

> That's a neat trick btw to use inheritance
> for the auditing feature...how is it working out for you?

That works very nicely for us. Same thing with aggregating
clinical narrative across diverse tables.

> Any general comments on postgresql with regards to your product?

We have found it to be very dependable and professionally
maintained. We've never lost any patient data due to crashes
(for what that's worth). The breadth of constraints one can
define saved our behinds several times by preventing buggy
applications from storing faulty patient data.

One thing that'd be helpful to have would be ON CONNECT
triggers - that would make it much safer to support HIPAA
requirements (I'm aware of the apparent fallacy of a faulty
ON CONNECT trigger preventing superuser access - that can be
overcome by not running ON CONNECT triggers in single-user
rescue mode).

Another would be database wide asserts on data. Of course,
better support of inheritance in terms of definably
propagating constraints and triggers would be nice :-)

But that's a lot to ask.

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: pervasiveness of surrogate (also called synthetic) keys

От

Scott Marlowe

Дата:

01 мая 2011 г., 22:00:30

On Thu, Apr 28, 2011 at 3:07 PM, Karsten Hilbert
<Karsten.Hilbert@gmx.net> wrote:

> I must agree with a recent poster
> that what appears to identify as a natural key often really
> isn't or else becomes not so later on. It's vastly easier to
> then deal with that by re-defining constraints without
> having to touch primary keys.

I have run into this.  Often what happens is that the data model
directly affects the assumptions coders make.  In a system I worked
on, the username was unique / primary key.  Code was written that
assumed username would always be unique across the app.  Then the app
was deployed to multiple non-associated groups, where user names from
two different organizations might be the same and they couldn't be
changed.  Talk about some ugly code changes needed to be made to fix
all the bugs.  ugh.  An initial assumption that a serial value would
be unique, but usernames might not be would have resulted in a much
cleaner design for that system.  Or even that the PK was org||username
or something.

Re: pervasiveness of surrogate (also called synthetic) keys

От

Andres Freund

Дата:

02 мая 2011 г., 06:50:33

On Thursday, April 28, 2011 08:31:09 PM Scott Ribe wrote:
> Well, natural keys are quite obviously the way to go, when they exist. The
> problem is, they usually don't really exist. What's usually proposed as a
> natural key, will upon further investigation, either not be guaranteed
> unique, or not guaranteed to be unchanging, or both.
There is no fundamental problem with changing primary keys. Sure, there are
challenges and situations where thats annoying and problematic, but it's not
as bad as often assumed.

Andres

Re: pervasiveness of surrogate (also called synthetic) keys

От

Greg Smith

Дата:

02 мая 2011 г., 12:09:16

On 05/01/2011 06:12 PM, Karsten Hilbert wrote:
> Good to know since I'm only a lowly medical doctor not
> having much schooling in database matters beyond this list,
> the PostgreSQL docs, and the Celko book.
>

This debate exists at all levels of experience, and the only thing that
changes as you get more experienced people involved is an increase in
anecdotes on each side.  The sole time I ever found myself arguing with
Joe Celko is over an article he wrote recommending natural keys, using
an example from the automotive industry.  Problem was, the specific
example he gave was flat out wrong.  I was working in automotive MIS at
the time, and the thing he was saying would never change did, in fact,
change every year--in only a fraction of a percent of cases, in an
extremely subtle way that snuck up on people and wreaked much
confusion.  That's typical for an early natural key design:  you get it
working fine in V1.0, only to discover months or years down the road
there's a case you never considered you don't model correctly, and it
may take some sort of conversion to fix.

The reason why there's a strong preference for surrogate keys is that
they always work and you can avoid ever needing to come up with a better
design. if you just use them and forget about it.  The position Merlin
has advocated here, that there should always be a natural key available
if you know the data well enough, may be true.  But few people are good
enough designers to be sure they've made the decision correctly, and the
downsides of being wrong can be a long, painful conversion process.
Easier for most people to just eliminate the possibility of making a
mistake by using auto-generated surrogate keys, where the primary
problem you'll run into is merely using more space/resources than you
might otherwise need to have.  It minimizes the worst-case--mistake make
in the model, expensive re-design--by adding overhead that makes the
average case more expensive.  Software design usually has enough risks
that any time you can eliminate one just by throwing some resources at
it, that's normally the right thing to do.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Re: pervasiveness of surrogate (also called synthetic) keys

От

Jim Irrer

Дата:

02 мая 2011 г., 15:24:58

I've been carefully reading all of the comments with great interest.

Thanks very much for the thoughtful responses - very enlightening.

- Jim (the topic originator)

Jim Irrer irrer@umich.edu (734) 647-4409
University of Michigan Hospital Radiation Oncology
519 W. William St. Ann Arbor, MI 48103

On Mon, May 2, 2011 at 11:10 AM, Greg Smith <greg@2ndquadrant.com> wrote:

On 05/01/2011 06:12 PM, Karsten Hilbert wrote:
Good to know since I'm only a lowly medical doctor not
having much schooling in database matters beyond this list,
the PostgreSQL docs, and the Celko book.

This debate exists at all levels of experience, and the only thing that changes as you get more experienced people involved is an increase in anecdotes on each side. The sole time I ever found myself arguing with Joe Celko is over an article he wrote recommending natural keys, using an example from the automotive industry. Problem was, the specific example he gave was flat out wrong. I was working in automotive MIS at the time, and the thing he was saying would never change did, in fact, change every year--in only a fraction of a percent of cases, in an extremely subtle way that snuck up on people and wreaked much confusion. That's typical for an early natural key design: you get it working fine in V1.0, only to discover months or years down the road there's a case you never considered you don't model correctly, and it may take some sort of conversion to fix.

The reason why there's a strong preference for surrogate keys is that they always work and you can avoid ever needing to come up with a better design. if you just use them and forget about it. The position Merlin has advocated here, that there should always be a natural key available if you know the data well enough, may be true. But few people are good enough designers to be sure they've made the decision correctly, and the downsides of being wrong can be a long, painful conversion process. Easier for most people to just eliminate the possibility of making a mistake by using auto-generated surrogate keys, where the primary problem you'll run into is merely using more space/resources than you might otherwise need to have. It minimizes the worst-case--mistake make in the model, expensive re-design--by adding overhead that makes the average case more expensive. Software design usually has enough risks that any time you can eliminate one just by throwing some resources at it, that's normally the right thing to do.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: pervasiveness of surrogate (also called synthetic) keys

От

Merlin Moncure

Дата:

02 мая 2011 г., 16:21:58

On Mon, May 2, 2011 at 10:10 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> On 05/01/2011 06:12 PM, Karsten Hilbert wrote:
>>
>> Good to know since I'm only a lowly medical doctor not
>> having much schooling in database matters beyond this list,
>> the PostgreSQL docs, and the Celko book.
>>
>
> This debate exists at all levels of experience, and the only thing that
> changes as you get more experienced people involved is an increase in
> anecdotes on each side.  The sole time I ever found myself arguing with Joe
> Celko is over an article he wrote recommending natural keys, using an
> example from the automotive industry.  Problem was, the specific example he
> gave was flat out wrong.  I was working in automotive MIS at the time, and
> the thing he was saying would never change did, in fact, change every
> year--in only a fraction of a percent of cases, in an extremely subtle way
> that snuck up on people and wreaked much confusion.  That's typical for an
> early natural key design:  you get it working fine in V1.0, only to discover
> months or years down the road there's a case you never considered you don't
> model correctly, and it may take some sort of conversion to fix.
>
> The reason why there's a strong preference for surrogate keys is that they
> always work and you can avoid ever needing to come up with a better design.
> if you just use them and forget about it.  The position Merlin has advocated
> here, that there should always be a natural key available if you know the
> data well enough, may be true.  But few people are good enough designers to
> be sure they've made the decision correctly, and the downsides of being
> wrong can be a long, painful conversion process.  Easier for most people to
> just eliminate the possibility of making a mistake by using auto-generated
> surrogate keys, where the primary problem you'll run into is merely using
> more space/resources than you might otherwise need to have.  It minimizes
> the worst-case--mistake make in the model, expensive re-design--by adding
> overhead that makes the average case more expensive.  Software design
> usually has enough risks that any time you can eliminate one just by
> throwing some resources at it, that's normally the right thing to do.

There are many practical arguments advocating the use of surrogate
keys.  Faster updates, easier schema changes, performance, maintenance
costs down the line, better tools integration, etc.  These arguments
basically involve trade-offs that can be justifiably be used to push
you one way or the other.  That's all well and good.

However, I tend to disagree with arguments that you are better off not
identifying natural keys at all.  To my mind, any database that has a
table with no discernible key but whose records are referred to via
another table's foreign key has a schema that is in a State of Error.
A surrogate key is just added information to the database -- why does
it play that record X out of N identical candidates should be the one
mapped?  Is that logic repeatable?  What are the ramifications for
joins that do not flow through the ID columns? Typically what it means
is that the rules that guard against duplicate information entry are
not, in fact in the database at all but in the application, and bad
data can now get into your database by a much broader array of causes.
 The last and best defense against a nasty and common class of data
errors has been removed.  The more complex and your database is, the
more it tends to be used a by a large array of clients, possibly even
spanning multiple computer languages -- thus the need for a root
system of constraint checking that is declarative and easily
understood.

Sure, requirements change, models change, but at any particular point
and time a model with as little as possible (read: none) outside
inputs should be able to provably demonstrate verifiable facts.   With
a natural key database (or a surrogate key database with defined keys
that are not used for RI) you have inherent constraint checking that a
purely surrogate database simply doesn't have.  Whatever the software
maintenance costs are, which is itself a complex and debatable topic,
I'll go with a strategy that gives a better defense against bad or
ambiguous data.

merlin

Re: pervasiveness of surrogate (also called synthetic) keys

От

Jeff Davis

Дата:

02 мая 2011 г., 21:29:18

On Mon, 2011-05-02 at 11:10 -0400, Greg Smith wrote:
> The position Merlin
> has advocated here, that there should always be a natural key available
> if you know the data well enough, may be true.  But few people are good
> enough designers to be sure they've made the decision correctly, and the
> downsides of being wrong can be a long, painful conversion process.
> Easier for most people to just eliminate the possibility of making a
> mistake by using auto-generated surrogate keys, where the primary
> problem you'll run into is merely using more space/resources than you
> might otherwise need to have.  It minimizes the worst-case--mistake make
> in the model, expensive re-design--by adding overhead that makes the
> average case more expensive.

Once you really try to define "natural" and "surrogate" keys, I think a
lot of the arguments disappear. I wrote about this a few years back:

http://thoughts.j-davis.com/2007/12/11/terminology-confusion/

In particular, I think you are falsely assuming that a natural key must
be generated from an outside source (or some source outside of your
control), and is therefore not reliably unique.

You can generate your own keys, and if you hand them out to customers
and include them on paperwork, they are now a part of the reality that
your database models -- and therefore become natural keys. Invoice
numbers, driver's license numbers, etc., are all natural keys, because
they are known about, and used, in reality. Usernames are, too, the only
difference is that you let the user choose it.

In contrast, a pointer or a UUID typically does not represent reality,
because no humans ever see it and no computer systems outside yours know
about it. So, it's merely an implementation detail and should not be a
part of the model.

Regards,
    Jeff Davis

Re: pervasiveness of surrogate (also called synthetic) keys

От

Craig Ringer

Дата:

02 мая 2011 г., 22:44:11

On 03/05/11 08:25, Jeff Davis wrote:

> You can generate your own keys, and if you hand them out to customers
> and include them on paperwork, they are now a part of the reality that
> your database models -- and therefore become natural keys. Invoice
> numbers, driver's license numbers, etc., are all natural keys, because
> they are known about, and used, in reality. Usernames are, too, the only
> difference is that you let the user choose it.

I've repeatedly run into situations where I generate a key that seems
entirely sensible, making a generated primary key part of the business
processes ... then external constraints force me to change the format of
that key or start accepting keys from outside. "Oh, we need to move to
14-digit client IDs because <x-system> that we interact with requires
them". "We want invoice numbers to include a Luhn check digit, can you
add that?". Etc.

I'm now strongly in favour of keeping an internal key that users never
see, and having separate user-visible identifiers. The users can demand
that those identifiers change format or generation method and it's an
easy change in only one place. Want two different keys? I can do that
too. Record a key that matches some external system? That's easy. Want
to be able to edit/override/rename keys? Yep, that's fuss free too, and
it won't affect my audit history (which uses the real internal keys) or
have to cascade to foreign key relationships in append-only ledger tables.

I use a mix of surrogate and natural keys, depending on the situation. I
see little point in surrogate keys for simple lookup tables, but find
them invaluable in audited tables with lots of foreign key relationships
that interact with other business systems.

--
Craig Ringer

Re: pervasiveness of surrogate (also called synthetic) keys

От

Scott Marlowe

Дата:

02 мая 2011 г., 22:53:05

On Mon, May 2, 2011 at 7:43 PM, Craig Ringer
<craig@postnewspapers.com.au> wrote:
> I'm now strongly in favour of keeping an internal key that users never
> see, and having separate user-visible identifiers. The users can demand
> that those identifiers change format or generation method and it's an

It's far easier to change a unique constraint on the fly than a PK in
a busy production database.

Re: pervasiveness of surrogate (also called synthetic) keys

От

Rob Sargent

Дата:

02 мая 2011 г., 23:06:58


Jeff Davis wrote:
> On Mon, 2011-05-02 at 11:10 -0400, Greg Smith wrote:
>> The position Merlin
>> has advocated here, that there should always be a natural key available
>> if you know the data well enough, may be true.  But few people are good
>> enough designers to be sure they've made the decision correctly, and the
>> downsides of being wrong can be a long, painful conversion process.
>> Easier for most people to just eliminate the possibility of making a
>> mistake by using auto-generated surrogate keys, where the primary
>> problem you'll run into is merely using more space/resources than you
>> might otherwise need to have.  It minimizes the worst-case--mistake make
>> in the model, expensive re-design--by adding overhead that makes the
>> average case more expensive.
>
> Once you really try to define "natural" and "surrogate" keys, I think a
> lot of the arguments disappear. I wrote about this a few years back:
>
> http://thoughts.j-davis.com/2007/12/11/terminology-confusion/
>
> In particular, I think you are falsely assuming that a natural key must
> be generated from an outside source (or some source outside of your
> control), and is therefore not reliably unique.
>
> You can generate your own keys, and if you hand them out to customers
> and include them on paperwork, they are now a part of the reality that
> your database models -- and therefore become natural keys. Invoice
> numbers, driver's license numbers, etc., are all natural keys, because
> they are known about, and used, in reality. Usernames are, too, the only
> difference is that you let the user choose it.
>
> In contrast, a pointer or a UUID typically does not represent reality,
> because no humans ever see it and no computer systems outside yours know
> about it. So, it's merely an implementation detail and should not be a
> part of the model.
>
> Regards,
>     Jeff Davis
>
My wife works (at the sql level) with shall we say "records about
people".  Real records, real people.  Somewhere around 2 million unique
individuals, several million source records.  They don't all have ssn,
they don't all have a drivers license.  They don't all have an address,
many have several addresses (especially over time) and separate people
have at one time or another lived at the same address.  You would be
surprise how many "bob smith"s where born on the same day.  But then
they weren't all born in a hospital etc etc etc.  A person may present
on any of a birth record, a death record, a hospital record, a drivers
license, a medical registry, a marriage record and so on.  There simply
is no natural key for a human.  We won't even worry about the
non-uniqueness of ssn. And please don't get her started on twins. :)


I can only imagine that other equally complex entities are just as
slippery when it comes time to pinpoint the natural key.

rjs

Re: pervasiveness of surrogate (also called synthetic) keys

От

Dann Corbit

Дата:

02 мая 2011 г., 23:50:01

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Rob Sargent
> Sent: Monday, May 02, 2011 7:07 PM
> To: Jeff Davis
> Cc: Greg Smith; pgsql-general@postgresql.org
> Subject: Re: [GENERAL] pervasiveness of surrogate (also called
> synthetic) keys
>
>
>
> Jeff Davis wrote:
> > On Mon, 2011-05-02 at 11:10 -0400, Greg Smith wrote:
> >> The position Merlin
> >> has advocated here, that there should always be a natural key
> available
> >> if you know the data well enough, may be true.  But few people are
> good
> >> enough designers to be sure they've made the decision correctly, and
> the
> >> downsides of being wrong can be a long, painful conversion process.
> >> Easier for most people to just eliminate the possibility of making a
> >> mistake by using auto-generated surrogate keys, where the primary
> >> problem you'll run into is merely using more space/resources than
> you
> >> might otherwise need to have.  It minimizes the worst-case--mistake
> make
> >> in the model, expensive re-design--by adding overhead that makes the
> >> average case more expensive.
> >
> > Once you really try to define "natural" and "surrogate" keys, I think
> a
> > lot of the arguments disappear. I wrote about this a few years back:
> >
> > http://thoughts.j-davis.com/2007/12/11/terminology-confusion/
> >
> > In particular, I think you are falsely assuming that a natural key
> must
> > be generated from an outside source (or some source outside of your
> > control), and is therefore not reliably unique.
> >
> > You can generate your own keys, and if you hand them out to customers
> > and include them on paperwork, they are now a part of the reality
> that
> > your database models -- and therefore become natural keys. Invoice
> > numbers, driver's license numbers, etc., are all natural keys,
> because
> > they are known about, and used, in reality. Usernames are, too, the
> only
> > difference is that you let the user choose it.
> >
> > In contrast, a pointer or a UUID typically does not represent
> reality,
> > because no humans ever see it and no computer systems outside yours
> know
> > about it. So, it's merely an implementation detail and should not be
> a
> > part of the model.
> >
> > Regards,
> >     Jeff Davis
> >
> My wife works (at the sql level) with shall we say "records about
> people".  Real records, real people.  Somewhere around 2 million unique
> individuals, several million source records.  They don't all have ssn,
> they don't all have a drivers license.  They don't all have an address,
> many have several addresses (especially over time) and separate people
> have at one time or another lived at the same address.  You would be
> surprise how many "bob smith"s where born on the same day.  But then
> they weren't all born in a hospital etc etc etc.  A person may present
> on any of a birth record, a death record, a hospital record, a drivers
> license, a medical registry, a marriage record and so on.  There simply
> is no natural key for a human.  We won't even worry about the
> non-uniqueness of ssn. And please don't get her started on twins. :)
>
>
> I can only imagine that other equally complex entities are just as
> slippery when it comes time to pinpoint the natural key.

People are sometimes surprised to discover the social security numbers are not unique.

There are fraudulent social security numbers:
http://www2.nbc4i.com/news/2010/dec/06/2/study-finds-millions-duplicate-social-security-num-ar-316988/

There are numbers given out by the IRS by mistake:
http://wnyt.com/article/stories/S1594530.shtml?cat=10115

Re: pervasiveness of surrogate (also called synthetic) keys

От

Craig Ringer

Дата:

02 мая 2011 г., 23:52:33

On 03/05/11 10:06, Rob Sargent wrote:

> My wife works (at the sql level) with shall we say "records about
> people".  Real records, real people.  Somewhere around 2 million unique
> individuals, several million source records.  They don't all have ssn,
> they don't all have a drivers license.  They don't all have an address,
> many have several addresses (especially over time) and separate people
> have at one time or another lived at the same address.

... and that's before we get into the horror of "what is someone's
name". Which name? Which spelling? Do they even have a single canonical
name? Is their canonical name - if any - expressable in the character
set used by the service? Is it even covered by Unicode?!? Does it make
any sense to split their name up into the traditional
english-speaking-recent-western "family" and "given" name parts? Is
there a single consistent way to do so for their name even if it does? etc.

SSN? What if they don't live in the US or aren't a citizen? What if they
have more than one SSN (yes, it happens)? Or there's one being shared by
more than one person (again, this happens) and they can't get it fixed
or don't want to?

My mother's postal address - before street numbering was introduced a
few years ago on her road - used to be "Rural Delivery Area 2,
SomeTownName, New Zealand". You'd be amazed how many systems could not
cope with that; she used to have to register all sorts of things to her
parents' address in the nearby town.

People decide to key a database on US Social Security number - because
it's a "unique natural key" then discover the exciting problems with
that. While they're battling those, business needs change and the
database starts needing to accept people from other countries, who don't
have a US SSN and don't know what one is or what it looks like.

Tracking people/companies in databases are ideal candidates for
synthetic keying with a solid split/merge procedure to handle
duplicates, overlapping identity records, etc.

--
Craig Ringer

Re: pervasiveness of surrogate (also called synthetic) keys

От

Greg Smith

Дата:

03 мая 2011 г., 00:09:59

On 05/02/2011 10:06 PM, Rob Sargent wrote:
> You would be surprise how many "bob smith"s where born on the same
> day.  But then they weren't all born in a hospital etc etc etc.

I wouldn't be surprised.  I once lived in a mile-square town (Hoboken,
that's it's nickname).  In that town were 40K residents and three gyms.
I forgot my ID card one day when going to mine, and they took my name
and street name as alternate proof of identity.  Some designer along the
line figured that was unique enough.  Number of Greg Smiths living on
that street who were members of that one gym?  Three.

I see this whole area as being similar to SQL injection.  The same way
that you just can't trust data input by the user to ever be secure, you
can't trust inputs to your database will ever be unique in the way you
expect them to be.  So if you build a so-called "natural key" based on
them, expect that to break one day.  That doesn't mean you can't use
them as a sort of foreign key indexing the data; it just means you can't
make them the sole unique identifier for a particular entity, where that
entity is a person, company, or part.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Re: pervasiveness of surrogate (also called synthetic) keys

От

John R Pierce

Дата:

03 мая 2011 г., 00:26:07

otoh, there's plenty of places where natural keys are optimal.   my
company makes widgets, and we make damn sure our serial #s and part
numbers are unique, and we use them as PK's for the various tables.
further, the PN has a N digit prefix which is unique to a part family,
then a M digit suffix which identifies a specific version of that PN.
we use the N digit PN for the family tables, and the full N+M digit PN
for the full PN tables.   serial # is globally unique across all PNs so
its the PK of any table related directly to a widget.

Re: pervasiveness of surrogate (also called synthetic) keys

От

Craig Ringer

Дата:

03 мая 2011 г., 01:53:29

On 03/05/11 11:07, Greg Smith wrote:

> That doesn't mean you can't use
> them as a sort of foreign key indexing the data; it just means you can't
> make them the sole unique identifier for a particular entity, where that
> entity is a person, company, or part.

Classic case: a database here has several tables indexed by MAC address.
It's used for asset reporting and software inventory.

Problem: VMs generate random MAC addresses by default. They're not
guaranteed to be globally unique. Collisions have happened and will
probably happen again. In this case, it wasn't a big deal, but it just
goes to show that even the "obviously" globally unique isn't necessarily so.

--
Craig Ringer

Re: pervasiveness of surrogate (also called synthetic) keys

От

Rob Sargent

Дата:

03 мая 2011 г., 01:53:53


John R Pierce wrote:
> otoh, there's plenty of places where natural keys are optimal.   my
> company makes widgets, and we make damn sure our serial #s and part
> numbers are unique, and we use them as PK's for the various tables.
> further, the PN has a N digit prefix which is unique to a part family,
> then a M digit suffix which identifies a specific version of that
> PN.   we use the N digit PN for the family tables, and the full N+M
> digit PN for the full PN tables.   serial # is globally unique across
> all PNs so its the PK of any table related directly to a widget.
>
>
>
I'm surprised to see, as a defence of natural keys, an arbitrary
identifier composed of references to arbitrary identifiers elsewhere in
the system described.  How many ways is this just wrong? That you're
assigning the serial numbers does not destinguish them from a UUID
generator, oh except you've put semantics into the value. oh and you're
at risk of having to reformat them when you buy out your competitor.
Starts to look like the database-in-a-name scheme of which I'm oh so
fond.  I thought a natural key for a part would be more like
"bolt=geometry:head-mm:head-depth:thread-per-inch:etc".  And for
something as describable as bolt indeed the complete record could well
qualify as a natural key especially if none of the attributes are
null-able (which is a believable concept for bolt).  But bolt is not nut
and both are parts.  Gets messy quickly without arbitary simple keys.  I
think you're short on the "simple" part in your defn of serial number.

And interestingly you're dealing with types.  What does one do when one
has to track instances of bolts. Given that all bolts of a certain
natural key are identical, save for the fact that one was made before
the other.  The job is to register every bolt against its (current)
deployment.  Naturally one assigns each bolt a non-upc barcode aka UUID.

Re: pervasiveness of surrogate (also called synthetic) keys

От

Rob Sargent

Дата:

03 мая 2011 г., 01:57:41


Craig Ringer wrote:
> On 03/05/11 11:07, Greg Smith wrote:
>
>
>> That doesn't mean you can't use
>> them as a sort of foreign key indexing the data; it just means you can't
>> make them the sole unique identifier for a particular entity, where that
>> entity is a person, company, or part.
>>
>
> Classic case: a database here has several tables indexed by MAC address.
> It's used for asset reporting and software inventory.
>
> Problem: VMs generate random MAC addresses by default. They're not
> guaranteed to be globally unique. Collisions have happened and will
> probably happen again. In this case, it wasn't a big deal, but it just
> goes to show that even the "obviously" globally unique isn't necessarily so.
>
> --
> Craig Ringer
>

Hm.. Virtual machines as assets.  Mortgage backed securities, anyone.

Re: pervasiveness of surrogate (also called synthetic) keys

От

Craig Ringer

Дата:

03 мая 2011 г., 02:04:44

On 03/05/11 12:57, Rob Sargent wrote:

> Hm.. Virtual machines as assets.  Mortgage backed securities, anyone.

Well, sure ... but the software running on them is tracked as part of
licensing compliance efforts, whether or not the virtual hardware its
self is an "asset" its self. The DB designer chose to use the host's MAC
address to identify the host, and the tracking software can't tell the
difference between a VM and a physical host.

The other problem with that is that it doesn't deal well with
multi-homed hosts. All in all, it's just a dumb design, and was fixed a
while ago, but it's an illustration of how something that seems like an
obvious natural key doesn't always stay one.

--
Craig Ringer

Re: pervasiveness of surrogate (also called synthetic) keys

От

Scott Ribe

Дата:

03 мая 2011 г., 02:20:02

On May 2, 2011, at 10:53 PM, Rob Sargent wrote:

> ...and you're at risk of having to reformat them when you buy out your competitor.

The scheme described was awfully similar to one that a client of mine used, product family prefix, identifiers within
thefamily. And guess what? The scheme, which had been stable for 20+ years, had to change when a new variant of product
wasintroduced which cut across family & product. I don't remember the details. I do remember that I hadn't used the
supposedlystable product ids as PKs ;-) 

--
Scott Ribe
scott_ribe@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice

Re: pervasiveness of surrogate (also called synthetic) keys

От

Greg Smith

Дата:

03 мая 2011 г., 03:11:32

John R Pierce wrote:
> otoh, there's plenty of places where natural keys are optimal.   my
> company makes widgets, and we make damn sure our serial #s and part
> numbers are unique, and we use them as PK's for the various tables.

Sure; what I was commenting on is that you normally can't ever trust
external sources for identifiers.  If you want to come up with your own,
internally unique keys for things, great.  But one of the goals of using
a natural key is often to avoid the overhead of storing both that ID and
some made up internal number, too.  And whether the number is made up by
the computer (the classic SERIAL or similar surrogate key), or you make
one up yourself, it's still another chunk of data that gets stored for
every item.  It's just one that means something more useful in your
case.  Probably going to take up more space in the process and possibly
be slower though--part number strings can easily end up longer than
SERIAL-like integers.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Re: pervasiveness of surrogate (also called synthetic) keys

От

Karsten Hilbert

Дата:

03 мая 2011 г., 05:12:46

On Tue, May 03, 2011 at 10:52:23AM +0800, Craig Ringer wrote:

> ... and that's before we get into the horror of "what is someone's
> name". Which name? Which spelling? Do they even have a single canonical
> name?

- people have, at least over time, several compound names
- they have, at any one time, one active compound name
- additional spellings can be tracked as additional names
  of that individual

> Is their canonical name - if any - expressable in the character
> set used by the service? Is it even covered by Unicode?!?

- I haven't seen evidence to the contrary.
- But then, I haven't had a need to store a Klingon name.
- Yes, it's been difficult to come up with something sensible
  to store Spock's first name in the GNUmed database.

> Does it make
> any sense to split their name up into the traditional
> english-speaking-recent-western "family" and "given" name parts?

- any compound names I have come across work like this:

    - group name
    - individual name
    - nicknames (pseudonyms, warrior names, actor names, ...)

The day-to-day usage of each part varies, though.

> Is there a single consistent way to do so for their name even if it does? etc.

Even in Japan, where the group is a lot more than the
individual, can you clearly split into group name and
individual name.

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: pervasiveness of surrogate (also called synthetic) keys

От

Rick Genter

Дата:

03 мая 2011 г., 08:28:28

On May 2, 2011, at 10:52 PM, Craig Ringer wrote:

> SSN? What if they don't live in the US or aren't a citizen?

Non-citizens can have SSNs (they have to if they work in the US).
--
Rick Genter
rick.genter@gmail.com

Re: pervasiveness of surrogate (also called synthetic) keys

От

Merlin Moncure

Дата:

03 мая 2011 г., 10:44:09

On Mon, May 2, 2011 at 11:53 PM, Craig Ringer
<craig@postnewspapers.com.au> wrote:
> On 03/05/11 11:07, Greg Smith wrote:
>
>> That doesn't mean you can't use
>> them as a sort of foreign key indexing the data; it just means you can't
>> make them the sole unique identifier for a particular entity, where that
>> entity is a person, company, or part.
>
> Classic case: a database here has several tables indexed by MAC address.
> It's used for asset reporting and software inventory.
>
> Problem: VMs generate random MAC addresses by default. They're not
> guaranteed to be globally unique. Collisions have happened and will
> probably happen again. In this case, it wasn't a big deal, but it just
> goes to show that even the "obviously" globally unique isn't necessarily so.

It's precisely pathological cases like this where uniqueness
constraints are important and should be used.  By the way, we aren't
debating the use of natural case but whether to define uniqueness
constraints.  My main gripe with surrogates is that their use often
leads directly to lazy schema design where uniqueness constraints are
not defined which leads to data problems exactly like the case you
described above.

In a purely surrogate table with no unique on the mac, suppose you
have two records with the same value for the address, and there are no
other interesting fields on the table or elsewhere in the database:
*) who/what made the decision to place a second record on the table?
*) is that decision verifiable?
*) Is that decision repeatable?
*) are there other routes of data entry into the database that bypass
that decision?  will there ever be?
*) what happens when the code that represents that decision has or
develops a bug?
*) why would you not want information supporting that decision in the database?
*) how do you know the tables keying to your mac table are pointing to
the correct record?
*) what are the consequences for ad hoc queries that join directly
against the mac?  DISTINCT?

If your data modeler that made the the assumptions that a MAC is
unique (a mistake obviously) at least the other tables are protected
from violations of that assumption because the database would reject
them with an error, which is a Good Thing.  Without a uniqueness
constraint you now have ambiguous data which is a Very Bad Thing.
Without proper unique constraints, a generated key is effectively
saying "well, I cant' figure this out right now...I'll deal with it
later".  That unmanaged complexity is now in the application and all
the queries that touch the database...you've created your own bug
factory.  With a uniqueness constraint, you have a rigorous definition
of what your record represents, and other entities in the database can
now rely on that definition.

Natural key designs are good for a lot of reasons, but #1 on the least
is that they force you to deal with problems in your data model up
front because they force you to define unqiueness.  If the MAC turns
out not to be unique and the problem is not in fact coming from the
input data or the application, yes, you do have to correct the model
but at least the data inside the database is clean, and can be
unambiguously mapped to the new model.  I'll take schema changes over
bad data.

Correcting the model means you have to figure out whatever information
is used to distinguish identical MACs #1 and #2 is stored in the
database because now your data and the corresponding decisions are
verifiable, repeatable, unambiguous, etc.  What extra field you have
to add to your 'mac' table to make it unique would depend on certain
things, but it's certainly a solvable problem, and when solved would
give you a more robust database.

merlin

Re: pervasiveness of surrogate (also called synthetic) keys

От

Jeff Davis

Дата:

03 мая 2011 г., 15:51:39

On Mon, 2011-05-02 at 20:06 -0600, Rob Sargent wrote:
> Jeff Davis wrote:
> > In particular, I think you are falsely assuming that a natural key must
> > be generated from an outside source (or some source outside of your
> > control), and is therefore not reliably unique.
> >
> > You can generate your own keys...

...

> My wife works (at the sql level) with shall we say "records about
> people".  Real records, real people.  Somewhere around 2 million unique
> individuals, several million source records.  They don't all have ssn,
> they don't all have a drivers license.  They don't all have an address,
> many have several addresses (especially over time) and separate people
> have at one time or another lived at the same address.  You would be
> surprise how many "bob smith"s where born on the same day.  But then
> they weren't all born in a hospital etc etc etc.  A person may present
> on any of a birth record, a death record, a hospital record, a drivers
> license, a medical registry, a marriage record and so on.  There simply
> is no natural key for a human.  We won't even worry about the
> non-uniqueness of ssn. And please don't get her started on twins. :)
>
>
> I can only imagine that other equally complex entities are just as
> slippery when it comes time to pinpoint the natural key.

I think you missed my point. You don't have to rely on natural keys that
come from somewhere else; you can make up your own, truly unique
identifier.

Regards,
    Jeff Davis

Re: pervasiveness of surrogate (also called synthetic) keys

От

Rob Sargent

Дата:

03 мая 2011 г., 16:35:42


On 05/03/2011 12:51 PM, Jeff Davis wrote:
> On Mon, 2011-05-02 at 20:06 -0600, Rob Sargent wrote:
>> Jeff Davis wrote:
>>> In particular, I think you are falsely assuming that a natural key must
>>> be generated from an outside source (or some source outside of your
>>> control), and is therefore not reliably unique.
>>>
>>> You can generate your own keys...
>
> ...
>
>> My wife works (at the sql level) with shall we say "records about
>> people".  Real records, real people.  Somewhere around 2 million unique
>> individuals, several million source records.  They don't all have ssn,
>> they don't all have a drivers license.  They don't all have an address,
>> many have several addresses (especially over time) and separate people
>> have at one time or another lived at the same address.  You would be
>> surprise how many "bob smith"s where born on the same day.  But then
>> they weren't all born in a hospital etc etc etc.  A person may present
>> on any of a birth record, a death record, a hospital record, a drivers
>> license, a medical registry, a marriage record and so on.  There simply
>> is no natural key for a human.  We won't even worry about the
>> non-uniqueness of ssn. And please don't get her started on twins. :)
>>
>>
>> I can only imagine that other equally complex entities are just as
>> slippery when it comes time to pinpoint the natural key.
>
> I think you missed my point. You don't have to rely on natural keys that
> come from somewhere else; you can make up your own, truly unique
> identifier.
>
> Regards,
>     Jeff Davis
>
Sorry, but I'm confused, but that's common.  Isn't a "natural key" to be
compose solely from the attributes of the entity?  As in a subset of the
columns of the table in a third-normalish world. Isn't tacking on
another column with a concocted id joining the "pervassiveness"?

Re: pervasiveness of surrogate (also called synthetic) keys

От

Jeff Davis

Дата:

03 мая 2011 г., 18:08:46

On Tue, 2011-05-03 at 13:35 -0600, Rob Sargent wrote:
> Sorry, but I'm confused, but that's common.  Isn't a "natural key" to be
> compose solely from the attributes of the entity?  As in a subset of the
> columns of the table in a third-normalish world. Isn't tacking on
> another column with a concocted id joining the "pervassiveness"?

Not in my opinion. Before cars existed, there was no driver's license
number. The DMV (as it's called in California, anyway) created it, and
it's now a key that they can trust to be unique. It's also an attribute
of the entity now, because it's printed on the cards you hand to people.

The thing that I think is a mistake is to use generated IDs like an
internal implementation detail (i.e. hide them like pointers); then at
the same time mix them into the data model.

Regards,
    Jeff Davis

Re: pervasiveness of surrogate (also called synthetic) keys

От

Jeff Davis

Дата:

03 мая 2011 г., 18:14:33

On Mon, 2011-05-02 at 23:07 -0400, Greg Smith wrote:
> I see this whole area as being similar to SQL injection.  The same way
> that you just can't trust data input by the user to ever be secure, you
> can't trust inputs to your database will ever be unique in the way you
> expect them to be.

So, don't trust them to be unique then. Make up your own unique
identifier, and use that.

Regards,
    Jeff Davis

Re: pervasiveness of surrogate (also called synthetic) keys

От

Rob Sargent

Дата:

03 мая 2011 г., 20:36:14


On 05/03/2011 03:08 PM, Jeff Davis wrote:
> On Tue, 2011-05-03 at 13:35 -0600, Rob Sargent wrote:
>> Sorry, but I'm confused, but that's common.  Isn't a "natural key" to be
>> compose solely from the attributes of the entity?  As in a subset of the
>> columns of the table in a third-normalish world. Isn't tacking on
>> another column with a concocted id joining the "pervassiveness"?
>
> Not in my opinion. Before cars existed, there was no driver's license
> number. The DMV (as it's called in California, anyway) created it, and
> it's now a key that they can trust to be unique. It's also an attribute
> of the entity now, because it's printed on the cards you hand to people.
>
> The thing that I think is a mistake is to use generated IDs like an
> internal implementation detail (i.e. hide them like pointers); then at
> the same time mix them into the data model.
>
> Regards,
>     Jeff Davis
>
>
>
Well yes it does all depend on how you model things after all. I think a
drivers license is and attribute of driver not person. So before cars,
one still had a hard time coming up with a natural key on person.  Of
course California's DMV only cares about Californian licenced drivers,
so they get to generate and assign license number as an arbitary key for
drivers 'cause under that we're back to person.

Re: pervasiveness of surrogate (also called synthetic) keys

От

Greg Smith

Дата:

03 мая 2011 г., 23:04:10

Merlin Moncure wrote:
> If your data modeler that made the the assumptions that a MAC is
> unique (a mistake obviously) at least the other tables are protected
> from violations of that assumption because the database would reject
> them with an error, which is a Good Thing.  Without a uniqueness
> constraint you now have ambiguous data which is a Very Bad Thing.
>

With a uniqueness constraint in this situation, the unexpected data--row
with a non unique MAC--will be rejected and possibly lost when the
insertion happens.  You say that's a good thing, plenty of people will
say that's the worst possible thing that can happen.  When dealing with
external data, it's often impossible to know everything you're going to
see later at design time.  Approaching that problem with the idea that
you're going to lose any data that doesn't fit into the original model
is not what everyone finds reasonable behavior.

I don't think it's possible to decide in a generic way which of these is
the better approach:  to reject unexpected data and force the problem
back at the application immediately (commit failure), or to accept with
with because you're using a surrogate key and discover the problems down
the line.  Both are valid approaches with a very different type of risk
associated with them.  I think it's fair to say that real-world data is
not always well known enough at design time to follow the idea you're
suggesting though, and that does factor into why there is such a
preference for surrogate keys in the industry.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Re: pervasiveness of surrogate (also called synthetic) keys

От

Greg Smith

Дата:

03 мая 2011 г., 23:12:44

Jeff Davis wrote:
> On Mon, 2011-05-02 at 23:07 -0400, Greg Smith wrote:
>
>> I see this whole area as being similar to SQL injection.  The same way
>> that you just can't trust data input by the user to ever be secure, you
>> can't trust inputs to your database will ever be unique in the way you
>> expect them to be.
>>
>
> So, don't trust them to be unique then. Make up your own unique
> identifier, and use that.
>

If you're making up your own unique identifier, that's closer to a
surrogate key as far as I'm concerned, even though it doesn't fit the
strict definition of that term (it doesn't have the subtle idea that
"surrogate" implies "meaningless").  Now, there is some value to doing
that well, instead of just using the typical incrementing integer
"pointer" approach, as you've called it.  But if it's not derived from
external data you're storing anyway, it's not a true natural key either.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Re: pervasiveness of surrogate (also called synthetic) keys

От

David Johnston

Дата:

04 мая 2011 г., 01:00:24

On May 3, 2011, at 22:03, Greg Smith <greg@2ndquadrant.com> wrote:

> Merlin Moncure wrote:
>> If your data modeler that made the the assumptions that a MAC is
>> unique (a mistake obviously) at least the other tables are protected
>> from violations of that assumption because the database would reject
>> them with an error, which is a Good Thing.  Without a uniqueness
>> constraint you now have ambiguous data which is a Very Bad Thing.
>>
>
> With a uniqueness constraint in this situation, the unexpected data--row with a non unique MAC--will be rejected and
possiblylost when the insertion happens.  You say that's a good thing, plenty of people will say that's the worst
possiblething that can happen.  When dealing with external data, it's often impossible to know everything you're going
tosee later at design time.  Approaching that problem with the idea that you're going to lose any data that doesn't fit
intothe original model is not what everyone finds reasonable behavior. 
>
> I don't think it's possible to decide in a generic way which of these is the better approach:  to reject unexpected
dataand force the problem back at the application immediately (commit failure), or to accept with with because you're
usinga surrogate key and discover the problems down the line.  Both are valid approaches with a very different type of
riskassociated with them.  I think it's fair to say that real-world data is not always well known enough at design time
tofollow the idea you're suggesting though, and that does factor into why there is such a preference for surrogate keys
inthe industry. 
>
> --
> Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
> PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
> "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

If you implicitly expect MAC to be unique but do not enforce it then you are likely to write queries that rely upon
thatuniqueness.  If you do enforce it then the constraint failure occurs anyway. 

A scalar sub-query that links via the MAC will fail when the duplicate data is encountered, and normal queries will
returntoo-many records.  A detail report may be obvious but if you are summarizing the data the specific offending
recordis going to require some effort to find. 

I guess if you are the compromising type you can loosely enforce the uniqueness by running a check query periodically
tosee if supposedly unique values have been duplicated. 

I agree there is no right answer - the designer needs to make trade-offs - but I'd rather reject new data and leave the
systemin a status-quo stable state instead of introducing invalid data and putting the system into a state where it
requireseffort to get it functioning again.  If you accept the invalid data the likely scenario, if something breaks,
issomeone finds the offending record and removes it until the application and database can be fixed properly - which is
wherewe are at with validation.  The common exception is where identifiers are reused over time and you remove the old
recordin order to keep/allow the newer record to remain. 

On a tangential course I've started considering is a setup whereby you basically have two identifiers for a record.
Oneis end-user facing and updatable whereas the other is static and used in intra-table relations.  You can create a
newrecord with the same user-facing id as an existing Id but the existing Id will be replaced with its system id.  This
isuseful when users will be using the Id often and it can be reasonably assumed to be unique over a moderate period of
time(say a year).  Invoice numbers, customer numbers are two common examples.  The lookup Id itself may require
additionalfields in order to qualify as a primary (natural) key but the static key wants to be a single field.  Often
simplyputting a date with the original id (and parent identifiers) is sufficient due to the infrequency of updates.
Thedownside is, with string-based parent identifiers the pk value can be quite long.  I currently have PKs of 40-50
lengthbut during my new design my first pass on a couple of tables indicated >100 characters limit. 

Is there any rules-of-thumb on the performance of a PK as a function of key length?  I like using varchar based
identifierssince I tend to query tables directly and writing where clauses is much easier if you can avoid the joins.
I'mlikely better off creating views and querying those but am still curious on any basic thoughts on having a 100+
lengthprimary key. 

David J.

Re: pervasiveness of surrogate (also called synthetic) keys

От

Greg Smith

Дата:

04 мая 2011 г., 04:26:59

David Johnston wrote:
> Is there any rules-of-thumb on the performance of a PK as a function of key length? I like using varchar based
identifierssince I tend to query tables directly and writing where clauses is much easier if you can avoid the joins.
I'mlikely better off creating views and querying those but am still curious on any basic thoughts on having a 100+
lengthprimary key.
>

The shorter the better, but it may not be as bad as you fear. The way
B-tree indexes are built, it isn't that expensive to hold a longer key
so long as the unique part doesn't average out to be that long. So if
you insert "123456666666666666666" and "12345777777777777777", that's
not going to be much different than navigating "123456" and "123457",
because once you get that far you've already reached a unique prefix.
But if your entries have a really long common prefix, like
"111111111111111112" and "111111111111111113", that's going to be more
expensive to deal with--even though the strings are the same length.

If your identifiers become unique after only a few characters, it may
not be so bad. But if they go many characters before you can
distinguish between any two entries, you're probably not going to be
happy with the performance or size of the indexes, relative to simple
integer keys.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Re: pervasiveness of surrogate (also called synthetic) keys

От

Misa Simic

Дата:

04 мая 2011 г., 08:15:04

2011/4/28 Merlin Moncure <mmoncure@gmail.com>

On Thu, Apr 28, 2011 at 12:29 PM, Jim Irrer <irrer@umich.edu> wrote:
*) most tables don't have unique natural keys (let's see em)
etc

i.e for an Invoice, we have at least 2 tables (more in practice...):

Invoice Header

-Invoice Number

-Date

-CustomerID

-Currency

Invoice Line

-ItemID

-qty

-Price

This is not real design for tables, just basic real info stored to represent 1 Invoice Document.

Now, let say for Invoice Header, "Natural" PK is Invoice Number, so we will add that column to Invoice Line table as FK....

What should be "Natural" PK for Invoice Line table?

Also, if we now add additional tables i.e. Serial Numbers, related to Invoice Line, what should be "Natural" key in it?

Thanks,

Misa

Re: pervasiveness of surrogate (also called synthetic) keys

От

Merlin Moncure

Дата:

04 мая 2011 г., 09:34:39

On Wed, May 4, 2011 at 7:14 AM, Misa Simic <misa.simic@gmail.com> wrote:
>
>
> 2011/4/28 Merlin Moncure <mmoncure@gmail.com>
>>
>> On Thu, Apr 28, 2011 at 12:29 PM, Jim Irrer <irrer@umich.edu> wrote:
>> *) most tables don't have unique natural keys (let's see em)
>> etc
>>
>
> i.e for an Invoice, we have at least 2 tables (more in practice...):
> Invoice Header
>     -Invoice Number
>     -Date
>     -CustomerID
>     -Currency
> Invoice Line
>     -ItemID
>     -qty
>     -Price
>
> This is not real design for tables, just basic real info stored to represent
> 1 Invoice Document.
> Now, let say for Invoice Header, "Natural" PK is Invoice Number, so we will
> add that column to Invoice Line table as FK....
> What should be "Natural" PK for Invoice Line table?
> Also, if we now add additional tables i.e. Serial Numbers, related to
> Invoice Line, what should be "Natural" key in it?

Most of the old school accounting systems maintained an invoice line number.

> Invoice Line
>     -Invoice Number
>     -LineNo
>     -ItemID
>     -qty
>     -Price

The line number started from 1 (the first line on the invoice) on
every unique invoice.  This has the added benefit of allowing the
primary key index (invoice_no, line_no) allowing you to pull up the
invoice line records in correct order without requiring an extra sort
and, if you don't need any fields from the invoice, a join.

Compare the two queries pulling up invoice lines over a range of
invoice numbers.

merlin

Re: pervasiveness of surrogate (also called synthetic) keys

От

Misa Simic

Дата:

04 мая 2011 г., 09:51:13

2011/5/4 Merlin Moncure <mmoncure@gmail.com>

Most of the old school accounting systems maintained an invoice line number.

> Invoice Line
> -Invoice Number
> -LineNo
> -ItemID
> -qty
> -Price

The line number started from 1 (the first line on the invoice) on
every unique invoice. This has the added benefit of allowing the
primary key index (invoice_no, line_no) allowing you to pull up the
invoice line records in correct order without requiring an extra sort
and, if you don't need any fields from the invoice, a join.

Compare the two queries pulling up invoice lines over a range of
invoice numbers.

merlin

Thanks, merlin,

And in that case, what is "Natural" in LineNo? I would say, with adding LineNo we are creating syntethic/surrogate Key (just instead of 1 surrogate column - it will be Compound key with more columns...)? The same is with all other tables what are "parts" of an Entity, Serial Numbers, Accounting Distribution produced by Invoice...etc etc...

Re: pervasiveness of surrogate (also called synthetic) keys

От

"David Johnston"

Дата:

04 мая 2011 г., 10:34:25

>>Thanks, merlin,

>>>And in that case, what is "Natural" in LineNo? I would say, with adding LineNo we are creating syntethic/surrogate Key (just instead of 1 surrogate column - it will be Compound key with more columns...)? The >>>same is with all other tables what are "parts" of an Entity, Serial Numbers, Accounting Distribution produced by Invoice...etc etc...

Being the “first line” or the “second line” of a physical invoice is a property for that line. Identifying its position on the invoice is only natural.

By your reasoning all identifiers are synthetically generated if you consider there is never truly only a single instance of anything in the multi-verse. The only truly unique identifier would be the time+place of an objects creation.

“Hello - person born in Liverpool London, St. Whatever hospital, Room 101 @ 13:14:57AM on the 5^th of March 2001 – how may I direct your call?” (I guess you could use the conception date as well although twins+ might be tough to distinguish in that case).

Generally it could be argued that any well-normalized compound key is inherently natural (whether left as multiple fields or concatenated into a single field). The identifier that is assigned to the “part” individually is likely to be “synthetic” but its membership in the hierarchy naturalizes it.

David J.

Re: pervasiveness of surrogate (also called synthetic) keys

От

Karsten Hilbert

Дата:

04 мая 2011 г., 10:43:52

On Wed, May 04, 2011 at 09:33:57AM -0400, David Johnston wrote:

> “Hello - person born in Liverpool London, St. Whatever
> hospital, Room 101 @ 13:14:57AM on the 5th of March 2001 –
> how may I direct your call?” (I guess you could use the
> conception date as well

That will rarely be known to any sufficient precision.

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: pervasiveness of surrogate (also called synthetic) keys

От

Merlin Moncure

Дата:

04 мая 2011 г., 10:52:00

On Wed, May 4, 2011 at 7:50 AM, Misa Simic <misa.simic@gmail.com> wrote:
> 2011/5/4 Merlin Moncure <mmoncure@gmail.com>
>>
>> Most of the old school accounting systems maintained an invoice line
>> number.
>> > Invoice Line
>> >     -Invoice Number
>> >     -LineNo
>> >     -ItemID
>> >     -qty
>> >     -Price
>>
>> The line number started from 1 (the first line on the invoice) on
>> every unique invoice.  This has the added benefit of allowing the
>> primary key index (invoice_no, line_no) allowing you to pull up the
>> invoice line records in correct order without requiring an extra sort
>> and, if you don't need any fields from the invoice, a join.
>>
>> Compare the two queries pulling up invoice lines over a range of
>> invoice numbers.
>>
>> merlin
>
>
> Thanks, merlin,
> And in that case, what is "Natural" in LineNo? I would say, with adding
> LineNo we are creating syntethic/surrogate Key (just instead of 1 surrogate
> column - it will be Compound key with more columns...)? The same is with all
> other tables what are "parts" of an Entity, Serial Numbers, Accounting
> Distribution produced by Invoice...etc etc...

It's natural because it contains information that is not synthetic,
and unique/well defined.  Specifically, the position of the line on
the invoice; you can't have to invoice lines at the second line of an
invoice for example.   This information is visible and important to
users because presumably the invoice lines as entered into an
application are in the order you would want them printed out on a
physical invoice.

The invoice number itself is more interesting case because it's
generated and what it might represent (the 77th invoice for this
customer) isn't very interesting.  An actual, 'true' natural key for
an invoice might be a combination of user, entry time, and customer,
but this key is fairly weak and not very useful for identification
purposes by human beings -- so we created the invoice number concept
as a proxy for that information.  Also timestamps and floating point
values tend to suck for a primary key.  This is what Jeff D was
talking about in detail largely upthread, to wit: if your users (man
or machine) really prefer to work with a generated identifier for some
reason or another, by all means make one, just make sure the record
has a rigorous definition.

A line number table would be a good candidate for adding an additional
surrogate key for purposes of relating on performance grounds if you
have foreign keys pointing to the record.. Any shifts in invoice line
position require renumbering the invoice which would annoyingly
cascade the updates.

The reason, by the way, that I *know* you aren't going to turn up many
interesting cases of not being able to define a key at least in the
accounting and manufacturing world is that I happen to have worked a
large ERP/CRM that dated from the pre-sql era.  Like many systems of
the day, it was written in COBOL over an isam data store which didn't
have the technical capability to do what sequences or uuids do today
(or at least, it was more trouble than it was worth)...100% natural
keys.  Somehow those guys seemed to manage just fine working under
hardware constraints much tighter than today's  Although there were
some real headaches in that system, for example when say, part numbers
changed, it had an inner elegance that grew on me over time, and
taught me the value of rigorous definition of records and their keys.

merlin

Re: pervasiveness of surrogate (also called synthetic) keys

От

Merlin Moncure

Дата:

04 мая 2011 г., 11:32:24

On Wed, May 4, 2011 at 2:25 AM, Greg Smith <greg@2ndquadrant.com> wrote:
> David Johnston wrote:
>>
>> Is there any rules-of-thumb on the performance of a PK as a function of
>> key length?  I like using varchar based identifiers since I tend to query
>> tables directly and writing where clauses is much easier if you can avoid
>> the joins.  I'm likely better off creating views and querying those but am
>> still curious on any basic thoughts on having a 100+ length primary key.
>>
>
> The shorter the better, but it may not be as bad as you fear.  The way
> B-tree indexes are built, it isn't that expensive to hold a longer key so
> long as the unique part doesn't average out to be that long.  So if you
> insert "123456666666666666666" and "12345777777777777777", that's not going
> to be much different than navigating "123456" and "123457", because once you
> get that far you've already reached a unique prefix.  But if your entries
> have a really long common prefix, like "111111111111111112" and
> "111111111111111113", that's going to be more expensive to deal with--even
> though the strings are the same length.
>
> If your identifiers become unique after only a few characters, it may not be
> so bad.  But if they go many characters before you can distinguish between
> any two entries, you're probably not going to be happy with the performance
> or size of the indexes, relative to simple integer keys.

yeah.  The number of comparisons should be basically the same, but
situational things are going to make/break you.  As the in house
'performance guy', it might interest you to work through them all --
indexing strategies are the key to good database performance and,
modeling concepts and religious debates aside, this is an interesting
discussion from a strictly performance point of view.

One reason why natural keys work much better than expected is you get
much better index utilization and potentially *much* better tuple/page
efficiency on certain very common classes of lookups/scans especially
if you cluster.   Speaking of clustering, you no longer have to
agonize in cases of say, having to cluster on 'email' or 'email_id'.
The identifying and ordering/searching criteria are in the same index
which can be an enormous win in some cases.  Do not underestimate the
value of this when the table is large and dependent scans are common.

If you key on email and the query coming from another table doesn't
need any other email properties, you just saved yourself a join
without having to de-normailze in the classic sense.  You also get to
cut out many sorts on similar principles.

OTOH, updates as noted tend to suck.  In some cases hilariously so.
The larger index is going to cause more cache pressure which is a
point against...sometimes the large index just isn't worth the cost
for what you get. Also, you have a hard limit on key sizes imposed by
postgres.  It's rare to hit that in the real world but it should be
noted.  Another negative point is that the postgres stats system also
doesn't deal well with composite keys for range scans.  I have a
pretty good idea on how to fix this, but I haven't gotten around to it
yet.

merlin

Re: pervasiveness of surrogate (also called synthetic) keys

От

Misa Simic

Дата:

04 мая 2011 г., 16:55:35

Being the “first line” or the “second line” of a physical invoice is a property for that line. Identifying its position on the invoice is only natural.

Specifically, the position of the line on the invoice; you can't have to invoice lines at the second line of aninvoice for example. This information is visible and important to users because presumably the invoice lines as entered into an application are in the order you would want them printed out on aphysical invoice.

I agree that Line No could say, that is order in which Item is added to Invoice - but definately it will not mean that it should be printed in that order... Items could be grouped by Category or whatever on Invoice... in which order Items will be printed I would say it is more Business Rule... and as all Business Rules it is changeable by time... for example 1 rule known to me: SLA item (charge for Service Licence Agreement) is always on the end, and its price is 20% of other items SubTotal for what SLA should be paid (so for some Items on Invoice, SLA should be paid, for some not - we don't know in which order user will enter Items)...

By your reasoning all identifiers are synthetically generated if you consider there is never truly only a single instance of anything in the multi-verse. The only truly unique identifier would be the time+place of an objects creation.

“Hello - person born in Liverpool London, St. Whatever hospital, Room 101 @ 13:14:57AM on the 5^th of March 2001 – how may I direct your call?” (I guess you could use the conception date as well although twins+ might be tough to distinguish in that case).

Specifically, the position of the line on the invoice; you can't have to invoice lines at the second line of aninvoice for example. This information is visible and important to users because presumably the invoice lines as entered into an application are in the order you would want them printed out on aphysical invoice.

Yes, I agree, I really think that "Natural" keys do not exist... but definatelly there are Entity properties, like e-mail, InvoiceNo (in whatever format), what must be Unique (if Business Rule say: that property must be unique).... That is even hard for humans to think and identify someone on that way - so humans invented SSN - to make them life easier... but it is known that SSN is not best canditate for PK...

How humans identify someone/something - is totally different way then how Relational Database identifies some record...

Human better Identify some person by picture (if it is known to them, if have it in their memory) - Relational Database, well, today could by 1 digitial picture and just that one... Tomorrow maybe...

If you key on email and the query coming from another table doesn't
need any other email properties, you just saved yourself a join
without having to de-normailze in the classic sense. You also get to
cut out many sorts on similar principles.

Performance and Reliability - that are main things I worry about... and then made some tests...

JOIN - that is natural thing for Relational Database... Especially JOIN on Keys - I agree that JOIN on non key columns are expensive...

I can't imagine, that we can get some info from ERP system without join on every day basis, like just e-mail, Amount as salary in that period... without First Name, Last Name... but lets do the test

So we have two tables, in two models...

Employees table: 50 000 records

Salaries table: 100 000 records

(For a first test, I didn't have time to wait to fill table with 100 000 000 records... but will run it over night...)

Model one:

Employee

-e-mail (PK)

-FirstName

-LastName

Salary (PK: e-mail, StartDate, EndDate)

-e-mail (FK to employee)

-StartDate

-EndDate

-Amount

(I recommend StartDate/EndDate to replace with Period datatype from many reasons - but out of this test topic)

Model two:

Employee

-EmpID (UUID -PK)

-e-mail

-FirstName

-LastName

Salary

-SalID (UUID PK)

-EmpID (UUID FK to Employee)

-StartDate

-EndDate

-Amount

Test for Model 1:

SELECT e-mail, startdate, enddate, Amount FROM Salary

Returns 100 000k records in 1000 ms

SELECT Salary.e-mail, FirstName, LastName, startdate, enddate, Amount FROM Salary INNER JOIN Employees ON Salary.e-mail = Employees.e-mail

Returns 100 000k records in 240 ms

WOW - query with JOIN returned same result 4 times faster!!!

just tried then:

SELECT Salary.e-mail, FirstName, LastName, startdate, enddate, Amount FROM Salary INNER JOIN Employees ON Salary.e-mail = Employees.e-mail

all info from Salary table but just joined to emploees without columns from Employees table...

Returns 100 000k records in 240 ms

test for Model2:

SELECT emp_id, startdate, enddate, amount FROM Salary

Returned 100 000k rows in 1314 ms (worst result - though in practice that query would never been run because of emp_id as GUID means nothing to human...)

SELECT email, fname, lname, startdate, enddate, amount FROM Salary INNER JOIN Employees ON Salary.emp_id = Employees.emp_id

Returns 100 000k records in 240 ms (the same as JOIN on e-mail column in Model1)

Like many systems of
the day, it was written in COBOL over an isam data store which didn't
have the technical capability to do what sequences or uuids do today

I agree. Seq or uuids solved many problems today...and perfectly fits in Relational Database design as PK...

With Kind Regards,

Misa

Re: pervasiveness of surrogate (also called synthetic) keys

От

Scott Marlowe

Дата:

04 мая 2011 г., 17:48:15

On Tue, May 3, 2011 at 8:03 PM, Greg Smith <greg@2ndquadrant.com> wrote:

> With a uniqueness constraint in this situation, the unexpected data--row
> with a non unique MAC--will be rejected and possibly lost when the insertion
> happens.  You say that's a good thing, plenty of people will say that's the
> worst possible thing that can happen.

But remember the original discussion is on using these are PK/FK.
That's where things get really ugly.  I can change my data model to
not have a unique MAC or to do something to make them unique (add IP
or something) much more easily if they're NOT a PK/FK.  That's the
real issue to me.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: pervasiveness of surrogate (also called synthetic) keys