Обсуждение: pervasiveness of surrogate (also called synthetic) keys
common practice by an overwhelming margin in relational databases and
that they are used in 99 percent of large installations. I agree that many
situations benefit from them, but are they really as pervasive
as he claims?
Thanks,
- Jim
On 4/28/2011 12:29 PM, Jim Irrer wrote: > A colleague of mine insists that using surrogate keys is the > common practice by an overwhelming margin in relational databases and > that they are used in 99 percent of large installations. I agree that many > situations benefit from them, but are they really as pervasive > as he claims? > > Thanks, > > - Jim I dont see how you could know unless you went to all the "large installations" and asked. But since its a good idea, and you "should" do it that way, and because I'm pessimistic, I'd say only 5% of RDB users do it that way. Oh! Joke: Why do DB Admins make better lovers? They use surrogates! Anyway, I'm not a large install, but I use em. That's gotta count for something. Really, how could you count? Was there a poll someplace? Ask for some data. Otherwise seems like BS to me. -Andy
On 04/28/2011 11:44 AM, Andy Colson wrote: > On 4/28/2011 12:29 PM, Jim Irrer wrote: >> A colleague of mine insists that using surrogate keys is the >> common practice by an overwhelming margin in relational databases and >> that they are used in 99 percent of large installations. I agree that >> many >> situations benefit from them, but are they really as pervasive >> as he claims? >> >> Thanks, >> >> - Jim > > I dont see how you could know unless you went to all the "large > installations" and asked. But since its a good idea, and you "should" do > it that way, and because I'm pessimistic, I'd say only 5% of RDB users > do it that way. > > Oh! Joke: Why do DB Admins make better lovers? They use surrogates! > > Anyway, I'm not a large install, but I use em. That's gotta count for > something. > > Really, how could you count? Was there a poll someplace? Ask for some > data. Otherwise seems like BS to me. > > -Andy > Hm, I get the feeling that only the good folks at Hibernate seem to think using a "natural key" is the _only_ way to go.
On Thu, Apr 28, 2011 at 01:29:31PM -0400, Jim Irrer wrote: > common practice by an overwhelming margin in relational databases and > that they are used in 99 percent of large installations. 94.68536% of all the claims I ever hear are obviously pulled out of thin air. What conclusion does your colleague want to draw from this overwhelming (if perhaps statistically dubious) penetration? Surely the argument doesn't conclude, "Therefore we should do that too?" I seem to recall my mother making some remark about others jumping off cliffs. A -- Andrew Sullivan ajs@crankycanuck.ca
On 04/28/2011 10:29 AM, Jim Irrer wrote: > A colleague of mine insists that using surrogate keys is the > common practice by an overwhelming margin in relational databases and > that they are used in 99 percent of large installations. I agree that > many > situations benefit from them, but are they really as pervasive > as he claims? Well there is no fact to back that up but, I will say that most toolkits require the use of a synthetic key, rails, django etc.... JD > > Thanks, > > - Jim -- Command Prompt, Inc. - http://www.commandprompt.com/ PostgreSQL Support, Training, Professional Services and Developement Organizers of the PostgreSQL Conference - http://www.postgresqlconference.org/ @cmdpromptinc - @postgresconf - 509-416-6579
On Apr 28, 2011, at 11:53 AM, Rob Sargent wrote: > Hm, I get the feeling that only the good folks at Hibernate seem to think using a "natural key" is the _only_ way to go. Well, natural keys are quite obviously the way to go, when they exist. The problem is, they usually don't really exist. What'susually proposed as a natural key, will upon further investigation, either not be guaranteed unique, or not guaranteedto be unchanging, or both. -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ (303) 722-0567 voice
On Thu, Apr 28, 2011 at 7:26 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > Well there is no fact to back that up but, I will say that most toolkits > require the use of a synthetic key, rails, django etc.... Usually such tools are born with surrogate keys only, because it's easier, and either grow up developing natural keys (e.g. the Django ORM, SQLAlchemy) or fade into uselessness (e.g. SQLObject). But this speaks more about tools than about the merits of the natural keys: if the toolkit doesn't support them it's seriously getting in the way in this and probably in other matters too. -- Daniele
On Thu, Apr 28, 2011 at 12:29 PM, Jim Irrer <irrer@umich.edu> wrote: > A colleague of mine insists that using surrogate keys is the > common practice by an overwhelming margin in relational databases and > that they are used in 99 percent of large installations. I agree that many > situations benefit from them, but are they really as pervasive > as he claims? They are fairly pervasive, and increasingly so, which I find to be really unfortunate. Personally I think rote use of surrogate keys is terrible and leads to bad table designs, especially if you don't identify the true natural key with, say, a unique constraint. This pushes duplicate enforcement out of the database and into application code, or (even worse) the user. What everyone misses in the database is that natural keys force good database design...if you can't discover one, you probably have a design problem. There are of course exceptions, but they are limited. The main/best case for surrogates is defense against cascading updates -- obviously if a single record change can turn into 1000's of updates statements you absolutely want to avoid that if you can. But in a typical database this only applies to particular tables. 'One off' updates are of course no problem. In a 'infinite performance' world, this would still be an update, because of locks. The performance case either way is complex. Generally, I find natural keys to be faster all things considered but this is going to be situationally dependent. Although integer indexes are tighter and faster, natural keys give back by skipping entire joins, sorts, etc. They also tend to have better tuple/page efficiency so that when searching for a range of records you touch less pages. Essentially, you get more traffic routed through a smaller number, albeit bigger, indexes. Natural key tables also tend to be 'self documenting' which is a nice property. A lot of bad arguments made against natural keys are made, for example: *) natural keys can change (so what? unless you are worried about cascades) *) SSN are reused! (SSN is obviously not a complete key if you want to identify a person) *) most tables don't have unique natural keys (let's see em) etc merlin
Any system that generates transactional data has to use some kind of synthetic key. I guess you could rely upon some form of timestamp but from a usability standpoint that is not really a good decision. Inventory also requires synthetic keys - whether you decide what they are or someone else does. A serial field is the common form of a synthetic key but any assigned identifier can be defined in the same way. You can even argue that a field that concatenates the parts of the natural key qualifies as synthetic. To actually use the data in the database it is often helpful to be able to uniquely identify a row using a single field. My personal take is that normalization is important. Once you have normalized you will find tables with compound natural primary keys. If these tables are going to be queried in such a way as to return a single record (WHERE clause) it is likely that a synthetic key would be beneficial. If they are normally going to be used only with the context of other core tables (basically JOIN ON clause) then it is less important to have a single representative value. Justifying anything simply based upon what other people do is superficial. As for the colleague; the basic response is "what is your source". If they are trying to convince you to include synthetic keys on specific tables ask them to explain what specific benefit will be gained by doing so and let them know that "you should always have a synthetic key" is not a helpful response. David J. -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Andy Colson Sent: Thursday, April 28, 2011 1:44 PM To: Jim Irrer Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] pervasiveness of surrogate (also called synthetic) keys On 4/28/2011 12:29 PM, Jim Irrer wrote: > A colleague of mine insists that using surrogate keys is the common > practice by an overwhelming margin in relational databases and that > they are used in 99 percent of large installations. I agree that many > situations benefit from them, but are they really as pervasive as he > claims? > > Thanks, > > - Jim I dont see how you could know unless you went to all the "large installations" and asked. But since its a good idea, and you "should" do it that way, and because I'm pessimistic, I'd say only 5% of RDB users do it that way. Oh! Joke: Why do DB Admins make better lovers? They use surrogates! Anyway, I'm not a large install, but I use em. That's gotta count for something. Really, how could you count? Was there a poll someplace? Ask for some data. Otherwise seems like BS to me. -Andy -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On Thu, Apr 28, 2011 at 03:39:19PM -0500, Merlin Moncure wrote: > They are fairly pervasive, and increasingly so, which I find to be > really unfortunate. Personally I think rote use of surrogate keys is > terrible and leads to bad table designs, especially if you don't > identify the true natural key with, say, a unique constraint. I was recently asked on this list why GNUmed uses all those surrogate keys. I should have added to my answer that we DO make *extensive* use of all sorts of built-in constraints and custom triggers to enforce "natural" keys. I must agree with a recent poster that what appears to identify as a natural key often really isn't or else becomes not so later on. It's vastly easier to then deal with that by re-defining constraints without having to touch primary keys. > This > pushes duplicate enforcement out of the database and into application > code, or (even worse) the user. What everyone misses in the database > is that natural keys force good database design...if you can't > discover one, you probably have a design problem. Sure but that doesn't mean you need to actually *use* natural keys as primary keys - enforce them with all sorts of constraints, sure - but don't assume you properly figured out the perfect schema the first time around. I've so far found it *good* to have duplicate enforcement: - the database enforces what we agree on the final data *should* look like - the UI tries to lure the user into entering "valid" data Now, those ("duplicative") database constraints have saved our butt more than once preventing faulty patient data to be inserted into medical records. > A lot of bad arguments made against natural keys are made, for example: > *) natural keys can change (so what? unless you are worried about cascades) I find it is not so much that they can change: Sure, it doesn't matter whether a certain code reads "C03EB21" or "C03EB22". but rather that they tend to go non-unique whenever the whims of meatspace decide it's now more convenient to allow dupes: Suddenly there must be two records with code "C03EB21". > *) SSN are reused! (SSN is obviously not a complete key if you want to > identify a person) > *) most tables don't have unique natural keys (let's see em) Now, those two arguments are bogus, I agree. Karsten -- GPG key ID E4071346 @ gpg-keyserver.de E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
On Thu, Apr 28, 2011 at 4:07 PM, Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote: > On Thu, Apr 28, 2011 at 03:39:19PM -0500, Merlin Moncure wrote: > >> They are fairly pervasive, and increasingly so, which I find to be >> really unfortunate. Personally I think rote use of surrogate keys is >> terrible and leads to bad table designs, especially if you don't >> identify the true natural key with, say, a unique constraint. > > I was recently asked on this list why GNUmed uses all those > surrogate keys. > > I should have added to my answer that we DO make *extensive* > use of all sorts of built-in constraints and custom triggers > to enforce "natural" keys. I must agree with a recent poster > that what appears to identify as a natural key often really > isn't or else becomes not so later on. It's vastly easier to > then deal with that by re-defining constraints without > having to touch primary keys. > >> This >> pushes duplicate enforcement out of the database and into application >> code, or (even worse) the user. What everyone misses in the database >> is that natural keys force good database design...if you can't >> discover one, you probably have a design problem. > > Sure but that doesn't mean you need to actually *use* > natural keys as primary keys - enforce them with all sorts > of constraints, sure - but don't assume you properly figured > out the perfect schema the first time around. > > I've so far found it *good* to have duplicate enforcement: > > - the database enforces what we agree on the final data > *should* look like > > - the UI tries to lure the user into entering "valid" data > > Now, those ("duplicative") database constraints have saved > our butt more than once preventing faulty patient data to be > inserted into medical records. > > >> A lot of bad arguments made against natural keys are made, for example: >> *) natural keys can change (so what? unless you are worried about cascades) > > I find it is not so much that they can change: > > Sure, it doesn't matter whether a certain code reads > "C03EB21" or "C03EB22". > > but rather that they tend to go non-unique whenever the > whims of meatspace decide it's now more convenient to allow > dupes: > > Suddenly there must be two records with code "C03EB21". > >> *) SSN are reused! (SSN is obviously not a complete key if you want to >> identify a person) >> *) most tables don't have unique natural keys (let's see em) > > Now, those two arguments are bogus, I agree. pretty much agree on all points. I don't really think primary keys tend to change very much in terms of schema but when they do it can be a real headache. I took a quick look at the gnumed schema and found it to be generally very thorough and excellent. If you're going to use surrogate keys, that's they way to do it. That's a neat trick btw to use inheritance for the auditing feature...how is it working out for you? Any general comments on postgresql with regards to your product? merlin
On Fri, Apr 29, 2011 at 10:14:07AM -0500, Merlin Moncure wrote: > I took a quick look at the gnumed schema and found it to be generally > very thorough and excellent. If you're going to use surrogate keys, > that's they way to do it. Good to know since I'm only a lowly medical doctor not having much schooling in database matters beyond this list, the PostgreSQL docs, and the Celko book. > That's a neat trick btw to use inheritance > for the auditing feature...how is it working out for you? That works very nicely for us. Same thing with aggregating clinical narrative across diverse tables. > Any general comments on postgresql with regards to your product? We have found it to be very dependable and professionally maintained. We've never lost any patient data due to crashes (for what that's worth). The breadth of constraints one can define saved our behinds several times by preventing buggy applications from storing faulty patient data. One thing that'd be helpful to have would be ON CONNECT triggers - that would make it much safer to support HIPAA requirements (I'm aware of the apparent fallacy of a faulty ON CONNECT trigger preventing superuser access - that can be overcome by not running ON CONNECT triggers in single-user rescue mode). Another would be database wide asserts on data. Of course, better support of inheritance in terms of definably propagating constraints and triggers would be nice :-) But that's a lot to ask. Karsten -- GPG key ID E4071346 @ gpg-keyserver.de E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
On Thu, Apr 28, 2011 at 3:07 PM, Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote: > I must agree with a recent poster > that what appears to identify as a natural key often really > isn't or else becomes not so later on. It's vastly easier to > then deal with that by re-defining constraints without > having to touch primary keys. I have run into this. Often what happens is that the data model directly affects the assumptions coders make. In a system I worked on, the username was unique / primary key. Code was written that assumed username would always be unique across the app. Then the app was deployed to multiple non-associated groups, where user names from two different organizations might be the same and they couldn't be changed. Talk about some ugly code changes needed to be made to fix all the bugs. ugh. An initial assumption that a serial value would be unique, but usernames might not be would have resulted in a much cleaner design for that system. Or even that the PK was org||username or something.
On Thursday, April 28, 2011 08:31:09 PM Scott Ribe wrote: > Well, natural keys are quite obviously the way to go, when they exist. The > problem is, they usually don't really exist. What's usually proposed as a > natural key, will upon further investigation, either not be guaranteed > unique, or not guaranteed to be unchanging, or both. There is no fundamental problem with changing primary keys. Sure, there are challenges and situations where thats annoying and problematic, but it's not as bad as often assumed. Andres
On 05/01/2011 06:12 PM, Karsten Hilbert wrote: > Good to know since I'm only a lowly medical doctor not > having much schooling in database matters beyond this list, > the PostgreSQL docs, and the Celko book. > This debate exists at all levels of experience, and the only thing that changes as you get more experienced people involved is an increase in anecdotes on each side. The sole time I ever found myself arguing with Joe Celko is over an article he wrote recommending natural keys, using an example from the automotive industry. Problem was, the specific example he gave was flat out wrong. I was working in automotive MIS at the time, and the thing he was saying would never change did, in fact, change every year--in only a fraction of a percent of cases, in an extremely subtle way that snuck up on people and wreaked much confusion. That's typical for an early natural key design: you get it working fine in V1.0, only to discover months or years down the road there's a case you never considered you don't model correctly, and it may take some sort of conversion to fix. The reason why there's a strong preference for surrogate keys is that they always work and you can avoid ever needing to come up with a better design. if you just use them and forget about it. The position Merlin has advocated here, that there should always be a natural key available if you know the data well enough, may be true. But few people are good enough designers to be sure they've made the decision correctly, and the downsides of being wrong can be a long, painful conversion process. Easier for most people to just eliminate the possibility of making a mistake by using auto-generated surrogate keys, where the primary problem you'll run into is merely using more space/resources than you might otherwise need to have. It minimizes the worst-case--mistake make in the model, expensive re-design--by adding overhead that makes the average case more expensive. Software design usually has enough risks that any time you can eliminate one just by throwing some resources at it, that's normally the right thing to do. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Thanks very much for the thoughtful responses - very enlightening.
- Jim (the topic originator)
Jim Irrer irrer@umich.edu (734) 647-4409
University of Michigan Hospital Radiation Oncology
519 W. William St. Ann Arbor, MI 48103
On 05/01/2011 06:12 PM, Karsten Hilbert wrote:This debate exists at all levels of experience, and the only thing that changes as you get more experienced people involved is an increase in anecdotes on each side. The sole time I ever found myself arguing with Joe Celko is over an article he wrote recommending natural keys, using an example from the automotive industry. Problem was, the specific example he gave was flat out wrong. I was working in automotive MIS at the time, and the thing he was saying would never change did, in fact, change every year--in only a fraction of a percent of cases, in an extremely subtle way that snuck up on people and wreaked much confusion. That's typical for an early natural key design: you get it working fine in V1.0, only to discover months or years down the road there's a case you never considered you don't model correctly, and it may take some sort of conversion to fix.Good to know since I'm only a lowly medical doctor not
having much schooling in database matters beyond this list,
the PostgreSQL docs, and the Celko book.
The reason why there's a strong preference for surrogate keys is that they always work and you can avoid ever needing to come up with a better design. if you just use them and forget about it. The position Merlin has advocated here, that there should always be a natural key available if you know the data well enough, may be true. But few people are good enough designers to be sure they've made the decision correctly, and the downsides of being wrong can be a long, painful conversion process. Easier for most people to just eliminate the possibility of making a mistake by using auto-generated surrogate keys, where the primary problem you'll run into is merely using more space/resources than you might otherwise need to have. It minimizes the worst-case--mistake make in the model, expensive re-design--by adding overhead that makes the average case more expensive. Software design usually has enough risks that any time you can eliminate one just by throwing some resources at it, that's normally the right thing to do.
--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Mon, May 2, 2011 at 10:10 AM, Greg Smith <greg@2ndquadrant.com> wrote: > On 05/01/2011 06:12 PM, Karsten Hilbert wrote: >> >> Good to know since I'm only a lowly medical doctor not >> having much schooling in database matters beyond this list, >> the PostgreSQL docs, and the Celko book. >> > > This debate exists at all levels of experience, and the only thing that > changes as you get more experienced people involved is an increase in > anecdotes on each side. The sole time I ever found myself arguing with Joe > Celko is over an article he wrote recommending natural keys, using an > example from the automotive industry. Problem was, the specific example he > gave was flat out wrong. I was working in automotive MIS at the time, and > the thing he was saying would never change did, in fact, change every > year--in only a fraction of a percent of cases, in an extremely subtle way > that snuck up on people and wreaked much confusion. That's typical for an > early natural key design: you get it working fine in V1.0, only to discover > months or years down the road there's a case you never considered you don't > model correctly, and it may take some sort of conversion to fix. > > The reason why there's a strong preference for surrogate keys is that they > always work and you can avoid ever needing to come up with a better design. > if you just use them and forget about it. The position Merlin has advocated > here, that there should always be a natural key available if you know the > data well enough, may be true. But few people are good enough designers to > be sure they've made the decision correctly, and the downsides of being > wrong can be a long, painful conversion process. Easier for most people to > just eliminate the possibility of making a mistake by using auto-generated > surrogate keys, where the primary problem you'll run into is merely using > more space/resources than you might otherwise need to have. It minimizes > the worst-case--mistake make in the model, expensive re-design--by adding > overhead that makes the average case more expensive. Software design > usually has enough risks that any time you can eliminate one just by > throwing some resources at it, that's normally the right thing to do. There are many practical arguments advocating the use of surrogate keys. Faster updates, easier schema changes, performance, maintenance costs down the line, better tools integration, etc. These arguments basically involve trade-offs that can be justifiably be used to push you one way or the other. That's all well and good. However, I tend to disagree with arguments that you are better off not identifying natural keys at all. To my mind, any database that has a table with no discernible key but whose records are referred to via another table's foreign key has a schema that is in a State of Error. A surrogate key is just added information to the database -- why does it play that record X out of N identical candidates should be the one mapped? Is that logic repeatable? What are the ramifications for joins that do not flow through the ID columns? Typically what it means is that the rules that guard against duplicate information entry are not, in fact in the database at all but in the application, and bad data can now get into your database by a much broader array of causes. The last and best defense against a nasty and common class of data errors has been removed. The more complex and your database is, the more it tends to be used a by a large array of clients, possibly even spanning multiple computer languages -- thus the need for a root system of constraint checking that is declarative and easily understood. Sure, requirements change, models change, but at any particular point and time a model with as little as possible (read: none) outside inputs should be able to provably demonstrate verifiable facts. With a natural key database (or a surrogate key database with defined keys that are not used for RI) you have inherent constraint checking that a purely surrogate database simply doesn't have. Whatever the software maintenance costs are, which is itself a complex and debatable topic, I'll go with a strategy that gives a better defense against bad or ambiguous data. merlin
On Mon, 2011-05-02 at 11:10 -0400, Greg Smith wrote: > The position Merlin > has advocated here, that there should always be a natural key available > if you know the data well enough, may be true. But few people are good > enough designers to be sure they've made the decision correctly, and the > downsides of being wrong can be a long, painful conversion process. > Easier for most people to just eliminate the possibility of making a > mistake by using auto-generated surrogate keys, where the primary > problem you'll run into is merely using more space/resources than you > might otherwise need to have. It minimizes the worst-case--mistake make > in the model, expensive re-design--by adding overhead that makes the > average case more expensive. Once you really try to define "natural" and "surrogate" keys, I think a lot of the arguments disappear. I wrote about this a few years back: http://thoughts.j-davis.com/2007/12/11/terminology-confusion/ In particular, I think you are falsely assuming that a natural key must be generated from an outside source (or some source outside of your control), and is therefore not reliably unique. You can generate your own keys, and if you hand them out to customers and include them on paperwork, they are now a part of the reality that your database models -- and therefore become natural keys. Invoice numbers, driver's license numbers, etc., are all natural keys, because they are known about, and used, in reality. Usernames are, too, the only difference is that you let the user choose it. In contrast, a pointer or a UUID typically does not represent reality, because no humans ever see it and no computer systems outside yours know about it. So, it's merely an implementation detail and should not be a part of the model. Regards, Jeff Davis
On 03/05/11 08:25, Jeff Davis wrote: > You can generate your own keys, and if you hand them out to customers > and include them on paperwork, they are now a part of the reality that > your database models -- and therefore become natural keys. Invoice > numbers, driver's license numbers, etc., are all natural keys, because > they are known about, and used, in reality. Usernames are, too, the only > difference is that you let the user choose it. I've repeatedly run into situations where I generate a key that seems entirely sensible, making a generated primary key part of the business processes ... then external constraints force me to change the format of that key or start accepting keys from outside. "Oh, we need to move to 14-digit client IDs because <x-system> that we interact with requires them". "We want invoice numbers to include a Luhn check digit, can you add that?". Etc. I'm now strongly in favour of keeping an internal key that users never see, and having separate user-visible identifiers. The users can demand that those identifiers change format or generation method and it's an easy change in only one place. Want two different keys? I can do that too. Record a key that matches some external system? That's easy. Want to be able to edit/override/rename keys? Yep, that's fuss free too, and it won't affect my audit history (which uses the real internal keys) or have to cascade to foreign key relationships in append-only ledger tables. I use a mix of surrogate and natural keys, depending on the situation. I see little point in surrogate keys for simple lookup tables, but find them invaluable in audited tables with lots of foreign key relationships that interact with other business systems. -- Craig Ringer
On Mon, May 2, 2011 at 7:43 PM, Craig Ringer <craig@postnewspapers.com.au> wrote: > I'm now strongly in favour of keeping an internal key that users never > see, and having separate user-visible identifiers. The users can demand > that those identifiers change format or generation method and it's an It's far easier to change a unique constraint on the fly than a PK in a busy production database.
Jeff Davis wrote: > On Mon, 2011-05-02 at 11:10 -0400, Greg Smith wrote: >> The position Merlin >> has advocated here, that there should always be a natural key available >> if you know the data well enough, may be true. But few people are good >> enough designers to be sure they've made the decision correctly, and the >> downsides of being wrong can be a long, painful conversion process. >> Easier for most people to just eliminate the possibility of making a >> mistake by using auto-generated surrogate keys, where the primary >> problem you'll run into is merely using more space/resources than you >> might otherwise need to have. It minimizes the worst-case--mistake make >> in the model, expensive re-design--by adding overhead that makes the >> average case more expensive. > > Once you really try to define "natural" and "surrogate" keys, I think a > lot of the arguments disappear. I wrote about this a few years back: > > http://thoughts.j-davis.com/2007/12/11/terminology-confusion/ > > In particular, I think you are falsely assuming that a natural key must > be generated from an outside source (or some source outside of your > control), and is therefore not reliably unique. > > You can generate your own keys, and if you hand them out to customers > and include them on paperwork, they are now a part of the reality that > your database models -- and therefore become natural keys. Invoice > numbers, driver's license numbers, etc., are all natural keys, because > they are known about, and used, in reality. Usernames are, too, the only > difference is that you let the user choose it. > > In contrast, a pointer or a UUID typically does not represent reality, > because no humans ever see it and no computer systems outside yours know > about it. So, it's merely an implementation detail and should not be a > part of the model. > > Regards, > Jeff Davis > My wife works (at the sql level) with shall we say "records about people". Real records, real people. Somewhere around 2 million unique individuals, several million source records. They don't all have ssn, they don't all have a drivers license. They don't all have an address, many have several addresses (especially over time) and separate people have at one time or another lived at the same address. You would be surprise how many "bob smith"s where born on the same day. But then they weren't all born in a hospital etc etc etc. A person may present on any of a birth record, a death record, a hospital record, a drivers license, a medical registry, a marriage record and so on. There simply is no natural key for a human. We won't even worry about the non-uniqueness of ssn. And please don't get her started on twins. :) I can only imagine that other equally complex entities are just as slippery when it comes time to pinpoint the natural key. rjs
> -----Original Message----- > From: pgsql-general-owner@postgresql.org [mailto:pgsql-general- > owner@postgresql.org] On Behalf Of Rob Sargent > Sent: Monday, May 02, 2011 7:07 PM > To: Jeff Davis > Cc: Greg Smith; pgsql-general@postgresql.org > Subject: Re: [GENERAL] pervasiveness of surrogate (also called > synthetic) keys > > > > Jeff Davis wrote: > > On Mon, 2011-05-02 at 11:10 -0400, Greg Smith wrote: > >> The position Merlin > >> has advocated here, that there should always be a natural key > available > >> if you know the data well enough, may be true. But few people are > good > >> enough designers to be sure they've made the decision correctly, and > the > >> downsides of being wrong can be a long, painful conversion process. > >> Easier for most people to just eliminate the possibility of making a > >> mistake by using auto-generated surrogate keys, where the primary > >> problem you'll run into is merely using more space/resources than > you > >> might otherwise need to have. It minimizes the worst-case--mistake > make > >> in the model, expensive re-design--by adding overhead that makes the > >> average case more expensive. > > > > Once you really try to define "natural" and "surrogate" keys, I think > a > > lot of the arguments disappear. I wrote about this a few years back: > > > > http://thoughts.j-davis.com/2007/12/11/terminology-confusion/ > > > > In particular, I think you are falsely assuming that a natural key > must > > be generated from an outside source (or some source outside of your > > control), and is therefore not reliably unique. > > > > You can generate your own keys, and if you hand them out to customers > > and include them on paperwork, they are now a part of the reality > that > > your database models -- and therefore become natural keys. Invoice > > numbers, driver's license numbers, etc., are all natural keys, > because > > they are known about, and used, in reality. Usernames are, too, the > only > > difference is that you let the user choose it. > > > > In contrast, a pointer or a UUID typically does not represent > reality, > > because no humans ever see it and no computer systems outside yours > know > > about it. So, it's merely an implementation detail and should not be > a > > part of the model. > > > > Regards, > > Jeff Davis > > > My wife works (at the sql level) with shall we say "records about > people". Real records, real people. Somewhere around 2 million unique > individuals, several million source records. They don't all have ssn, > they don't all have a drivers license. They don't all have an address, > many have several addresses (especially over time) and separate people > have at one time or another lived at the same address. You would be > surprise how many "bob smith"s where born on the same day. But then > they weren't all born in a hospital etc etc etc. A person may present > on any of a birth record, a death record, a hospital record, a drivers > license, a medical registry, a marriage record and so on. There simply > is no natural key for a human. We won't even worry about the > non-uniqueness of ssn. And please don't get her started on twins. :) > > > I can only imagine that other equally complex entities are just as > slippery when it comes time to pinpoint the natural key. People are sometimes surprised to discover the social security numbers are not unique. There are fraudulent social security numbers: http://www2.nbc4i.com/news/2010/dec/06/2/study-finds-millions-duplicate-social-security-num-ar-316988/ There are numbers given out by the IRS by mistake: http://wnyt.com/article/stories/S1594530.shtml?cat=10115
On 03/05/11 10:06, Rob Sargent wrote: > My wife works (at the sql level) with shall we say "records about > people". Real records, real people. Somewhere around 2 million unique > individuals, several million source records. They don't all have ssn, > they don't all have a drivers license. They don't all have an address, > many have several addresses (especially over time) and separate people > have at one time or another lived at the same address. ... and that's before we get into the horror of "what is someone's name". Which name? Which spelling? Do they even have a single canonical name? Is their canonical name - if any - expressable in the character set used by the service? Is it even covered by Unicode?!? Does it make any sense to split their name up into the traditional english-speaking-recent-western "family" and "given" name parts? Is there a single consistent way to do so for their name even if it does? etc. SSN? What if they don't live in the US or aren't a citizen? What if they have more than one SSN (yes, it happens)? Or there's one being shared by more than one person (again, this happens) and they can't get it fixed or don't want to? My mother's postal address - before street numbering was introduced a few years ago on her road - used to be "Rural Delivery Area 2, SomeTownName, New Zealand". You'd be amazed how many systems could not cope with that; she used to have to register all sorts of things to her parents' address in the nearby town. People decide to key a database on US Social Security number - because it's a "unique natural key" then discover the exciting problems with that. While they're battling those, business needs change and the database starts needing to accept people from other countries, who don't have a US SSN and don't know what one is or what it looks like. Tracking people/companies in databases are ideal candidates for synthetic keying with a solid split/merge procedure to handle duplicates, overlapping identity records, etc. -- Craig Ringer
On 05/02/2011 10:06 PM, Rob Sargent wrote: > You would be surprise how many "bob smith"s where born on the same > day. But then they weren't all born in a hospital etc etc etc. I wouldn't be surprised. I once lived in a mile-square town (Hoboken, that's it's nickname). In that town were 40K residents and three gyms. I forgot my ID card one day when going to mine, and they took my name and street name as alternate proof of identity. Some designer along the line figured that was unique enough. Number of Greg Smiths living on that street who were members of that one gym? Three. I see this whole area as being similar to SQL injection. The same way that you just can't trust data input by the user to ever be secure, you can't trust inputs to your database will ever be unique in the way you expect them to be. So if you build a so-called "natural key" based on them, expect that to break one day. That doesn't mean you can't use them as a sort of foreign key indexing the data; it just means you can't make them the sole unique identifier for a particular entity, where that entity is a person, company, or part. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
otoh, there's plenty of places where natural keys are optimal. my company makes widgets, and we make damn sure our serial #s and part numbers are unique, and we use them as PK's for the various tables. further, the PN has a N digit prefix which is unique to a part family, then a M digit suffix which identifies a specific version of that PN. we use the N digit PN for the family tables, and the full N+M digit PN for the full PN tables. serial # is globally unique across all PNs so its the PK of any table related directly to a widget.
On 03/05/11 11:07, Greg Smith wrote: > That doesn't mean you can't use > them as a sort of foreign key indexing the data; it just means you can't > make them the sole unique identifier for a particular entity, where that > entity is a person, company, or part. Classic case: a database here has several tables indexed by MAC address. It's used for asset reporting and software inventory. Problem: VMs generate random MAC addresses by default. They're not guaranteed to be globally unique. Collisions have happened and will probably happen again. In this case, it wasn't a big deal, but it just goes to show that even the "obviously" globally unique isn't necessarily so. -- Craig Ringer
John R Pierce wrote: > otoh, there's plenty of places where natural keys are optimal. my > company makes widgets, and we make damn sure our serial #s and part > numbers are unique, and we use them as PK's for the various tables. > further, the PN has a N digit prefix which is unique to a part family, > then a M digit suffix which identifies a specific version of that > PN. we use the N digit PN for the family tables, and the full N+M > digit PN for the full PN tables. serial # is globally unique across > all PNs so its the PK of any table related directly to a widget. > > > I'm surprised to see, as a defence of natural keys, an arbitrary identifier composed of references to arbitrary identifiers elsewhere in the system described. How many ways is this just wrong? That you're assigning the serial numbers does not destinguish them from a UUID generator, oh except you've put semantics into the value. oh and you're at risk of having to reformat them when you buy out your competitor. Starts to look like the database-in-a-name scheme of which I'm oh so fond. I thought a natural key for a part would be more like "bolt=geometry:head-mm:head-depth:thread-per-inch:etc". And for something as describable as bolt indeed the complete record could well qualify as a natural key especially if none of the attributes are null-able (which is a believable concept for bolt). But bolt is not nut and both are parts. Gets messy quickly without arbitary simple keys. I think you're short on the "simple" part in your defn of serial number. And interestingly you're dealing with types. What does one do when one has to track instances of bolts. Given that all bolts of a certain natural key are identical, save for the fact that one was made before the other. The job is to register every bolt against its (current) deployment. Naturally one assigns each bolt a non-upc barcode aka UUID.
Craig Ringer wrote: > On 03/05/11 11:07, Greg Smith wrote: > > >> That doesn't mean you can't use >> them as a sort of foreign key indexing the data; it just means you can't >> make them the sole unique identifier for a particular entity, where that >> entity is a person, company, or part. >> > > Classic case: a database here has several tables indexed by MAC address. > It's used for asset reporting and software inventory. > > Problem: VMs generate random MAC addresses by default. They're not > guaranteed to be globally unique. Collisions have happened and will > probably happen again. In this case, it wasn't a big deal, but it just > goes to show that even the "obviously" globally unique isn't necessarily so. > > -- > Craig Ringer > Hm.. Virtual machines as assets. Mortgage backed securities, anyone.
On 03/05/11 12:57, Rob Sargent wrote: > Hm.. Virtual machines as assets. Mortgage backed securities, anyone. Well, sure ... but the software running on them is tracked as part of licensing compliance efforts, whether or not the virtual hardware its self is an "asset" its self. The DB designer chose to use the host's MAC address to identify the host, and the tracking software can't tell the difference between a VM and a physical host. The other problem with that is that it doesn't deal well with multi-homed hosts. All in all, it's just a dumb design, and was fixed a while ago, but it's an illustration of how something that seems like an obvious natural key doesn't always stay one. -- Craig Ringer
On May 2, 2011, at 10:53 PM, Rob Sargent wrote: > ...and you're at risk of having to reformat them when you buy out your competitor. The scheme described was awfully similar to one that a client of mine used, product family prefix, identifiers within thefamily. And guess what? The scheme, which had been stable for 20+ years, had to change when a new variant of product wasintroduced which cut across family & product. I don't remember the details. I do remember that I hadn't used the supposedlystable product ids as PKs ;-) -- Scott Ribe scott_ribe@elevated-dev.com http://www.elevated-dev.com/ (303) 722-0567 voice
John R Pierce wrote: > otoh, there's plenty of places where natural keys are optimal. my > company makes widgets, and we make damn sure our serial #s and part > numbers are unique, and we use them as PK's for the various tables. Sure; what I was commenting on is that you normally can't ever trust external sources for identifiers. If you want to come up with your own, internally unique keys for things, great. But one of the goals of using a natural key is often to avoid the overhead of storing both that ID and some made up internal number, too. And whether the number is made up by the computer (the classic SERIAL or similar surrogate key), or you make one up yourself, it's still another chunk of data that gets stored for every item. It's just one that means something more useful in your case. Probably going to take up more space in the process and possibly be slower though--part number strings can easily end up longer than SERIAL-like integers. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Tue, May 03, 2011 at 10:52:23AM +0800, Craig Ringer wrote: > ... and that's before we get into the horror of "what is someone's > name". Which name? Which spelling? Do they even have a single canonical > name? - people have, at least over time, several compound names - they have, at any one time, one active compound name - additional spellings can be tracked as additional names of that individual > Is their canonical name - if any - expressable in the character > set used by the service? Is it even covered by Unicode?!? - I haven't seen evidence to the contrary. - But then, I haven't had a need to store a Klingon name. - Yes, it's been difficult to come up with something sensible to store Spock's first name in the GNUmed database. > Does it make > any sense to split their name up into the traditional > english-speaking-recent-western "family" and "given" name parts? - any compound names I have come across work like this: - group name - individual name - nicknames (pseudonyms, warrior names, actor names, ...) The day-to-day usage of each part varies, though. > Is there a single consistent way to do so for their name even if it does? etc. Even in Japan, where the group is a lot more than the individual, can you clearly split into group name and individual name. Karsten -- GPG key ID E4071346 @ gpg-keyserver.de E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
On May 2, 2011, at 10:52 PM, Craig Ringer wrote: > SSN? What if they don't live in the US or aren't a citizen? Non-citizens can have SSNs (they have to if they work in the US). -- Rick Genter rick.genter@gmail.com
On Mon, May 2, 2011 at 11:53 PM, Craig Ringer <craig@postnewspapers.com.au> wrote: > On 03/05/11 11:07, Greg Smith wrote: > >> That doesn't mean you can't use >> them as a sort of foreign key indexing the data; it just means you can't >> make them the sole unique identifier for a particular entity, where that >> entity is a person, company, or part. > > Classic case: a database here has several tables indexed by MAC address. > It's used for asset reporting and software inventory. > > Problem: VMs generate random MAC addresses by default. They're not > guaranteed to be globally unique. Collisions have happened and will > probably happen again. In this case, it wasn't a big deal, but it just > goes to show that even the "obviously" globally unique isn't necessarily so. It's precisely pathological cases like this where uniqueness constraints are important and should be used. By the way, we aren't debating the use of natural case but whether to define uniqueness constraints. My main gripe with surrogates is that their use often leads directly to lazy schema design where uniqueness constraints are not defined which leads to data problems exactly like the case you described above. In a purely surrogate table with no unique on the mac, suppose you have two records with the same value for the address, and there are no other interesting fields on the table or elsewhere in the database: *) who/what made the decision to place a second record on the table? *) is that decision verifiable? *) Is that decision repeatable? *) are there other routes of data entry into the database that bypass that decision? will there ever be? *) what happens when the code that represents that decision has or develops a bug? *) why would you not want information supporting that decision in the database? *) how do you know the tables keying to your mac table are pointing to the correct record? *) what are the consequences for ad hoc queries that join directly against the mac? DISTINCT? If your data modeler that made the the assumptions that a MAC is unique (a mistake obviously) at least the other tables are protected from violations of that assumption because the database would reject them with an error, which is a Good Thing. Without a uniqueness constraint you now have ambiguous data which is a Very Bad Thing. Without proper unique constraints, a generated key is effectively saying "well, I cant' figure this out right now...I'll deal with it later". That unmanaged complexity is now in the application and all the queries that touch the database...you've created your own bug factory. With a uniqueness constraint, you have a rigorous definition of what your record represents, and other entities in the database can now rely on that definition. Natural key designs are good for a lot of reasons, but #1 on the least is that they force you to deal with problems in your data model up front because they force you to define unqiueness. If the MAC turns out not to be unique and the problem is not in fact coming from the input data or the application, yes, you do have to correct the model but at least the data inside the database is clean, and can be unambiguously mapped to the new model. I'll take schema changes over bad data. Correcting the model means you have to figure out whatever information is used to distinguish identical MACs #1 and #2 is stored in the database because now your data and the corresponding decisions are verifiable, repeatable, unambiguous, etc. What extra field you have to add to your 'mac' table to make it unique would depend on certain things, but it's certainly a solvable problem, and when solved would give you a more robust database. merlin
On Mon, 2011-05-02 at 20:06 -0600, Rob Sargent wrote: > Jeff Davis wrote: > > In particular, I think you are falsely assuming that a natural key must > > be generated from an outside source (or some source outside of your > > control), and is therefore not reliably unique. > > > > You can generate your own keys... ... > My wife works (at the sql level) with shall we say "records about > people". Real records, real people. Somewhere around 2 million unique > individuals, several million source records. They don't all have ssn, > they don't all have a drivers license. They don't all have an address, > many have several addresses (especially over time) and separate people > have at one time or another lived at the same address. You would be > surprise how many "bob smith"s where born on the same day. But then > they weren't all born in a hospital etc etc etc. A person may present > on any of a birth record, a death record, a hospital record, a drivers > license, a medical registry, a marriage record and so on. There simply > is no natural key for a human. We won't even worry about the > non-uniqueness of ssn. And please don't get her started on twins. :) > > > I can only imagine that other equally complex entities are just as > slippery when it comes time to pinpoint the natural key. I think you missed my point. You don't have to rely on natural keys that come from somewhere else; you can make up your own, truly unique identifier. Regards, Jeff Davis
On 05/03/2011 12:51 PM, Jeff Davis wrote: > On Mon, 2011-05-02 at 20:06 -0600, Rob Sargent wrote: >> Jeff Davis wrote: >>> In particular, I think you are falsely assuming that a natural key must >>> be generated from an outside source (or some source outside of your >>> control), and is therefore not reliably unique. >>> >>> You can generate your own keys... > > ... > >> My wife works (at the sql level) with shall we say "records about >> people". Real records, real people. Somewhere around 2 million unique >> individuals, several million source records. They don't all have ssn, >> they don't all have a drivers license. They don't all have an address, >> many have several addresses (especially over time) and separate people >> have at one time or another lived at the same address. You would be >> surprise how many "bob smith"s where born on the same day. But then >> they weren't all born in a hospital etc etc etc. A person may present >> on any of a birth record, a death record, a hospital record, a drivers >> license, a medical registry, a marriage record and so on. There simply >> is no natural key for a human. We won't even worry about the >> non-uniqueness of ssn. And please don't get her started on twins. :) >> >> >> I can only imagine that other equally complex entities are just as >> slippery when it comes time to pinpoint the natural key. > > I think you missed my point. You don't have to rely on natural keys that > come from somewhere else; you can make up your own, truly unique > identifier. > > Regards, > Jeff Davis > Sorry, but I'm confused, but that's common. Isn't a "natural key" to be compose solely from the attributes of the entity? As in a subset of the columns of the table in a third-normalish world. Isn't tacking on another column with a concocted id joining the "pervassiveness"?
On Tue, 2011-05-03 at 13:35 -0600, Rob Sargent wrote: > Sorry, but I'm confused, but that's common. Isn't a "natural key" to be > compose solely from the attributes of the entity? As in a subset of the > columns of the table in a third-normalish world. Isn't tacking on > another column with a concocted id joining the "pervassiveness"? Not in my opinion. Before cars existed, there was no driver's license number. The DMV (as it's called in California, anyway) created it, and it's now a key that they can trust to be unique. It's also an attribute of the entity now, because it's printed on the cards you hand to people. The thing that I think is a mistake is to use generated IDs like an internal implementation detail (i.e. hide them like pointers); then at the same time mix them into the data model. Regards, Jeff Davis
On Mon, 2011-05-02 at 23:07 -0400, Greg Smith wrote: > I see this whole area as being similar to SQL injection. The same way > that you just can't trust data input by the user to ever be secure, you > can't trust inputs to your database will ever be unique in the way you > expect them to be. So, don't trust them to be unique then. Make up your own unique identifier, and use that. Regards, Jeff Davis
On 05/03/2011 03:08 PM, Jeff Davis wrote: > On Tue, 2011-05-03 at 13:35 -0600, Rob Sargent wrote: >> Sorry, but I'm confused, but that's common. Isn't a "natural key" to be >> compose solely from the attributes of the entity? As in a subset of the >> columns of the table in a third-normalish world. Isn't tacking on >> another column with a concocted id joining the "pervassiveness"? > > Not in my opinion. Before cars existed, there was no driver's license > number. The DMV (as it's called in California, anyway) created it, and > it's now a key that they can trust to be unique. It's also an attribute > of the entity now, because it's printed on the cards you hand to people. > > The thing that I think is a mistake is to use generated IDs like an > internal implementation detail (i.e. hide them like pointers); then at > the same time mix them into the data model. > > Regards, > Jeff Davis > > > Well yes it does all depend on how you model things after all. I think a drivers license is and attribute of driver not person. So before cars, one still had a hard time coming up with a natural key on person. Of course California's DMV only cares about Californian licenced drivers, so they get to generate and assign license number as an arbitary key for drivers 'cause under that we're back to person.
Merlin Moncure wrote: > If your data modeler that made the the assumptions that a MAC is > unique (a mistake obviously) at least the other tables are protected > from violations of that assumption because the database would reject > them with an error, which is a Good Thing. Without a uniqueness > constraint you now have ambiguous data which is a Very Bad Thing. > With a uniqueness constraint in this situation, the unexpected data--row with a non unique MAC--will be rejected and possibly lost when the insertion happens. You say that's a good thing, plenty of people will say that's the worst possible thing that can happen. When dealing with external data, it's often impossible to know everything you're going to see later at design time. Approaching that problem with the idea that you're going to lose any data that doesn't fit into the original model is not what everyone finds reasonable behavior. I don't think it's possible to decide in a generic way which of these is the better approach: to reject unexpected data and force the problem back at the application immediately (commit failure), or to accept with with because you're using a surrogate key and discover the problems down the line. Both are valid approaches with a very different type of risk associated with them. I think it's fair to say that real-world data is not always well known enough at design time to follow the idea you're suggesting though, and that does factor into why there is such a preference for surrogate keys in the industry. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
Jeff Davis wrote: > On Mon, 2011-05-02 at 23:07 -0400, Greg Smith wrote: > >> I see this whole area as being similar to SQL injection. The same way >> that you just can't trust data input by the user to ever be secure, you >> can't trust inputs to your database will ever be unique in the way you >> expect them to be. >> > > So, don't trust them to be unique then. Make up your own unique > identifier, and use that. > If you're making up your own unique identifier, that's closer to a surrogate key as far as I'm concerned, even though it doesn't fit the strict definition of that term (it doesn't have the subtle idea that "surrogate" implies "meaningless"). Now, there is some value to doing that well, instead of just using the typical incrementing integer "pointer" approach, as you've called it. But if it's not derived from external data you're storing anyway, it's not a true natural key either. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On May 3, 2011, at 22:03, Greg Smith <greg@2ndquadrant.com> wrote: > Merlin Moncure wrote: >> If your data modeler that made the the assumptions that a MAC is >> unique (a mistake obviously) at least the other tables are protected >> from violations of that assumption because the database would reject >> them with an error, which is a Good Thing. Without a uniqueness >> constraint you now have ambiguous data which is a Very Bad Thing. >> > > With a uniqueness constraint in this situation, the unexpected data--row with a non unique MAC--will be rejected and possiblylost when the insertion happens. You say that's a good thing, plenty of people will say that's the worst possiblething that can happen. When dealing with external data, it's often impossible to know everything you're going tosee later at design time. Approaching that problem with the idea that you're going to lose any data that doesn't fit intothe original model is not what everyone finds reasonable behavior. > > I don't think it's possible to decide in a generic way which of these is the better approach: to reject unexpected dataand force the problem back at the application immediately (commit failure), or to accept with with because you're usinga surrogate key and discover the problems down the line. Both are valid approaches with a very different type of riskassociated with them. I think it's fair to say that real-world data is not always well known enough at design time tofollow the idea you're suggesting though, and that does factor into why there is such a preference for surrogate keys inthe industry. > > -- > Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD > PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us > "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general If you implicitly expect MAC to be unique but do not enforce it then you are likely to write queries that rely upon thatuniqueness. If you do enforce it then the constraint failure occurs anyway. A scalar sub-query that links via the MAC will fail when the duplicate data is encountered, and normal queries will returntoo-many records. A detail report may be obvious but if you are summarizing the data the specific offending recordis going to require some effort to find. I guess if you are the compromising type you can loosely enforce the uniqueness by running a check query periodically tosee if supposedly unique values have been duplicated. I agree there is no right answer - the designer needs to make trade-offs - but I'd rather reject new data and leave the systemin a status-quo stable state instead of introducing invalid data and putting the system into a state where it requireseffort to get it functioning again. If you accept the invalid data the likely scenario, if something breaks, issomeone finds the offending record and removes it until the application and database can be fixed properly - which is wherewe are at with validation. The common exception is where identifiers are reused over time and you remove the old recordin order to keep/allow the newer record to remain. On a tangential course I've started considering is a setup whereby you basically have two identifiers for a record. Oneis end-user facing and updatable whereas the other is static and used in intra-table relations. You can create a newrecord with the same user-facing id as an existing Id but the existing Id will be replaced with its system id. This isuseful when users will be using the Id often and it can be reasonably assumed to be unique over a moderate period of time(say a year). Invoice numbers, customer numbers are two common examples. The lookup Id itself may require additionalfields in order to qualify as a primary (natural) key but the static key wants to be a single field. Often simplyputting a date with the original id (and parent identifiers) is sufficient due to the infrequency of updates. Thedownside is, with string-based parent identifiers the pk value can be quite long. I currently have PKs of 40-50 lengthbut during my new design my first pass on a couple of tables indicated >100 characters limit. Is there any rules-of-thumb on the performance of a PK as a function of key length? I like using varchar based identifierssince I tend to query tables directly and writing where clauses is much easier if you can avoid the joins. I'mlikely better off creating views and querying those but am still curious on any basic thoughts on having a 100+ lengthprimary key. David J.
David Johnston wrote: > Is there any rules-of-thumb on the performance of a PK as a function of key length? I like using varchar based identifierssince I tend to query tables directly and writing where clauses is much easier if you can avoid the joins. I'mlikely better off creating views and querying those but am still curious on any basic thoughts on having a 100+ lengthprimary key. > The shorter the better, but it may not be as bad as you fear. The way B-tree indexes are built, it isn't that expensive to hold a longer key so long as the unique part doesn't average out to be that long. So if you insert "123456666666666666666" and "12345777777777777777", that's not going to be much different than navigating "123456" and "123457", because once you get that far you've already reached a unique prefix. But if your entries have a really long common prefix, like "111111111111111112" and "111111111111111113", that's going to be more expensive to deal with--even though the strings are the same length. If your identifiers become unique after only a few characters, it may not be so bad. But if they go many characters before you can distinguish between any two entries, you're probably not going to be happy with the performance or size of the indexes, relative to simple integer keys. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
*) most tables don't have unique natural keys (let's see em)On Thu, Apr 28, 2011 at 12:29 PM, Jim Irrer <irrer@umich.edu> wrote:
etc
On Wed, May 4, 2011 at 7:14 AM, Misa Simic <misa.simic@gmail.com> wrote: > > > 2011/4/28 Merlin Moncure <mmoncure@gmail.com> >> >> On Thu, Apr 28, 2011 at 12:29 PM, Jim Irrer <irrer@umich.edu> wrote: >> *) most tables don't have unique natural keys (let's see em) >> etc >> > > i.e for an Invoice, we have at least 2 tables (more in practice...): > Invoice Header > -Invoice Number > -Date > -CustomerID > -Currency > Invoice Line > -ItemID > -qty > -Price > > This is not real design for tables, just basic real info stored to represent > 1 Invoice Document. > Now, let say for Invoice Header, "Natural" PK is Invoice Number, so we will > add that column to Invoice Line table as FK.... > What should be "Natural" PK for Invoice Line table? > Also, if we now add additional tables i.e. Serial Numbers, related to > Invoice Line, what should be "Natural" key in it? Most of the old school accounting systems maintained an invoice line number. > Invoice Line > -Invoice Number > -LineNo > -ItemID > -qty > -Price The line number started from 1 (the first line on the invoice) on every unique invoice. This has the added benefit of allowing the primary key index (invoice_no, line_no) allowing you to pull up the invoice line records in correct order without requiring an extra sort and, if you don't need any fields from the invoice, a join. Compare the two queries pulling up invoice lines over a range of invoice numbers. merlin
Most of the old school accounting systems maintained an invoice line number.
> Invoice Line
> -Invoice Number
> -LineNo
> -ItemID
> -qty
> -Price
The line number started from 1 (the first line on the invoice) on
every unique invoice. This has the added benefit of allowing the
primary key index (invoice_no, line_no) allowing you to pull up the
invoice line records in correct order without requiring an extra sort
and, if you don't need any fields from the invoice, a join.
Compare the two queries pulling up invoice lines over a range of
invoice numbers.
merlin
>>Thanks, merlin,
>>>And in that case, what is "Natural" in LineNo? I would say, with adding LineNo we are creating syntethic/surrogate Key (just instead of 1 surrogate column - it will be Compound key with more columns...)? The >>>same is with all other tables what are "parts" of an Entity, Serial Numbers, Accounting Distribution produced by Invoice...etc etc...
Being the “first line” or the “second line” of a physical invoice is a property for that line. Identifying its position on the invoice is only natural.
By your reasoning all identifiers are synthetically generated if you consider there is never truly only a single instance of anything in the multi-verse. The only truly unique identifier would be the time+place of an objects creation.
“Hello - person born in Liverpool London, St. Whatever hospital, Room 101 @ 13:14:57AM on the 5th of March 2001 – how may I direct your call?” (I guess you could use the conception date as well although twins+ might be tough to distinguish in that case).
Generally it could be argued that any well-normalized compound key is inherently natural (whether left as multiple fields or concatenated into a single field). The identifier that is assigned to the “part” individually is likely to be “synthetic” but its membership in the hierarchy naturalizes it.
David J.
On Wed, May 04, 2011 at 09:33:57AM -0400, David Johnston wrote: > “Hello - person born in Liverpool London, St. Whatever > hospital, Room 101 @ 13:14:57AM on the 5th of March 2001 – > how may I direct your call?” (I guess you could use the > conception date as well That will rarely be known to any sufficient precision. Karsten -- GPG key ID E4071346 @ gpg-keyserver.de E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
On Wed, May 4, 2011 at 7:50 AM, Misa Simic <misa.simic@gmail.com> wrote: > 2011/5/4 Merlin Moncure <mmoncure@gmail.com> >> >> Most of the old school accounting systems maintained an invoice line >> number. >> > Invoice Line >> > -Invoice Number >> > -LineNo >> > -ItemID >> > -qty >> > -Price >> >> The line number started from 1 (the first line on the invoice) on >> every unique invoice. This has the added benefit of allowing the >> primary key index (invoice_no, line_no) allowing you to pull up the >> invoice line records in correct order without requiring an extra sort >> and, if you don't need any fields from the invoice, a join. >> >> Compare the two queries pulling up invoice lines over a range of >> invoice numbers. >> >> merlin > > > Thanks, merlin, > And in that case, what is "Natural" in LineNo? I would say, with adding > LineNo we are creating syntethic/surrogate Key (just instead of 1 surrogate > column - it will be Compound key with more columns...)? The same is with all > other tables what are "parts" of an Entity, Serial Numbers, Accounting > Distribution produced by Invoice...etc etc... It's natural because it contains information that is not synthetic, and unique/well defined. Specifically, the position of the line on the invoice; you can't have to invoice lines at the second line of an invoice for example. This information is visible and important to users because presumably the invoice lines as entered into an application are in the order you would want them printed out on a physical invoice. The invoice number itself is more interesting case because it's generated and what it might represent (the 77th invoice for this customer) isn't very interesting. An actual, 'true' natural key for an invoice might be a combination of user, entry time, and customer, but this key is fairly weak and not very useful for identification purposes by human beings -- so we created the invoice number concept as a proxy for that information. Also timestamps and floating point values tend to suck for a primary key. This is what Jeff D was talking about in detail largely upthread, to wit: if your users (man or machine) really prefer to work with a generated identifier for some reason or another, by all means make one, just make sure the record has a rigorous definition. A line number table would be a good candidate for adding an additional surrogate key for purposes of relating on performance grounds if you have foreign keys pointing to the record.. Any shifts in invoice line position require renumbering the invoice which would annoyingly cascade the updates. The reason, by the way, that I *know* you aren't going to turn up many interesting cases of not being able to define a key at least in the accounting and manufacturing world is that I happen to have worked a large ERP/CRM that dated from the pre-sql era. Like many systems of the day, it was written in COBOL over an isam data store which didn't have the technical capability to do what sequences or uuids do today (or at least, it was more trouble than it was worth)...100% natural keys. Somehow those guys seemed to manage just fine working under hardware constraints much tighter than today's Although there were some real headaches in that system, for example when say, part numbers changed, it had an inner elegance that grew on me over time, and taught me the value of rigorous definition of records and their keys. merlin
On Wed, May 4, 2011 at 2:25 AM, Greg Smith <greg@2ndquadrant.com> wrote: > David Johnston wrote: >> >> Is there any rules-of-thumb on the performance of a PK as a function of >> key length? I like using varchar based identifiers since I tend to query >> tables directly and writing where clauses is much easier if you can avoid >> the joins. I'm likely better off creating views and querying those but am >> still curious on any basic thoughts on having a 100+ length primary key. >> > > The shorter the better, but it may not be as bad as you fear. The way > B-tree indexes are built, it isn't that expensive to hold a longer key so > long as the unique part doesn't average out to be that long. So if you > insert "123456666666666666666" and "12345777777777777777", that's not going > to be much different than navigating "123456" and "123457", because once you > get that far you've already reached a unique prefix. But if your entries > have a really long common prefix, like "111111111111111112" and > "111111111111111113", that's going to be more expensive to deal with--even > though the strings are the same length. > > If your identifiers become unique after only a few characters, it may not be > so bad. But if they go many characters before you can distinguish between > any two entries, you're probably not going to be happy with the performance > or size of the indexes, relative to simple integer keys. yeah. The number of comparisons should be basically the same, but situational things are going to make/break you. As the in house 'performance guy', it might interest you to work through them all -- indexing strategies are the key to good database performance and, modeling concepts and religious debates aside, this is an interesting discussion from a strictly performance point of view. One reason why natural keys work much better than expected is you get much better index utilization and potentially *much* better tuple/page efficiency on certain very common classes of lookups/scans especially if you cluster. Speaking of clustering, you no longer have to agonize in cases of say, having to cluster on 'email' or 'email_id'. The identifying and ordering/searching criteria are in the same index which can be an enormous win in some cases. Do not underestimate the value of this when the table is large and dependent scans are common. If you key on email and the query coming from another table doesn't need any other email properties, you just saved yourself a join without having to de-normailze in the classic sense. You also get to cut out many sorts on similar principles. OTOH, updates as noted tend to suck. In some cases hilariously so. The larger index is going to cause more cache pressure which is a point against...sometimes the large index just isn't worth the cost for what you get. Also, you have a hard limit on key sizes imposed by postgres. It's rare to hit that in the real world but it should be noted. Another negative point is that the postgres stats system also doesn't deal well with composite keys for range scans. I have a pretty good idea on how to fix this, but I haven't gotten around to it yet. merlin
Being the “first line” or the “second line” of a physical invoice is a property for that line. Identifying its position on the invoice is only natural.
Specifically, the position of the line on the invoice; you can't have to invoice lines at the second line of aninvoice for example. This information is visible and important to users because presumably the invoice lines as entered into an application are in the order you would want them printed out on aphysical invoice.
By your reasoning all identifiers are synthetically generated if you consider there is never truly only a single instance of anything in the multi-verse. The only truly unique identifier would be the time+place of an objects creation.
“Hello - person born in Liverpool London, St. Whatever hospital, Room 101 @ 13:14:57AM on the 5th of March 2001 – how may I direct your call?” (I guess you could use the conception date as well although twins+ might be tough to distinguish in that case).
Specifically, the position of the line on the invoice; you can't have to invoice lines at the second line of aninvoice for example. This information is visible and important to users because presumably the invoice lines as entered into an application are in the order you would want them printed out on aphysical invoice.
If you key on email and the query coming from another table doesn't
need any other email properties, you just saved yourself a join
without having to de-normailze in the classic sense. You also get to
cut out many sorts on similar principles.
Like many systems of
the day, it was written in COBOL over an isam data store which didn't
have the technical capability to do what sequences or uuids do today
On Tue, May 3, 2011 at 8:03 PM, Greg Smith <greg@2ndquadrant.com> wrote: > With a uniqueness constraint in this situation, the unexpected data--row > with a non unique MAC--will be rejected and possibly lost when the insertion > happens. You say that's a good thing, plenty of people will say that's the > worst possible thing that can happen. But remember the original discussion is on using these are PK/FK. That's where things get really ugly. I can change my data model to not have a unique MAC or to do something to make them unique (add IP or something) much more easily if they're NOT a PK/FK. That's the real issue to me.