Re: FW: Surrogate keys (Was: enums)

Поиск
Список
Период
Сортировка
От Leandro Guimarães Faria Corcete Dutra
Тема Re: FW: Surrogate keys (Was: enums)
Дата
Msg-id 1137630669.12989.46.camel@bege.exemplo.com
обсуждение исходный текст
Ответ на Re: FW: Surrogate keys (Was: enums)  ("Jim C. Nasby" <jnasby@pervasive.com>)
Ответы Re: Surrogate keys (Was: enums)  (Michael Glaesemann <grzm@myrealbox.com>)
Список pgsql-hackers
Em Qua, 2006-01-18 às 17:22 -0600, Jim C. Nasby escreveu:
> >
> >    Forgive me my ignorance, but are ints inherently faster to compare than
> >strings, or is it just an implementation detail?  Ideally, if this is so
> >a fully data-independent system would create a hash behind the back of
> >user in order to get performance.
> 
> The CPU can do an integer comparison with one instruction; it can't do
> that with a text string.
OK.  Again, data independence should be the goal here.


> >    OK, hardly a typical example.  As I think I left clear, my problem is
> >not using surrogate keys, but using them by default, or even
> >exclusively.
> 
> No? It's certainly not uncommon to have tables with 100M+ rows.
No, but neither are they *that* common.
Certainly, lots of database have a few of them.  But then, they have
dozens, hundreds, thousands of much smaller tables.


> And keep
> in mind that this applies to every row of every table that has foreign
> keys. I'd bet it's actually common to save 1G or more with surrogate
> keys in moderately sized databases.
Only if you have quite some children, because otherwise, in the main
tables, the surrogate keys add a field, an index and a sequence to an
otherwise smaller table and index.


> Of course, you do have to be intelligent here, too. The only key defined
> on the table in my example is participant_id, project_id, date; there is
> no surrogate key because there's no real reason to have one.
Quite.


> >> (In actuality, there isn't participant_name... participants are
> >> identified by email address (not a great idea, but I wasn't around
> >> when that was chosen). As you can imagine, email addresses are
> >> substantially longer than 4 bytes. When we normalized email out of
> >> that main table things got substantially faster. That was a number of
> >> years ago, so the table was probably 15-25% of it's current size, but
> >> it still made a huge difference.)
> >
> >    This isn't normalisation at all, as far as I understand it.  It is just
> 
> I don't have the rules of normalization memorized enough to know what
> form this breaks, but I'm 99% certain it breaks at least one of them.
No, never.
Normalisation is about eliminating redundancy and, therefore, update
anomalies.  Making all the table dependent on only the keys and the
whole keys, by projecting relations to eliminate entity mixups.
What you mention is actually exposing an implementation detail, namely
an integer that serves as a hash of the key.


> Look at it this way: if someone wants to change their email address,
> best case scenario is that you have cascading RI setup and it updates
> thousands of rows in that table. Worst case scenario, you just de-linked
> a whole bunch of data. But with a surrogate key, all you have to do is
> update one row in one table and you're done.
OK, if you have lots of linked data.  But most tables are really dead
ends.


> >that we don't have data independence... so you had to expose an
> >implementation detail?
> 
> Expose to what? The application? First, this is a pretty minor thing to
> expose; second, if it's that big a concern you can completely hide it by
> using a view.
As someone said, you end up with ids everywhere, and no
user-understandable data at all...


> But the reality is, dealing with a numeric ID can be a heck of a lot
> easier than an email address. Look at URLs that embbed one versus the
> other for a good example.
Again, implementation details... levels mixup.


> >> Why should it? It's trivial to create views that abstract surrogate
> >> keys out, and if you really want to you can even make the views
> >> updatable. But here's two other things to consider:
> >
> >    These views, in heavy querying environments, can be prohibitive.
> 
> "Normalize 'til it hurts; denormalize 'til it works."
Lack of data implementation biting us again.


> Yes, the added overhead of rules for updates/inserts/deletes could start
> to add up in performance-critical code. But if performance is that
> critical you're far more likely to run into other bottlenecks first. And
> worst-case, you abstract behind a stored procedure that just has the
> right queries hard-coded.
> 
> As for select-only views you'll have a hard time showing any meaningful
> performance penalty.
Yet real user-defined data types could make it all much simpler.

-- 
+55 (11) 5685 2219               xmpp:leandrod@jabber.org
+55 (11) 9406 7191                       Yahoo!: lgcdutra    
+55 (11) 5686 9607         MSN: leandro@dutra.fastmail.fm
+55 (11) 4390 5383                      ICQ/AIM: 61287803



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Dunstan
Дата:
Сообщение: Re: pgxs/windows
Следующее
От: Tom Lane
Дата:
Сообщение: Re: No heap lookups on index