The "many nulls" problem

Поиск

Список

Период

Сортировка

От	Kynn Jones
Тема	The "many nulls" problem
Дата	14 марта 2008 г. 18:05:56
Msg-id	c2350ba40803141105o1d2a9a30o60f5108b2fc15f1@mail.gmail.com обсуждение исходный текст
Ответы	Re: The "many nulls" problem Re: The "many nulls" problem
Список	pgsql-performance

Дерево обсуждения

It often happens that a particular pieces of information is non-null for a small minority of cases. A superficially different manifestation of this is when two pieces of information are identical in all but a small minority of cases. This can be easily mapped to the previous description by defining a null in one column to mean that its contents should be obtained from those of another column. A further variant of this is when one piece of information is a simple function of another one in all but a small minority of cases.

(BTW, I vaguely recall that RDb theorists have a technical term for this particular design issue, but I don't remember it.)

In all these cases, the design choice, at least according to RDb's 101, is between including a column in the table that will be NULL most of the time, or defining a second auxiliary column that references the first one and holds the non-redundant information for the minority of cases for which this is necessary (and maybe define a VIEW that includes all the columns).

But for me it is a frequent occurrence that my quaint and simple RDb's 101 reasoning doesn't really apply for PostgreSQL. Basically, Pg is too smart for it! For example, does a large proportion of NULLs really imply a lot of wasted space? Maybe this is true for fixed-length data types, but what about for type TEXT or VARCHAR?

Just to be concrete, consider the case of a customers database for some home shopping website. Suppose that, as it happens, for the majority of this site's customers, the shipping and billing addresses are identical. Or consider the scenario of a company in which, for most employees, the email address can be readily computed from the first and last name using the rule First M. Last => first_last@acme.com, but the company allows some flexibility for special cases (e.g. for people like Yasuhiro Tanaka who's known to everyone by his nickname, Yaz, the email is yaz_tanaka@acme.com because hardly anyone remembers or even knows his full name.)

What's your schema design approach for such situations? How would you go about deciding whether the number of exceptional cases is small enough to warrant a second table? Of course, one could do a systematic profiling of various possible scenarios, but as a first approximation what's your rule-of-thumb?

TIA!

Kynn

В списке pgsql-performance по дате отправления:

Предыдущее

От: Glyn Astill
Дата: 14 марта 2008 г., 17:20:47
Сообщение: Adaptec 5805 SAS Raid

Следующее

От: Oleg Bartunov
Дата: 14 марта 2008 г., 18:59:23
Сообщение: Re: The "many nulls" problem

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

The "many nulls" problem

Предыдущее

Следующее