Re: Geographic data sources, queries and questions

Поиск
Список
Период
Сортировка
От Michael Glaesemann
Тема Re: Geographic data sources, queries and questions
Дата
Msg-id 06E237E2-1C8E-449D-8EDE-5AF513DD00D2@seespotcode.net
обсуждение исходный текст
Ответ на Re: Geographic data sources, queries and questions  ("btober" <btober@ct.metrocast.net>)
Ответы Re: Geographic data sources, queries and questions  (Oliver Elphick <olly@lfix.co.uk>)
Список pgsql-general
On May 24, 2007, at 8:57 , btober wrote:

> I'm not sure it is a bad design. Country has a country_id.
> That's the primary key. State has a state_id, and exactly
> one country, so really state has a compound primary key,
> namely (country_id, state_id).

While each state may have a single state_id and a single country,
that does not imply a composite (or compound) primary key. There's
been a lot written on database normalization, so I won't go into
depth here, but a quick way to think about it is how each row is
uniquely identified. For example;

Is each state uniquely identified by state_id? If so, that's a
(primary) key for the states table. That's often what people are
trying to do when they make a table of the form:

-- Listing 1
CREATE TABLE states
(
    state_id INTEGER PRIMARY KEY
    , state_name TEXT NOT NULL
);

If you are only dealing with one country, each state is (hopefully)
uniquely identified by its name as well, so you could add a UNIQUE
constraint to the state_name column, e..g,

-- Listing 2
CREATE TABLE states
(
    state_id INTEGER PRIMARY KEY
    , state_name TEXT NOT NULL UNIQUE
);

 From a logical point of view, PRIMARY KEY is equivalent to NOT NULL
UNIQUE, so there's no logical difference between Listing 2 and the
following:

-- Listing 3
CREATE TABLE states
(
    state_id INTEGER NOT NULL UNIQUE
    , state_name TEXT PRIMARY KEY
);

The state_id column is what is often referred to as a surrogate key:
it holds no information that really identifies the state in any real
sense. One integer is as good as another to identify the state. On
the other hand, the state_name column *is* associated with each state
in a real sense. Assigning arbitrary names to states would be less
than useful.

If you're handling more than one country, you'll most likely want to
associate the states with their respective countries.

-- Listing 4
CREATE TABLE countries
(
     country_id INTEGER PRIMARY KEY
);

CREATE TABLE states
(
     state_id INTEGER PRIMARY KEY
     , state_name TEXT NOT NULL
     , country_id INTEGER NOT NULL
         REFERENCES countries (country_id)
);

Note that there's no UNIQUE constraint on state_name. You may have
more than one state with the same state_name around the world so you
may want to make sure that for each country, each state_name is unique:

-- Listing 5
CREATE TABLE states
(
     state_id INTEGER PRIMARY KEY
     , state_name TEXT NOT NULL
     , country_id INTEGER NOT NULL
         REFERENCES countries (country_id)
     , UNIQUE (country_id, state_name)
);


> And similarly to produce
> relational integrity between state and city, city needs to
> reference the state primary key, which means state has to
> have all three (country_id, state_id, city_id) as it's
> primary key.

While each city does belong to a given state, each state in turn is
associated with a given country. To find out what country a given
city belongs to you'd just join through the states table. For example:

-- Listing 6
CREATE TABLE cities
(
     city_id INTEGER PRIMARY KEY
     , city_name TEXT NOT NULL
     , state_id INTEGER NOT NULL
         REFERENCES states (state_id)
     , UNIQUE (state_id, city_name)
);

I've also gone ahead and provided a UNIQUE constraint to prevent
city_name duplicates in the same state.

A PRIMARY KEY constraint of the form PRIMARY KEY (country_id,
state_id, city_id) would mean that the for each country_id and
state_id combination each city_id is unique. This means you could
potentially have the same city in multiple states in the same country
or in various countries and states. And there's nothing to prevent
something along the lines of (Mexico City, Nebraska, Canada). Note
that (Omaha, Nebraska, United States of America) would happily exist
in the same cities table!

To find the countries for each city:

-- Listing 7

SELECT city_name, state_name, country_name
FROM cities
NATURAL JOIN states
NATURAL JOIN countries;


> While it would be temptingly easy to simply declare all
> three separately in each table as
>
> country:
> country_id SERIAL, ...
>
> state:
> country_id integer,
> state_id SERIAL, ...
>
> city:
> country_id integer,
> state_id integer,
> city_id SERIAL, ...
>
> with that naive approach, every row in state has a unique
> state_id, and every row in city has a unique city_id. Then
> you'll notice that values of country_id are repeated in
> state, and state_id values are repeated in city.

This "naive" approach (other than adding country_id to the city
table) actually looks like proper normalization. The repetition you
have here is just providing the country for each state and the state
for each city. That's not duplication of information, if you want to
associate states with countries and cities with states.

> And then you'll realize that really it is the combination of
> (country_id, state_id) that defines a unique state, and
> (country_id, state_id, city_id) that defines a unique city.
> It would require the use of stored programs and triggers to
> manage these compound keys. But that's what stored programs
> and triggers are for.

As above, if you've got your database schema designed properly, you
don't need any stored procedures or triggers (other than those
provided under the covers by the foreign keys) to maintain the proper
referential integrity. Each city has a unique state (the state_id
column) and a unique country (joined through the states table). But
enforcing (city_id, state_id, country_id) uniqueness allows all kinds
of city/state/country mismatches.

> If you allow the fact that two countries could lay claim to
> the same geographic sub-region, then you need a separate
> table for an n-m relation.

This could be interesting :) Especially if the regions don't share
exactly the same borders! :)

> Then throw in postal codes. Not only can a city have
> multiple postal codes, but a postal code can serve more than
> one city.

-- Listing 8

CREATE TABLE postal_codes
(
    postal_code TEXT PRIMARY KEY
);

CREATE TABLE city_postal_codes
(
     city_id INTEGER NOT NULL
         REFERENCES cities (city_id)
     , postal_code TEXT NOT NULL
         REFERENCES postal_codes (postal_code)
     , PRIMARY KEY (city_id, postal_code)
);

> And the cities served might be in different
> states! (I used to have an example of that, but I can't find
> it right now.)

In Listing 8 there's nothing enforcing unique (state_id, postal_code)
associations, so you wouldn't run into trouble there. If you wanted
to find out which postal codes service which states you can use a join:

-- Listing 9

SELECT DISTINCT state_name, postal_code
FROM states
NATURAL JOIN city_postal_codes;

> Same with telephone area codes. (You are going there,
> eventually, right?)

Given the mobility of telephone numbers nowadays, you may not be
concerned with strict associations with telephone numbers,
subscribers, cities, and states.

Anyway, this has ended up much longer than I intended, but I didn't
want this to go unanswered. I've found Chris Date's books very
helpful, in particular "Introduction to Database Systems"[1] and
"Database in Depth: Relational Theory for Practice"[2].

Hope this helps.

Michael Glaesemann
grzm seespotcode net

[1](http://www.amazon.com/Introduction-Database-Systems-Eighth/dp/
0321197844/)
[2](http://www.amazon.com/Database-Depth-Relational-Theory-
Practitioners/dp/0596100124/)


В списке pgsql-general по дате отправления:

Предыдущее
От: John DeSoi
Дата:
Сообщение: Re: psql Tab Completion in Windows
Следующее
От: Lew
Дата:
Сообщение: Re: optimisation for a table with frequently used query