Обсуждение: design of queries for sparse data

Поиск
Список
Период
Сортировка

design of queries for sparse data

От
Andreas
Дата:
Hi,

I need some help to improve my design skills.   :)

I lately read an article about table design, that teached one shouldn't
designe tables where it's clear that some columns aren't relevant for
every row. It didn't span into the dirty usage details beyond the table
design, though.

E.g. a really simple example like a school that stores pupils like this:
pupil (pupil_id, pupil_name, attends_english, attends_history,
attends_maths, attends_football, attends_swimming)

1)   Some pupils don't attend to football, swimming or both.
2)   Occasionally there will be new classes added and others get
dropped. Say in a year a column "attends_knitting" gets introduced. Now
all those 50,000 existing rows get a column where the person hadn't even
the occasion to apply.
If for some reason the knitting class gets discontinued every row in the
future will still get this column.

So it was better to create 3 normalized tables:
pupil  (pupil_id,  pupil_name,  start_date,  exit_date)
classes  (class_id,  class_name,  is_available,  output_order)
attends_to  (pupil_id,  class_id,  in_year)   as an n:m-relation

Fine. Now I got rid off those empty columns in the pupil table.


MY QUESTIONS:

1)   How would I SELECT a report that looks like the first version of
the pupil table out of the 3 table design?
There must be a nontrivial SELECT statement that combines all 3 tables.
E.g. I want the result:
pupil_id, pupil_name, attends_to_english, ....., attends_to_football,
attends_to_swimming, attends_to_knitting
(42, Frank Miller, yes, ...., no, yes, yes)
(43, Suzy Smith, yes, ..., yes, yes, no)
...

2)   Could I control the order in which those attends_to-columns appear
by a numerical field output_order?

3)   Could I restrict the classes list so that only those appear when
there are pupils actually attending them in a given time frame?
3) a)   Like "competitve knitting" was only available from 2000-2005.
Now I'd produce a list of 2007 so there shouldn't appear an empty
knitting-column.  -->  classes.is_availlable
3) b)   Or it is availlable but no one has chosen it in 2007. -->
attends_to.in_year


Regards
Andreas



Re: design of queries for sparse data

От
"Sean Davis"
Дата:
On Nov 11, 2007 1:09 PM, Andreas <maps.on@gmx.net> wrote:
> Hi,
>
> I need some help to improve my design skills.   :)
>
> I lately read an article about table design, that teached one shouldn't
> designe tables where it's clear that some columns aren't relevant for
> every row. It didn't span into the dirty usage details beyond the table
> design, though.
>
> E.g. a really simple example like a school that stores pupils like this:
> pupil (pupil_id, pupil_name, attends_english, attends_history,
> attends_maths, attends_football, attends_swimming)
>
> 1)   Some pupils don't attend to football, swimming or both.
> 2)   Occasionally there will be new classes added and others get
> dropped. Say in a year a column "attends_knitting" gets introduced. Now
> all those 50,000 existing rows get a column where the person hadn't even
> the occasion to apply.
> If for some reason the knitting class gets discontinued every row in the
> future will still get this column.
>
> So it was better to create 3 normalized tables:
> pupil  (pupil_id,  pupil_name,  start_date,  exit_date)
> classes  (class_id,  class_name,  is_available,  output_order)
> attends_to  (pupil_id,  class_id,  in_year)   as an n:m-relation
>
> Fine. Now I got rid off those empty columns in the pupil table.
>
>
> MY QUESTIONS:
>
> 1)   How would I SELECT a report that looks like the first version of
> the pupil table out of the 3 table design?
> There must be a nontrivial SELECT statement that combines all 3 tables.
> E.g. I want the result:
> pupil_id, pupil_name, attends_to_english, ....., attends_to_football,
> attends_to_swimming, attends_to_knitting
> (42, Frank Miller, yes, ...., no, yes, yes)
> (43, Suzy Smith, yes, ..., yes, yes, no)

This type is called a crosstab query.  There are several ways to
accomplish this, but adding a class will result in having to change
the query, generally.  However, you can easily construct these results
in your client application.  For flexibility reasons, it might be best
to do things this latter way.

> 2)   Could I control the order in which those attends_to-columns appear
> by a numerical field output_order?

See the answer to number 1.  This is most flexibly done in a client application.

> 3)   Could I restrict the classes list so that only those appear when
> there are pupils actually attending them in a given time frame?

You could use SQL to get the list of relevant classes and then use
those in your client application to build the appropriate
representation of the data.

> 3) a)   Like "competitve knitting" was only available from 2000-2005.
> Now I'd produce a list of 2007 so there shouldn't appear an empty
> knitting-column.  -->  classes.is_availlable
> 3) b)   Or it is availlable but no one has chosen it in 2007. -->
> attends_to.in_year

Re: design of queries for sparse data

От
Hans-Peter Oeri
Дата:
Hi!

Andreas wrote:
> 1)   How would I SELECT a report that looks like the first version of
> the pupil table out of the 3 table design?
I agree with Sean that such a crosstab would best be created in the
application. What you get from the db would be something like:

42, Frank Miller, Knitting
42, Frank Miller, Basketball
43, Suzy Smith, Wrestling
43, Suzy Smith, House Cleaning

(and that's keeping it to one query... Would probably be more efficient
to split)
Like that you simply loop over the rows concerning the "same" student
and add classes to your structure.
> 2)   Could I control the order in which those attends_to-columns appear
> by a numerical field output_order?
As soon as you have "out-normalized" classes, that table may have an
output_order, of course.
> 3)   Could I restrict the classes list so that only those appear when
> there are pupils actually attending them in a given time frame?
Using an MN relation between students and classes, of course
> 3) a)   Like "competitve knitting" was only available from 2000-2005.
> Now I'd produce a list of 2007 so there shouldn't appear an empty
> knitting-column.  -->  classes.is_availlable
is_available seems not a wise choice. You would probably prefer a
temporal structure (Ask yourself in 2007: Did Suzy attend House Cleaning
in 2004? Was Knitting available back then?). Why don't you split
"classes" even further: a) an abstract definition of the contents/title
and b) the concrete class by year, referring the abstract definition
(and the professor of that year).

HPO