Обсуждение: counting query

Поиск
Список
Период
Сортировка

counting query

От
garry saddington
Дата:
I have a table definition such as:

CREATE TABLE attendance
(
  attendanceid serial primary key,
  entered date DEFAULT current_date NOT NULL,
  absent boolean,
  authorization text default 'N',
  timeperiod char(2) check(timeperiod in('AM','PM')),
  days varchar(10),
  studentid int,
  unique(entered,timeperiod,studentid)
)

Which is used to record school attendance data. I am now trying to write
a query to identify trends in absences by counting the days column and
returning any student that has repeated absences on certain days. I am
struggling to return anything that does not need further manipulation in
Python before being useful.
Does anyone have any ideas?


Re: counting query

От
"Joris Dobbelsteen"
Дата:
>-----Original Message-----
>From: pgsql-general-owner@postgresql.org
>[mailto:pgsql-general-owner@postgresql.org] On Behalf Of garry
>saddington
>Sent: zondag 28 januari 2007 14:06
>To: pgsql-general@postgresql.org
>Subject: [GENERAL] counting query
>
>I have a table definition such as:
>
>CREATE TABLE attendance
>(
>  attendanceid serial primary key,

Why you have this??? You already have (entered,timeperiod,studentid)
that you can use, since that must be unique too. Try to avoid surrogate
keys as much as possible (it really increases performance and
ease-of-writing for complex queries!

>  entered date DEFAULT current_date NOT NULL,
>  absent boolean,
>  authorization text default 'N',
>  timeperiod char(2) check(timeperiod in('AM','PM')),
>  days varchar(10),
>  studentid int,
>  unique(entered,timeperiod,studentid)
>)

Guessing the meaning a bit, not too self-explaining.

>Which is used to record school attendance data. I am now
>trying to write a query to identify trends in absences by
>counting the days column and returning any student that has
>repeated absences on certain days. I am struggling to return
>anything that does not need further manipulation in Python
>before being useful.
>Does anyone have any ideas?

Yes, before starting you must have a well-defined idea on what you want
to know. What should the result look like? In most situations start
simple things, and eventually combine these to something more complex,
but always know what you are querying. You should have some idea of what
results you will be getting.
Of course, this is a proces of discovery rather than following set
rules. Some background on probabilities and statistics really helps.
Personally I like visualization quite a lot to help me with this.

Personally I've found nothing that will beat Excel for doing data
analysis. Learn to use the pivot table and pivot charts. They are
extremely powerful.
However, it can be a bit tricky to transform the input into something
the tool can use. A good starting point is to split the dates into
seperate year, month, day, week values.

Some idea's that might work to get it started:
* A graph with days vs occurrences (count).
* Graph of total sick days per student vs occurrences.
* Graph of Check the occurrences per month/day/week

Next try filtering of the data to form sequential periods and more funny
things.

I believe all this can be done with plain SQL and you don't need any
python or plsql or other languages.

- Joris Dobbelsteen

Re: counting query

От
John Meyer
Дата:
Joris Dobbelsteen wrote:
>>
>> CREATE TABLE attendance
>> (
>>  attendanceid serial primary key,
>
> Why you have this??? You already have (entered,timeperiod,studentid)
> that you can use, since that must be unique too. Try to avoid surrogate
> keys as much as possible (it really increases performance and
> ease-of-writing for complex queries!


Correct me if I am wrong, but wouldn't a simple number be a lot easier
to look up than a composite key?

Re: counting query

От
"Joris Dobbelsteen"
Дата:
>-----Original Message-----
>From: pgsql-general-owner@postgresql.org
>[mailto:pgsql-general-owner@postgresql.org] On Behalf Of John Meyer
>Sent: zondag 28 januari 2007 15:36
>To: pgsql-general@postgresql.org
>Subject: Re: [GENERAL] counting query
>
>Joris Dobbelsteen wrote:
>>>
>>> CREATE TABLE attendance
>>> (
>>>  attendanceid serial primary key,
>>
>> Why you have this??? You already have (entered,timeperiod,studentid)
>> that you can use, since that must be unique too. Try to avoid
>> surrogate keys as much as possible (it really increases performance
>> and ease-of-writing for complex queries!
>
>
>Correct me if I am wrong, but wouldn't a simple number be a
>lot easier to look up than a composite key?

No, it is not.
Better, it is, It might be, until you go just a bit larger.

I've learned it when I've build a production database (that's still used
in production and still performs excellent. At the time I was only 15 or
16 or 17 years old. That was 7 to 10 years ago. Since then I've learned
a lot.)

The trouble is, the database consists of well over 40 tables (with
nearly surrogate keys) and joining a bit of data on the far ends of the
database requires you to join arround 10 to 15 tables. If you are doing
something complex you will get lost at some point and really need
graphical tools to just grasp what you are trying to query.
So a bit of a complex query easily results in a excessive number of
tables that must be queried.

Why I did this. At this time I was not aware that you could build a
primary key consisting of multiple columns. And if you look arround you
at the Internet you see (nearly) all databases of free 'web
applications' making excessive use of surrogate keys.
This resulted in a lot of excessive surrogate keys that could have been
easily avoided and where not a required atrifact of the inconsistent
data I had to import.

What would have been better without surrogate keys all-over:
* Easier to write complex queries with much fewer tables to be queried.
* Much faster query performance, as fewer tables need to be referenced.
* Better integrity enforcement with simple foreign key constraints.

If fact, in (guessed) 50% of the queries I could have avoided at least 2
table joins!

Think big. That is why...

- Joris Dobbelsteen

Re: counting query

От
Douglas McNaught
Дата:
"Joris Dobbelsteen" <Joris@familiedobbelsteen.nl> writes:

> What would have been better without surrogate keys all-over:
> * Easier to write complex queries with much fewer tables to be queried.
> * Much faster query performance, as fewer tables need to be referenced.
> * Better integrity enforcement with simple foreign key constraints.

Not this debeta again.  ;)

Surrugate vs natural keys shouldn't make a difference in how many
yables you have--they depends on the degree of normalization.  Sounds
like you denormalized your database and happened to eliminate
surrogate keys at the same time.  Using that to say "surrogate keys
are bad" is kind of misleading.

-Doug

Re: counting query

От
Ron Johnson
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/28/07 08:36, John Meyer wrote:
> Joris Dobbelsteen wrote:
>>> CREATE TABLE attendance
>>> (
>>>  attendanceid serial primary key,
>> Why you have this??? You already have (entered,timeperiod,studentid)
>> that you can use, since that must be unique too. Try to avoid surrogate
>> keys as much as possible (it really increases performance and
>> ease-of-writing for complex queries!
>
>
> Correct me if I am wrong, but wouldn't a simple number be a lot easier
> to look up than a composite key?

This is the great synthetic-vs-natural key debate.

Sure, it's easier to write queries that join on a synthetic integer
field.  However, adding 3 extra fields to a few other tables is not
onerous, and it adds useful information to the other tables, since
(entered,timeperiod,studentid) are what *really* makes a record
unique.  Also, synthetic keys mean that you have to do more joins,
since if you want to know about "entered" and table T_FOO, you'd
have to join "attendance" to T_FOO.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFvMH/S9HxQb37XmcRAoefAKDpf/6TG5WzP4nBIEcqVHE1dmb4/gCgxkZd
5fxfG4NoBR/Ul3fhqmpuTFQ=
=g/F2
-----END PGP SIGNATURE-----

Re: counting query

От
Ron Johnson
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/28/07 07:05, garry saddington wrote:
> I have a table definition such as:
>
> CREATE TABLE attendance
> (
>   attendanceid serial primary key,
>   entered date DEFAULT current_date NOT NULL,
>   absent boolean,
>   authorization text default 'N',
>   timeperiod char(2) check(timeperiod in('AM','PM')),
>   days varchar(10),
>   studentid int,
>   unique(entered,timeperiod,studentid)
> )
>
> Which is used to record school attendance data. I am now trying to write
> a query to identify trends in absences by counting the days column and
> returning any student that has repeated absences on certain days. I am
> struggling to return anything that does not need further manipulation in
> Python before being useful.
> Does anyone have any ideas?

When you say "certain days", you mean "days of the week"?

If so, create a view like:
CREATE VIEW V_DAY_ABSENCES AS
SELECT ENTERED,
       AUTHORIZATION,
       TIMEPERIOD,
       DAYS,
       STUDENTID,
       DOW(CAST(ENTERED AS TIMESTAMP)) AS WEEKDAY
FROM ATTENDANCE
WHERE ABSENT = TRUE;

Then, this query should do what you want:
SELECT STUDENTID,
       TIMEPERIOD,
       WEEKDAY,
       COUNT(*)
FROM V_DAY_ABSENSES
GROUP BY STUDENTID,
         TIMEPERIOD,
         WEEKDAY
HAVING COUNT(*) > 3;






-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFvMgCS9HxQb37XmcRAkvrAJ9ZiOJCM8GGE7ptIzcZsUJc7T2fnQCgpkUn
/9nkR9BO04WB0XThPlx+254=
=9D2A
-----END PGP SIGNATURE-----

Re: counting query

От
"Chad Wagner"
Дата:
On 1/28/07, Ron Johnson <ron.l.johnson@cox.net> wrote:
This is the great synthetic-vs-natural key debate.

Truly.  But what the heck!

Surrogate keys are not evil, and they do have value.  I see no value in proclaiming "surrogate keys are evil, do not use them".

Surrogate keys do have advantages:
- Disassociation of natural data to other referential tables (which can also be "confusing")
    Imagine a social security number, drivers license number, or any other natural key.  Now imagine that
    key value has changed for a specific person, and you have used it as a natural key throughout your data
    structures.  (and they do change)
- Reduced storage requirements (yields better performance)
    It is cheaper to store a 50 byte field + a 4 byte surrogate key once, then it is to store it a million times:
    (surrogate key)   54 bytes + (4 bytes * 1 million) = 4MB
    vs.
    (natural key)   50 bytes * 1 million = 50 MB


Natural keys are not evil either, and they have their own advantages.  But when your modeling very large databases (around 10TB range) then you absolutely have to consider every single decision, and natural keys (in my opinion) is not always a good one as a single natural key could result in another 100GB of storage requirements.

There should be some thought when you are modeling and these are some of the things to consider.  I don't see a 10 table join being a major performance penalty, especially when 8 of the tables may be a few MB in size.


--
Chad
http://www.postgresqlforums.com/

Re: counting query

От
"Joris Dobbelsteen"
Дата:
>-----Original Message-----
>From: Douglas McNaught [mailto:doug@mcnaught.org]
>Sent: zondag 28 januari 2007 16:29
>To: Joris Dobbelsteen
>Cc: John Meyer; pgsql-general@postgresql.org
>Subject: Re: [GENERAL] counting query
>
>"Joris Dobbelsteen" <Joris@familiedobbelsteen.nl> writes:
>
>> What would have been better without surrogate keys all-over:
>> * Easier to write complex queries with much fewer tables to
>be queried.
>> * Much faster query performance, as fewer tables need to be
>referenced.
>> * Better integrity enforcement with simple foreign key constraints.
>
>Not this debeta again.  ;)
>
>Surrugate vs natural keys shouldn't make a difference in how
>many yables you have--they depends on the degree of
>normalization.  Sounds like you denormalized your database and
>happened to eliminate surrogate keys at the same time.  Using
>that to say "surrogate keys are bad" is kind of misleading.

I have perhaps formulated it quite extreme. It was not intended to take
such a extreme stance. I appologize if it will lead to such a
discussion. I'll try to do it a bit more careful next time.

My point is only, be careful with surrogate keys and try not the use
them for everything. In my sole opinion I see them getting used too
much. But then again, what is good and wrong will always be subjective.

So perhaps <quote>What would have been better without surrogate keys
all-over<quote> should have been "My database where I extremely overdid
it with surrogate keys".

Lets leave it to this.

- Joris

Re: counting query

От
Douglas McNaught
Дата:
"Joris Dobbelsteen" <Joris@familiedobbelsteen.nl> writes:

> So perhaps <quote>What would have been better without surrogate keys
> all-over<quote> should have been "My database where I extremely overdid
> it with surrogate keys".

Fair enough.  It's generally true that going to extremes with anything
causes problems.  :)

-Doug

Re: counting query

От
garry saddington
Дата:
On Sun, 2007-01-28 at 09:57 -0600, Ron Johnson wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 01/28/07 07:05, garry saddington wrote:
> > I have a table definition such as:
> >
> > CREATE TABLE attendance
> > (
> >   attendanceid serial primary key,
> >   entered date DEFAULT current_date NOT NULL,
> >   absent boolean,
> >   authorization text default 'N',
> >   timeperiod char(2) check(timeperiod in('AM','PM')),
> >   days varchar(10),
> >   studentid int,
> >   unique(entered,timeperiod,studentid)
> > )
> >
> > Which is used to record school attendance data. I am now trying to write
> > a query to identify trends in absences by counting the days column and
> > returning any student that has repeated absences on certain days. I am
> > struggling to return anything that does not need further manipulation in
> > Python before being useful.
> > Does anyone have any ideas?
>
> When you say "certain days", you mean "days of the week"?
>
> If so, create a view like:
> CREATE VIEW V_DAY_ABSENCES AS
> SELECT ENTERED,
>        AUTHORIZATION,
>        TIMEPERIOD,
>        DAYS,
>        STUDENTID,
>        DOW(CAST(ENTERED AS TIMESTAMP)) AS WEEKDAY
> FROM ATTENDANCE
> WHERE ABSENT = TRUE;
>
> Then, this query should do what you want:
> SELECT STUDENTID,
>        TIMEPERIOD,
>        WEEKDAY,
>        COUNT(*)
> FROM V_DAY_ABSENSES
> GROUP BY STUDENTID,
>          TIMEPERIOD,
>          WEEKDAY
> HAVING COUNT(*) > 3;
Thank you, this works great. But I have another problem: Is it possible
to identify absences in consecutive weeks on the same day. EG. If a
pupil has a pattern of having every monday AM off school, how could that
be identified?
Regards
Garry


Re: counting query

От
Ron Johnson
Дата:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 01/28/07 15:18, garry saddington wrote:
> On Sun, 2007-01-28 at 09:57 -0600, Ron Johnson wrote:
>> On 01/28/07 07:05, garry saddington wrote:
[snip]
>> When you say "certain days", you mean "days of the week"?
>>
>> If so, create a view like:
>> CREATE VIEW V_DAY_ABSENCES AS
>> SELECT ENTERED,
>>        AUTHORIZATION,
>>        TIMEPERIOD,
>>        DAYS,
>>        STUDENTID,
>>        DOW(CAST(ENTERED AS TIMESTAMP)) AS WEEKDAY
>> FROM ATTENDANCE
>> WHERE ABSENT = TRUE;
>>
>> Then, this query should do what you want:
>> SELECT STUDENTID,
>>        TIMEPERIOD,
>>        WEEKDAY,
>>        COUNT(*)
>> FROM V_DAY_ABSENSES
>> GROUP BY STUDENTID,
>>          TIMEPERIOD,
>>          WEEKDAY
>> HAVING COUNT(*) > 3;
>>
> Thank you, this works great. But I have another problem: Is it possible
> to identify absences in consecutive weeks on the same day. EG. If a
> pupil has a pattern of having every monday AM off school, how could that
> be identified?

I'd use the T_CALENDAR table, modified for your purposes.  (It's a
"static" that we create on every database.)  We populate it with 22
years of dates.  You'll have to write a small procedure to do it.

CREATE TABLE T_CALENDAR (
DATE_ANSI     DATE,
YEAR_NUM      SMALLINT,
MONTH_NUM     SMALLINT,
DAY_OF_MONTH  SMALLINT,
DAY_OF_WEEK   SMALLINT,
JULIAN_DAY    SMALLINT,
DAY_OF_WEEK   SMALLINT,
IS_SCHOOL_DAY BOOL,
SCHOOL_YEAR   SMALLINT,  -- "2006" for the 2006/07 school year
SCHOOL_MONTH  SMALLINT); -- 1 for August, 2 for September, etc

Then, join T_CALENDAR to ATTENDANCE, WHERE DATE_ANSI = ENTERED
AND DAY_OF_WEEK = 1
AND IS_SCHOOL_DAY = TRUE
AND SCHOOL_YEAR = 2006;

Making that join into a view and then, as Joris suggested, connect
it to a spreadsheet.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFvRmAS9HxQb37XmcRAvpcAKDQ7kjs9/rMb39w5JYRRTl65mYoKQCfVVUm
NTv6r6Kzu8T5D+SS8vxwFjs=
=VDXa
-----END PGP SIGNATURE-----

Re: counting query

От
"John D. Burger"
Дата:
Joris Dobbelsteen wrote:

> Personally I've found nothing that will beat Excel for doing data
> analysis. Learn to use the pivot table and pivot charts. They are
> extremely powerful.

Funny, there is an on-going discussion about this on one of our
internal mailing lists.  Excel is perhaps okay for simple analysis,
and drawing pretty pictures, but the statistics routines, necessary
for any kind of detailed analysis, have historically had lots of
problems.  Here're a few explications of the many ways Excel seems to
fall down:

   http://www.cs.uiowa.edu/~jcryer/JSMTalk2001.pdf
   http://pages.stern.nyu.edu/~jsimonof/classes/1305/pdf/excelreg.pdf

- John Burger