Re: newbie table design question

Поиск
Список
Период
Сортировка
От Andrew Smith
Тема Re: newbie table design question
Дата
Msg-id 300497a30905312341l4ac7bd5cxd4b10849c305d869@mail.gmail.com
обсуждение исходный текст
Ответ на Re: newbie table design question  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-general


On Mon, Jun 1, 2009 at 1:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrew Smith <laconical@gmail.com> writes:
> I'm a beginner when it comes to Postgresql, and have a table design question
> about a project I'm currently working on.  I have 1500 data items that need
> to be copied every minute from an external system into my database.  The
> items have a timestamp, an identifier and a value.  For example:

> 12/10/2008 05:00       ID_ABC      14
> 12/10/2008 05:01       ID_ABC      17
> 12/10/2008 05:02       ID_ABC      13

> Pretty simple stuff.  The current solution (implemented using SQL Server a
> few years ago) looks like this (an approximation using Postgresql syntax):

> CREATE TABLE "DataImport"
> (
>   "DataImportID" serial NOT NULL PRIMARY KEY,
>   "Time" timestamp without time zone NOT NULL,
>   "ID_ABC" integer NOT NULL,
>   "ID_DEF" integer NOT NULL,
>   "ID_HIJ" integer NOT NULL,
>   etc
> );

So the table would have ~1500 columns?  You definitely don't want to do
it that way in Postgres --- you'd be way too close to the maximum column
count limitation. 

After my original post, I found out that the current solution in SQL Server actually had to be split into two different tables due to a similar limitation.
 


> My initial thought for the design of the new solution looks like this:

> CREATE TABLE "DataImport"
> (
>   "DataImportID" serial NOT NULL PRIMARY KEY,
>   "Time" timestamp without time zone NOT NULL,
>   "Identifier" text NOT NULL,
>   "Value" integer NOT NULL
> );

DataImportID couldn't be a primary key here, could it?  Or is it just
meant as an artificial primary key?  If so, consider not bothering with
it at all --- (Time, Identifier) seems like a perfectly good natural
key, and with such short records the extra space for a serial column
is not negligible.

You're right, I guess there doesn't seem to be much point having that surrogate key in place...and it does take up space.  I did a quick test and got the following figures:

1 million records with DataImportID = ~80mb
1 million records without DataImportID = ~50mb.
 
That'll certainly add up over time.


Anyway, my answer would definitely lean towards using this normalized
representation, if all the data values are integers.  (If they're not,
it gets messier...)

> Users will then be doing regular queries on this data (say, a few hundred
> times per day), such as:

> SELECT "Time", "Value" FROM "DataImport" WHERE "Identifier" = 'ID_ABC' AND
> "Time" between '2008-11-07' and '2008-11-11';

An index on (Identifier, Time) (not the other way around) should work
well for that type of query.

I'll give it a try.
 


> My concern is that 1500 values * 14400 minutes per day = 21,600,000
> records.  Add this up over the course of a month (the length of time I need
> to keep the data in this table) and I'll have more than half a billion
> records being stored in there.

That's not an especially big table.  However it might be worth your
trouble to use partitioning.  Not to speed searches, particularly, but
to make it easier to drop 1/30th of the data each day.

Re-reading this I noticed that I had an extra 0 in one of my figures - there's only 1440 minutes in a day, so my table gets down to the much more manageable 2.16 million records per day instead of 21.6 million.  I'll have a look into partitioning too, thanks for your help!

Cheers,

Andrew
 

В списке pgsql-general по дате отправления:

Предыдущее
От: Gurjeet Singh
Дата:
Сообщение: Query to find Foreign Key column data type mismatch
Следующее
От: Luca Ferrari
Дата:
Сообщение: pg_dump & table space