Обсуждение: newbie table design question

Поиск
Список
Период
Сортировка

newbie table design question

От
Andrew Smith
Дата:
Hi all,

I'm a beginner when it comes to Postgresql, and have a table design question about a project I'm currently working on.  I have 1500 data items that need to be copied every minute from an external system into my database.  The items have a timestamp, an identifier and a value.  For example:

12/10/2008 05:00       ID_ABC      14
12/10/2008 05:01       ID_ABC      17
12/10/2008 05:02       ID_ABC      13

Pretty simple stuff.  The current solution (implemented using SQL Server a few years ago) looks like this (an approximation using Postgresql syntax):

CREATE TABLE "DataImport"
(
  "DataImportID" serial NOT NULL PRIMARY KEY,
  "Time" timestamp without time zone NOT NULL,
  "ID_ABC" integer NOT NULL,
  "ID_DEF" integer NOT NULL,
  "ID_HIJ" integer NOT NULL,
  etc
);

While this design results in only 14400 records being present in the table per day, I don't like it.  One problem is that every time a data item is added or removed from the import process, the structure of the table needs to be altered.  I don't know what sort of overhead that involves in the Postgresql world, but I'm thinking that once I define the structure of the table, I don't want to be messing with it very often, if at all.

My initial thought for the design of the new solution looks like this:

CREATE TABLE "DataImport"
(
  "DataImportID" serial NOT NULL PRIMARY KEY,
  "Time" timestamp without time zone NOT NULL,
  "Identifier" text NOT NULL,
  "Value" integer NOT NULL
);

Users will then be doing regular queries on this data (say, a few hundred times per day), such as:

SELECT "Time", "Value" FROM "DataImport" WHERE "Identifier" = 'ID_ABC' AND "Time" between '2008-11-07' and '2008-11-11';

My concern is that 1500 values * 14400 minutes per day = 21,600,000 records.  Add this up over the course of a month (the length of time I need to keep the data in this table) and I'll have more than half a billion records being stored in there. 

I guess my question is: is my approach reasonable?  I haven't dealt with tables of this size before (using any DBMS) - should I expect really slow queries due to the sheer number of records present?  Is there some better way I should be structuring my imported data?  All suggestions welcome.

Cheers,

Andrew


Re: newbie table design question

От
Tom Lane
Дата:
Andrew Smith <laconical@gmail.com> writes:
> I'm a beginner when it comes to Postgresql, and have a table design question
> about a project I'm currently working on.  I have 1500 data items that need
> to be copied every minute from an external system into my database.  The
> items have a timestamp, an identifier and a value.  For example:

> 12/10/2008 05:00       ID_ABC      14
> 12/10/2008 05:01       ID_ABC      17
> 12/10/2008 05:02       ID_ABC      13

> Pretty simple stuff.  The current solution (implemented using SQL Server a
> few years ago) looks like this (an approximation using Postgresql syntax):

> CREATE TABLE "DataImport"
> (
>   "DataImportID" serial NOT NULL PRIMARY KEY,
>   "Time" timestamp without time zone NOT NULL,
>   "ID_ABC" integer NOT NULL,
>   "ID_DEF" integer NOT NULL,
>   "ID_HIJ" integer NOT NULL,
>   etc
> );

So the table would have ~1500 columns?  You definitely don't want to do
it that way in Postgres --- you'd be way too close to the maximum column
count limitation.

> My initial thought for the design of the new solution looks like this:

> CREATE TABLE "DataImport"
> (
>   "DataImportID" serial NOT NULL PRIMARY KEY,
>   "Time" timestamp without time zone NOT NULL,
>   "Identifier" text NOT NULL,
>   "Value" integer NOT NULL
> );

DataImportID couldn't be a primary key here, could it?  Or is it just
meant as an artificial primary key?  If so, consider not bothering with
it at all --- (Time, Identifier) seems like a perfectly good natural
key, and with such short records the extra space for a serial column
is not negligible.

Anyway, my answer would definitely lean towards using this normalized
representation, if all the data values are integers.  (If they're not,
it gets messier...)

> Users will then be doing regular queries on this data (say, a few hundred
> times per day), such as:

> SELECT "Time", "Value" FROM "DataImport" WHERE "Identifier" = 'ID_ABC' AND
> "Time" between '2008-11-07' and '2008-11-11';

An index on (Identifier, Time) (not the other way around) should work
well for that type of query.

> My concern is that 1500 values * 14400 minutes per day = 21,600,000
> records.  Add this up over the course of a month (the length of time I need
> to keep the data in this table) and I'll have more than half a billion
> records being stored in there.

That's not an especially big table.  However it might be worth your
trouble to use partitioning.  Not to speed searches, particularly, but
to make it easier to drop 1/30th of the data each day.

            regards, tom lane

Re: newbie table design question

От
Andrew Smith
Дата:


On Mon, Jun 1, 2009 at 1:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrew Smith <laconical@gmail.com> writes:
> I'm a beginner when it comes to Postgresql, and have a table design question
> about a project I'm currently working on.  I have 1500 data items that need
> to be copied every minute from an external system into my database.  The
> items have a timestamp, an identifier and a value.  For example:

> 12/10/2008 05:00       ID_ABC      14
> 12/10/2008 05:01       ID_ABC      17
> 12/10/2008 05:02       ID_ABC      13

> Pretty simple stuff.  The current solution (implemented using SQL Server a
> few years ago) looks like this (an approximation using Postgresql syntax):

> CREATE TABLE "DataImport"
> (
>   "DataImportID" serial NOT NULL PRIMARY KEY,
>   "Time" timestamp without time zone NOT NULL,
>   "ID_ABC" integer NOT NULL,
>   "ID_DEF" integer NOT NULL,
>   "ID_HIJ" integer NOT NULL,
>   etc
> );

So the table would have ~1500 columns?  You definitely don't want to do
it that way in Postgres --- you'd be way too close to the maximum column
count limitation. 

After my original post, I found out that the current solution in SQL Server actually had to be split into two different tables due to a similar limitation.
 


> My initial thought for the design of the new solution looks like this:

> CREATE TABLE "DataImport"
> (
>   "DataImportID" serial NOT NULL PRIMARY KEY,
>   "Time" timestamp without time zone NOT NULL,
>   "Identifier" text NOT NULL,
>   "Value" integer NOT NULL
> );

DataImportID couldn't be a primary key here, could it?  Or is it just
meant as an artificial primary key?  If so, consider not bothering with
it at all --- (Time, Identifier) seems like a perfectly good natural
key, and with such short records the extra space for a serial column
is not negligible.

You're right, I guess there doesn't seem to be much point having that surrogate key in place...and it does take up space.  I did a quick test and got the following figures:

1 million records with DataImportID = ~80mb
1 million records without DataImportID = ~50mb.
 
That'll certainly add up over time.


Anyway, my answer would definitely lean towards using this normalized
representation, if all the data values are integers.  (If they're not,
it gets messier...)

> Users will then be doing regular queries on this data (say, a few hundred
> times per day), such as:

> SELECT "Time", "Value" FROM "DataImport" WHERE "Identifier" = 'ID_ABC' AND
> "Time" between '2008-11-07' and '2008-11-11';

An index on (Identifier, Time) (not the other way around) should work
well for that type of query.

I'll give it a try.
 


> My concern is that 1500 values * 14400 minutes per day = 21,600,000
> records.  Add this up over the course of a month (the length of time I need
> to keep the data in this table) and I'll have more than half a billion
> records being stored in there.

That's not an especially big table.  However it might be worth your
trouble to use partitioning.  Not to speed searches, particularly, but
to make it easier to drop 1/30th of the data each day.

Re-reading this I noticed that I had an extra 0 in one of my figures - there's only 1440 minutes in a day, so my table gets down to the much more manageable 2.16 million records per day instead of 21.6 million.  I'll have a look into partitioning too, thanks for your help!

Cheers,

Andrew
 

Re: newbie table design question

От
Sam Mason
Дата:
On Sun, May 31, 2009 at 11:54:09PM +0800, Andrew Smith wrote:
> I'm a beginner when it comes to Postgresql, and have a table design question
> about a project I'm currently working on.  I have 1500 data items that need
> to be copied every minute from an external system into my database.  The
> items have a timestamp, an identifier and a value.

Not sure if it would help, but maybe an array of integers would be
appropriate.  I.e. you'd have the following table structure:

  CREATE TABLE data (
    time TIMESTAMP PRIMARY KEY,
    vals INTEGER[]
  );

and you'd be inserting something like:

  INSERT INTO data (time,vals) VALUES
    ('2009-06-01 10:54', '{1,2,3,4}');

This would have much lower overhead than having one row per value and
will make some queries easier and some more difficult.  It also relies
upon having the values in the same order each time.

Another thing you can get PG to do would be a constraint like:

  CHECK (time = date_trunc('minute',time))

this would ensure that you get at-most one entry per minute and that
it's on the minute.  This sort of thing should make some sorts of
queries easier.

--
  Sam  http://samason.me.uk/

Re: newbie table design question

От
"Chris Spotts"
Дата:
I just finished doing something very close to this - not quite once per
minute, but close.  I started off with an array of integers and after about
a month of it, I'm having to redesign my way out of it.  It would have
worked fine, but you just have to be sure that simple searches is all you're
ever going to want to do.  I started needing to do some more complex
analysis and it took forever and a day.  I'd recommend the normalized
approach with partitioning personally. Just my two cents.

-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Sam Mason
Sent: Monday, June 01, 2009 5:00 AM
To: pgsql-general@postgresql.org
Subject: Re: [GENERAL] newbie table design question

On Sun, May 31, 2009 at 11:54:09PM +0800, Andrew Smith wrote:
> I'm a beginner when it comes to Postgresql, and have a table design
question
> about a project I'm currently working on.  I have 1500 data items that
need
> to be copied every minute from an external system into my database.  The
> items have a timestamp, an identifier and a value.

Not sure if it would help, but maybe an array of integers would be
appropriate.  I.e. you'd have the following table structure:

  CREATE TABLE data (
    time TIMESTAMP PRIMARY KEY,
    vals INTEGER[]
  );

and you'd be inserting something like:

  INSERT INTO data (time,vals) VALUES
    ('2009-06-01 10:54', '{1,2,3,4}');

This would have much lower overhead than having one row per value and
will make some queries easier and some more difficult.  It also relies
upon having the values in the same order each time.

Another thing you can get PG to do would be a constraint like:

  CHECK (time = date_trunc('minute',time))

this would ensure that you get at-most one entry per minute and that
it's on the minute.  This sort of thing should make some sorts of
queries easier.

--
  Sam  http://samason.me.uk/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: newbie table design question

От
Sam Mason
Дата:
On Mon, Jun 01, 2009 at 06:53:30AM -0500, Chris Spotts wrote:
> I just finished doing something very close to this - not quite once per
> minute, but close.  I started off with an array of integers and after about
> a month of it, I'm having to redesign my way out of it.

I've had to go the other way before; started with a nicely normalized
design but the performance overhead (and query complexity to a lesser
extent) made the design unworkable.  As always the design depends on the
specifics of the problem, but as the OP said he was new to PG I thought
it good to point out its support for arrays.

--
  Sam  http://samason.me.uk/

Re: newbie table design question

От
Grzegorz Jaśkiewicz
Дата:
one word, horizontal structure. you are trying to sort 1500 colums,
instead of creating 1500 rows per entry...

Re: newbie table design question

От
björn lundin
Дата:
> CREATE TABLE "DataImport"
> (
>   "DataImportID" serial NOT NULL PRIMARY KEY,
>   "Time" timestamp without time zone NOT NULL,
>   "ID_ABC" integer NOT NULL,
>   "ID_DEF" integer NOT NULL,
>   "ID_HIJ" integer NOT NULL,
>   etc
> );

Perhaps you want to not use the "" around the table and column names.
It makes them case sensitive
--
Björn Lundin

Re: newbie table design question

От
Scott Marlowe
Дата:
2009/6/1 björn lundin <b.f.lundin@gmail.com>:
>
>> CREATE TABLE "DataImport"
>> (
>>   "DataImportID" serial NOT NULL PRIMARY KEY,
>>   "Time" timestamp without time zone NOT NULL,
>>   "ID_ABC" integer NOT NULL,
>>   "ID_DEF" integer NOT NULL,
>>   "ID_HIJ" integer NOT NULL,
>>   etc
>> );
>
> Perhaps you want to not use the "" around the table and column names.
> It makes them case sensitive

OR always always always use them.  But yeah, I find having to quote my
identifiers all the time a gigantic pain.

Re: newbie table design question

От
Andrew Smith
Дата:


2009/6/2 björn lundin <b.f.lundin@gmail.com>

> CREATE TABLE "DataImport"
> (
>   "DataImportID" serial NOT NULL PRIMARY KEY,
>   "Time" timestamp without time zone NOT NULL,
>   "ID_ABC" integer NOT NULL,
>   "ID_DEF" integer NOT NULL,
>   "ID_HIJ" integer NOT NULL,
>   etc
> );

Perhaps you want to not use the "" around the table and column names.
It makes them case sensitive

Yes, that's exactly why I've got them there. I can't stand lowercase field names in tables.

Re: newbie table design question

От
Merlin Moncure
Дата:
On Mon, Jun 1, 2009 at 8:41 AM, Sam Mason <sam@samason.me.uk> wrote:
> On Mon, Jun 01, 2009 at 06:53:30AM -0500, Chris Spotts wrote:
>> I just finished doing something very close to this - not quite once per
>> minute, but close.  I started off with an array of integers and after about
>> a month of it, I'm having to redesign my way out of it.
>
> I've had to go the other way before; started with a nicely normalized
> design but the performance overhead (and query complexity to a lesser
> extent) made the design unworkable.  As always the design depends on the
> specifics of the problem, but as the OP said he was new to PG I thought
> it good to point out its support for arrays.

+1

_IF_ arrays fit your requirements, they are an extremely compact and
efficient way to store your data.  The requirements are exact..you
can't update or extract a single piece of data out of a block
efficiently.  While indexing techniques are possible with gist, they
are moderate help at best.  But if arrays fit, your table will be a
fraction of size of the 'row per datum' approach, at the cost of
flexibility.  arrays are an enormous footgun when used in tables, but
if used properly can work well.

merlin

Re: newbie table design question

От
Grzegorz Jaśkiewicz
Дата:
On Tue, Jun 2, 2009 at 2:27 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

> _IF_ arrays fit your requirements, they are an extremely compact and
> efficient way to store your data.  The requirements are exact..you
> can't update or extract a single piece of data out of a block
> efficiently.  While indexing techniques are possible with gist, they
> are moderate help at best.  But if arrays fit, your table will be a
> fraction of size of the 'row per datum' approach, at the cost of
> flexibility.  arrays are an enormous footgun when used in tables, but
> if used properly can work well.

Gin index works with arrays too.

--
GJ

Re: newbie table design question

От
Merlin Moncure
Дата:
2009/6/1 Andrew Smith <laconical@gmail.com>:
> 2009/6/2 björn lundin <b.f.lundin@gmail.com>
>>
>> > CREATE TABLE "DataImport"
>> > (
>> >   "DataImportID" serial NOT NULL PRIMARY KEY,
>> >   "Time" timestamp without time zone NOT NULL,
>> >   "ID_ABC" integer NOT NULL,
>> >   "ID_DEF" integer NOT NULL,
>> >   "ID_HIJ" integer NOT NULL,
>> >   etc
>> > );
>>
>> Perhaps you want to not use the "" around the table and column names.
>> It makes them case sensitive
>
> Yes, that's exactly why I've got them there. I can't stand lowercase field
> names in tables.

well, beauty is in the eye of the beholder and all that, but I'd
advise rethinking your style.  On technical grounds I'd argue that you
may run into 3rd party software that is not aware of postgresql case
folding conventions that could lead to trouble.

On style grounds I'd argue that forcing all your field names to be
explicitly cases is basically crazy.  It's problematic for some client
languages requiring extra escaping when you otherwise wouldn't need
it.  Style and conventions are supposed to make things easier for the
programmer, not harder.  I'd also advise moving 'ID' and all other
name clarifying names from the prefix to the suffix.  Good for you
though and spelling out DataImportID and not using 'ID' which is
problematic for other reasons.

merlin