Обсуждение: designing time dimension for star schema

Поиск
Список
Период
Сортировка

designing time dimension for star schema

От
Mark Wong
Дата:
Hello everybody,

I was wondering if anyone had any experiences they can share when
designing the time dimension for a star schema and the like.  I'm
curious about how well it would work to use a timestamp for the
attribute key, as opposed to a surrogate key, and populating the time
dimension with triggers on insert to the fact tables.  This is
something that would have data streaming in (as oppose to bulk
loading) and I think we want time granularity to the minute.

A simplified example:

-- Time dimension
CREATE TABLE time (
    datetime TIMESTAMP WITH TIME ZONE NOT NULL,
    day_of_week SMALLINT NOT NULL
);
CREATE UNIQUE INDEX ON time (datetime);

-- Fact
CREATE TABLE fact(
    datetime TIMESTAMP WITH TIME ZONE NOT NULL,
    FOREIGN KEY (datetime) REFERENCES time(datetime)
);

-- Function to populate the time dimension
CREATE OR REPLACE FUNCTION decompose_timestamp() RETURNS TRIGGER AS $$
BEGIN
    NEW.datetime = date_trunc('minutes', NEW.datetime);
    INSERT INTO time (datetime, day_of_week)
    VALUES (NEW.datetime, date_part('dow', NEW.datetime));
    RETURN NEW;
EXCEPTION
    WHEN unique_violation THEN
        -- Do nothing if the timestamp already exists in the dimension table.
        RETURN new;
END; $$
LANGUAGE 'plpgsql';

CREATE TRIGGER populate_time BEFORE INSERT
ON fact FOR EACH ROW
EXECUTE PROCEDURE decompose_timestamp();

Regards,
Mark


Re: designing time dimension for star schema

От
CS DBA
Дата:
I've done a lot of DSS architecture. A couple of thoughts:

- in most cases the ETL process figures out the time id's as part of the
preparation and then does bulk loads into the fact tables
      I would be very concerned about performance of a trigger that
fired for every row on the fact table

      you mention you want to do data streaming instead of bulk loads,
can you elaborate?


- When querying a star schema one of the performance features is the
fact that all joins to the dimension tables are performed via a numeric
key, such as:
"select * from fact, time_dim, geo_dim
   where fact.time_id = time_dim.time_id..."

In the case of this being a timestamp I suspect the performance would
take a hit, depending on the size of your fact table and the
scope/volume of your DSS queries this could easily be a show stopper
based on the assumption that the database can do a numeric binary search
much faster than a timestamp search





On 2/10/14, 9:45 AM, Mark Wong wrote:
> Hello everybody,
>
> I was wondering if anyone had any experiences they can share when
> designing the time dimension for a star schema and the like.  I'm
> curious about how well it would work to use a timestamp for the
> attribute key, as opposed to a surrogate key, and populating the time
> dimension with triggers on insert to the fact tables.  This is
> something that would have data streaming in (as oppose to bulk
> loading) and I think we want time granularity to the minute.
>
> A simplified example:
>
> -- Time dimension
> CREATE TABLE time (
>      datetime TIMESTAMP WITH TIME ZONE NOT NULL,
>      day_of_week SMALLINT NOT NULL
> );
> CREATE UNIQUE INDEX ON time (datetime);
>
> -- Fact
> CREATE TABLE fact(
>      datetime TIMESTAMP WITH TIME ZONE NOT NULL,
>      FOREIGN KEY (datetime) REFERENCES time(datetime)
> );
>
> -- Function to populate the time dimension
> CREATE OR REPLACE FUNCTION decompose_timestamp() RETURNS TRIGGER AS $$
> BEGIN
>      NEW.datetime = date_trunc('minutes', NEW.datetime);
>      INSERT INTO time (datetime, day_of_week)
>      VALUES (NEW.datetime, date_part('dow', NEW.datetime));
>      RETURN NEW;
> EXCEPTION
>      WHEN unique_violation THEN
>          -- Do nothing if the timestamp already exists in the dimension table.
>          RETURN new;
> END; $$
> LANGUAGE 'plpgsql';
>
> CREATE TRIGGER populate_time BEFORE INSERT
> ON fact FOR EACH ROW
> EXECUTE PROCEDURE decompose_timestamp();
>
> Regards,
> Mark
>
>



Re: designing time dimension for star schema

От
Mark Wong
Дата:
On Mon, Feb 10, 2014 at 9:20 AM, CS DBA <cs_dba@consistentstate.com> wrote:
> I've done a lot of DSS architecture. A couple of thoughts:
>
> - in most cases the ETL process figures out the time id's as part of the
> preparation and then does bulk loads into the fact tables
>       I would be very concerned about performance of a trigger that
> fired for every row on the fact table
>
>       you mention you want to do data streaming instead of bulk loads,
> can you elaborate?

We have processes inserting data from log files as they are written.

> - When querying a star schema one of the performance features is the
> fact that all joins to the dimension tables are performed via a numeric
> key, such as:
> "select * from fact, time_dim, geo_dim
>    where fact.time_id = time_dim.time_id..."
>
> In the case of this being a timestamp I suspect the performance would
> take a hit, depending on the size of your fact table and the
> scope/volume of your DSS queries this could easily be a show stopper
> based on the assumption that the database can do a numeric binary search
> much faster than a timestamp search

I guess I was hoping the extra 4 bytes from a timestamp, compared to a
bigint, wouldn't be too significant yet I didn't consider postgres
might do a binary search faster on an integer type than a timestamp.
Even with 1 billion rows, but maybe that's wishful thinking.  Maybe a
regular integer at 4 bytes would be good enough.  I would estimate a
query would touch up to an order of 1 million rows at a time.

Regards,
Mark


Re: designing time dimension for star schema

От
Tom Lane
Дата:
Mark Wong <markwkm@gmail.com> writes:
> On Mon, Feb 10, 2014 at 9:20 AM, CS DBA <cs_dba@consistentstate.com> wrote:
>> In the case of this being a timestamp I suspect the performance would
>> take a hit, depending on the size of your fact table and the
>> scope/volume of your DSS queries this could easily be a show stopper
>> based on the assumption that the database can do a numeric binary search
>> much faster than a timestamp search

> I guess I was hoping the extra 4 bytes from a timestamp, compared to a
> bigint, wouldn't be too significant yet I didn't consider postgres
> might do a binary search faster on an integer type than a timestamp.

Actually, Postgres timestamps *are* bigints under the hood, and
comparisons will be about the same speed as for bigints.  Now, I/O is a
lot more expensive for timestamps ... but not index searches.

            regards, tom lane


Re: designing time dimension for star schema

От
Merlin Moncure
Дата:
On Mon, Feb 10, 2014 at 8:45 AM, Mark Wong <markwkm@gmail.com> wrote:
> Hello everybody,
>
> I was wondering if anyone had any experiences they can share when
> designing the time dimension for a star schema and the like.  I'm
> curious about how well it would work to use a timestamp for the
> attribute key, as opposed to a surrogate key, and populating the time
> dimension with triggers on insert to the fact tables.  This is
> something that would have data streaming in (as oppose to bulk
> loading) and I think we want time granularity to the minute.
>
> A simplified example:
>
> -- Time dimension
> CREATE TABLE time (
>     datetime TIMESTAMP WITH TIME ZONE NOT NULL,
>     day_of_week SMALLINT NOT NULL
> );
> CREATE UNIQUE INDEX ON time (datetime);
>
> -- Fact
> CREATE TABLE fact(
>     datetime TIMESTAMP WITH TIME ZONE NOT NULL,
>     FOREIGN KEY (datetime) REFERENCES time(datetime)
> );
>
> -- Function to populate the time dimension
> CREATE OR REPLACE FUNCTION decompose_timestamp() RETURNS TRIGGER AS $$
> BEGIN
>     NEW.datetime = date_trunc('minutes', NEW.datetime);
>     INSERT INTO time (datetime, day_of_week)
>     VALUES (NEW.datetime, date_part('dow', NEW.datetime));
>     RETURN NEW;
> EXCEPTION
>     WHEN unique_violation THEN
>         -- Do nothing if the timestamp already exists in the dimension table.
>         RETURN new;
> END; $$
> LANGUAGE 'plpgsql';

Failure to inline the date/time in dimension tables a terrible
practice IMO.   You add a lookup and an expensive subtransaction to
each insert.  When querying the fact table you tack on a join for
every query that does not need to be there (for no benefit I can see).

merlin


Re: designing time dimension for star schema

От
Mark Wong
Дата:
On Mon, Feb 10, 2014 at 8:45 AM, Mark Wong <markwkm@gmail.com> wrote:
> Hello everybody,
>
> I was wondering if anyone had any experiences they can share when
> designing the time dimension for a star schema and the like.  I'm
> curious about how well it would work to use a timestamp for the
> attribute key, as opposed to a surrogate key, and populating the time
> dimension with triggers on insert to the fact tables.  This is
> something that would have data streaming in (as oppose to bulk
> loading) and I think we want time granularity to the minute.

Hello everybody,

I did a simple experiment and just wanted to share.  Hopefully this
wasn't too simple.  On a 72GB 15K rpm 2.5" drive, I tried to see how
long it would take to insert (committing after each insert) 100,000
bigints, timestamps with time zone, and timestamps with time zone with
insert trigger.  The timestamp and bigints by themselves took ~10
minutes to insert 100,000 rows, and implementing the trigger increased
the time up to about ~11 minutes.

Regards,
Mark