finding gaps in temporal data

Поиск

Список

Период

Сортировка

От	Samuel Gendler
Тема	finding gaps in temporal data
Дата	15 июня 2011 г. 14:24:04
Msg-id	BANLkTim=aT7_R+cwUUz2eAan+dytcU3jbg@mail.gmail.com обсуждение исходный текст
Ответы	Re: finding gaps in temporal data
Список	pgsql-sql

Дерево обсуждения

I have a fact table that looks like this:

dim1_fk bigint,

time_fk bigint,

tstamp timestamp without timezone

value numeric(16,2)

The tstamp column is completely redundant to a colume in the time_utc table, but I like to leave it there for convenience when writing ad-hoc queries in psql - it allows me to skip the join to a time dimension table. The fact table is actually partitioned into 1-month child tables, with check constraints on both time_fk and the tstamp column, since there is a 1:1 relationship between those columns.

dim1_fk refers to a dim1 table which has two varchar columns we care about:

dim1_pk bigint,

label1 varchar,

label2 varchar

the time_utc table contains the usual time dimension columns, and I've structured the primary key to be an integer in the form YYYYMMDDHH24mm - so 2011-06-15 15:35 has a primary key of 201106151535 and there is a row every 5 minutes. All data in the fact table is assigned to a given 5 minute window. There is a row in the time_utc table for every possible time value, regardless of whether there is data in a fact table for that 5 minute interval. For our purposes, we only need care about 2 columns

time_pk bigint,

tstamp timstamp without time zone

I'm looking to run a report which will show me any gaps in the data for any label1/label2 pair that appears in the dim1 table - there are 0 or more rows for each label1/label2 combination in each 5 minute window and I don't actually care about the duplicates (all my queries aggregate multiple rows for a given timestamp via the avg() function). I can find every missing row in the fact table data by just cross joining dim1 and time_utc for a given date range and then left joining between that and the fact table. Since every possible time value appears in the time table and every possible value1/value2 combination appears in the dim1 table, I will see NULLs in the fact table columns for any row that is missing in the data set.

However, it would be far nicer to get a report which just lists gaps, with a start and end, rather than one row for each missing 5 minute interval, even if they are sequential. Is there any way to accomplish this? I'm running 8.4.x at the moment, but would be willing to upgrade to 9.0.x, if necessary. If I can do it without the left outer join from time_utc to the fact table, then so much the better.

A query to grab just the missing rows looks something like this:

SELECT t.tstamp,

d.label1,

d.label2

FROM dimensions.dim1 d

INNER JOIN dimensions.time_utc t

ON t.tstamp between '2011-06-01' and '2011-06-05'

LEFT OUTER JOIN facts.parent_fact f

ON f.time_fk = t.time_pk

AND f.dim1_fk = d.dim1_pk

GROUP BY 3,2,1

HAVING avg(value) IS NULL

ORDER BY 3,2,1

so all I really need to do is figure out how to combine sequential rows into a single row with a start and end time. Rows are sequential if row1.tstamp - row2.tstamp = 5 minutes

В списке pgsql-sql по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

finding gaps in temporal data