Re: finding gaps in temporal data

Поиск

Список

Период

Сортировка

От	Samuel Gendler
Тема	Re: finding gaps in temporal data
Дата	15 июня 2011 г. 21:00:40
Msg-id	BANLkTi=vD=MgfDjCaN7TrThbE3=1-KHXEw@mail.gmail.com обсуждение исходный текст
Ответ на	finding gaps in temporal data (Samuel Gendler <sgendler@ideasculptor.com>)
Список	pgsql-sql

Дерево обсуждения

On Wed, Jun 15, 2011 at 10:23 AM, Samuel Gendler <sgendler@ideasculptor.com> wrote:

I have a fact table that looks like this:

dim1_fk bigint,
time_fk bigint,
tstamp timestamp without timezone
value numeric(16,2)

The tstamp column is completely redundant to a colume in the time_utc table, but I like to leave it there for convenience when writing ad-hoc queries in psql - it allows me to skip the join to a time dimension table. The fact table is actually partitioned into 1-month child tables, with check constraints on both time_fk and the tstamp column, since there is a 1:1 relationship between those columns.

dim1_fk refers to a dim1 table which has two varchar columns we care about:

dim1_pk bigint,
label1 varchar,
label2 varchar

the time_utc table contains the usual time dimension columns, and I've structured the primary key to be an integer in the form YYYYMMDDHH24mm - so 2011-06-15 15:35 has a primary key of 201106151535 and there is a row every 5 minutes. All data in the fact table is assigned to a given 5 minute window. There is a row in the time_utc table for every possible time value, regardless of whether there is data in a fact table for that 5 minute interval. For our purposes, we only need care about 2 columns

time_pk bigint,
tstamp timstamp without time zone

I'm looking to run a report which will show me any gaps in the data for any label1/label2 pair that appears in the dim1 table - there are 0 or more rows for each label1/label2 combination in each 5 minute window and I don't actually care about the duplicates (all my queries aggregate multiple rows for a given timestamp via the avg() function).

OK, I figured this one out on my own. It looks like this (plus a union all to a query to show entries from dim1 for which there is no data at all):

SELECT d2.label1,

d2.label2,

tstamp - gap as start_time,

tstamp - '5 minute'::interval as end_time

FROM (

SELECT d.dim1_pk,

t.tstamp,

t.tstamp - lag(t.tstamp,1, '2011-05-15 00:00:00') OVER w AS gap

FROM

dim1 d

JOIN facts.fact_tbl f

ON f.dim1_fk = d.dim1_pk

JOIN time_utc t

ON t.time_pk = f.time_fk

WHERE f.tstamp between '2011-05-15 00:00:00' and '2011-06-03 23:55:00'

GROUP BY 1,2

WINDOW w AS (PARTITION BY d.dim1_pk

ORDER BY d.dim1_pk)

ORDER BY 1,2

) AS q JOIN dim1 d2 ON d2.dim1_pk = q.dim1_pk

WHERE q.gap > '5 minute'::interval

;

That subtracts the tstamp of the previous row from the tstamp of the current row, within a window defined on individual dim1_pk values. The outer query then selects only rows where the gap is greater than 5 minutes, since sequential values will show a 5 minute interval. It also joins to dim1 again in order to pull out the label1 and label2 values, since including those in the window instead of dim1_pk resulted in a much slower query, presumably because I don't have indexes on the label columns.

This avoids all of the cross join and left join craziness I was doing, which is useful when attempting to plug gaps with default values, but a pain in the neck when just attempting to determine where the gaps are and how large they are.

window functions are a seriously useful tool!

--sam

В списке pgsql-sql по дате отправления:

Предыдущее

От: Jasen Betts
Дата: 15 июня 2011 г., 20:30:33
Сообщение: Re: Storage of Indian Language text in postgresql through PHP

Следующее

От: Clem Dickey
Дата: 16 июня 2011 г., 05:29:28
Сообщение: " ::= " grammar rule not accepted by Postgres

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: finding gaps in temporal data

Предыдущее

Следующее