Обсуждение: Statistics query

Поиск
Список
Период
Сортировка

Statistics query

От
Steve Crawford
Дата:
I'm seeking ideas on the best way to craft the following query. I've
stripped everything down to the bare essentials and simplified it below.

Input data has a timestamp (actually an int received from the system in
the form of a Unix epoch), a unit identifier and a status:

  event_time | unit_id | status
------------+---------+--------
  1357056011 |      60 |      1
  1357056012 |     178 |      0
  1357056019 |     168 |      0
  1357056021 |       3 |      0
  1357056021 |       4 |      1
  1357056021 |     179 |      0
  1357056022 |       0 |      1
  1357056022 |       1 |      0
  1357056023 |       2 |      0
  1357056024 |       9 |      0
  1357056025 |       5 |      0
  1357056025 |       6 |      0
  1357056026 |       7 |      1
...

A given unit_id cannot have two events at the same time (enforced by
constraints).

Given a point in time I would like to:

1. Identify all distinct unit_ids with an entry that exists in the
preceding hour then

2. Count both the total events and sum the status=1 events for the most
recent 50 events for each unit_id that fall within a limited period
(e.g. don't look at data earlier than midnight). So unit_id 60 might
have 50 events in the last 15 minutes while unit_id 4 might have only 12
events after midnight.

The output would look something like:

  unit_id | events | status_1_count
---------+--------+----------------
       1  |     50 |             34
       2  |     27 |             18
       1  |     50 |             34
       1  |      2 |              0
...

Each sub-portion is easy and while I could use external processing or
set-returning functions I was hoping first to find the secret-sauce to
glue everything together into a single query.

Cheers,
Steve






Re: Statistics query

От
Chris Curvey
Дата:

On Wed, Apr 10, 2013 at 12:30 PM, Steve Crawford <scrawford@pinpointresearch.com> wrote:
I'm seeking ideas on the best way to craft the following query. I've stripped everything down to the bare essentials and simplified it below.

Input data has a timestamp (actually an int received from the system in the form of a Unix epoch), a unit identifier and a status:

 event_time | unit_id | status
------------+---------+--------
 1357056011 |      60 |      1
 1357056012 |     178 |      0
 1357056019 |     168 |      0
 1357056021 |       3 |      0
 1357056021 |       4 |      1
 1357056021 |     179 |      0
 1357056022 |       0 |      1
 1357056022 |       1 |      0
 1357056023 |       2 |      0
 1357056024 |       9 |      0
 1357056025 |       5 |      0
 1357056025 |       6 |      0
 1357056026 |       7 |      1
...

A given unit_id cannot have two events at the same time (enforced by constraints).

Given a point in time I would like to:

1. Identify all distinct unit_ids with an entry that exists in the preceding hour then

2. Count both the total events and sum the status=1 events for the most recent 50 events for each unit_id that fall within a limited period (e.g. don't look at data earlier than midnight). So unit_id 60 might have 50 events in the last 15 minutes while unit_id 4 might have only 12 events after midnight.

The output would look something like:

 unit_id | events | status_1_count
---------+--------+----------------
      1  |     50 |             34
      2  |     27 |             18
      1  |     50 |             34
      1  |      2 |              0
...

Each sub-portion is easy and while I could use external processing or set-returning functions I was hoping first to find the secret-sauce to glue everything together into a single query.

Cheers,
Steve


something like 

select unit_id, count(*), sum(status)
from mytable a
where event_time >= [whatever unix epoch translates to "last midnight"]
and exists
(  select *
   from mytable b
   where b.unit_id = a.unit_id
   and b.epoch >= [unix epoch that translates to "one hour ago"])
group by unit _id;

1) I think figuring out the unix epoch should be reasonable...but I don't know how to do it off the top of my head.
2) I could completely be misunderstanding this.  I'm not sure why the example results would have unit id 1 repeated. (which my suggestion WON'T do)

Re: Statistics query

От
Steve Crawford
Дата:
On 04/10/2013 10:31 AM, Chris Curvey wrote:

On Wed, Apr 10, 2013 at 12:30 PM, Steve Crawford <scrawford@pinpointresearch.com> wrote:
I'm seeking ideas on the best way to craft the following query. I've stripped everything down to the bare essentials and simplified it below.

Input data has a timestamp (actually an int received from the system in the form of a Unix epoch), a unit identifier and a status:

 event_time | unit_id | status
------------+---------+--------
 1357056011 |      60 |      1
 1357056012 |     178 |      0
 1357056019 |     168 |      0
 1357056021 |       3 |      0
 1357056021 |       4 |      1
 1357056021 |     179 |      0
 1357056022 |       0 |      1
 1357056022 |       1 |      0
 1357056023 |       2 |      0
 1357056024 |       9 |      0
 1357056025 |       5 |      0
 1357056025 |       6 |      0
 1357056026 |       7 |      1
...

A given unit_id cannot have two events at the same time (enforced by constraints).

Given a point in time I would like to:

1. Identify all distinct unit_ids with an entry that exists in the preceding hour then

2. Count both the total events and sum the status=1 events for the most recent 50 events for each unit_id that fall within a limited period (e.g. don't look at data earlier than midnight). So unit_id 60 might have 50 events in the last 15 minutes while unit_id 4 might have only 12 events after midnight.

The output would look something like:

 unit_id | events | status_1_count
---------+--------+----------------
      1  |     50 |             34
      2  |     27 |             18
      1  |     50 |             34
      1  |      2 |              0
...

Each sub-portion is easy and while I could use external processing or set-returning functions I was hoping first to find the secret-sauce to glue everything together into a single query.

Cheers,
Steve


something like 

select unit_id, count(*), sum(status)
from mytable a
where event_time >= [whatever unix epoch translates to "last midnight"]
and exists
(  select *
   from mytable b
   where b.unit_id = a.unit_id
   and b.epoch >= [unix epoch that translates to "one hour ago"])
group by unit _id;

1) I think figuring out the unix epoch should be reasonable...but I don't know how to do it off the top of my head.
2) I could completely be misunderstanding this.  I'm not sure why the example results would have unit id 1 repeated. (which my suggestion WON'T do)

Because I screwed up cutting and pasting to make an example. The unit_id in the output should, in fact, be distinct:

 unit_id | events | status_1_count
---------+--------+----------------
      1  |     50 |             34
      2  |     27 |             18
      3  |     50 |             34
      4  |      2 |              0

You are correct, epoch is easy:
abstime(epoch)
or
extract(epoch from timestamptz)
depending on which direction you are going or for an hour difference just subtract 3600.

The solution, however, misses the important complicating gotcha. The units I want listed are only those that have had at least one event in the last hour. But for each such unit, I only want the statistics to reflect the most-recent 50 events (even if those events occurred earlier than the current hour) provided the event occurred on the current date. So the events column can never be less than 1 nor more than 50.

For example...

One unit might have a single event at the start of the last hour but 49 more in the preceding 10 minutes. I want to see that unit and the stats for those 50 events.

Same thing if a unit has 50 events clustered at the end of an hour - I don't want the earlier ones.

Another might have 50 events early in the day but none this hour. I don't want to see that one.

But I do want to see the one that had an event in the last hour late in the day along with the 48 other events that have accumulated since midnight.

Cheers,
Steve

Re: SOLVED Statistics query

От
Steve Crawford
Дата:
On 04/10/2013 10:51 AM, Steve Crawford wrote:
...
Given a point in time I would like to:

1. Identify all distinct unit_ids with an entry that exists in the preceding hour then

2. Count both the total events and sum the status=1 events for the most recent 50 events for each unit_id that fall within a limited period (e.g. don't look at data earlier than midnight). So unit_id 60 might have 50 events in the last 15 minutes while unit_id 4 might have only 12 events after midnight.

...

Guess I needed to wait for the coffee to absorb. I've come up with an initial working solution (perhaps excess use of CTE but it's useful for testing/debugging over different portions of collected data):

with

report_time as (
select
    1365526800::int as list_end
),

report_ranges as (
select
    extract(epoch from date_trunc('day', abstime(list_end)))::int as day_start,
    greatest(list_end-3600, extract(epoch from date_trunc('day', abstime(list_end)))::int) as list_start,
    list_end
from
    report_time
),

today_events as (
select
    unit_id,
    event_time,
    status
from
    event_log d,
    report_ranges r
where
    d.event_time >= r.day_start and
    d.event_time <= r.list_end
),

unit_id_list as (
select
    distinct unit_id,
    coalesce((select
         i.event_time
     from
         today_events i
     where
         i.unit_id = o.unit_id and
         i.event_time <= r.list_end
     order by
         event_time desc
     limit 1
     offset 49), r.day_start) as first_event
from
    event_log o,
    report_ranges r
where
    event_time between r.list_start and r.list_end
)

select
    unit_id,
    (select
         count(*)
     from
         today_events ii
     where
         ii.unit_id = oo.unit_id and
         ii.event_time >= oo.first_event) as events,
    (select
         sum (case when status = -6 then 1 else 0 end)
     from
         today_events ii
     where
         ii.unit_id = oo.unit_id and
         ii.event_time >= oo.first_event) as live_answer
from
    unit_id_list oo
order by
    unit_id
;

Cheers,
Steve