Обсуждение: Statistics query

Поиск

Список

Период

Сортировка

Statistics query

От

Steve Crawford

Дата:

10 апреля 2013 г., 19:31:03

I'm seeking ideas on the best way to craft the following query. I've
stripped everything down to the bare essentials and simplified it below.

Input data has a timestamp (actually an int received from the system in
the form of a Unix epoch), a unit identifier and a status:

  event_time | unit_id | status
------------+---------+--------
  1357056011 |      60 |      1
  1357056012 |     178 |      0
  1357056019 |     168 |      0
  1357056021 |       3 |      0
  1357056021 |       4 |      1
  1357056021 |     179 |      0
  1357056022 |       0 |      1
  1357056022 |       1 |      0
  1357056023 |       2 |      0
  1357056024 |       9 |      0
  1357056025 |       5 |      0
  1357056025 |       6 |      0
  1357056026 |       7 |      1
...

A given unit_id cannot have two events at the same time (enforced by
constraints).

Given a point in time I would like to:

1. Identify all distinct unit_ids with an entry that exists in the
preceding hour then

2. Count both the total events and sum the status=1 events for the most
recent 50 events for each unit_id that fall within a limited period
(e.g. don't look at data earlier than midnight). So unit_id 60 might
have 50 events in the last 15 minutes while unit_id 4 might have only 12
events after midnight.

The output would look something like:

  unit_id | events | status_1_count
---------+--------+----------------
       1  |     50 |             34
       2  |     27 |             18
       1  |     50 |             34
       1  |      2 |              0
...

Each sub-portion is easy and while I could use external processing or
set-returning functions I was hoping first to find the secret-sauce to
glue everything together into a single query.

Cheers,
Steve

Re: Statistics query

От

Chris Curvey

Дата:

10 апреля 2013 г., 20:31:55

On Wed, Apr 10, 2013 at 12:30 PM, Steve Crawford <scrawford@pinpointresearch.com> wrote:

I'm seeking ideas on the best way to craft the following query. I've stripped everything down to the bare essentials and simplified it below.

Input data has a timestamp (actually an int received from the system in the form of a Unix epoch), a unit identifier and a status:

event_time | unit_id | status
------------+---------+--------
1357056011 | 60 | 1
1357056012 | 178 | 0
1357056019 | 168 | 0
1357056021 | 3 | 0
1357056021 | 4 | 1
1357056021 | 179 | 0
1357056022 | 0 | 1
1357056022 | 1 | 0
1357056023 | 2 | 0
1357056024 | 9 | 0
1357056025 | 5 | 0
1357056025 | 6 | 0
1357056026 | 7 | 1
...

A given unit_id cannot have two events at the same time (enforced by constraints).

Given a point in time I would like to:

1. Identify all distinct unit_ids with an entry that exists in the preceding hour then

2. Count both the total events and sum the status=1 events for the most recent 50 events for each unit_id that fall within a limited period (e.g. don't look at data earlier than midnight). So unit_id 60 might have 50 events in the last 15 minutes while unit_id 4 might have only 12 events after midnight.

The output would look something like:

unit_id | events | status_1_count
---------+--------+----------------
1 | 50 | 34
2 | 27 | 18
1 | 50 | 34
1 | 2 | 0
...

Each sub-portion is easy and while I could use external processing or set-returning functions I was hoping first to find the secret-sauce to glue everything together into a single query.

Cheers,
Steve

something like

select unit_id, count(*), sum(status)

from mytable a

where event_time >= [whatever unix epoch translates to "last midnight"]

and exists

( select *

from mytable b

where b.unit_id = a.unit_id

and b.epoch >= [unix epoch that translates to "one hour ago"])

group by unit _id;

1) I think figuring out the unix epoch should be reasonable...but I don't know how to do it off the top of my head.

2) I could completely be misunderstanding this. I'm not sure why the example results would have unit id 1 repeated. (which my suggestion WON'T do)

Re: Statistics query

От

Steve Crawford

Дата:

10 апреля 2013 г., 20:51:58

On 04/10/2013 10:31 AM, Chris Curvey wrote:

On Wed, Apr 10, 2013 at 12:30 PM, Steve Crawford <scrawford@pinpointresearch.com> wrote:
I'm seeking ideas on the best way to craft the following query. I've stripped everything down to the bare essentials and simplified it below.

Input data has a timestamp (actually an int received from the system in the form of a Unix epoch), a unit identifier and a status:

event_time | unit_id | status
------------+---------+--------
1357056011 | 60 | 1
1357056012 | 178 | 0
1357056019 | 168 | 0
1357056021 | 3 | 0
1357056021 | 4 | 1
1357056021 | 179 | 0
1357056022 | 0 | 1
1357056022 | 1 | 0
1357056023 | 2 | 0
1357056024 | 9 | 0
1357056025 | 5 | 0
1357056025 | 6 | 0
1357056026 | 7 | 1
...

A given unit_id cannot have two events at the same time (enforced by constraints).

Given a point in time I would like to:

1. Identify all distinct unit_ids with an entry that exists in the preceding hour then

2. Count both the total events and sum the status=1 events for the most recent 50 events for each unit_id that fall within a limited period (e.g. don't look at data earlier than midnight). So unit_id 60 might have 50 events in the last 15 minutes while unit_id 4 might have only 12 events after midnight.

The output would look something like:

unit_id | events | status_1_count
---------+--------+----------------
1 | 50 | 34
2 | 27 | 18
1 | 50 | 34
1 | 2 | 0
...

Each sub-portion is easy and while I could use external processing or set-returning functions I was hoping first to find the secret-sauce to glue everything together into a single query.

Cheers,
Steve

something like

select unit_id, count(*), sum(status)
from mytable a
where event_time >= [whatever unix epoch translates to "last midnight"]
and exists
( select *
from mytable b
where b.unit_id = a.unit_id
and b.epoch >= [unix epoch that translates to "one hour ago"])
group by unit _id;

1) I think figuring out the unix epoch should be reasonable...but I don't know how to do it off the top of my head.
2) I could completely be misunderstanding this. I'm not sure why the example results would have unit id 1 repeated. (which my suggestion WON'T do)

Because I screwed up cutting and pasting to make an example. The unit_id in the output should, in fact, be distinct:

unit_id | events | status_1_count ---------+--------+---------------- 1 | 50 | 34 2 | 27 | 18 3 | 50 | 34 4 | 2 | 0

You are correct, epoch is easy:
abstime(epoch)
or
extract(epoch from timestamptz)
depending on which direction you are going or for an hour difference just subtract 3600.

The solution, however, misses the important complicating gotcha. The units I want listed are only those that have had at least one event in the last hour. But for each such unit, I only want the statistics to reflect the most-recent 50 events (even if those events occurred earlier than the current hour) provided the event occurred on the current date. So the events column can never be less than 1 nor more than 50.

For example...

One unit might have a single event at the start of the last hour but 49 more in the preceding 10 minutes. I want to see that unit and the stats for those 50 events.

Same thing if a unit has 50 events clustered at the end of an hour - I don't want the earlier ones.

Another might have 50 events early in the day but none this hour. I don't want to see that one.

But I do want to see the one that had an event in the last hour late in the day along with the 48 other events that have accumulated since midnight.

Cheers,
Steve

Re: SOLVED Statistics query

От

Steve Crawford

Дата:

10 апреля 2013 г., 22:49:28

On 04/10/2013 10:51 AM, Steve Crawford wrote:

...
Given a point in time I would like to:

1. Identify all distinct unit_ids with an entry that exists in the preceding hour then

2. Count both the total events and sum the status=1 events for the most recent 50 events for each unit_id that fall within a limited period (e.g. don't look at data earlier than midnight). So unit_id 60 might have 50 events in the last 15 minutes while unit_id 4 might have only 12 events after midnight.

...

Guess I needed to wait for the coffee to absorb. I've come up with an initial working solution (perhaps excess use of CTE but it's useful for testing/debugging over different portions of collected data):

withreport_time as (select 1365526800::int as list_end),report_ranges as (select extract(epoch from date_trunc('day', abstime(list_end)))::int as day_start, greatest(list_end-3600, extract(epoch from date_trunc('day', abstime(list_end)))::int) as list_start, list_endfrom report_time),today_events as (select unit_id, event_time, statusfrom event_log d, report_ranges rwhere d.event_time >= r.day_start and d.event_time <= r.list_end),unit_id_list as (select distinct unit_id, coalesce((select i.event_time from today_events i where i.unit_id = o.unit_id and i.event_time <= r.list_end order by event_time desc limit 1 offset 49), r.day_start) as first_eventfrom event_log o, report_ranges rwhere event_time between r.list_start and r.list_end)select unit_id, (select count(*) from today_events ii where ii.unit_id = oo.unit_id and ii.event_time >= oo.first_event) as events, (select sum (case when status = -6 then 1 else 0 end) from today_events ii where ii.unit_id = oo.unit_id and ii.event_time >= oo.first_event) as live_answerfrom unit_id_list ooorder by unit_id;
Cheers,
Steve

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Statistics query

Statistics query

Re: Statistics query

Re: Statistics query

Re: SOLVED Statistics query