Re: [GENERAL] Approach to Data Summary and Analysis

Поиск
Список
Период
Сортировка
От Chris Curvey
Тема Re: [GENERAL] Approach to Data Summary and Analysis
Дата
Msg-id CADfwSsD1w-p3SH=woCPrcbrzkJ0mzJmSzfT61JSCaTk9y-jL0w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [GENERAL] Approach to Data Summary and Analysis  (Chris Curvey <chris@chriscurvey.com>)
Ответы Re: [GENERAL] Approach to Data Summary and Analysis
Список pgsql-performance



On Tue, Apr 15, 2014 at 10:56 AM, Chris Curvey <chris@chriscurvey.com> wrote:
On Mon, Apr 14, 2014 at 12:27 PM, Robert DiFalco <robert.difalco@gmail.com> wrote:
I have several related tables that represent a call state. Let's think of these as phone calls to simplify things. Sometimes I need to determine the last time a user was called, the last time a user answered a call, or the last time a user completed a call. 

The basic schema is something like this:

CREATE TABLE calls (
  id      BIGINT NOT NULL, // sequence generator
  user_id BIGINT NOT NULL,
  called  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

  PRIMARY KEY (id),
  FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

CREATE TABLE calls_answered (
  id        BIGINT NOT NULL,
  answered  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,

  PRIMARY KEY (id),
  FOREIGN KEY (id) REFERENCES calls(id) ON DELETE CASCADE
);

And so on for calls_connected, calls_completed, call_errors, etc.

Occasionally I will want to know things like "When was the last time a user answered a call" or "How many times has a user been called". 

I can do these queries using a combination of MAX or COUNT. But I'm concerned about the performance. 

SELECT MAX(a.id
FROM calls_answered a JOIN calls c ON c.id = a.id 
WHERE c.user_id = ?;

Or the number of answered calls:

SELECT MAX(a.id
FROM calls_answered a JOIN calls c ON c.id = a.id 
WHERE c.user_id = ?;

Sometimes I might want to get this data for a whole bunch of users. For example, "give me all users whose have not answered a call in the last 5 days." Or even "what percentage of users called actually answered a call." This approach could become a performance issue. So the other option is to create a call_summary table that is updated with triggers.

The summary table would need fields like "user_id", "last_call_id", "call_count", "last_answered_id", "answered_count", "last_completed_id", "last_completed_count", etc.

My only issue with a summary table is that I don't want a bunch of null fields. For example, if the user was called but they have never answered at call then the last_call_id and call_count fields on the summary table would be non-NULL but the last_answer_id and answer_count fields WOULD be NULL. But over time all fields would eventually become non-NULL.

So that leads me to a summary table for EACH call state. Each summary table would have a user id, a ref_id, and a count -- one summary table for each state e.g. call_summary, call_answered_summary, etc.   

This approach has the down side that it creates a lot of tables and triggers. It has the upside of being pretty efficient without having to deal with NULL values.  It's also pretty easy to reason about. 

So for my question -- is the choice between these a personal preference sort of thing or is there a right or wrong approach? Am I missing another approach that would be better?  I'm okay with SQL but I'm not expert so I'm not sure if there is an accepted DESIGN PATTERN for this that I am missing. 

Thanks!



(Sorry, fat-fingered and hit "send too early"...)

CREATE TABLE calls (
  id      BIGINT NOT NULL, // sequence generator
  user_id BIGINT NOT NULL,
  called  TIMESTAMPTZ NOT NULL DEFAULT CURRENT_TIMESTAMP,
  answered  TIMESTAMPTZ NULL,
  connected  TIMESTAMPTZ NULL,
  completed  TIMESTAMPTZ NULL,


  PRIMARY KEY (id),
  FOREIGN KEY (user_id) REFERENCES my_users(id) ON DELETE CASCADE
);

Then  your queries end up looking like this:

--last time john answered
SELECT MAX(a.id
FROM calls
where answered is not null
and user_id = ?

-- users that have not answered a call in the last five days (I can think of a few ways to interpret that phrase)
select myusers.*
from myusers
where not exists
(  select *
  from calls
  where calls.user_id = myusers.user_id
  and answered >= <five days ago>)

-- average ring time
select avg(extract ('seconds' from called - answered))
where answered is not null



--
I asked the Internet how to train my cat, and the Internet told me to get a dog.

В списке pgsql-performance по дате отправления:

Предыдущее
От: Chris Curvey
Дата:
Сообщение: Re: [GENERAL] Approach to Data Summary and Analysis
Следующее
От: Robert DiFalco
Дата:
Сообщение: Re: [GENERAL] Approach to Data Summary and Analysis