Обсуждение: Better way to compute moving averages?

Поиск
Список
Период
Сортировка

Better way to compute moving averages?

От
Jason Aleski
Дата:
I've been asked compute various moving averages of end of day sales by 
store.  I can do this for all rows with no problem (same query without 
the WHERE clause towards the end).  That query took 10-15 minutes to run 
over approx 3.4 million rows.  I'm sure they will want this information 
to be added to the daily end of day reports. I can run the query below 
(excluding the WHERE clause) but it takes almost as long to run one day 
as it does the entire dataset.  It looks like when I do the inner 
select, it is still running over the entire dataset.  I have added a 
"WHERE eod_ts > CURRENT_TIMESTAMP - INTERVAL '365 days'" (as below) to 
the inner query, which allows the query to run between 1-2 minutes.

Question 1)  This seems to work, but was curious if there is a better way.

Question 2)  Is there a way to specify a date, instead of using current 
date and current_timestamp, as a variable and use that in the query?  I 
know I can do that in my Java program
using variables, but wasn't sure if there was a way to do this with a 
function or stored procedure?


INSERT INTO historical_data_avg (store_id, date, avg7sales, avg14sales, 
avg30sales, avg60sales, avg90sales, avg180sales) (  SELECT t1.store_id, t1.eod_ts, t1.avg5sales, t1.avg10sales, 
t1.avg20sales, t1.avg50sales, t1.avg100sales, t1.avg180sales FROM (    SELECT      store_id,      eod_ts,
avg(eod_sales)OVER (PARTITION BY store_id ORDER BY eod_ts DESC 
 
ROWS BETWEEN CURRENT ROW AND 4 FOLLOWING) AS avg5sales,      avg(eod_sales) OVER (PARTITION BY store_id ORDER BY eod_ts
DESC
 
ROWS BETWEEN CURRENT ROW AND 9 FOLLOWING) AS avg10sales,      avg(eod_sales) OVER (PARTITION BY store_id ORDER BY
eod_tsDESC 
 
ROWS BETWEEN CURRENT ROW AND 19 FOLLOWING) AS avg20sales,      avg(eod_sales) OVER (PARTITION BY store_id ORDER BY
eod_tsDESC 
 
ROWS BETWEEN CURRENT ROW AND 49 FOLLOWING) AS avg50sales,      avg(eod_sales) OVER (PARTITION BY store_id ORDER BY
eod_tsDESC 
 
ROWS BETWEEN CURRENT ROW AND 99 FOLLOWING) AS avg100sales,      avg(eod_sales) OVER (PARTITION BY store_id ORDER BY
eod_tsDESC 
 
ROWS BETWEEN CURRENT ROW AND 179 FOLLOWING) AS avg200sales    FROM end_of_day_data    WHERE eod_ts > CURRENT_TIMESTAMP
-INTERVAL '260 days'    GROUP BY store_id, eod_ts, eod_sales    ORDER BY ticker_id, eod_ts  ) as t1  WHERE t1.eod_ts =
current_date
);



Re: Better way to compute moving averages?

От
David G Johnston
Дата:
Jason Aleksi wrote
> I've been asked compute various moving averages of end of day sales by 
> store.  I can do this for all rows with no problem (same query without 
> the WHERE clause towards the end).  That query took 10-15 minutes to run 
> over approx 3.4 million rows.  I'm sure they will want this information 
> to be added to the daily end of day reports. I can run the query below 
> (excluding the WHERE clause) but it takes almost as long to run one day 
> as it does the entire dataset.  It looks like when I do the inner 
> select, it is still running over the entire dataset.  I have added a 
> "WHERE eod_ts > CURRENT_TIMESTAMP - INTERVAL '365 days'" (as below) to 
> the inner query, which allows the query to run between 1-2 minutes.
> 
> Question 1)  This seems to work, but was curious if there is a better way.

Possibly though I'm not sure you've given enough info...

Regardless, are you positive you need "FOLLOWING" instead of "PRECEDING"
frames?


> Question 2)  Is there a way to specify a date, instead of using current 
> date and current_timestamp, as a variable and use that in the query?  I 
> know I can do that in my Java program
> using variables, but wasn't sure if there was a way to do this with a 
> function or stored procedure?

Have you read about functions in the documentation?

CREATE FUNCTION echo_text(input_var text)
RETURNS text
AS $$
SELECT input_var;
$$
LANGUAGE sql
STRICT
;


> It looks like when I do the inner 
> select, it is still running over the entire dataset.
> I have added a 
> "WHERE eod_ts > CURRENT_TIMESTAMP - INTERVAL '365 days'" (as below) to 
> the inner query, which allows the query to run between 1-2 minutes.
> [...]
>      FROM end_of_day_data
>      WHERE eod_ts > CURRENT_TIMESTAMP - INTERVAL '260 days'
>      GROUP BY store_id, eod_ts, eod_sales
>      ORDER BY ticker_id, eod_ts
>    ) as t1
>    WHERE t1.eod_ts = current_date
> );

You expected differently?  GROUP BY (in particular, not only) prevents any
kind of push-down of quals.

Try something like (not tested or tried - just having an idea this moment):

WITH scope_of_source_with_aging AS (
SELECT eod_ts,
first(eod_ts) OVER () AS ref_eod, 
eod_ts - first(eod_ts) OVER (ORDER BY eod_ts) AS eod_age
FROM end_of_day_data WHERE eod_ts >= current_timestamp - interval '206 days'
)
SELECT ref_eod, 
avg(CASE WHEN eod_age BETWEEN 0 AND 7 THEN ... ELSE NULL END) AS
average_7day
[, ...]
FROM scope_of_source_with_aging
GROUP BY ref_eod
;

The first, and most important part, is to limit your source data to only the
desired time period.  The rest of the above simply tries to limit how much
work needs to be done.  Since window function, while great, are fairly slow
if you can minimize their use - in this case by limiting them to simply
calculating and age - it should help performance.

The concept above should be applicable to your situation generally but you
will need to clarify your needs and then modify the WHERE and window clauses
as appropriate.

I also don't recall if there is a native "first" function though you can
either write one yourself or possibly add an appropriate range clause that
will accomplish the same goal.

David J.






--
View this message in context:
http://postgresql.nabble.com/Better-way-to-compute-moving-averages-tp5835291p5835295.html
Sent from the PostgreSQL - sql mailing list archive at Nabble.com.



Re: Better way to compute moving averages?

От
Jason Aleski
Дата:
On 1/23/2015 6:50 PM, David G Johnston wrote:
> Jason Aleksi wrote
>> I've been asked compute various moving averages of end of day sales by
>> store.  I can do this for all rows with no problem (same query without
>> the WHERE clause towards the end).  That query took 10-15 minutes to run
>> over approx 3.4 million rows.  I'm sure they will want this information
>> to be added to the daily end of day reports. I can run the query below
>> (excluding the WHERE clause) but it takes almost as long to run one day
>> as it does the entire dataset.  It looks like when I do the inner
>> select, it is still running over the entire dataset.  I have added a
>> "WHERE eod_ts > CURRENT_TIMESTAMP - INTERVAL '365 days'" (as below) to
>> the inner query, which allows the query to run between 1-2 minutes.
>>
>> Question 1)  This seems to work, but was curious if there is a better way.
> Possibly though I'm not sure you've given enough info...
>
> Regardless, are you positive you need "FOLLOWING" instead of "PRECEDING"
> frames?
>
>
>> Question 2)  Is there a way to specify a date, instead of using current
>> date and current_timestamp, as a variable and use that in the query?  I
>> know I can do that in my Java program
>> using variables, but wasn't sure if there was a way to do this with a
>> function or stored procedure?
> Have you read about functions in the documentation?
>
> CREATE FUNCTION echo_text(input_var text)
> RETURNS text
> AS $$
> SELECT input_var;
> $$
> LANGUAGE sql
> STRICT
> ;
>
>
>> It looks like when I do the inner
>> select, it is still running over the entire dataset.
>> I have added a
>> "WHERE eod_ts > CURRENT_TIMESTAMP - INTERVAL '365 days'" (as below) to
>> the inner query, which allows the query to run between 1-2 minutes.
>> [...]
>>       FROM end_of_day_data
>>       WHERE eod_ts > CURRENT_TIMESTAMP - INTERVAL '260 days'
>>       GROUP BY store_id, eod_ts, eod_sales
>>       ORDER BY ticker_id, eod_ts
>>     ) as t1
>>     WHERE t1.eod_ts = current_date
>> );
> You expected differently?  GROUP BY (in particular, not only) prevents any
> kind of push-down of quals.
>
> Try something like (not tested or tried - just having an idea this moment):
>
> WITH scope_of_source_with_aging AS (
> SELECT eod_ts,
> first(eod_ts) OVER () AS ref_eod,
> eod_ts - first(eod_ts) OVER (ORDER BY eod_ts) AS eod_age
> FROM end_of_day_data WHERE eod_ts >= current_timestamp - interval '206 days'
> )
> SELECT ref_eod,
> avg(CASE WHEN eod_age BETWEEN 0 AND 7 THEN ... ELSE NULL END) AS
> average_7day
> [, ...]
> FROM scope_of_source_with_aging
> GROUP BY ref_eod
> ;
>
> The first, and most important part, is to limit your source data to only the
> desired time period.  The rest of the above simply tries to limit how much
> work needs to be done.  Since window function, while great, are fairly slow
> if you can minimize their use - in this case by limiting them to simply
> calculating and age - it should help performance.
>
> The concept above should be applicable to your situation generally but you
> will need to clarify your needs and then modify the WHERE and window clauses
> as appropriate.
>
> I also don't recall if there is a native "first" function though you can
> either write one yourself or possibly add an appropriate range clause that
> will accomplish the same goal.
>
> David J.
>
>
>
>
>
>
> --
> View this message in context:
http://postgresql.nabble.com/Better-way-to-compute-moving-averages-tp5835291p5835295.html
> Sent from the PostgreSQL - sql mailing list archive at Nabble.com.
>
>
David J.
    Thanks for the info.

1)  Yes, I pretty sure I need FOLLOWING because I ordered the eod_ts 
(timestamp) using DESC.

2)  The only functions I've created in the past have been primarily 
relating to queries that don't return any data.  Things such as "purge 
duplicate rows".  Nothing that used input and output; only to execute 
tasks on the backend.

Also, I'll look at the WITH method again.  I saw that in an earlier 
search, but didn't fully understand it and looked for another way.

In both cases, thanks for the reassurance and pointing me in a direction 
to explore.

Regards,
JA