Re: Implementing STDDEV and VARIANCE

Поиск
Список
Период
Сортировка
От Bruce Momjian
Тема Re: Implementing STDDEV and VARIANCE
Дата
Msg-id 200006080257.WAA09717@candle.pha.pa.us
обсуждение исходный текст
Ответ на Implementing STDDEV and VARIANCE  (Jeroen van Vianen <jeroen@design.nl>)
Ответы Re: Implementing STDDEV and VARIANCE  (JanWieck@t-online.de (Jan Wieck))
Список pgsql-hackers
Has this gone anywhere?

> I'd like to implement stddev and variance aggregates in Postgres. This is a 
> long standing TODO item.
> 
> There already has been some discussion on implementing this, see 
> http://www.postgresql.org/mhonarc/pgsql-hackers/1998-06/msg00175.html
> 
> There are two definitions for standard deviation and variance:
>                                          _
> population variance = sigma^2 = SUM(X - X)^2 / N
> 
> population stddev = sqrt(population variance)
>                                  _
> sample variance = s^2 = SUM(X - X)^2 / (N-1)
> 
> sample stddev = sqrt(sample variance)
> 
> These statistics can be calculated in one pass when three variables are 
> kept when scanning through the rows, n, sum(x), sum(x^2). Currently, only 
> two variables are kept.
> 
> E.g. avg() is calculated as follows:
> 
> sx = 0
> n = 0
> for every row {
>     sx = sx + value in row                   // transition function 1
>     n = n+1                                  // transition function 2
> }
> avg = sum(x) / n
> 
> stddev / variance might be calculated as follows:
> 
> sx = 0
> n = 0
> sx2 = 0
> for every row {
>     sx = sx + value in row                   // transition function 1
>     n = n+1                                  // transition function 2
>     sx2 = sx2 + value in row^2               // transition function 3
> }
> var = (1/n) * (sx2 - (1/n) * sx^2)              // Population
> 
> or
> 
> var = (1/(n-1)) * (sx2 - (1/n) * sx^2)          // Sample
> 
> and
> 
> stddev = sqrt(var)
> 
> I've looked through the code and the following things need to be implemented:
> 
> 1. Add three columns to pg_aggregate for the additional third transition 
> function.
> 
> Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:
>  >All that you need to implement this is room to keep two running
>  >sums instead of one.  I haven't looked at pgsql's aggregate functions,
>  >but I'd hope that the working state can be a struct not just a
>  >single number.
> 
> I saw no other way than adding another transition function and logic, as 
> this might break user-defined aggregates (are there any around?).
> 
> 2. Add logic to nodeAgg.c to execute the third transition function and 
> finalize function with three rather than two parameters
> 3. Add functions f(a,b) that returns a + b^2 for selected types
> 4. Add four finalize functions to calculate the variance / stddev
> 5. Update the code for create aggregate, to include the definition of the 
> third transition function
> 6. Update the documentation
> 
> My questions are:
> 1. Is this the correct way to continue? What am I missing? Any errors in my 
> reasoning?
> 2. I am proposing the names stddev(x) and variance(x) for population and 
> samplestddev(x) and
> samplevariance(x) for sample statistics. Any comments?
> 3. I'm planning to implement this for types float4, float8 and numeric. Any 
> other types also? int[2,4,8] don't seem logical, as these would introduce 
> serious rounding errors.
> 
> Let me know what you think,
> 
> 
> Jeroen
> 
> ************
> 


--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: column aliases
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: column aliases