Re: Implementing STDDEV and VARIANCE
От | Bruce Momjian |
---|---|
Тема | Re: Implementing STDDEV and VARIANCE |
Дата | |
Msg-id | 200006080257.WAA09717@candle.pha.pa.us обсуждение исходный текст |
Ответ на | Implementing STDDEV and VARIANCE (Jeroen van Vianen <jeroen@design.nl>) |
Ответы |
Re: Implementing STDDEV and VARIANCE
(JanWieck@t-online.de (Jan Wieck))
|
Список | pgsql-hackers |
Has this gone anywhere? > I'd like to implement stddev and variance aggregates in Postgres. This is a > long standing TODO item. > > There already has been some discussion on implementing this, see > http://www.postgresql.org/mhonarc/pgsql-hackers/1998-06/msg00175.html > > There are two definitions for standard deviation and variance: > _ > population variance = sigma^2 = SUM(X - X)^2 / N > > population stddev = sqrt(population variance) > _ > sample variance = s^2 = SUM(X - X)^2 / (N-1) > > sample stddev = sqrt(sample variance) > > These statistics can be calculated in one pass when three variables are > kept when scanning through the rows, n, sum(x), sum(x^2). Currently, only > two variables are kept. > > E.g. avg() is calculated as follows: > > sx = 0 > n = 0 > for every row { > sx = sx + value in row // transition function 1 > n = n+1 // transition function 2 > } > avg = sum(x) / n > > stddev / variance might be calculated as follows: > > sx = 0 > n = 0 > sx2 = 0 > for every row { > sx = sx + value in row // transition function 1 > n = n+1 // transition function 2 > sx2 = sx2 + value in row^2 // transition function 3 > } > var = (1/n) * (sx2 - (1/n) * sx^2) // Population > > or > > var = (1/(n-1)) * (sx2 - (1/n) * sx^2) // Sample > > and > > stddev = sqrt(var) > > I've looked through the code and the following things need to be implemented: > > 1. Add three columns to pg_aggregate for the additional third transition > function. > > Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400: > >All that you need to implement this is room to keep two running > >sums instead of one. I haven't looked at pgsql's aggregate functions, > >but I'd hope that the working state can be a struct not just a > >single number. > > I saw no other way than adding another transition function and logic, as > this might break user-defined aggregates (are there any around?). > > 2. Add logic to nodeAgg.c to execute the third transition function and > finalize function with three rather than two parameters > 3. Add functions f(a,b) that returns a + b^2 for selected types > 4. Add four finalize functions to calculate the variance / stddev > 5. Update the code for create aggregate, to include the definition of the > third transition function > 6. Update the documentation > > My questions are: > 1. Is this the correct way to continue? What am I missing? Any errors in my > reasoning? > 2. I am proposing the names stddev(x) and variance(x) for population and > samplestddev(x) and > samplevariance(x) for sample statistics. Any comments? > 3. I'm planning to implement this for types float4, float8 and numeric. Any > other types also? int[2,4,8] don't seem logical, as these would introduce > serious rounding errors. > > Let me know what you think, > > > Jeroen > > ************ > -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
В списке pgsql-hackers по дате отправления: