Implementing STDDEV and VARIANCE
От | Jeroen van Vianen |
---|---|
Тема | Implementing STDDEV and VARIANCE |
Дата | |
Msg-id | 4.2.0.58.20000123170505.00954560@mail.design.nl обсуждение исходный текст |
Ответы |
Re: [HACKERS] Implementing STDDEV and VARIANCE
(Tom Lane <tgl@sss.pgh.pa.us>)
Re: Implementing STDDEV and VARIANCE (Bruce Momjian <pgman@candle.pha.pa.us>) Re: Implementing STDDEV and VARIANCE (Bruce Momjian <pgman@candle.pha.pa.us>) |
Список | pgsql-hackers |
I'd like to implement stddev and variance aggregates in Postgres. This is a long standing TODO item. There already has been some discussion on implementing this, see http://www.postgresql.org/mhonarc/pgsql-hackers/1998-06/msg00175.html There are two definitions for standard deviation and variance: _ population variance = sigma^2 = SUM(X - X)^2 / N population stddev = sqrt(population variance) _ sample variance = s^2 = SUM(X - X)^2 / (N-1) sample stddev = sqrt(sample variance) These statistics can be calculated in one pass when three variables are kept when scanning through the rows, n, sum(x), sum(x^2). Currently, only two variables are kept. E.g. avg() is calculated as follows: sx = 0 n = 0 for every row {sx = sx + value in row // transition function 1n = n+1 // transition function 2 } avg = sum(x) / n stddev / variance might be calculated as follows: sx = 0 n = 0 sx2 = 0 for every row {sx = sx + value in row // transition function 1n = n+1 // transition function 2sx2 = sx2 + value in row^2 // transition function 3 } var = (1/n) * (sx2 - (1/n) * sx^2) // Population or var = (1/(n-1)) * (sx2 - (1/n) * sx^2) // Sample and stddev = sqrt(var) I've looked through the code and the following things need to be implemented: 1. Add three columns to pg_aggregate for the additional third transition function. Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:>All that you need to implement this is room to keep two running>sums insteadof one. I haven't looked at pgsql's aggregate functions,>but I'd hope that the working state can be a struct notjust a>single number. I saw no other way than adding another transition function and logic, as this might break user-defined aggregates (are there any around?). 2. Add logic to nodeAgg.c to execute the third transition function and finalize function with three rather than two parameters 3. Add functions f(a,b) that returns a + b^2 for selected types 4. Add four finalize functions to calculate the variance / stddev 5. Update the code for create aggregate, to include the definition of the third transition function 6. Update the documentation My questions are: 1. Is this the correct way to continue? What am I missing? Any errors in my reasoning? 2. I am proposing the names stddev(x) and variance(x) for population and samplestddev(x) and samplevariance(x) for sample statistics. Any comments? 3. I'm planning to implement this for types float4, float8 and numeric. Any other types also? int[2,4,8] don't seem logical, as these would introduce serious rounding errors. Let me know what you think, Jeroen
В списке pgsql-hackers по дате отправления:
Следующее
От: Tom LaneДата:
Сообщение: Re: pg_dump possible fix, need testers. (was: Re: [HACKERS] pg_dump disaster)