Re: [HACKERS] Implementing STDDEV and VARIANCE
От | Tom Lane |
---|---|
Тема | Re: [HACKERS] Implementing STDDEV and VARIANCE |
Дата | |
Msg-id | 6270.948652021@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Implementing STDDEV and VARIANCE (Jeroen van Vianen <jeroen@design.nl>) |
Ответы |
Re: [HACKERS] Implementing STDDEV and VARIANCE
(Tom Lane <tgl@sss.pgh.pa.us>)
|
Список | pgsql-hackers |
Jeroen van Vianen <jeroen@design.nl> writes: > I'd like to implement stddev and variance aggregates in Postgres. This is a > long standing TODO item. > 1. Add three columns to pg_aggregate for the additional third transition > function. > Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400: >> All that you need to implement this is room to keep two running >> sums instead of one. I haven't looked at pgsql's aggregate functions, >> but I'd hope that the working state can be a struct not just a >> single number. > I saw no other way than adding another transition function and logic, as > this might break user-defined aggregates (are there any around?). Yes, there are some, and no you do not need a third transition function. What you do need is a datatype holding two values that you can use as the transition datatype, plus appropriate functions for the transition functions. The reason there are two transition functions at all is that it allows some of the standard aggregate functions to be built using arithmetic functions that exist anyway --- for example, float8 AVG is built from float8 addition, float8 increment, and float8 divide, with only float8 increment being a function you wouldn't have anyway. However, the whole thing is really a kluge; nodeAvg.c has all sorts of weird little hacks that are necessary to make AVG have the right behavior in boundary conditions such as no-tuples-in. (A blind application of float8 divide would lead to a divide-by-zero exception in that case.) These hacks limit the ability of user-defined aggregates to control their behavior at the boundary conditions. Nor can an aggregate control its response to NULL data values; that's hardwired into nodeAvg.c as well. A cleaner solution would have just one transition function and one transition data value, plus an optional finalization function that takes only the one data value. For AVG the transition data type would be a two-field struct and the transition function would update both fields. This would halve the function-call overhead per tuple. We'd have to provide specialized transition and finalization functions for AVG and probably a couple of the other standard aggregates, but that would allow us to rely on those functions to do the right things at the boundary conditions; nodeAvg.c could stop foreclosing the choices. I have been thinking about proposing such a change along with the function manager rewrite that is now planned for 7.1. That would be a good time because user-defined aggregates would need to be revisited anyway. Also, if the transition functions are to determine the behavior for no-tuples and for NULL data values, they had better be able to pass and return NULLs cleanly; which depends on the function manager rewrite. In short, I'd suggest thinking about implementing STDDEV with a single transition function and transition data value. You'll need specialized functions for it anyway, so I don't see that you're saving any work by proposing a third transition function. What you will need instead is a pg_type entry for the transition data type, but since that data type needn't have any operators, there's not much work needed. > 3. I'm planning to implement this for types float4, float8 and numeric. Any > other types also? int[2,4,8] don't seem logical, as these would introduce > serious rounding errors. I'd suggest just two basic implementations, with float8 and numeric internal calculations respectively. Data of other numeric types can be type-coerced to one of these (that might even happen automatically), but the output would always be either float8 or numeric. (I don't think float4 has enough precision to generate reliable stddev numbers except in very narrow conditions...) regards, tom lane
В списке pgsql-hackers по дате отправления: