Re: [HACKERS] Implementing STDDEV and VARIANCE
От | Jeroen van Vianen |
---|---|
Тема | Re: [HACKERS] Implementing STDDEV and VARIANCE |
Дата | |
Msg-id | 388B677F.381B10E4@design.nl обсуждение исходный текст |
Ответ на | Implementing STDDEV and VARIANCE (Jeroen van Vianen <jeroen@design.nl>) |
Ответы |
Re: [HACKERS] Implementing STDDEV and VARIANCE
(Tom Lane <tgl@sss.pgh.pa.us>)
|
Список | pgsql-hackers |
Tom Lane wrote: > Jeroen van Vianen <jeroen@design.nl> writes: > > I'd like to implement stddev and variance aggregates in Postgres. This is a > > long standing TODO item. > > > 1. Add three columns to pg_aggregate for the additional third transition > > function. > > > Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400: > >> All that you need to implement this is room to keep two running > >> sums instead of one. I haven't looked at pgsql's aggregate functions, > >> but I'd hope that the working state can be a struct not just a > >> single number. > > > I saw no other way than adding another transition function and logic, as > > this might break user-defined aggregates (are there any around?). > > Yes, there are some, and no you do not need a third transition > function. What you do need is a datatype holding two values that > you can use as the transition datatype, plus appropriate functions > for the transition functions. So it might be better to have this type hold three values (n, sum(x) and sum(x^2)) and only use one transition function to update all three values at once and have the finalization function do the necessary calculations. > The reason there are two transition functions at all is that it allows > some of the standard aggregate functions to be built using arithmetic > functions that exist anyway --- for example, float8 AVG is built from > float8 addition, float8 increment, and float8 divide, with only float8 > increment being a function you wouldn't have anyway. However, the > whole thing is really a kluge; nodeAvg.c has all sorts of weird little > hacks that are necessary to make AVG have the right behavior in boundary > conditions such as no-tuples-in. (A blind application of float8 divide > would lead to a divide-by-zero exception in that case.) These hacks > limit the ability of user-defined aggregates to control their behavior > at the boundary conditions. Nor can an aggregate control its response > to NULL data values; that's hardwired into nodeAvg.c as well. Yes, I saw these little hacks. And there are boundary conditions with stddev and variance with no rows and one row (for sample stddev and sample variance). > A cleaner solution would have just one transition function and one > transition data value, plus an optional finalization function that takes > only the one data value. For AVG the transition data type would be a > two-field struct and the transition function would update both fields. > This would halve the function-call overhead per tuple. We'd have to > provide specialized transition and finalization functions for AVG and > probably a couple of the other standard aggregates, but that would allow > us to rely on those functions to do the right things at the boundary > conditions; nodeAvg.c could stop foreclosing the choices. So you suggest changing all transition functions for 7.1 to keep all the state they need? > I have been thinking about proposing such a change along with the > function manager rewrite that is now planned for 7.1. That would be > a good time because user-defined aggregates would need to be revisited > anyway. Also, if the transition functions are to determine the behavior > for no-tuples and for NULL data values, they had better be able to pass > and return NULLs cleanly; which depends on the function manager rewrite. Are you suggesting also to change the lay-out of pg_attribute in 7.1 to something like this and do updates for all built-in types and aggregates? aggname aggowner aggtype aggtranstype [ n, sx, sx2 ] agginitfunction function that does ( n = 0, sx = 0.0, sx2 = 0.0 ) aggtransfunction function that does ( n = n + 1, sx = sx + x, sx2 = sx2 + x * x ) aggfinalizefunction function that returns (sx2 - (1/n) * sx * sx ) / n Might it be better for me to wait for 7.1 before implementing stddev and variance? > In short, I'd suggest thinking about implementing STDDEV with a > single transition function and transition data value. You'll need > specialized functions for it anyway, so I don't see that you're saving > any work by proposing a third transition function. What you will need > instead is a pg_type entry for the transition data type, but since that > data type needn't have any operators, there's not much work needed. OK, clear. > > 3. I'm planning to implement this for types float4, float8 and numeric. Any > > other types also? int[2,4,8] don't seem logical, as these would introduce > > serious rounding errors. > > I'd suggest just two basic implementations, with float8 and numeric > internal calculations respectively. Data of other numeric types can > be type-coerced to one of these (that might even happen automatically), > but the output would always be either float8 or numeric. (I don't think > float4 has enough precision to generate reliable stddev numbers except > in very narrow conditions...) OK, only float8 and numeric. Jeroen
В списке pgsql-hackers по дате отправления: