Re: [HACKERS] Implementing STDDEV and VARIANCE

Поиск
Список
Период
Сортировка
От Jeroen van Vianen
Тема Re: [HACKERS] Implementing STDDEV and VARIANCE
Дата
Msg-id 388B677F.381B10E4@design.nl
обсуждение исходный текст
Ответ на Implementing STDDEV and VARIANCE  (Jeroen van Vianen <jeroen@design.nl>)
Ответы Re: [HACKERS] Implementing STDDEV and VARIANCE  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Tom Lane wrote:
> Jeroen van Vianen <jeroen@design.nl> writes:
> > I'd like to implement stddev and variance aggregates in Postgres. This is a
> > long standing TODO item.
> 
> > 1. Add three columns to pg_aggregate for the additional third transition
> > function.
> 
> > Tom Lane wrote at Fri, 05 Jun 1998 11:24:04 -0400:
> >> All that you need to implement this is room to keep two running
> >> sums instead of one.  I haven't looked at pgsql's aggregate functions,
> >> but I'd hope that the working state can be a struct not just a
> >> single number.
> 
> > I saw no other way than adding another transition function and logic, as
> > this might break user-defined aggregates (are there any around?).
> 
> Yes, there are some, and no you do not need a third transition
> function.  What you do need is a datatype holding two values that
> you can use as the transition datatype, plus appropriate functions
> for the transition functions.

So it might be better to have this type hold three values (n, sum(x) and
sum(x^2)) and only use one transition function to update all three
values at once and have the finalization function do the necessary
calculations.

> The reason there are two transition functions at all is that it allows
> some of the standard aggregate functions to be built using arithmetic
> functions that exist anyway --- for example, float8 AVG is built from
> float8 addition, float8 increment, and float8 divide, with only float8
> increment being a function you wouldn't have anyway.  However, the
> whole thing is really a kluge; nodeAvg.c has all sorts of weird little
> hacks that are necessary to make AVG have the right behavior in boundary
> conditions such as no-tuples-in.  (A blind application of float8 divide
> would lead to a divide-by-zero exception in that case.)  These hacks
> limit the ability of user-defined aggregates to control their behavior
> at the boundary conditions.  Nor can an aggregate control its response
> to NULL data values; that's hardwired into nodeAvg.c as well.

Yes, I saw these little hacks. And there are boundary conditions with
stddev and variance with no rows and one row (for sample stddev and
sample variance).

> A cleaner solution would have just one transition function and one
> transition data value, plus an optional finalization function that takes
> only the one data value.  For AVG the transition data type would be a
> two-field struct and the transition function would update both fields.
> This would halve the function-call overhead per tuple.  We'd have to
> provide specialized transition and finalization functions for AVG and
> probably a couple of the other standard aggregates, but that would allow
> us to rely on those functions to do the right things at the boundary
> conditions; nodeAvg.c could stop foreclosing the choices.

So you suggest changing all transition functions for 7.1 to keep all the
state they need?

> I have been thinking about proposing such a change along with the
> function manager rewrite that is now planned for 7.1.  That would be
> a good time because user-defined aggregates would need to be revisited
> anyway.  Also, if the transition functions are to determine the behavior
> for no-tuples and for NULL data values, they had better be able to pass
> and return NULLs cleanly; which depends on the function manager rewrite.

Are you suggesting also to change the lay-out of pg_attribute in 7.1 to
something like this and do updates for all built-in types and
aggregates?

aggname                
aggowner
aggtype
aggtranstype           [ n, sx, sx2 ]
agginitfunction        function that does ( n = 0, sx = 0.0, sx2 = 0.0 )
aggtransfunction       function that does ( n = n + 1, sx = sx + x,                        sx2 = sx2 + x * x )
aggfinalizefunction    function that returns (sx2 - (1/n) * sx * sx ) /
n

Might it be better for me to wait for 7.1 before implementing stddev and
variance?

> In short, I'd suggest thinking about implementing STDDEV with a
> single transition function and transition data value.  You'll need
> specialized functions for it anyway, so I don't see that you're saving
> any work by proposing a third transition function.  What you will need
> instead is a pg_type entry for the transition data type, but since that
> data type needn't have any operators, there's not much work needed.

OK, clear.

> > 3. I'm planning to implement this for types float4, float8 and numeric. Any
> > other types also? int[2,4,8] don't seem logical, as these would introduce
> > serious rounding errors.
> 
> I'd suggest just two basic implementations, with float8 and numeric
> internal calculations respectively.  Data of other numeric types can
> be type-coerced to one of these (that might even happen automatically),
> but the output would always be either float8 or numeric.  (I don't think
> float4 has enough precision to generate reliable stddev numbers except
> in very narrow conditions...)

OK, only float8 and numeric.


Jeroen


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: [HACKERS] Happy column dropping
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: [HACKERS] Happy column dropping