Обсуждение: Re: [HACKERS] Standard Deviation function.

Поиск
Список
Период
Сортировка

Re: [HACKERS] Standard Deviation function.

От
Andreas Zeugswetter
Дата:
David Gould wrote:
>The Perl Module "Statistics/Descriptive" has on the fly variance calculation.
>
>sub add_data {
>  my $self = shift;  ##Myself
>  my $oldmean;
>  my ($min,$mindex,$max,$maxdex);
>
>  ##Take care of appending to an existing data set
>  $min    = (defined ($self->{min}) ? $self->{min} : $_[0]);
>  $max    = (defined ($self->{max}) ? $self->{max} : $_[0]);
>  $maxdex = $self->{maxdex} || 0;
>  $mindex = $self->{mindex} || 0;
>
>  ##Calculate new mean, pseudo-variance, min and max;
>  foreach (@_) {
>    $oldmean = $self->{mean};
>    $self->{sum} += $_;
>    $self->{count}++;
>    if ($_ >= $max) {
>      $max = $_;
>      $maxdex = $self->{count}-1;
>    }
>    if ($_ <= $min) {
>      $min = $_;
>      $mindex = $self->{count}-1;
>    }
>    $self->{mean} += ($_ - $oldmean) / $self->{count};
>    $self->{pseudo_variance} += ($_ - $oldmean) * ($_ - $self->{mean});
>  }
>
>  $self->{min}          = $min;
>  $self->{mindex}       = $mindex;
>  $self->{max}          = $max;
>  $self->{maxdex}       = $maxdex;
>  $self->{sample_range} = $self->{max} - $self->{min};
>  if ($self->{count} > 1) {
>    $self->{variance}     = $self->{pseudo_variance} / ($self->{count} -1);
>    $self->{standard_deviation}  = sqrt( $self->{variance});

Wow, this is it. But as I said, the above line is wrong (By the way: this is a very common mistake).
It should read:
    $self->{standard_deviation}  = sqrt( $self->{pseudo_variance} / $self->{count} )
Note: The - 1 is missing

>  }
>  return 1;
>}



Re: [HACKERS] Standard Deviation function.

От
Tom Lane
Дата:
Andreas Zeugswetter <andreas.zeugswetter@telecom.at> writes:
> Wow, this is it. But as I said, the above line is wrong (By the way:
> this is a very common mistake).
> It should read:
>     $self->{standard_deviation}  = sqrt( $self->{pseudo_variance} / $self->{count} )
> Note: The - 1 is missing

The formula with N-1 in the divisor is correct for the "sample standard
deviation".  That is what you use when your N data points represent a
sample from a larger population, and you want to estimate the standard
deviation of the whole population.

If your N data points in fact are the *whole* population of interest,
then you calculate the "population standard deviation" which has just N
in the divisor.  So both versions of the formula are correct depending
on the situation, and you really ought to provide both.

(To justify the difference intuitively: if you have exactly one data
point, and it is the *whole* population, then the mean equals the
data value and the standard deviation is zero.  That is what you get
with N in the divisor.  But if your one data point is a sample from
a larger population, you cannot estimate the population's standard
deviation; you need more data.  The N-1 equation gives 0/0 in this
case, correctly signifying that the value is indeterminate.)

I think the Perl code given earlier in the thread pretty much sucks
from a numerical accuracy point of view.  The running mean calculation
suffers from accumulation of errors, and that propagates into the
pseudo-variance in a big way.  It's particularly bad if the data is
tightly clustered about the mean; the code ends up doing lots of
subtractions of nearly equal values.

The accepted way to do sample standard deviation in one pass is this:

STDDEV = SQRT( (N*SIGMA(Xi^2) - SIGMA(Xi)^2) / (N*(N-1)) )

where N is the number of data points and SIGMA(Xi) means the sum
of the data values Xi.  You keep running sums of Xi and Xi^2 as
you pass over the data, then you apply the above equation once
at the end.  (For population standard deviation, you use N^2 as
the denominator.  For variance, you just leave off the SQRT().)

All that you need to implement this is room to keep two running
sums instead of one.  I haven't looked at pgsql's aggregate functions,
but I'd hope that the working state can be a struct not just a
single number.

            regards, tom lane