Обсуждение: Definitional issue: stddev_pop (and related) for 1 input

Поиск
Список
Период
Сортировка

Definitional issue: stddev_pop (and related) for 1 input

От
Tom Lane
Дата:
Before v12, stddev_pop() had the following behavior with just a
single input value:

regression=# SELECT stddev_pop('42'::float8);
 stddev_pop 
------------
          0
(1 row)

regression=# SELECT stddev_pop('inf'::float8);
 stddev_pop 
------------
        NaN
(1 row)

regression=# SELECT stddev_pop('nan'::float8);
 stddev_pop 
------------
        NaN
(1 row)

As of v12, though, all three cases produce 0.  I am not sure what
to think about that with respect to an infinity input, but I'm
quite sure I don't like it for NaN input.

It looks like the culprit is the introduction of the "Youngs-Cramer"
algorithm in float8_accum: nothing is done to Sxx at the first iteration,
even if the input is inf or NaN.  I'd be inclined to force Sxx to NaN
when the first input is NaN, and perhaps also when it's Inf.
Alternatively we could clean up in the finalization routine by noting
that Sx is Inf/NaN, but that seems messier.  Thoughts?

(I came across this by noting that the results don't agree with
numeric accumulation, which isn't using Youngs-Cramer.)

            regards, tom lane



Re: Definitional issue: stddev_pop (and related) for 1 input

От
Tom Lane
Дата:
I wrote:
> Before v12, stddev_pop() had the following behavior with just a
> single input value:
> ...
> As of v12, though, all three cases produce 0.  I am not sure what
> to think about that with respect to an infinity input, but I'm
> quite sure I don't like it for NaN input.

While I'm still not sure whether there's an academic argument that
zero is a reasonable stddev value for a single input that is Inf,
it seems to me that backwards compatibility is a sufficient reason
for going back to producing NaN for that.

Hence, attached are some proposed patches.  0001 just adds test
cases demonstrating the current behavior; then 0002 makes the
proposed code change.  It's easy to check that the test case results
after 0002 match what v11 produces.

0003 deals with a different problem which I noted in [1]: the numeric
variants of var_samp and stddev_samp also do the wrong thing for a
single special input.  Their disease is that they produce NaN for a
single NaN input, where it seems more sensible to produce NULL.
At least, NULL is what we get for the same case with the float
aggregates, so we have to change one or the other set of functions
if we want consistency.

I propose back-patching 0001/0002 as far as v12, since the failure
to match the old outputs seems like a pretty clear bug/regression.
However, I'd be content to apply 0003 only to HEAD.  That misbehavior
is very ancient, and the lack of complaints suggests that few people
really care about this fine point.

            regards, tom lane

[1] https://www.postgresql.org/message-id/606717.1591924582%40sss.pgh.pa.us

diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index d659013e41..0a6884d382 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -127,7 +127,79 @@ SELECT var_samp(b::numeric) FROM aggtest;

 -- population variance is defined for a single tuple, sample variance
 -- is not
-SELECT var_pop(1.0), var_samp(2.0);
+SELECT var_pop(1.0::float8), var_samp(2.0::float8);
+ var_pop | var_samp
+---------+----------
+       0 |
+(1 row)
+
+SELECT stddev_pop(3.0::float8), stddev_samp(4.0::float8);
+ stddev_pop | stddev_samp
+------------+-------------
+          0 |
+(1 row)
+
+SELECT var_pop('inf'::float8), var_samp('inf'::float8);
+ var_pop | var_samp
+---------+----------
+       0 |
+(1 row)
+
+SELECT stddev_pop('inf'::float8), stddev_samp('inf'::float8);
+ stddev_pop | stddev_samp
+------------+-------------
+          0 |
+(1 row)
+
+SELECT var_pop('nan'::float8), var_samp('nan'::float8);
+ var_pop | var_samp
+---------+----------
+       0 |
+(1 row)
+
+SELECT stddev_pop('nan'::float8), stddev_samp('nan'::float8);
+ stddev_pop | stddev_samp
+------------+-------------
+          0 |
+(1 row)
+
+SELECT var_pop(1.0::float4), var_samp(2.0::float4);
+ var_pop | var_samp
+---------+----------
+       0 |
+(1 row)
+
+SELECT stddev_pop(3.0::float4), stddev_samp(4.0::float4);
+ stddev_pop | stddev_samp
+------------+-------------
+          0 |
+(1 row)
+
+SELECT var_pop('inf'::float4), var_samp('inf'::float4);
+ var_pop | var_samp
+---------+----------
+       0 |
+(1 row)
+
+SELECT stddev_pop('inf'::float4), stddev_samp('inf'::float4);
+ stddev_pop | stddev_samp
+------------+-------------
+          0 |
+(1 row)
+
+SELECT var_pop('nan'::float4), var_samp('nan'::float4);
+ var_pop | var_samp
+---------+----------
+       0 |
+(1 row)
+
+SELECT stddev_pop('nan'::float4), stddev_samp('nan'::float4);
+ stddev_pop | stddev_samp
+------------+-------------
+          0 |
+(1 row)
+
+SELECT var_pop(1.0::numeric), var_samp(2.0::numeric);
  var_pop | var_samp
 ---------+----------
        0 |
@@ -139,6 +211,18 @@ SELECT stddev_pop(3.0::numeric), stddev_samp(4.0::numeric);
           0 |
 (1 row)

+SELECT var_pop('nan'::numeric), var_samp('nan'::numeric);
+ var_pop | var_samp
+---------+----------
+     NaN |      NaN
+(1 row)
+
+SELECT stddev_pop('nan'::numeric), stddev_samp('nan'::numeric);
+ stddev_pop | stddev_samp
+------------+-------------
+        NaN |         NaN
+(1 row)
+
 -- verify correct results for null and NaN inputs
 select sum(null::int4) from generate_series(1,3);
  sum
@@ -299,6 +383,25 @@ SELECT corr(b, a) FROM aggtest;
  0.139634516517873
 (1 row)

+-- check single-tuple behavior
+SELECT covar_pop(1::float8,2::float8), covar_samp(3::float8,4::float8);
+ covar_pop | covar_samp
+-----------+------------
+         0 |
+(1 row)
+
+SELECT covar_pop(1::float8,'inf'::float8), covar_samp(3::float8,'inf'::float8);
+ covar_pop | covar_samp
+-----------+------------
+         0 |
+(1 row)
+
+SELECT covar_pop(1::float8,'nan'::float8), covar_samp(3::float8,'nan'::float8);
+ covar_pop | covar_samp
+-----------+------------
+         0 |
+(1 row)
+
 -- test accum and combine functions directly
 CREATE TABLE regr_test (x float8, y float8);
 INSERT INTO regr_test VALUES (10,150),(20,250),(30,350),(80,540),(100,200);
diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql
index 2a066f5a3a..044d515507 100644
--- a/src/test/regress/sql/aggregates.sql
+++ b/src/test/regress/sql/aggregates.sql
@@ -39,8 +39,22 @@ SELECT var_samp(b::numeric) FROM aggtest;

 -- population variance is defined for a single tuple, sample variance
 -- is not
-SELECT var_pop(1.0), var_samp(2.0);
+SELECT var_pop(1.0::float8), var_samp(2.0::float8);
+SELECT stddev_pop(3.0::float8), stddev_samp(4.0::float8);
+SELECT var_pop('inf'::float8), var_samp('inf'::float8);
+SELECT stddev_pop('inf'::float8), stddev_samp('inf'::float8);
+SELECT var_pop('nan'::float8), var_samp('nan'::float8);
+SELECT stddev_pop('nan'::float8), stddev_samp('nan'::float8);
+SELECT var_pop(1.0::float4), var_samp(2.0::float4);
+SELECT stddev_pop(3.0::float4), stddev_samp(4.0::float4);
+SELECT var_pop('inf'::float4), var_samp('inf'::float4);
+SELECT stddev_pop('inf'::float4), stddev_samp('inf'::float4);
+SELECT var_pop('nan'::float4), var_samp('nan'::float4);
+SELECT stddev_pop('nan'::float4), stddev_samp('nan'::float4);
+SELECT var_pop(1.0::numeric), var_samp(2.0::numeric);
 SELECT stddev_pop(3.0::numeric), stddev_samp(4.0::numeric);
+SELECT var_pop('nan'::numeric), var_samp('nan'::numeric);
+SELECT stddev_pop('nan'::numeric), stddev_samp('nan'::numeric);

 -- verify correct results for null and NaN inputs
 select sum(null::int4) from generate_series(1,3);
@@ -81,6 +95,11 @@ SELECT regr_slope(b, a), regr_intercept(b, a) FROM aggtest;
 SELECT covar_pop(b, a), covar_samp(b, a) FROM aggtest;
 SELECT corr(b, a) FROM aggtest;

+-- check single-tuple behavior
+SELECT covar_pop(1::float8,2::float8), covar_samp(3::float8,4::float8);
+SELECT covar_pop(1::float8,'inf'::float8), covar_samp(3::float8,'inf'::float8);
+SELECT covar_pop(1::float8,'nan'::float8), covar_samp(3::float8,'nan'::float8);
+
 -- test accum and combine functions directly
 CREATE TABLE regr_test (x float8, y float8);
 INSERT INTO regr_test VALUES (10,150),(20,250),(30,350),(80,540),(100,200);
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
index 2101d58674..6a717f19bb 100644
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -2925,6 +2925,17 @@ float8_accum(PG_FUNCTION_ARGS)
             Sxx = get_float8_nan();
         }
     }
+    else
+    {
+        /*
+         * At the first input, we normally can leave Sxx as 0.  However, if
+         * the first input is Inf or NaN, we'd better force Sxx to NaN;
+         * otherwise we will falsely report variance zero when there are no
+         * more inputs.
+         */
+        if (isnan(newval) || isinf(newval))
+            Sxx = get_float8_nan();
+    }

     /*
      * If we're invoked as an aggregate, we can cheat and modify our first
@@ -2999,6 +3010,17 @@ float4_accum(PG_FUNCTION_ARGS)
             Sxx = get_float8_nan();
         }
     }
+    else
+    {
+        /*
+         * At the first input, we normally can leave Sxx as 0.  However, if
+         * the first input is Inf or NaN, we'd better force Sxx to NaN;
+         * otherwise we will falsely report variance zero when there are no
+         * more inputs.
+         */
+        if (isnan(newval) || isinf(newval))
+            Sxx = get_float8_nan();
+    }

     /*
      * If we're invoked as an aggregate, we can cheat and modify our first
@@ -3225,6 +3247,19 @@ float8_regr_accum(PG_FUNCTION_ARGS)
                 Sxy = get_float8_nan();
         }
     }
+    else
+    {
+        /*
+         * At the first input, we normally can leave Sxx et al as 0.  However,
+         * if the first input is Inf or NaN, we'd better force the dependent
+         * sums to NaN; otherwise we will falsely report variance zero when
+         * there are no more inputs.
+         */
+        if (isnan(newvalX) || isinf(newvalX))
+            Sxx = Sxy = get_float8_nan();
+        if (isnan(newvalY) || isinf(newvalY))
+            Syy = Sxy = get_float8_nan();
+    }

     /*
      * If we're invoked as an aggregate, we can cheat and modify our first
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index 0a6884d382..e4ffa5ee42 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -142,25 +142,25 @@ SELECT stddev_pop(3.0::float8), stddev_samp(4.0::float8);
 SELECT var_pop('inf'::float8), var_samp('inf'::float8);
  var_pop | var_samp
 ---------+----------
-       0 |
+     NaN |
 (1 row)

 SELECT stddev_pop('inf'::float8), stddev_samp('inf'::float8);
  stddev_pop | stddev_samp
 ------------+-------------
-          0 |
+        NaN |
 (1 row)

 SELECT var_pop('nan'::float8), var_samp('nan'::float8);
  var_pop | var_samp
 ---------+----------
-       0 |
+     NaN |
 (1 row)

 SELECT stddev_pop('nan'::float8), stddev_samp('nan'::float8);
  stddev_pop | stddev_samp
 ------------+-------------
-          0 |
+        NaN |
 (1 row)

 SELECT var_pop(1.0::float4), var_samp(2.0::float4);
@@ -178,25 +178,25 @@ SELECT stddev_pop(3.0::float4), stddev_samp(4.0::float4);
 SELECT var_pop('inf'::float4), var_samp('inf'::float4);
  var_pop | var_samp
 ---------+----------
-       0 |
+     NaN |
 (1 row)

 SELECT stddev_pop('inf'::float4), stddev_samp('inf'::float4);
  stddev_pop | stddev_samp
 ------------+-------------
-          0 |
+        NaN |
 (1 row)

 SELECT var_pop('nan'::float4), var_samp('nan'::float4);
  var_pop | var_samp
 ---------+----------
-       0 |
+     NaN |
 (1 row)

 SELECT stddev_pop('nan'::float4), stddev_samp('nan'::float4);
  stddev_pop | stddev_samp
 ------------+-------------
-          0 |
+        NaN |
 (1 row)

 SELECT var_pop(1.0::numeric), var_samp(2.0::numeric);
@@ -393,13 +393,13 @@ SELECT covar_pop(1::float8,2::float8), covar_samp(3::float8,4::float8);
 SELECT covar_pop(1::float8,'inf'::float8), covar_samp(3::float8,'inf'::float8);
  covar_pop | covar_samp
 -----------+------------
-         0 |
+       NaN |
 (1 row)

 SELECT covar_pop(1::float8,'nan'::float8), covar_samp(3::float8,'nan'::float8);
  covar_pop | covar_samp
 -----------+------------
-         0 |
+       NaN |
 (1 row)

 -- test accum and combine functions directly
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 553e261ed0..eea4239854 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -5172,11 +5172,21 @@ numeric_stddev_internal(NumericAggState *state,
                 vsumX,
                 vsumX2,
                 vNminus1;
-    const NumericVar *comp;
+    int64        totCount;
     int            rscale;

-    /* Deal with empty input and NaN-input cases */
-    if (state == NULL || (state->N + state->NaNcount) == 0)
+    /*
+     * Sample stddev and variance are undefined when N <= 1; population stddev
+     * is undefined when N == 0.  Return NULL in either case (note that NaNs
+     * count as normal inputs for this purpose).
+     */
+    if (state == NULL || (totCount = state->N + state->NaNcount) == 0)
+    {
+        *is_null = true;
+        return NULL;
+    }
+
+    if (sample && totCount <= 1)
     {
         *is_null = true;
         return NULL;
@@ -5184,9 +5194,13 @@ numeric_stddev_internal(NumericAggState *state,

     *is_null = false;

+    /*
+     * Deal with NaN inputs.
+     */
     if (state->NaNcount > 0)
         return make_result(&const_nan);

+    /* OK, normal calculation applies */
     init_var(&vN);
     init_var(&vsumX);
     init_var(&vsumX2);
@@ -5195,21 +5209,6 @@ numeric_stddev_internal(NumericAggState *state,
     accum_sum_final(&(state->sumX), &vsumX);
     accum_sum_final(&(state->sumX2), &vsumX2);

-    /*
-     * Sample stddev and variance are undefined when N <= 1; population stddev
-     * is undefined when N == 0. Return NULL in either case.
-     */
-    if (sample)
-        comp = &const_one;
-    else
-        comp = &const_zero;
-
-    if (cmp_var(&vN, comp) <= 0)
-    {
-        *is_null = true;
-        return NULL;
-    }
-
     init_var(&vNminus1);
     sub_var(&vN, &const_one, &vNminus1);

diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index e4ffa5ee42..3bd184ae29 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -214,13 +214,13 @@ SELECT stddev_pop(3.0::numeric), stddev_samp(4.0::numeric);
 SELECT var_pop('nan'::numeric), var_samp('nan'::numeric);
  var_pop | var_samp
 ---------+----------
-     NaN |      NaN
+     NaN |
 (1 row)

 SELECT stddev_pop('nan'::numeric), stddev_samp('nan'::numeric);
  stddev_pop | stddev_samp
 ------------+-------------
-        NaN |         NaN
+        NaN |
 (1 row)

 -- verify correct results for null and NaN inputs

Re: Definitional issue: stddev_pop (and related) for 1 input

От
Dean Rasheed
Дата:
On Fri, 12 Jun 2020 at 20:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > Before v12, stddev_pop() had the following behavior with just a
> > single input value:
> > ...
> > As of v12, though, all three cases produce 0.  I am not sure what
> > to think about that with respect to an infinity input, but I'm
> > quite sure I don't like it for NaN input.
>
> While I'm still not sure whether there's an academic argument that
> zero is a reasonable stddev value for a single input that is Inf,
> it seems to me that backwards compatibility is a sufficient reason
> for going back to producing NaN for that.

Yeah, it was an oversight, not considering that case. I think the
academic argument could equally well be made that the result should be
NaN if any input is Inf or NaN, even if there's only one input (it's
effectively "Inf - Inf" or "NaN - NaN"), so I agree that backwards
compatibility clinches it.

> Hence, attached are some proposed patches.  0001 just adds test
> cases demonstrating the current behavior; then 0002 makes the
> proposed code change.  It's easy to check that the test case results
> after 0002 match what v11 produces.

Those both look reasonable to me.

> 0003 deals with a different problem which I noted in [1]: the numeric
> variants of var_samp and stddev_samp also do the wrong thing for a
> single special input.  Their disease is that they produce NaN for a
> single NaN input, where it seems more sensible to produce NULL.
> At least, NULL is what we get for the same case with the float
> aggregates, so we have to change one or the other set of functions
> if we want consistency.

Hmm, yeah it's a bit annoying that they're different. NULL seems like
the more logical result -- sample standard deviation isn't defined for
a sample of 1, so why should it be different if that one value is NaN.

The patch looks reasonable, except I wonder if all compilers are smart
enough to realise that totCount is always initialised.

> I propose back-patching 0001/0002 as far as v12, since the failure
> to match the old outputs seems like a pretty clear bug/regression.
> However, I'd be content to apply 0003 only to HEAD.  That misbehavior
> is very ancient, and the lack of complaints suggests that few people
> really care about this fine point.

Makes sense.

Regards,
Dean



Re: Definitional issue: stddev_pop (and related) for 1 input

От
Tom Lane
Дата:
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> The patch looks reasonable, except I wonder if all compilers are smart
> enough to realise that totCount is always initialised.

I think they should be, since that if-block ends with a return;
the only way to get to the use of totCount is for both parts of the
first if-condition to be executed.

In any case, we do have an effective policy of ignoring
uninitialized-variable warnings from very old/stupid compilers.
locust and prairiedog, which I think use the same ancient gcc
version, emit a couple dozen such warnings.  If they are the only
ones that complain about this new code, I'll not worry.

Thanks for looking at the patch!

            regards, tom lane