Re: Definitional issue: stddev_pop (and related) for 1 input
От | Tom Lane |
---|---|
Тема | Re: Definitional issue: stddev_pop (and related) for 1 input |
Дата | |
Msg-id | 780404.1591991618@sss.pgh.pa.us обсуждение исходный текст |
Ответ на | Definitional issue: stddev_pop (and related) for 1 input (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: Definitional issue: stddev_pop (and related) for 1 input
(Dean Rasheed <dean.a.rasheed@gmail.com>)
|
Список | pgsql-hackers |
I wrote: > Before v12, stddev_pop() had the following behavior with just a > single input value: > ... > As of v12, though, all three cases produce 0. I am not sure what > to think about that with respect to an infinity input, but I'm > quite sure I don't like it for NaN input. While I'm still not sure whether there's an academic argument that zero is a reasonable stddev value for a single input that is Inf, it seems to me that backwards compatibility is a sufficient reason for going back to producing NaN for that. Hence, attached are some proposed patches. 0001 just adds test cases demonstrating the current behavior; then 0002 makes the proposed code change. It's easy to check that the test case results after 0002 match what v11 produces. 0003 deals with a different problem which I noted in [1]: the numeric variants of var_samp and stddev_samp also do the wrong thing for a single special input. Their disease is that they produce NaN for a single NaN input, where it seems more sensible to produce NULL. At least, NULL is what we get for the same case with the float aggregates, so we have to change one or the other set of functions if we want consistency. I propose back-patching 0001/0002 as far as v12, since the failure to match the old outputs seems like a pretty clear bug/regression. However, I'd be content to apply 0003 only to HEAD. That misbehavior is very ancient, and the lack of complaints suggests that few people really care about this fine point. regards, tom lane [1] https://www.postgresql.org/message-id/606717.1591924582%40sss.pgh.pa.us diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out index d659013e41..0a6884d382 100644 --- a/src/test/regress/expected/aggregates.out +++ b/src/test/regress/expected/aggregates.out @@ -127,7 +127,79 @@ SELECT var_samp(b::numeric) FROM aggtest; -- population variance is defined for a single tuple, sample variance -- is not -SELECT var_pop(1.0), var_samp(2.0); +SELECT var_pop(1.0::float8), var_samp(2.0::float8); + var_pop | var_samp +---------+---------- + 0 | +(1 row) + +SELECT stddev_pop(3.0::float8), stddev_samp(4.0::float8); + stddev_pop | stddev_samp +------------+------------- + 0 | +(1 row) + +SELECT var_pop('inf'::float8), var_samp('inf'::float8); + var_pop | var_samp +---------+---------- + 0 | +(1 row) + +SELECT stddev_pop('inf'::float8), stddev_samp('inf'::float8); + stddev_pop | stddev_samp +------------+------------- + 0 | +(1 row) + +SELECT var_pop('nan'::float8), var_samp('nan'::float8); + var_pop | var_samp +---------+---------- + 0 | +(1 row) + +SELECT stddev_pop('nan'::float8), stddev_samp('nan'::float8); + stddev_pop | stddev_samp +------------+------------- + 0 | +(1 row) + +SELECT var_pop(1.0::float4), var_samp(2.0::float4); + var_pop | var_samp +---------+---------- + 0 | +(1 row) + +SELECT stddev_pop(3.0::float4), stddev_samp(4.0::float4); + stddev_pop | stddev_samp +------------+------------- + 0 | +(1 row) + +SELECT var_pop('inf'::float4), var_samp('inf'::float4); + var_pop | var_samp +---------+---------- + 0 | +(1 row) + +SELECT stddev_pop('inf'::float4), stddev_samp('inf'::float4); + stddev_pop | stddev_samp +------------+------------- + 0 | +(1 row) + +SELECT var_pop('nan'::float4), var_samp('nan'::float4); + var_pop | var_samp +---------+---------- + 0 | +(1 row) + +SELECT stddev_pop('nan'::float4), stddev_samp('nan'::float4); + stddev_pop | stddev_samp +------------+------------- + 0 | +(1 row) + +SELECT var_pop(1.0::numeric), var_samp(2.0::numeric); var_pop | var_samp ---------+---------- 0 | @@ -139,6 +211,18 @@ SELECT stddev_pop(3.0::numeric), stddev_samp(4.0::numeric); 0 | (1 row) +SELECT var_pop('nan'::numeric), var_samp('nan'::numeric); + var_pop | var_samp +---------+---------- + NaN | NaN +(1 row) + +SELECT stddev_pop('nan'::numeric), stddev_samp('nan'::numeric); + stddev_pop | stddev_samp +------------+------------- + NaN | NaN +(1 row) + -- verify correct results for null and NaN inputs select sum(null::int4) from generate_series(1,3); sum @@ -299,6 +383,25 @@ SELECT corr(b, a) FROM aggtest; 0.139634516517873 (1 row) +-- check single-tuple behavior +SELECT covar_pop(1::float8,2::float8), covar_samp(3::float8,4::float8); + covar_pop | covar_samp +-----------+------------ + 0 | +(1 row) + +SELECT covar_pop(1::float8,'inf'::float8), covar_samp(3::float8,'inf'::float8); + covar_pop | covar_samp +-----------+------------ + 0 | +(1 row) + +SELECT covar_pop(1::float8,'nan'::float8), covar_samp(3::float8,'nan'::float8); + covar_pop | covar_samp +-----------+------------ + 0 | +(1 row) + -- test accum and combine functions directly CREATE TABLE regr_test (x float8, y float8); INSERT INTO regr_test VALUES (10,150),(20,250),(30,350),(80,540),(100,200); diff --git a/src/test/regress/sql/aggregates.sql b/src/test/regress/sql/aggregates.sql index 2a066f5a3a..044d515507 100644 --- a/src/test/regress/sql/aggregates.sql +++ b/src/test/regress/sql/aggregates.sql @@ -39,8 +39,22 @@ SELECT var_samp(b::numeric) FROM aggtest; -- population variance is defined for a single tuple, sample variance -- is not -SELECT var_pop(1.0), var_samp(2.0); +SELECT var_pop(1.0::float8), var_samp(2.0::float8); +SELECT stddev_pop(3.0::float8), stddev_samp(4.0::float8); +SELECT var_pop('inf'::float8), var_samp('inf'::float8); +SELECT stddev_pop('inf'::float8), stddev_samp('inf'::float8); +SELECT var_pop('nan'::float8), var_samp('nan'::float8); +SELECT stddev_pop('nan'::float8), stddev_samp('nan'::float8); +SELECT var_pop(1.0::float4), var_samp(2.0::float4); +SELECT stddev_pop(3.0::float4), stddev_samp(4.0::float4); +SELECT var_pop('inf'::float4), var_samp('inf'::float4); +SELECT stddev_pop('inf'::float4), stddev_samp('inf'::float4); +SELECT var_pop('nan'::float4), var_samp('nan'::float4); +SELECT stddev_pop('nan'::float4), stddev_samp('nan'::float4); +SELECT var_pop(1.0::numeric), var_samp(2.0::numeric); SELECT stddev_pop(3.0::numeric), stddev_samp(4.0::numeric); +SELECT var_pop('nan'::numeric), var_samp('nan'::numeric); +SELECT stddev_pop('nan'::numeric), stddev_samp('nan'::numeric); -- verify correct results for null and NaN inputs select sum(null::int4) from generate_series(1,3); @@ -81,6 +95,11 @@ SELECT regr_slope(b, a), regr_intercept(b, a) FROM aggtest; SELECT covar_pop(b, a), covar_samp(b, a) FROM aggtest; SELECT corr(b, a) FROM aggtest; +-- check single-tuple behavior +SELECT covar_pop(1::float8,2::float8), covar_samp(3::float8,4::float8); +SELECT covar_pop(1::float8,'inf'::float8), covar_samp(3::float8,'inf'::float8); +SELECT covar_pop(1::float8,'nan'::float8), covar_samp(3::float8,'nan'::float8); + -- test accum and combine functions directly CREATE TABLE regr_test (x float8, y float8); INSERT INTO regr_test VALUES (10,150),(20,250),(30,350),(80,540),(100,200); diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c index 2101d58674..6a717f19bb 100644 --- a/src/backend/utils/adt/float.c +++ b/src/backend/utils/adt/float.c @@ -2925,6 +2925,17 @@ float8_accum(PG_FUNCTION_ARGS) Sxx = get_float8_nan(); } } + else + { + /* + * At the first input, we normally can leave Sxx as 0. However, if + * the first input is Inf or NaN, we'd better force Sxx to NaN; + * otherwise we will falsely report variance zero when there are no + * more inputs. + */ + if (isnan(newval) || isinf(newval)) + Sxx = get_float8_nan(); + } /* * If we're invoked as an aggregate, we can cheat and modify our first @@ -2999,6 +3010,17 @@ float4_accum(PG_FUNCTION_ARGS) Sxx = get_float8_nan(); } } + else + { + /* + * At the first input, we normally can leave Sxx as 0. However, if + * the first input is Inf or NaN, we'd better force Sxx to NaN; + * otherwise we will falsely report variance zero when there are no + * more inputs. + */ + if (isnan(newval) || isinf(newval)) + Sxx = get_float8_nan(); + } /* * If we're invoked as an aggregate, we can cheat and modify our first @@ -3225,6 +3247,19 @@ float8_regr_accum(PG_FUNCTION_ARGS) Sxy = get_float8_nan(); } } + else + { + /* + * At the first input, we normally can leave Sxx et al as 0. However, + * if the first input is Inf or NaN, we'd better force the dependent + * sums to NaN; otherwise we will falsely report variance zero when + * there are no more inputs. + */ + if (isnan(newvalX) || isinf(newvalX)) + Sxx = Sxy = get_float8_nan(); + if (isnan(newvalY) || isinf(newvalY)) + Syy = Sxy = get_float8_nan(); + } /* * If we're invoked as an aggregate, we can cheat and modify our first diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out index 0a6884d382..e4ffa5ee42 100644 --- a/src/test/regress/expected/aggregates.out +++ b/src/test/regress/expected/aggregates.out @@ -142,25 +142,25 @@ SELECT stddev_pop(3.0::float8), stddev_samp(4.0::float8); SELECT var_pop('inf'::float8), var_samp('inf'::float8); var_pop | var_samp ---------+---------- - 0 | + NaN | (1 row) SELECT stddev_pop('inf'::float8), stddev_samp('inf'::float8); stddev_pop | stddev_samp ------------+------------- - 0 | + NaN | (1 row) SELECT var_pop('nan'::float8), var_samp('nan'::float8); var_pop | var_samp ---------+---------- - 0 | + NaN | (1 row) SELECT stddev_pop('nan'::float8), stddev_samp('nan'::float8); stddev_pop | stddev_samp ------------+------------- - 0 | + NaN | (1 row) SELECT var_pop(1.0::float4), var_samp(2.0::float4); @@ -178,25 +178,25 @@ SELECT stddev_pop(3.0::float4), stddev_samp(4.0::float4); SELECT var_pop('inf'::float4), var_samp('inf'::float4); var_pop | var_samp ---------+---------- - 0 | + NaN | (1 row) SELECT stddev_pop('inf'::float4), stddev_samp('inf'::float4); stddev_pop | stddev_samp ------------+------------- - 0 | + NaN | (1 row) SELECT var_pop('nan'::float4), var_samp('nan'::float4); var_pop | var_samp ---------+---------- - 0 | + NaN | (1 row) SELECT stddev_pop('nan'::float4), stddev_samp('nan'::float4); stddev_pop | stddev_samp ------------+------------- - 0 | + NaN | (1 row) SELECT var_pop(1.0::numeric), var_samp(2.0::numeric); @@ -393,13 +393,13 @@ SELECT covar_pop(1::float8,2::float8), covar_samp(3::float8,4::float8); SELECT covar_pop(1::float8,'inf'::float8), covar_samp(3::float8,'inf'::float8); covar_pop | covar_samp -----------+------------ - 0 | + NaN | (1 row) SELECT covar_pop(1::float8,'nan'::float8), covar_samp(3::float8,'nan'::float8); covar_pop | covar_samp -----------+------------ - 0 | + NaN | (1 row) -- test accum and combine functions directly diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c index 553e261ed0..eea4239854 100644 --- a/src/backend/utils/adt/numeric.c +++ b/src/backend/utils/adt/numeric.c @@ -5172,11 +5172,21 @@ numeric_stddev_internal(NumericAggState *state, vsumX, vsumX2, vNminus1; - const NumericVar *comp; + int64 totCount; int rscale; - /* Deal with empty input and NaN-input cases */ - if (state == NULL || (state->N + state->NaNcount) == 0) + /* + * Sample stddev and variance are undefined when N <= 1; population stddev + * is undefined when N == 0. Return NULL in either case (note that NaNs + * count as normal inputs for this purpose). + */ + if (state == NULL || (totCount = state->N + state->NaNcount) == 0) + { + *is_null = true; + return NULL; + } + + if (sample && totCount <= 1) { *is_null = true; return NULL; @@ -5184,9 +5194,13 @@ numeric_stddev_internal(NumericAggState *state, *is_null = false; + /* + * Deal with NaN inputs. + */ if (state->NaNcount > 0) return make_result(&const_nan); + /* OK, normal calculation applies */ init_var(&vN); init_var(&vsumX); init_var(&vsumX2); @@ -5195,21 +5209,6 @@ numeric_stddev_internal(NumericAggState *state, accum_sum_final(&(state->sumX), &vsumX); accum_sum_final(&(state->sumX2), &vsumX2); - /* - * Sample stddev and variance are undefined when N <= 1; population stddev - * is undefined when N == 0. Return NULL in either case. - */ - if (sample) - comp = &const_one; - else - comp = &const_zero; - - if (cmp_var(&vN, comp) <= 0) - { - *is_null = true; - return NULL; - } - init_var(&vNminus1); sub_var(&vN, &const_one, &vNminus1); diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out index e4ffa5ee42..3bd184ae29 100644 --- a/src/test/regress/expected/aggregates.out +++ b/src/test/regress/expected/aggregates.out @@ -214,13 +214,13 @@ SELECT stddev_pop(3.0::numeric), stddev_samp(4.0::numeric); SELECT var_pop('nan'::numeric), var_samp('nan'::numeric); var_pop | var_samp ---------+---------- - NaN | NaN + NaN | (1 row) SELECT stddev_pop('nan'::numeric), stddev_samp('nan'::numeric); stddev_pop | stddev_samp ------------+------------- - NaN | NaN + NaN | (1 row) -- verify correct results for null and NaN inputs
В списке pgsql-hackers по дате отправления: