Обсуждение: BUG #19340: Wrong result from CORR() function
The following bug has been logged on the website:
Bug reference: 19340
Logged by: Oleg Ivanov
Email address: o15611@gmail.com
PostgreSQL version: 18.1
Operating system: all
Description:
postgres=# SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,24) ;
corr
------
(1 row)
postgres=# SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,25) ;
corr
---------------------
-0.1877294297321991
(1 row)
postgres=# SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,31) ;
corr
----------------------
-0.03366532289960463
(1 row)
postgres=# SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,32) ;
corr
----------------------
0.037939234274452456
(1 row)
Another example:
postgres=# SELECT corr( 0.9 , 0.91 ) FROM generate_series(1,36) ;
corr
------
(1 row)
postgres=# SELECT corr( 0.9 , 0.91 ) FROM generate_series(1,37) ;
corr
---------------------
0.37167954745944803
(1 row)
postgres=# SELECT corr( 0.9 , 0.91 ) FROM generate_series(1,113) ;
corr
----------------------
0.022884787550167176
(1 row)
postgres=# SELECT corr( 0.9 , 0.91 ) FROM generate_series(1,114) ;
corr
-----------------------
-0.004981175111303341
(1 row)
In the Oracle Database:
SQL> select corr( 0.09 , 0.09000001 ) FROM (select rownum as id from dual
connect by level <=330);
CORR(0.09,0.09000001)
---------------------
If argument is the constant, function CORR() must give a 0 or NaN.
Consequences of this bug: statistic functions are used to make business
descision. Wrong and completely different results can lead to make mistakes.
On Tue, 2025-12-02 at 07:57 +0000, PG Bug reporting form wrote: > The following bug has been logged on the website: > > Bug reference: 19340 > Logged by: Oleg Ivanov > Email address: o15611@gmail.com > PostgreSQL version: 18.1 > Operating system: all > Description: > > postgres=# SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,24) ; > corr > ------ > > (1 row) > > If argument is the constant, function CORR() must give a 0 or NaN. > Consequences of this bug: statistic functions are used to make business > descision. Wrong and completely different results can lead to make mistakes. The documentation is pretty clear about that: In all cases, null is returned if the computation is meaningless, for example when N is zero. Yours, Laurenz Albe
Yes, must be NULL in all the queries I have provided!
But PostgreSQL curr() returns numbers, wich is incorrect.
On Tue, Dec 2, 2025 at 2:30 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Tue, 2025-12-02 at 07:57 +0000, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference: 19340
> Logged by: Oleg Ivanov
> Email address: o15611@gmail.com
> PostgreSQL version: 18.1
> Operating system: all
> Description:
>
> postgres=# SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,24) ;
> corr
> ------
>
> (1 row)
>
> If argument is the constant, function CORR() must give a 0 or NaN.
> Consequences of this bug: statistic functions are used to make business
> descision. Wrong and completely different results can lead to make mistakes.
The documentation is pretty clear about that:
In all cases, null is returned if the computation is meaningless, for example when N is zero.
Yours,
Laurenz Albe
Oleg Ivanov <o15611@gmail.com> writes: > Yes, must be NULL in all the queries I have provided! > But PostgreSQL curr() returns numbers, wich is incorrect. Yeah, looks like roundoff error to me. In your example SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,25) ; at the end of float8_corr we have 3754 PG_RETURN_FLOAT8(Sxy / sqrt(Sxx * Syy)); (gdb) i locals transarray = <optimized out> transvalues = 0x1b96da8 N = 25 Sxx = 3.2869204384208827e-34 Syy = 9.3266240309214617e-33 Sxy = -3.2869204384208827e-34 where ideally those three values would be zero (and we would have fallen out with a NULL result at the preceding line). It's fundamentally impossible to guarantee exact results with floating-point arithmetic, so if you are expecting that you need to readjust your expectations. But having said that, it does seem a bit sad that we can't detect constant-input cases exactly. I wonder whether it'd be worth carrying additional state to check that explicitly (instead of assuming that "if (Sxx == 0 || Syy == 0)" will catch it). You might find the previous discussion interesting: https://www.postgresql.org/message-id/flat/153313051300.1397.9594490737341194671%40wrigleys.postgresql.org regards, tom lane
On Tue, 2025-12-02 at 17:05 +0300, Oleg Ivanov wrote: > Yes, must be NULL in all the queries I have provided! > But PostgreSQL curr() returns numbers, wich is incorrect. I see. I guess these are rounding errors, which are to be expected with floating point numbers. But perhaps we could do better. Yours, Laurenz Albe
On Tue, 2 Dec 2025 at 17:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> It's fundamentally impossible to guarantee exact results with
> floating-point arithmetic, so if you are expecting that you need
> to readjust your expectations. But having said that, it does
> seem a bit sad that we can't detect constant-input cases exactly.
Yes, indeed. I tested the following query:
SELECT n,
(SELECT variance(1.3::float8) FROM generate_series(1, n)),
(SELECT corr(1.3, 1.3) FROM generate_series(1, n))
FROM generate_series(1, 10) g(n);
In v11 (with the old algorithm) this produces
n | variance | corr
----+----------------------+------
1 | |
2 | 0 |
3 | 0 |
4 | 0 |
5 | 3.5527136788005e-16 | 1
6 | 2.368475785867e-16 | 1
7 | 3.38353683695286e-16 | 1
8 | 0 |
9 | 0 |
10 | 0 |
(10 rows)
whereas in HEAD (with the Youngs-Cramer algorithm) it produces
n | variance | corr
----+------------------------+------
1 | |
2 | 0 |
3 | 0 |
4 | 0 |
5 | 0 |
6 | 5.259072701473412e-33 | 1
7 | 4.382560584561177e-33 | 1
8 | 3.756480501052437e-33 | 1
9 | 3.2869204384208825e-33 | 1
10 | 6.817316464872942e-33 | 1
(10 rows)
so the errors in the variance are smaller, but any non-zero error
makes the correlation completely wrong.
> I wonder whether it'd be worth carrying additional state to
> check that explicitly (instead of assuming that "if (Sxx == 0 ||
> Syy == 0)" will catch it).
I wondered the same thing. It's not nice to have to do that, but
clearly the existing test for constant inputs is no good. The question
is, do we really want to spend extra cycles on every query just to
catch this odd corner case?
Regards,
Dean
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> On Tue, 2 Dec 2025 at 17:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I wonder whether it'd be worth carrying additional state to
>> check that explicitly (instead of assuming that "if (Sxx == 0 ||
>> Syy == 0)" will catch it).
> I wondered the same thing. It's not nice to have to do that, but
> clearly the existing test for constant inputs is no good. The question
> is, do we really want to spend extra cycles on every query just to
> catch this odd corner case?
I experimented with the attached patch, which is very incomplete;
I just carried it far enough to be able to run performance checks on
the modified code, and so all the binary statistics aggregates except
corr() are broken. I observe about 2% slowdown on this test case:
SELECT corr( 0.09 , 0.09000001 ) FROM generate_series(1,100000000);
I think that any real-world usage is going to expend more effort
obtaining the input data than this test does, so 2% should be a
conservative upper bound on the cost. Seems to me that getting
NULL-or-not right is probably worth a percent or so.
If anyone feels differently, another idea could be to use a
separate state transition function for corr() that skips the
accumulation steps that corr() doesn't use. But I agree with
the pre-existing decision to use just one transition function
for all the binary aggregates.
If this seems like a reasonable approach, I'll see about finishing
out the patch.
regards, tom lane
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
index 7b97d2be6ca..5d8954a34d0 100644
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -3319,9 +3319,15 @@ float8_stddev_samp(PG_FUNCTION_ARGS)
* As with the preceding aggregates, we use the Youngs-Cramer algorithm to
* reduce rounding errors in the aggregate final functions.
*
- * The transition datatype for all these aggregates is a 6-element array of
+ * The transition datatype for all these aggregates is a 9-element array of
* float8, holding the values N, Sx=sum(X), Sxx=sum((X-Sx/N)^2), Sy=sum(Y),
- * Syy=sum((Y-Sy/N)^2), Sxy=sum((X-Sx/N)*(Y-Sy/N)) in that order.
+ * Syy=sum((Y-Sy/N)^2), Sxy=sum((X-Sx/N)*(Y-Sy/N)), firstX, firstY, DIFF,
+ * in that order.
+ *
+ * DIFF, like N, is treated as an integer. Bit 0 is set if we saw distinct
+ * X inputs, and bit 1 is set if we saw distinct Y inputs. This allows us
+ * to detect constant inputs exactly, which is important for deciding whether
+ * some outputs should be NULL.
*
* Note that Y is the first argument to all these aggregates!
*
@@ -3345,17 +3351,23 @@ float8_regr_accum(PG_FUNCTION_ARGS)
Sy,
Syy,
Sxy,
+ firstX,
+ firstY,
tmpX,
tmpY,
scale;
+ int diff;
- transvalues = check_float8_array(transarray, "float8_regr_accum", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_accum", 9);
N = transvalues[0];
Sx = transvalues[1];
Sxx = transvalues[2];
Sy = transvalues[3];
Syy = transvalues[4];
Sxy = transvalues[5];
+ firstX = transvalues[6];
+ firstY = transvalues[7];
+ diff = transvalues[8];
/*
* Use the Youngs-Cramer algorithm to incorporate the new values into the
@@ -3373,6 +3385,19 @@ float8_regr_accum(PG_FUNCTION_ARGS)
Syy += tmpY * tmpY * scale;
Sxy += tmpX * tmpY * scale;
+ /*
+ * Check to see if we have seen distinct inputs. In normal use, diff
+ * will reach 3 very soon and then we can stop checking.
+ */
+ if (diff != 3)
+ {
+ /* Need SQL-style comparison of NaNs here */
+ if (float8_ne(newvalX, firstX))
+ diff |= 1;
+ if (float8_ne(newvalY, firstY))
+ diff |= 2;
+ }
+
/*
* Overflow check. We only report an overflow error when finite
* inputs lead to infinite results. Note also that Sxx, Syy and Sxy
@@ -3410,6 +3435,8 @@ float8_regr_accum(PG_FUNCTION_ARGS)
Sxx = Sxy = get_float8_nan();
if (isnan(newvalY) || isinf(newvalY))
Syy = Sxy = get_float8_nan();
+ firstX = newvalX;
+ firstY = newvalY;
}
/*
@@ -3425,12 +3452,15 @@ float8_regr_accum(PG_FUNCTION_ARGS)
transvalues[3] = Sy;
transvalues[4] = Syy;
transvalues[5] = Sxy;
+ transvalues[6] = firstX;
+ transvalues[7] = firstY;
+ transvalues[8] = diff;
PG_RETURN_ARRAYTYPE_P(transarray);
}
else
{
- Datum transdatums[6];
+ Datum transdatums[9];
ArrayType *result;
transdatums[0] = Float8GetDatumFast(N);
@@ -3439,8 +3469,11 @@ float8_regr_accum(PG_FUNCTION_ARGS)
transdatums[3] = Float8GetDatumFast(Sy);
transdatums[4] = Float8GetDatumFast(Syy);
transdatums[5] = Float8GetDatumFast(Sxy);
+ transdatums[6] = Float8GetDatumFast(firstX);
+ transdatums[7] = Float8GetDatumFast(firstY);
+ transdatums[8] = Float8GetDatum(diff);
- result = construct_array_builtin(transdatums, 6, FLOAT8OID);
+ result = construct_array_builtin(transdatums, 9, FLOAT8OID);
PG_RETURN_ARRAYTYPE_P(result);
}
@@ -3730,27 +3763,25 @@ float8_corr(PG_FUNCTION_ARGS)
{
ArrayType *transarray = PG_GETARG_ARRAYTYPE_P(0);
float8 *transvalues;
- float8 N,
- Sxx,
+ float8 Sxx,
Syy,
Sxy;
+ int diff;
- transvalues = check_float8_array(transarray, "float8_corr", 6);
- N = transvalues[0];
+ transvalues = check_float8_array(transarray, "float8_corr", 9);
Sxx = transvalues[2];
Syy = transvalues[4];
Sxy = transvalues[5];
+ diff = transvalues[8];
- /* if N is 0 we should return NULL */
- if (N < 1.0)
+ /*
+ * Per spec, we must return NULL if N is zero, all X inputs are equal, or
+ * all Y inputs are equal. Checking the diff mask covers all three cases.
+ */
+ if (diff != 3)
PG_RETURN_NULL();
/* Note that Sxx and Syy are guaranteed to be non-negative */
-
- /* per spec, return NULL for horizontal and vertical lines */
- if (Sxx == 0 || Syy == 0)
- PG_RETURN_NULL();
-
PG_RETURN_FLOAT8(Sxy / sqrt(Sxx * Syy));
}
diff --git a/src/include/catalog/pg_aggregate.dat b/src/include/catalog/pg_aggregate.dat
index 870769e8f14..68dc1329ea0 100644
--- a/src/include/catalog/pg_aggregate.dat
+++ b/src/include/catalog/pg_aggregate.dat
@@ -505,7 +505,7 @@
aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
{ aggfnoid => 'corr', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_corr', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0,0}' },
# boolean-and and boolean-or
{ aggfnoid => 'bool_and', aggtransfn => 'booland_statefunc',
On Tue, 2 Dec 2025 at 20:26, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I experimented with the attached patch, which is very incomplete; > I just carried it far enough to be able to run performance checks on > the modified code, and so all the binary statistics aggregates except > corr() are broken. I observe about 2% slowdown on this test case: I played around with having just 2 extra array elements, constX and constY equal to the common value if all the values are the same, and NaN otherwise. For me, that was slightly faster, which I put down to floating point comparison being faster than converting back and forth between floating point and integer. Either way, it seems like a difference that no one is likely to notice. Doing it that way does lead to one difference though: all-NaN inputs leads to a NaN result, whereas your patch produces NULL for that case. Regards, Dean
Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> I played around with having just 2 extra array elements, constX and
> constY equal to the common value if all the values are the same, and
> NaN otherwise.
Hmm.
> Doing it that way does lead to one difference though: all-NaN inputs
> leads to a NaN result, whereas your patch produces NULL for that case.
Yeah, I did it as I did precisely because I wanted all-NaN-input to be
seen as a constant. But you could make an argument that NaN is not
really a fixed value but has more kinship to the "we don't know what
the value is" interpretation of SQL NULL. In that case your proposal
is semantically reasonable on the grounds that maybe the NaNs aren't
really all equal, and I agree it ought to be a little faster than
mine.
regards, tom lane
On Tue, 2 Dec 2025 at 23:24, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Dean Rasheed <dean.a.rasheed@gmail.com> writes: > > I played around with having just 2 extra array elements, constX and > > constY equal to the common value if all the values are the same, and > > NaN otherwise. > > Hmm. > > > Doing it that way does lead to one difference though: all-NaN inputs > > leads to a NaN result, whereas your patch produces NULL for that case. > > Yeah, I did it as I did precisely because I wanted all-NaN-input to be > seen as a constant. But you could make an argument that NaN is not > really a fixed value but has more kinship to the "we don't know what > the value is" interpretation of SQL NULL. I think that's more consistent with the general policy that most (all?) math functions have, where if any input is NaN, the result is NaN. Regards, Dean
I tried the attached realization of your idea, and it does seem
very marginally faster than what I did; but it's like a 1.8%
slowdown instead of 2%. Might be different on a different
machine of course. But I think we should choose on the basis
of which semantics we like better, rather than such a tiny
performance difference.
I'm coming around to the conclusion that your way is better,
though. It seems good that "any NaN in the input results in
NaN output", which your way does and mine doesn't.
regards, tom lane
diff --git a/src/backend/utils/adt/float.c b/src/backend/utils/adt/float.c
index 7b97d2be6ca..c2173f378bf 100644
--- a/src/backend/utils/adt/float.c
+++ b/src/backend/utils/adt/float.c
@@ -3319,9 +3319,16 @@ float8_stddev_samp(PG_FUNCTION_ARGS)
* As with the preceding aggregates, we use the Youngs-Cramer algorithm to
* reduce rounding errors in the aggregate final functions.
*
- * The transition datatype for all these aggregates is a 6-element array of
+ * The transition datatype for all these aggregates is an 8-element array of
* float8, holding the values N, Sx=sum(X), Sxx=sum((X-Sx/N)^2), Sy=sum(Y),
- * Syy=sum((Y-Sy/N)^2), Sxy=sum((X-Sx/N)*(Y-Sy/N)) in that order.
+ * Syy=sum((Y-Sy/N)^2), Sxy=sum((X-Sx/N)*(Y-Sy/N)), commonX, and commonY
+ * in that order.
+ *
+ * commonX is defined as the common X value if all the X values were the
+ * same, else NaN; likewise for commonY. This is useful for deciding
+ * whether corr() should return NULL. Note that this representation cannot
+ * distinguish all-the-values-were-NaN from the-values-weren't-all-the-same,
+ * but that's okay because we want to return NaN for all-NaN input.
*
* Note that Y is the first argument to all these aggregates!
*
@@ -3345,17 +3352,21 @@ float8_regr_accum(PG_FUNCTION_ARGS)
Sy,
Syy,
Sxy,
+ commonX,
+ commonY,
tmpX,
tmpY,
scale;
- transvalues = check_float8_array(transarray, "float8_regr_accum", 6);
+ transvalues = check_float8_array(transarray, "float8_regr_accum", 8);
N = transvalues[0];
Sx = transvalues[1];
Sxx = transvalues[2];
Sy = transvalues[3];
Syy = transvalues[4];
Sxy = transvalues[5];
+ commonX = transvalues[6];
+ commonY = transvalues[7];
/*
* Use the Youngs-Cramer algorithm to incorporate the new values into the
@@ -3373,6 +3384,16 @@ float8_regr_accum(PG_FUNCTION_ARGS)
Syy += tmpY * tmpY * scale;
Sxy += tmpX * tmpY * scale;
+ /*
+ * Check to see if we have seen distinct inputs. We can use a test
+ * that's a bit cheaper than float8_ne() because if commonX is already
+ * NaN, it does not matter whether the != test returns true or not.
+ */
+ if (newvalX != commonX || isnan(newvalX))
+ commonX = get_float8_nan();
+ if (newvalY != commonY || isnan(newvalY))
+ commonY = get_float8_nan();
+
/*
* Overflow check. We only report an overflow error when finite
* inputs lead to infinite results. Note also that Sxx, Syy and Sxy
@@ -3410,6 +3431,8 @@ float8_regr_accum(PG_FUNCTION_ARGS)
Sxx = Sxy = get_float8_nan();
if (isnan(newvalY) || isinf(newvalY))
Syy = Sxy = get_float8_nan();
+ commonX = newvalX;
+ commonY = newvalY;
}
/*
@@ -3425,12 +3448,14 @@ float8_regr_accum(PG_FUNCTION_ARGS)
transvalues[3] = Sy;
transvalues[4] = Syy;
transvalues[5] = Sxy;
+ transvalues[6] = commonX;
+ transvalues[7] = commonY;
PG_RETURN_ARRAYTYPE_P(transarray);
}
else
{
- Datum transdatums[6];
+ Datum transdatums[8];
ArrayType *result;
transdatums[0] = Float8GetDatumFast(N);
@@ -3439,8 +3464,10 @@ float8_regr_accum(PG_FUNCTION_ARGS)
transdatums[3] = Float8GetDatumFast(Sy);
transdatums[4] = Float8GetDatumFast(Syy);
transdatums[5] = Float8GetDatumFast(Sxy);
+ transdatums[6] = Float8GetDatumFast(commonX);
+ transdatums[7] = Float8GetDatumFast(commonY);
- result = construct_array_builtin(transdatums, 6, FLOAT8OID);
+ result = construct_array_builtin(transdatums, 8, FLOAT8OID);
PG_RETURN_ARRAYTYPE_P(result);
}
@@ -3733,24 +3760,27 @@ float8_corr(PG_FUNCTION_ARGS)
float8 N,
Sxx,
Syy,
- Sxy;
+ Sxy,
+ commonX,
+ commonY;
- transvalues = check_float8_array(transarray, "float8_corr", 6);
+ transvalues = check_float8_array(transarray, "float8_corr", 8);
N = transvalues[0];
Sxx = transvalues[2];
Syy = transvalues[4];
Sxy = transvalues[5];
+ commonX = transvalues[6];
+ commonY = transvalues[7];
/* if N is 0 we should return NULL */
if (N < 1.0)
PG_RETURN_NULL();
- /* Note that Sxx and Syy are guaranteed to be non-negative */
-
/* per spec, return NULL for horizontal and vertical lines */
- if (Sxx == 0 || Syy == 0)
+ if (!isnan(commonX) || !isnan(commonY))
PG_RETURN_NULL();
+ /* at this point, Sxx and Syy cannot be zero or negative */
PG_RETURN_FLOAT8(Sxy / sqrt(Sxx * Syy));
}
diff --git a/src/include/catalog/pg_aggregate.dat b/src/include/catalog/pg_aggregate.dat
index 870769e8f14..d1342f7bb94 100644
--- a/src/include/catalog/pg_aggregate.dat
+++ b/src/include/catalog/pg_aggregate.dat
@@ -505,7 +505,7 @@
aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
{ aggfnoid => 'corr', aggtransfn => 'float8_regr_accum',
aggfinalfn => 'float8_corr', aggcombinefn => 'float8_regr_combine',
- aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0}' },
+ aggtranstype => '_float8', agginitval => '{0,0,0,0,0,0,0,0}' },
# boolean-and and boolean-or
{ aggfnoid => 'bool_and', aggtransfn => 'booland_statefunc',
I wrote:
> I'm coming around to the conclusion that your way is better,
> though. It seems good that "any NaN in the input results in
> NaN output", which your way does and mine doesn't.
Poking further at this, I found that my v2 patch fails that principle
in one case:
regression=# SELECT corr( 0.1 , 'nan' ) FROM generate_series(1,1000) g;
corr
------
(1 row)
We see that Y is constant and therefore return NULL, despite the
other NaN input.
I think we can fix that along these lines:
@@ -3776,8 +3776,12 @@ float8_corr(PG_FUNCTION_ARGS)
if (N < 1.0)
PG_RETURN_NULL();
- /* per spec, return NULL for horizontal and vertical lines */
- if (!isnan(commonX) || !isnan(commonY))
+ /*
+ * per spec, return NULL for horizontal and vertical lines; but not if the
+ * result would otherwise be NaN
+ */
+ if ((!isnan(commonX) || !isnan(commonY)) &&
+ (!isnan(Sxx) && !isnan(Syy)))
PG_RETURN_NULL();
/* at this point, Sxx and Syy cannot be zero or negative */
(don't think it should be necessary to also check Sxy)
BTW, HEAD is inconsistent: it will return NaN for this example, but
only because it's confused by roundoff error into thinking that Y
isn't constant. With few enough inputs, it produces NULL too:
regression=# SELECT corr( 0.1 , 'nan' ) FROM generate_series(1,3) g;
corr
------
(1 row)
regards, tom lane