Обсуждение: comment regarding double timestamps; and, infinite timestamps and NaN

Поиск
Список
Период
Сортировка

comment regarding double timestamps; and, infinite timestamps and NaN

От
Justin Pryzby
Дата:
selfuncs.c convert_to_scalar() says:

|* The several datatypes representing absolute times are all converted
|* to Timestamp, which is actually a double, and then we just use that
|* double value.  Note this will give correct results even for the "special"
|* values of Timestamp, since those are chosen to compare correctly;
|* see timestamp_cmp.

But:
https://www.postgresql.org/docs/10/release-10.html
|Remove support for floating-point timestamps and intervals (Tom Lane)
|This removes configure's --disable-integer-datetimes option. Floating-point timestamps have few advantages and have
notbeen the default since PostgreSQL 8.3.
 
|b6aa17e De-support floating-point timestamps.
|configure                                                    | 18 ++++++------------
|configure.in                                                 | 12 ++++++------
|doc/src/sgml/config.sgml                                     |  8 +++-----
|doc/src/sgml/datatype.sgml                                   | 55
+++++++++++--------------------------------------------
|doc/src/sgml/installation.sgml                               | 22 ----------------------
|src/include/c.h                                              |  7 ++++---
|src/include/pg_config.h.in                                   |  4 ----
|src/include/pg_config.h.win32                                |  4 ----
|src/interfaces/ecpg/include/ecpg_config.h.in                 |  4 ----
|src/interfaces/ecpg/include/pgtypes_interval.h               |  2 --
|src/interfaces/ecpg/test/expected/pgtypeslib-dt_test2.c      |  6 ++----
|src/interfaces/ecpg/test/expected/pgtypeslib-dt_test2.stdout |  2 ++
|src/interfaces/ecpg/test/pgtypeslib/dt_test2.pgc             |  6 ++----
|src/tools/msvc/Solution.pm                                   |  9 ---------
|src/tools/msvc/config_default.pl                             |  1 -
|15 files changed, 36 insertions(+), 124 deletions(-)

It's true that convert_to_scalar sees doubles:
|static double
|convert_timevalue_to_scalar(Datum value, Oid typid, bool *failure)
|{
|        switch (typid)
|        {
|                case TIMESTAMPOID:
|                        return DatumGetTimestamp(value);

But:
$ git grep DatumGetTimestamp src/include/
src/include/utils/timestamp.h:#define DatumGetTimestamp(X)  ((Timestamp) DatumGetInt64(X))

So I propose it should say something like:

|* The several datatypes representing absolute times are all converted
|* to Timestamp, which is actually an int64, and then we just promote that
|* to double.  Note this will give correct results even for the "special"
|* values of Timestamp, since those are chosen to compare correctly;
|* see timestamp_cmp.

That seems to be only used for ineq_histogram_selectivity() interpolation of
histogram bins.  It looks to me that at least isn't working for "special
values", and needs to use something other than isnan().  I added debugging code
and tested the attached like:

DROP TABLE t; CREATE TABLE t(t) AS SELECT generate_series(now(), now()+'1 day', '5 minutes');
INSERT INTO t VALUES('-infinity');
ALTER TABLE t ALTER t SET STATISTICS 1;
ANALYZE t;
explain SELECT * FROM t WHERE t>='2010-12-29';

Вложения

Re: comment regarding double timestamps; and, infinite timestamps and NaN

От
Tom Lane
Дата:
Justin Pryzby <pryzby@telsasoft.com> writes:
> selfuncs.c convert_to_scalar() says:
> |* The several datatypes representing absolute times are all converted
> |* to Timestamp, which is actually a double, and then we just use that
> |* double value.

> So I propose it should say something like:

> |* The several datatypes representing absolute times are all converted
> |* to Timestamp, which is actually an int64, and then we just promote that
> |* to double.

Check, obviously this comment never got updated.

> That seems to be only used for ineq_histogram_selectivity() interpolation of
> histogram bins.  It looks to me that at least isn't working for "special
> values", and needs to use something other than isnan().

Uh, what?  This seems completely wrong to me.  We could possibly
promote DT_NOBEGIN and DT_NOEND to +/- infinity (not NaN), but
I don't really see the point.  They'll compare to other timestamp
values correctly without that, cf timestamp_cmp_internal().
The example you give seems to me to be working sanely, or at least
as sanely as it can given the number of histogram points available,
with the existing code.  In any case, shoving NaNs into the
computation is not going to make anything better.

            regards, tom lane



Re: comment regarding double timestamps; and, infinite timestampsand NaN

От
Justin Pryzby
Дата:
On Mon, Dec 30, 2019 at 09:05:24AM -0500, Tom Lane wrote:
> Justin Pryzby <pryzby@telsasoft.com> writes:
> > That seems to be only used for ineq_histogram_selectivity() interpolation of
> > histogram bins.  It looks to me that at least isn't working for "special
> > values", and needs to use something other than isnan().
> 
> Uh, what?  This seems completely wrong to me.  We could possibly
> promote DT_NOBEGIN and DT_NOEND to +/- infinity (not NaN), but
> I don't really see the point.  They'll compare to other timestamp
> values correctly without that, cf timestamp_cmp_internal().
> The example you give seems to me to be working sanely, or at least
> as sanely as it can given the number of histogram points available,
> with the existing code.  In any case, shoving NaNs into the
> computation is not going to make anything better.

As I see it, the problem is that the existing code tests for isnan(), but
infinite timestamps are PG_INT64_MIN/MAX (here, stored in a double), so there's
absurdly large values being used as if they were isnormal().

src/include/datatype/timestamp.h:#define DT_NOBEGIN             PG_INT64_MIN
src/include/datatype/timestamp.h-#define DT_NOEND               PG_INT64_MAX

On v12, my test gives:
|DROP TABLE t; CREATE TABLE t(t) AS SELECT generate_series(now(), now()+'1 day', '5 minutes');
|INSERT INTO t VALUES('-infinity');
|ALTER TABLE t ALTER t SET STATISTICS 1; ANALYZE t;
|explain analyze SELECT * FROM t WHERE t>='2010-12-29';
| Seq Scan on t  (cost=0.00..5.62 rows=3 width=8) (actual time=0.012..0.042 rows=289 loops=1)

vs patched master:
|DROP TABLE t; CREATE TABLE t(t) AS SELECT generate_series(now(), now()+'1 day', '5 minutes');
|INSERT INTO t VALUES('-infinity');
|ALTER TABLE t ALTER t SET STATISTICS 1; ANALYZE t;
|explain analyze SELECT * FROM t WHERE t>='2010-12-29';
| Seq Scan on t  (cost=0.00..5.62 rows=146 width=8) (actual time=0.048..0.444 rows=289 loops=1)

IMO 146 rows is a reasonable estimate given a single histogram bucket of
infinite width, and 3 rows is a less reasonable result of returning INT64_MAX
in one place and then handling it as a normal value.  The comments in
ineq_histogram seem to indicate that this case is intended to get binfrac=0.5:

|  Watch out for the possibility that we got a NaN or Infinity from the
|  division.  This can happen despite the previous checks, if for example "low" is
|  -Infinity.

I changed to use INFINITY, -INFINITY and !isnormal() rather than nan() and
isnan() (although binfrac is actually NAN at that point so the existing test is
ok).

Justin



Re: comment regarding double timestamps; and, infinite timestamps and NaN

От
Tom Lane
Дата:
Justin Pryzby <pryzby@telsasoft.com> writes:
> On Mon, Dec 30, 2019 at 09:05:24AM -0500, Tom Lane wrote:
>> Uh, what?  This seems completely wrong to me.  We could possibly
>> promote DT_NOBEGIN and DT_NOEND to +/- infinity (not NaN), but
>> I don't really see the point.  They'll compare to other timestamp
>> values correctly without that, cf timestamp_cmp_internal().
>> The example you give seems to me to be working sanely, or at least
>> as sanely as it can given the number of histogram points available,
>> with the existing code.  In any case, shoving NaNs into the
>> computation is not going to make anything better.

> As I see it, the problem is that the existing code tests for isnan(), but
> infinite timestamps are PG_INT64_MIN/MAX (here, stored in a double), so there's
> absurdly large values being used as if they were isnormal().

I still say that (1) you're confusing NaN with Infinity, and (2)
you haven't actually shown that there's a problem to fix.
These endpoint values are *not* NaNs.

> On v12, my test gives:
> |DROP TABLE t; CREATE TABLE t(t) AS SELECT generate_series(now(), now()+'1 day', '5 minutes');
> |INSERT INTO t VALUES('-infinity');
> |ALTER TABLE t ALTER t SET STATISTICS 1; ANALYZE t;
> |explain analyze SELECT * FROM t WHERE t>='2010-12-29';
> | Seq Scan on t  (cost=0.00..5.62 rows=3 width=8) (actual time=0.012..0.042 rows=289 loops=1)

This is what it should do.  There's only one histogram bucket, and
it extends down to -infinity, so the conclusion is going to be that
the WHERE clause excludes all but a small part of the bucket.  This
is the correct answer based on the available stats; the problem is
not with the calculation, but with the miserable granularity of the
available stats.

> vs patched master:
> |DROP TABLE t; CREATE TABLE t(t) AS SELECT generate_series(now(), now()+'1 day', '5 minutes');
> |INSERT INTO t VALUES('-infinity');
> |ALTER TABLE t ALTER t SET STATISTICS 1; ANALYZE t;
> |explain analyze SELECT * FROM t WHERE t>='2010-12-29';
> | Seq Scan on t  (cost=0.00..5.62 rows=146 width=8) (actual time=0.048..0.444 rows=289 loops=1)

This answer is simply broken.  You've caused it to estimate half
of the bucket, which is an insane estimate for the given bucket
boundaries and WHERE constraint.

> IMO 146 rows is a reasonable estimate given a single histogram bucket of
> infinite width,

No, it isn't.

            regards, tom lane



On Mon, Dec 30, 2019 at 02:18:17PM -0500, Tom Lane wrote:
> > On v12, my test gives:
> > |DROP TABLE t; CREATE TABLE t(t) AS SELECT generate_series(now(), now()+'1 day', '5 minutes');
> > |INSERT INTO t VALUES('-infinity');
> > |ALTER TABLE t ALTER t SET STATISTICS 1; ANALYZE t;
> > |explain analyze SELECT * FROM t WHERE t>='2010-12-29';
> > | Seq Scan on t  (cost=0.00..5.62 rows=3 width=8) (actual time=0.012..0.042 rows=289 loops=1)
> 
> This is what it should do.  There's only one histogram bucket, and
> it extends down to -infinity, so the conclusion is going to be that
> the WHERE clause excludes all but a small part of the bucket.  This
> is the correct answer based on the available stats; the problem is
> not with the calculation, but with the miserable granularity of the
> available stats.
> 
> > vs patched master:
> > |DROP TABLE t; CREATE TABLE t(t) AS SELECT generate_series(now(), now()+'1 day', '5 minutes');
> > |INSERT INTO t VALUES('-infinity');
> > |ALTER TABLE t ALTER t SET STATISTICS 1; ANALYZE t;
> > |explain analyze SELECT * FROM t WHERE t>='2010-12-29';
> > | Seq Scan on t  (cost=0.00..5.62 rows=146 width=8) (actual time=0.048..0.444 rows=289 loops=1)
> 
> This answer is simply broken.  You've caused it to estimate half
> of the bucket, which is an insane estimate for the given bucket
> boundaries and WHERE constraint.
> 
> > IMO 146 rows is a reasonable estimate given a single histogram bucket of
> > infinite width,
> 
> No, it isn't.

When using floats, v12 also returns half the histogram:

 DROP TABLE t; CREATE TABLE t(t) AS SELECT generate_series(0, 99, 1)::float;
 INSERT INTO t VALUES('-Infinity');
 ALTER TABLE t ALTER t SET STATISTICS 1; ANALYZE t;
 explain analyze SELECT * FROM t WHERE t>='50';
 Seq Scan on t  (cost=0.00..2.26 rows=51 width=8) (actual time=0.014..0.020 rows=50 loops=1)

I'm fine if the isnan() logic changes, but the comment indicates it's intended
to be hit for an infinite histogram bound, but that doesn't work for timestamps
(convert_to_scalar() should return (double)INFINITY and not
(double)INT64_MIN/MAX).

On Mon, Dec 30, 2019 at 02:18:17PM -0500, Tom Lane wrote:
> Justin Pryzby <pryzby@telsasoft.com> writes:
> > On Mon, Dec 30, 2019 at 09:05:24AM -0500, Tom Lane wrote:
> >> Uh, what?  This seems completely wrong to me.  We could possibly
> >> promote DT_NOBEGIN and DT_NOEND to +/- infinity (not NaN), but
> >> I don't really see the point.  They'll compare to other timestamp
> >> values correctly without that, cf timestamp_cmp_internal().
> >> The example you give seems to me to be working sanely, or at least
> >> as sanely as it can given the number of histogram points available,
> >> with the existing code.  In any case, shoving NaNs into the
> >> computation is not going to make anything better.
> 
> > As I see it, the problem is that the existing code tests for isnan(), but
> > infinite timestamps are PG_INT64_MIN/MAX (here, stored in a double), so there's
> > absurdly large values being used as if they were isnormal().
> 
> I still say that (1) you're confusing NaN with Infinity, and (2)
> you haven't actually shown that there's a problem to fix.
> These endpoint values are *not* NaNs.

I probably did confuse it while trying to make the behavior match the comment
for timestamps.
The Subject says NAN since isnan(binfrac) is what's supposed to be hit for that
case.

The NAN is intended to come from:

|binfrac = (val - low) / (high - low);

which is some variation of -inf / inf.

Justin



Justin Pryzby <pryzby@telsasoft.com> writes:
> On Mon, Dec 30, 2019 at 02:18:17PM -0500, Tom Lane wrote:
>> This answer is simply broken.  You've caused it to estimate half
>> of the bucket, which is an insane estimate for the given bucket
>> boundaries and WHERE constraint.

> I'm fine if the isnan() logic changes, but the comment indicates it's intended
> to be hit for an infinite histogram bound, but that doesn't work for timestamps
> (convert_to_scalar() should return (double)INFINITY and not
> (double)INT64_MIN/MAX).

I suppose the code you're looking at is

                        binfrac = (val - low) / (high - low);

                        /*
                         * Watch out for the possibility that we got a NaN or
                         * Infinity from the division.  This can happen
                         * despite the previous checks, if for example "low"
                         * is -Infinity.
                         */
                        if (isnan(binfrac) ||
                            binfrac < 0.0 || binfrac > 1.0)
                            binfrac = 0.5;

This doesn't really have any goals beyond "make sure we get a result
between 0.0 and 1.0, even if the calculation went pear-shaped for
some reason".  You could make an argument that it should be like

                        if (isnan(binfrac))
                            binfrac = 0.5;   /* throw up our hands for NaN */
                        else if (binfrac <= 0.0)
                            binfrac = 0.0;   /* clamp in case of -Inf or -0 */
                        else if (binfrac > 1.0)
                            binfrac = 1.0;   /* clamp in case of +Inf */

which would probably produce saner results in edge cases like these.
I think it'd also obviate the need for fooling with the conversion in
convert_to_scalar: while DT_NOBEGIN/DT_NOEND wouldn't produce exactly
the same result (hard 0.0 or 1.0) as an infinity, they'd produce
results very close to that.

            regards, tom lane