Обсуждение: BUG #15141: Faulty ISO 8601 parsing

Поиск
Список
Период
Сортировка

BUG #15141: Faulty ISO 8601 parsing

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      15141
Logged by:          defanor
Email address:      defanor@uberspace.net
PostgreSQL version: 10.0
Operating system:   Any, apparently
Description:

The time parsing fails on some valid ISO times, with some locales, e.g.:

# select to_timestamp('2018-04-03T01:45:00,728456785+0000')::timestamp with
time zone;
ERROR:  invalid input syntax for type double precision:
"2018-04-03T01:45:00,728456785+0000"
LINE 1: select to_timestamp('2018-04-03T01:45:00,728456785+0000')::t...
                            ^

Apparently the parsing is locale-dependent (using the locale-dependent
strtod function), while ISO 8601 permits both comma and full stop, with a
preference for comma (and without mentioning locales). Would be nice to
handle both, so that any valid ISO times would get parsed.


Re: BUG #15141: Faulty ISO 8601 parsing

От
"David G. Johnston"
Дата:
On Mon, Apr 2, 2018 at 4:31 PM, PG Bug reporting form <noreply@postgresql.org> wrote:
The following bug has been logged on the website:

Bug reference:      15141
Logged by:          defanor
Email address:      defanor@uberspace.net
PostgreSQL version: 10.0
Operating system:   Any, apparently
Description:

The time parsing fails on some valid ISO times, with some locales, e.g.:

# select to_timestamp('2018-04-03T01:45:00,728456785+0000')::timestamp with
time zone;
ERROR:  invalid input syntax for type double precision:
"2018-04-03T01:45:00,728456785+0000"
LINE 1: select to_timestamp('
​​
2018-04-03T01:45:00,728456785+0000')::t...
                            ^

Apparently the parsing is locale-dependent (using the locale-dependent
strtod function), while ISO 8601 permits both comma and full stop, with a
preference for comma (and without mentioning locales). Would be nice to
handle both, so that any valid ISO times would get parsed.

​The observed problem here is that you've called the single-argument version of to_timestamp, which takes a double​, and the literal that you've supplied doesn't look like a double (i.e., it contains hyphens, the letter T, colons, a comma, and a plus sign).  IOW, you've implicitly asked PostgreSQL to do: "SELECT '
2018-04-03T01:45:00,728456785+0000 '::double" and it rightly complains that it cannot.

If you want to covert a string literal to a timestamp you need to use the two-argument version of the to_timestamp function and pass a format string that looks like the ISO 8601 standard.

David J.

Re: BUG #15141: Faulty ISO 8601 parsing

От
Tom Lane
Дата:
=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes:
> The time parsing fails on some valid ISO times, with some locales, e.g.:

> # select to_timestamp('2018-04-03T01:45:00,728456785+0000')::timestamp with
> time zone;
> ERROR:  invalid input syntax for type double precision:
> "2018-04-03T01:45:00,728456785+0000"

This is confused: the single-argument form of to_timestamp() takes a
float8 argument, not a timestamp, which is why the error message is
phrased the way it is.

I think you meant that this fails:

# select '2018-04-03T01:45:00,728456785+0000'::timestamp with time zone;
ERROR:  invalid input syntax for type timestamp with time zone: "2018-04-03T01:45:00,728456785+0000"

which it does, but I don't think we should do anything about it.
There is not and never has been any dependency on LC_TIME properties in
PG's timestamp I/O.  Considering that we also have DateStyle to cope with,
as well as a lot more flexibility in the input parser than ISO 8601
contemplates, I think allowing a comma instead of decimal point here
would probably create more confusion than benefit.

(Also, this is not the only aspect of 8601 that we don't support;
for instance the option to specify fractional minutes or hours
instead of fractional seconds.  There again, I think the potential
for confusion and error outweighs any benefit from being nominally
more standards-compliant.)

You probably could read data formatted this way using the two-argument
form of to_timestamp() with a suitable format string.

            regards, tom lane


Re: BUG #15141: Faulty ISO 8601 parsing

От
"David G. Johnston"
Дата:
On Mon, Apr 2, 2018 at 5:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
PG Bug reporting form <noreply@postgresql.org> writes:
> The time parsing fails on some valid ISO times, with some locales, e.g.:

> # select to_timestamp('2018-04-03T01:45:00,728456785+0000')::timestamp with
> time zone;
> ERROR:  invalid input syntax for type double precision:
> "2018-04-03T01:45:00,728456785+0000"

This is confused: the single-argument form of to_timestamp() takes a
float8 argument, not a timestamp, which is why the error message is
phrased the way it is.

I think you meant that this fails:

# select '2018-04-03T01:45:00,728456785+0000'::timestamp with time zone;
ERROR:  invalid input syntax for type timestamp with time zone: "2018-04-03T01:45:00,728456785+0000"

which it does, but I don't think we should do anything about it.
There is not and never has been any dependency on LC_TIME properties in
PG's timestamp I/O.  Considering that we also have DateStyle to cope with,
as well as a lot more flexibility in the input parser than ISO 8601
contemplates, I think allowing a comma instead of decimal point here
would probably create more confusion than benefit.

​Ideally it would be as simple as:

diff --git a/src/backend/utils/adt/datetime.c b/src/backend/utils/adt/datetime.c
index 8375b93c39..4a3c7382f2 100644
--- a/src/backend/utils/adt/datetime.c
+++ b/src/backend/utils/adt/datetime.c
@@ -653,8 +653,8 @@ ParseDateTime(const char *timestr, char *workbuf, size_t buflen,
  else
  ftype[nf] = DTK_NUMBER;
  }
- /* Leading decimal point? Then fractional seconds... */
- else if (*cp == '.')
+ /* Leading decimal point or comma? Then fractional seconds... */
+ else if (*cp == '.' || *cp  == ',')
  {
  APPEND_CHAR(bufp, bufend, *cp++);
  while (isdigit((unsigned char) *cp))

But then one needs to contemplate the impact that has on:

​/* ignore other punctuation but use as delimiter */
else if (ispunct((unsigned char) *cp))
{
cp++;
continue;
}

Its not LC_TIME dependent but a bit more complex implementation wise to detect commas used as separators in custom formats and a comma used as a fractional separator.

I don't foresee much end-user confusion involved here - especially if we only allow for fractional seconds at a relatively fixed location.

David J.

Re: BUG #15141: Faulty ISO 8601 parsing

От
defanor
Дата:
> I think you meant that this fails:
>
> # select '2018-04-03T01:45:00,728456785+0000'::timestamp with time zone;
> ERROR:  invalid input syntax for type timestamp with time zone: "2018-04-03T01:45:00,728456785+0000"

Indeed, sorry – got that mixed up, copied a wrong part.


Re: BUG #15141: Faulty ISO 8601 parsing

От
defanor
Дата:
> (Also, this is not the only aspect of 8601 that we don't support;
> for instance the option to specify fractional minutes or hours
> instead of fractional seconds.  There again, I think the potential
> for confusion and error outweighs any benefit from being nominally
> more standards-compliant.)
>
> You probably could read data formatted this way using the two-argument
> form of to_timestamp() with a suitable format string.

FWIW, the use case on which I've stumbled onto that was use of COPY: as
I understand it, to_timestamp() won't help there, and it's not just
about standards compliance for the sake of it; when reading
(particularly test data) from stdin or a program, it is tempting to use
date(1) from a shell script, or arbitrary ISO 8601 formatting functions.
It's not hard to get around that either, but still would be nice to be
able to rely on standard formats being handled (or perhaps to see it
mentioned in the docs that the support is incomplete).


Re: BUG #15141: Faulty ISO 8601 parsing

От
Tom Lane
Дата:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Mon, Apr 2, 2018 at 5:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think you meant that this fails:
>> # select '2018-04-03T01:45:00,728456785+0000'::timestamp with time zone;
>> ERROR:  invalid input syntax for type timestamp with time zone:
>> "2018-04-03T01:45:00,728456785+0000"
>> which it does, but I don't think we should do anything about it.
>> There is not and never has been any dependency on LC_TIME properties in
>> PG's timestamp I/O.  Considering that we also have DateStyle to cope with,
>> as well as a lot more flexibility in the input parser than ISO 8601
>> contemplates, I think allowing a comma instead of decimal point here
>> would probably create more confusion than benefit.

> ​Ideally it would be as simple as:

> - /* Leading decimal point? Then fractional seconds... */
> - else if (*cp == '.')
> + /* Leading decimal point or comma? Then fractional seconds... */
> + else if (*cp == '.' || *cp  == ',')

Well, it's not.

The core problem here is that the datetime input parser treats commas as
ignorable punctuation, equivalent to whitespace.  For instance, we accept

# select 'sunday, april 1, 2018'::timestamp;
      timestamp      
---------------------
 2018-04-01 00:00:00
(1 row)

but not

# select 'sunday. april 1, 2018'::timestamp;
ERROR:  invalid input syntax for type timestamp: "sunday. april 1, 2018"

But surely it would not be a good idea to consider "T01:45:00,728456785"
the same as "T01:45:00 728456785".

Changing this would entail all sorts of side-effects on non-ISO
data formats.  Considering that this code is about old enough to
vote, and this is the first complaint on this specific issue
that I can recall hearing, I don't even think it's worth the
time to investigate just what the side-effects would be ...
there's no way that any proposed change would get past the
backwards-compatibility hurdles.

            regards, tom lane