Обсуждение: Allow to_date() and to_timestamp() to accept localized names

Поиск
Список
Период
Сортировка

Allow to_date() and to_timestamp() to accept localized names

От
Juan José Santamaría Flecha
Дата:
Hello,

Going through the current items in the wiki's todo list, I have been
looking into: "Allow to_date () and to_timestamp () to accept
localized month names".

It seems to me that the code is almost there, so I would like to know
if something like the attached patch would be a reasonable way to go.

Regards,

Juan José Santamaría Flecha

Вложения

Re: Allow to_date() and to_timestamp() to accept localized names

От
Juan José Santamaría Flecha
Дата:
On Sun, Aug 18, 2019 at 10:42 AM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:
>
> Going through the current items in the wiki's todo list, I have been
> looking into: "Allow to_date () and to_timestamp () to accept
> localized month names".
>

I have gone through a second take on this, trying to give it a better
shape but it would surely benefit from some review, so I will open an
item in the commitfest.

Regards,

Juan José Santamaría Flecha

Вложения

Re: Allow to_date() and to_timestamp() to accept localized names

От
Juan José Santamaría Flecha
Дата:
On Thu, Aug 22, 2019 at 9:38 PM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:
>
> >
> > Going through the current items in the wiki's todo list, I have been
> > looking into: "Allow to_date () and to_timestamp () to accept
> > localized month names".
> >
>
> I have gone through a second take on this, trying to give it a better
> shape but it would surely benefit from some review, so I will open an
> item in the commitfest.

For reviewers, the current aproach for this patch is:

Break seq_search() into two functions:
  * seq_search_sqlascii() that supports seq_search() current usage.
  * seq_search_localized() similar to the previous but supports multibyte input.
To avoid code duplication most of current seq_search() logic has been
moved to a new function str_compare().
from_char_seq_search() is now responsible to choose between
seq_search_sqlascii() or seq_search_localized().
Also, since localized names is not a null terminated array,
seq_search() now receives the dimension as input and terminating nulls
have been removed from static arrays.

The commitfest item is:

https://commitfest.postgresql.org/24/2255/

Regards,

Juan José Santamaría Flecha



Re: Allow to_date() and to_timestamp() to accept localized names

От
Alvaro Herrera
Дата:
On 2019-Aug-22, Juan José Santamaría Flecha wrote:

> On Sun, Aug 18, 2019 at 10:42 AM Juan José Santamaría Flecha
> <juanjo.santamaria@gmail.com> wrote:
> >
> > Going through the current items in the wiki's todo list, I have been
> > looking into: "Allow to_date () and to_timestamp () to accept
> > localized month names".
> 
> I have gone through a second take on this, trying to give it a better
> shape but it would surely benefit from some review, so I will open an
> item in the commitfest.

I'm confused why we acquire the MONTH_DIM / etc definitions.  Can't we
just use lengthof() of the corresponding array?  AFAICS it should work
just as well.

I wonder if the "compare first char" thing (seq_search_localized) really
works when there are multibyte chars in the day/month names.  I think
the code compares just the first char ... but what if the original
string uses those funny Unicode non-normalized letters and the locale
translation uses normalized letters?  My guess is that the first-char
comparison will fail, but surely you'll want the name to match.
(There's no month/day name in Spanish that doesn't start with an ASCII
letter, but I bet there are some in other languages.)  I think the
localized comparison should avoid the first-char optimization, just
compare the whole string all the time, and avoid possible weird issues.

But then, I'm not 100% sure of this code.  formatting.c is madness
distilled.

Thanks,

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Allow to_date() and to_timestamp() to accept localized names

От
Juan José Santamaría Flecha
Дата:
On Fri, Sep 13, 2019 at 10:31 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>
Thanks for taking a look at this.

> I'm confused why we acquire the MONTH_DIM / etc definitions.  Can't we
> just use lengthof() of the corresponding array?  AFAICS it should work
> just as well.
>

It was because of the length difference between ascii-name arrays,
which were all null-ended, and localized-name arrays. The attached
version uses lengthof().

> I wonder if the "compare first char" thing (seq_search_localized) really
> works when there are multibyte chars in the day/month names.  I think
> the code compares just the first char ... but what if the original
> string uses those funny Unicode non-normalized letters and the locale
> translation uses normalized letters?  My guess is that the first-char
> comparison will fail, but surely you'll want the name to match.
> (There's no month/day name in Spanish that doesn't start with an ASCII
> letter, but I bet there are some in other languages.)  I think the
> localized comparison should avoid the first-char optimization, just
> compare the whole string all the time, and avoid possible weird issues.

The localized search is reformulated in this version to deal with
multibyte normalization. A regression test for this issue is included.


Regards,

Juan José Santamaría Flecha

Вложения

Re: Allow to_date() and to_timestamp() to accept localized names

От
Juan José Santamaría Flecha
Дата:
On Wed, Sep 18, 2019 at 11:09 AM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:
>
> On Fri, Sep 13, 2019 at 10:31 PM Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> >
> Thanks for taking a look at this.
>
> > I'm confused why we acquire the MONTH_DIM / etc definitions.  Can't we
> > just use lengthof() of the corresponding array?  AFAICS it should work
> > just as well.
> >
>
> It was because of the length difference between ascii-name arrays,
> which were all null-ended, and localized-name arrays. The attached
> version uses lengthof().
>
> > I wonder if the "compare first char" thing (seq_search_localized) really
> > works when there are multibyte chars in the day/month names.  I think
> > the code compares just the first char ... but what if the original
> > string uses those funny Unicode non-normalized letters and the locale
> > translation uses normalized letters?  My guess is that the first-char
> > comparison will fail, but surely you'll want the name to match.
> > (There's no month/day name in Spanish that doesn't start with an ASCII
> > letter, but I bet there are some in other languages.)  I think the
> > localized comparison should avoid the first-char optimization, just
> > compare the whole string all the time, and avoid possible weird issues.
>
> The localized search is reformulated in this version to deal with
> multibyte normalization. A regression test for this issue is included.

This version is updated to optimize the need for dynamic allocation.


> Regards,
>
> Juan José Santamaría Flecha

Вложения

Re: Allow to_date() and to_timestamp() to accept localized names

От
Alvaro Herrera
Дата:
This patch no longer applies.  Can you please rebase?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Allow to_date() and to_timestamp() to accept localized names

От
Juan José Santamaría Flecha
Дата:
On Wed, Sep 25, 2019 at 9:57 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> This patch no longer applies.  Can you please rebase?

Thank you for the notification.

The patch rot after commit 1a950f3, a rebased version is attached.

Regards,

Juan José Santamaría Flecha

Вложения

Re: Allow to_date() and to_timestamp() to accept localized names

От
Tomas Vondra
Дата:
On Thu, Sep 26, 2019 at 08:36:25PM +0200, Juan José Santamaría Flecha wrote:
>On Wed, Sep 25, 2019 at 9:57 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>
>> This patch no longer applies.  Can you please rebase?
>
>Thank you for the notification.
>
>The patch rot after commit 1a950f3, a rebased version is attached.
>

Thanks. I did a quick review of this patch, and I think it's almost RFC.
I only found a couple of minor issue:

- In func.sgml, it seems we've lost this bit:

       <para>
        <literal>TM</literal> does not include trailing blanks.
        <function>to_timestamp</function> and <function>to_date</function> ignore
        the <literal>TM</literal> modifier.
       </para>

   Does that mean the function no longer ignore the TM modifier? That
   would be somewhat problematic (i.e. it might break some user code).
   But looking at the code I don't see why the patch would have this
   effect, so I suppose it's merely a doc bug.

- I don't think we need to include examples how to_timestmap ignores
   case, I'd say just stating the fact is clear enough. But if we want to
   have examples, I think we should not inline in the para but use the
   established pattern:

   <para>
     Some examples:
<programlisting>
...
</programlisting>
    </para>

   which is used elsewhere in the func.sgml file.

- In formatting.c the "common/unicode_norm.h" should be right after
   includes from "catalog/" to follow the alphabetic order (unless
   there's a reason why that does not work).

- I rather dislike the "dim" parameter name, because I immediately think
   "dimension" which makes little sense. I suggest renaming to "nitems"
   or "nelements" or something like that. 


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Allow to_date() and to_timestamp() to accept localized names

От
Juan José Santamaría Flecha
Дата:

On Sat, Jan 11, 2020 at 5:06 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

Thanks. I did a quick review of this patch, and I think it's almost RFC.


Thanks for reviewing.
 
- In func.sgml, it seems we've lost this bit:

       <para>
        <literal>TM</literal> does not include trailing blanks.
        <function>to_timestamp</function> and <function>to_date</function> ignore
        the <literal>TM</literal> modifier.
       </para>

   Does that mean the function no longer ignore the TM modifier? That
   would be somewhat problematic (i.e. it might break some user code).
   But looking at the code I don't see why the patch would have this
   effect, so I suppose it's merely a doc bug.


It is intentional. This patch uses the TM modifier to identify the usage of localized names as input for to_timestamp() and to_date().
 
- I don't think we need to include examples how to_timestmap ignores
   case, I'd say just stating the fact is clear enough. But if we want to
   have examples, I think we should not inline in the para but use the
   established pattern:

   <para>
     Some examples:
<programlisting>
...
</programlisting>
    </para>

   which is used elsewhere in the func.sgml file.

I was trying to match the style surrounding the usage notes for date/time formatting [1]. Agreed that it is not worth an example on its own, so dropped.
 

- In formatting.c the "common/unicode_norm.h" should be right after
   includes from "catalog/" to follow the alphabetic order (unless
   there's a reason why that does not work).

Fixed.
 

- I rather dislike the "dim" parameter name, because I immediately think
   "dimension" which makes little sense. I suggest renaming to "nitems"
   or "nelements" or something like that.
 

Agreed, using "nelements" as a better style matchup.

Please, find attached a version addressing the above mentioned.


Regards,

Juan José Santamaría Flecha
 
Вложения

Re: Allow to_date() and to_timestamp() to accept localized names

От
Arthur Zakirov
Дата:
Hello!

On 2020/01/13 21:04, Juan José Santamaría Flecha wrote:
> Please, find attached a version addressing the above mentioned.

I have some couple picky notes.

> +    if (name_len != norm_len)
> +        pfree(norm_name);

I'm not sure here. Is it save to assume that if it was allocated a new 
norm_name name_len and norm_len always will differ?

>  static const char *const months_full[] = {
>      "January", "February", "March", "April", "May", "June", "July",
> -    "August", "September", "October", "November", "December", NULL
> +    "August", "September", "October", "November", "December"
>  };

Unfortunately I didn't see from the patch why it was necessary to remove 
last null element from months_full, days_short, adbc_strings, 
adbc_strings_long and other arrays. I think it is not good because 
unnecessarily increases the patch and adds code like the following:

> +                from_char_seq_search(&value, &s, months, localized_names,
> +                                     ONE_UPPER, MAX_MON_LEN, n, have_error,
> +                                     lengthof(months_full));

Here it passes "months" from datetime.c to from_char_seq_search() and 
calculates its length using "months_full" array from formatting.c.

-- 
Arthur



Re: Allow to_date() and to_timestamp() to accept localized names

От
Juan José Santamaría Flecha
Дата:


On Wed, Jan 15, 2020 at 11:20 AM Arthur Zakirov <zaartur@gmail.com> wrote:

I have some couple picky notes.


Thanks for the review.
 
> +     if (name_len != norm_len)
> +             pfree(norm_name);

I'm not sure here. Is it save to assume that if it was allocated a new
norm_name name_len and norm_len always will differ?

Good call, that check can be more robust.
 

>  static const char *const months_full[] = {
>       "January", "February", "March", "April", "May", "June", "July",
> -     "August", "September", "October", "November", "December", NULL
> +     "August", "September", "October", "November", "December"
>  };

Unfortunately I didn't see from the patch why it was necessary to remove
last null element from months_full, days_short, adbc_strings,
adbc_strings_long and other arrays. I think it is not good because
unnecessarily increases the patch and adds code like the following:

> +                             from_char_seq_search(&value, &s, months, localized_names,
> +                                                                      ONE_UPPER, MAX_MON_LEN, n, have_error,
> +                                                                      lengthof(months_full));

Here it passes "months" from datetime.c to from_char_seq_search() and
calculates its length using "months_full" array from formatting.c.
 

With the introduction of seq_search_localized() that can be avoided, minimizing code churn.

Please, find attached a version addressing the above mentioned.

Regards,

Juan José Santamaría Flecha  
 
Вложения

Re: Allow to_date() and to_timestamp() to accept localized names

От
Tom Lane
Дата:
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= <juanjo.santamaria@gmail.com> writes:
> [ 0001-Allow-localized-month-names-to_date-v6.patch ]

I took a quick look through this.

One thing I completely don't understand is why it's sufficient for
seq_search_localized to do "initcap" semantics.  Shouldn't it cover
the all-upper and all-lower cases as well, as the existing seq_search
code does?  (That is, why is it okay to ignore the "type" argument?)

On the other hand, you probably *should* be ignoring the "max"
argument, because AFAICS the values that are passed for that are
specific to the English ASCII spellings of the day and month names.
It seems very likely that they could be too small for some sets of
non-English names.

Related to that, the business with

+    mb_max = max * pg_encoding_max_length(encoding);
+    name_len = strlen(name);
+    name_len = name_len < mb_max ? name_len : mb_max;

seems wrong even if we assume that "max" is a sane limit on the number of
characters to consider.  This coding is likely to truncate a long "name"
string in the middle of a multibyte character, leading to bad-encoding
errors from later functions.

I also am confused by the "if (mb_max > max ...)" bit.  It looks to me
like that's an obscure way of writing "if (pg_encoding_max_length() > 1)",
which is a rather pointless check considering that the if() then goes on
to insist on encoding == PG_UTF8.

A bit further down, you have an "if (name_wlen > norm_wlen)"
condition that seems pretty fishy.  Is it really guaranteed that
unicode_normalize_kc cannot transform the string without shortening it?
I don't see that specified in its header comment, for sure.

Also, it's purely accidental if this:

    norm_name = (char *) palloc((norm_wlen + 1) * sizeof(pg_wchar));

allocates a long enough string for the conversion back to multibyte form,
because that output is not pg_wchars.  I think you want something more
like "norm_wlen * MAX_MULTIBYTE_CHAR_LEN + 1".  (I wonder whether we
need to be worrying about integer overflow in any of this.)

It seems like you could eliminate a lot of messiness by extending
localized_abbrev_days[] and related arrays by one element that's
always NULL.  That would waste, hmm, four 8-byte pointers on typical
machines --- but you're eating a lot more than 32 bytes of code to
pass around the array lengths, plus it's really ugly that the plain
and localized arrays are handled differently.

I don't think the way you're managing the "localized_names" variable
in DCH_from_char is very sensible.  The reader has to do a lot of
reverse engineering just to discover that the value is constant NULL
in most references, something that you've not helped by declaring
it outside the loop rather than inside.  I think you could drop
the variable and write constant NULL in most places, with something
like
    S_TM(n->suffix) ? localized_full_days : NULL
in those from_char_seq_search calls where it's relevant.

In general I'm displeased with the lack of attention to comments.
Notably, you added arguments to from_char_seq_search without updating
its header comment ... not that that comment is a great API spec, but
at the least you shouldn't be making it worse.  I think the bug I
complained of above is directly traceable to the lack of a clear spec
here for what "max" means, so failure to keep these comments up to
speed does have demonstrable bad consequences.

            regards, tom lane



Re: Allow to_date() and to_timestamp() to accept localized names

От
Tom Lane
Дата:
I wrote:
> One thing I completely don't understand is why it's sufficient for
> seq_search_localized to do "initcap" semantics.  Shouldn't it cover
> the all-upper and all-lower cases as well, as the existing seq_search
> code does?  (That is, why is it okay to ignore the "type" argument?)

I took a closer look at this and decided you were probably doing that
just because the seq_search code uses initcap-style case folding to
match month and day names, relying on the assumption that the constant
arrays it's passed are cased that way.  But we shouldn't (and the patch
doesn't) assume that the localized names we'll get from pg_locale.c are
cased that way.

However ... it seems to me that the way seq_search is defined is
plenty bogus.  In the first place, it scribbles on its source string,
which is surely not okay.  You can see that in error messages that
are thrown on match failures:

regression=# select to_date('3 JANUZRY 1999', 'DD MONTH YYYY');
ERROR:  invalid value "JanuzRY 1" for "MONTH"
DETAIL:  The given value did not match any of the allowed values for this field.

Now, maybe somebody out there thinks this is a cute way of highlighting
how much of the string we examined, but it looks more like a bug from
where I sit.  It's mere luck that what's being clobbered is a local
string copy created by do_to_timestamp(), and not the original input
data passed to to_date().

In the second place, there's really no need at all for seq_search to
rely on any assumptions about the case state of the array it's
searching.  pg_toupper() is pretty cheap already, and we can make it
guaranteed cheap if we use pg_ascii_toupper() instead.  So I think
we ought to just remove the "type" argument altogether and have
seq_search dynamically convert all the strings it examines to upper
case (or lower case would work as well, at least for English).

> On the other hand, you probably *should* be ignoring the "max"
> argument, because AFAICS the values that are passed for that are
> specific to the English ASCII spellings of the day and month names.
> It seems very likely that they could be too small for some sets of
> non-English names.

Closer examination shows that the "max" argument is pretty bogus as
well.  It doesn't do anything except confuse the reader, because there
are no cases where the value passed is less than the maximum array entry
length, so it never acts to change seq_search's behavior.  So we should
just drop that behavior from seq_search, too, and redefine "max" as
having no purpose except to specify how much of the string to show in
error messages.  There's still a question of what that should be for
non-English cases, but at least we now have a clear idea of what we
need the value to do.

I also noted while I was looking at it that from_char_seq_search()
would truncate at "max" bytes even when dealing with multibyte input.
That has a clear risk of generating an invalidly-encoded error message.

The 0001 patch attached cleans up all the above complaints.  I felt
that given the business about scribbling on strings we shouldn't,
it would also be wise to const-ify the string pointers if possible.
That seems not too painful, per 0002 attached.

I'm tempted to go a bit further than 0001 does, and remove the 'max'
argument from from_char_seq_search() altogether.  Since we only need
'max' in error cases, which don't need to be super-quick, we could drop
the requirement for the caller to specify that and instead compute it
when we do need it as the largest of the constant array's string
lengths.  That'd carry over into the localized-names case in a
reasonably straightforward way (though we might need to count characters
not bytes for that to work nicely).

Because of the risk of incorrectly-encoded error messages, I'm rather
tempted to claim that these patches (or at least 0001) are a bug fix
and should be back-patched.  In any case I think we should apply this
to HEAD and then rebase the localized-names patch over it.  It makes
a whole lot more sense, IMO, for the localized-names comparison logic
to match what this is doing than what seq_search does today.

Comments?

            regards, tom lane

diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index ca3c48d..3ef10dc 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -317,10 +317,6 @@ static const char *const numth[] = {"st", "nd", "rd", "th", NULL};
  * Flags & Options:
  * ----------
  */
-#define ONE_UPPER        1        /* Name */
-#define ALL_UPPER        2        /* NAME */
-#define ALL_LOWER        3        /* name */
-
 #define MAX_MONTH_LEN    9
 #define MAX_MON_LEN        3
 #define MAX_DAY_LEN        9
@@ -1068,9 +1064,9 @@ static int    from_char_parse_int_len(int *dest, char **src, const int len,
                                     FormatNode *node, bool *have_error);
 static int    from_char_parse_int(int *dest, char **src, FormatNode *node,
                                 bool *have_error);
-static int    seq_search(char *name, const char *const *array, int type, int max, int *len);
+static int    seq_search(const char *name, const char *const *array, int *len);
 static int    from_char_seq_search(int *dest, char **src,
-                                 const char *const *array, int type, int max,
+                                 const char *const *array, int max,
                                  FormatNode *node, bool *have_error);
 static void do_to_timestamp(text *date_txt, text *fmt, bool std,
                             struct pg_tm *tm, fsec_t *fsec, int *fprec,
@@ -2454,70 +2450,57 @@ from_char_parse_int(int *dest, char **src, FormatNode *node, bool *have_error)
 }

 /* ----------
- * Sequential search with to upper/lower conversion
+ * Sequentially search null-terminated "array" for a case-insensitive match
+ * to the initial character(s) of "name".
+ *
+ * Returns array index of match, or -1 for no match.
+ *
+ * *len is set to the length of the match, or 0 for no match.
+ *
+ * Case-insensitivity is defined per pg_ascii_toupper, so this is only
+ * suitable for comparisons to ASCII strings.
  * ----------
  */
 static int
-seq_search(char *name, const char *const *array, int type, int max, int *len)
+seq_search(const char *name, const char *const *array, int *len)
 {
-    const char *p;
+    unsigned char firstc;
     const char *const *a;
-    char       *n;
-    int            last,
-                i;

     *len = 0;

+    /* empty string can't match anything */
     if (!*name)
         return -1;

-    /* set first char */
-    if (type == ONE_UPPER || type == ALL_UPPER)
-        *name = pg_toupper((unsigned char) *name);
-    else if (type == ALL_LOWER)
-        *name = pg_tolower((unsigned char) *name);
+    /* we handle first char specially to gain some speed */
+    firstc = pg_ascii_toupper((unsigned char) *name);

-    for (last = 0, a = array; *a != NULL; a++)
+    for (a = array; *a != NULL; a++)
     {
+        int            i;
+        const char *p;
+        const char *n;
+
         /* compare first chars */
-        if (*name != **a)
+        if (pg_ascii_toupper((unsigned char) **a) != firstc)
             continue;

         for (i = 1, p = *a + 1, n = name + 1;; n++, p++, i++)
         {
-            /* search fragment (max) only */
-            if (max && i == max)
-            {
-                *len = i;
-                return a - array;
-            }
-            /* full size */
+            /* return success if we matched whole array entry */
             if (*p == '\0')
             {
                 *len = i;
                 return a - array;
             }
-            /* Not found in array 'a' */
+            /* else, must have another character in "name" ... */
             if (*n == '\0')
                 break;

-            /*
-             * Convert (but convert new chars only)
-             */
-            if (i > last)
-            {
-                if (type == ONE_UPPER || type == ALL_LOWER)
-                    *n = pg_tolower((unsigned char) *n);
-                else if (type == ALL_UPPER)
-                    *n = pg_toupper((unsigned char) *n);
-                last = i;
-            }
-
-#ifdef DEBUG_TO_FROM_CHAR
-            elog(DEBUG_elog_output, "N: %c, P: %c, A: %s (%s)",
-                 *n, *p, *a, name);
-#endif
-            if (*n != *p)
+            /* ... and it must match */
+            if (pg_ascii_toupper((unsigned char) *p) !=
+                pg_ascii_toupper((unsigned char) *n))
                 break;
         }
     }
@@ -2526,8 +2509,8 @@ seq_search(char *name, const char *const *array, int type, int max, int *len)
 }

 /*
- * Perform a sequential search in 'array' for text matching the first 'max'
- * characters of the source string.
+ * Perform a sequential search in 'array' for an entry matching the first
+ * character(s) of the 'src' string case-insensitively.
  *
  * If a match is found, copy the array index of the match into the integer
  * pointed to by 'dest', advance 'src' to the end of the part of the string
@@ -2535,19 +2518,28 @@ seq_search(char *name, const char *const *array, int type, int max, int *len)
  *
  * If the string doesn't match, throw an error if 'have_error' is NULL,
  * otherwise set '*have_error' and return -1.
+ *
+ * 'max' and 'node' are used only for error reports: 'max' is the max
+ * number of bytes of the string to include in the error report, and
+ * node->key->name identifies what we were searching for.
  */
 static int
-from_char_seq_search(int *dest, char **src, const char *const *array, int type,
+from_char_seq_search(int *dest, char **src, const char *const *array,
                      int max, FormatNode *node, bool *have_error)
 {
     int            len;

-    *dest = seq_search(*src, array, type, max, &len);
+    /* Assert that no call gives us a 'max' too large for the error buffer */
+    Assert(max <= DCH_MAX_ITEM_SIZ);
+
+    *dest = seq_search(*src, array, &len);
+
     if (len <= 0)
     {
         char        copy[DCH_MAX_ITEM_SIZ + 1];

-        Assert(max <= DCH_MAX_ITEM_SIZ);
+        /* Use multibyte-aware truncation to avoid generating a bogus string */
+        max = pg_mbcliplen(*src, strlen(*src), max);
         strlcpy(copy, *src, max + 1);

         RETURN_ERROR(ereport(ERROR,
@@ -3279,7 +3271,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
             case DCH_a_m:
             case DCH_p_m:
                 from_char_seq_search(&value, &s, ampm_strings_long,
-                                     ALL_UPPER, n->key->len, n, have_error);
+                                     n->key->len, n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->pm, value % 2, n, have_error);
                 CHECK_ERROR;
@@ -3290,7 +3282,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
             case DCH_am:
             case DCH_pm:
                 from_char_seq_search(&value, &s, ampm_strings,
-                                     ALL_UPPER, n->key->len, n, have_error);
+                                     n->key->len, n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->pm, value % 2, n, have_error);
                 CHECK_ERROR;
@@ -3403,7 +3395,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
             case DCH_a_d:
             case DCH_b_c:
                 from_char_seq_search(&value, &s, adbc_strings_long,
-                                     ALL_UPPER, n->key->len, n, have_error);
+                                     n->key->len, n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->bc, value % 2, n, have_error);
                 CHECK_ERROR;
@@ -3413,7 +3405,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
             case DCH_ad:
             case DCH_bc:
                 from_char_seq_search(&value, &s, adbc_strings,
-                                     ALL_UPPER, n->key->len, n, have_error);
+                                     n->key->len, n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->bc, value % 2, n, have_error);
                 CHECK_ERROR;
@@ -3421,7 +3413,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
             case DCH_MONTH:
             case DCH_Month:
             case DCH_month:
-                from_char_seq_search(&value, &s, months_full, ONE_UPPER,
+                from_char_seq_search(&value, &s, months_full,
                                      MAX_MONTH_LEN, n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->mm, value + 1, n, have_error);
@@ -3430,7 +3422,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
             case DCH_MON:
             case DCH_Mon:
             case DCH_mon:
-                from_char_seq_search(&value, &s, months, ONE_UPPER,
+                from_char_seq_search(&value, &s, months,
                                      MAX_MON_LEN, n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->mm, value + 1, n, have_error);
@@ -3444,7 +3436,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
             case DCH_DAY:
             case DCH_Day:
             case DCH_day:
-                from_char_seq_search(&value, &s, days, ONE_UPPER,
+                from_char_seq_search(&value, &s, days,
                                      MAX_DAY_LEN, n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->d, value, n, have_error);
@@ -3454,7 +3446,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
             case DCH_DY:
             case DCH_Dy:
             case DCH_dy:
-                from_char_seq_search(&value, &s, days, ONE_UPPER,
+                from_char_seq_search(&value, &s, days,
                                      MAX_DY_LEN, n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->d, value, n, have_error);
@@ -3572,7 +3564,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
                 break;
             case DCH_RM:
                 from_char_seq_search(&value, &s, rm_months_upper,
-                                     ALL_UPPER, MAX_RM_LEN, n, have_error);
+                                     MAX_RM_LEN, n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->mm, MONTHS_PER_YEAR - value,
                                   n, have_error);
@@ -3580,7 +3572,7 @@ DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
                 break;
             case DCH_rm:
                 from_char_seq_search(&value, &s, rm_months_lower,
-                                     ALL_LOWER, MAX_RM_LEN, n, have_error);
+                                     MAX_RM_LEN, n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->mm, MONTHS_PER_YEAR - value,
                                   n, have_error);
diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 3ef10dc..ef21b7e 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1044,7 +1044,7 @@ static void parse_format(FormatNode *node, const char *str, const KeyWord *kw,

 static void DCH_to_char(FormatNode *node, bool is_interval,
                         TmToChar *in, char *out, Oid collid);
-static void DCH_from_char(FormatNode *node, char *in, TmFromChar *out,
+static void DCH_from_char(FormatNode *node, const char *in, TmFromChar *out,
                           bool std, bool *have_error);

 #ifdef DEBUG_TO_FROM_CHAR
@@ -1055,17 +1055,17 @@ static void dump_node(FormatNode *node, int max);
 static const char *get_th(char *num, int type);
 static char *str_numth(char *dest, char *num, int type);
 static int    adjust_partial_year_to_2020(int year);
-static int    strspace_len(char *str);
+static int    strspace_len(const char *str);
 static void from_char_set_mode(TmFromChar *tmfc, const FromCharDateMode mode,
                                bool *have_error);
 static void from_char_set_int(int *dest, const int value, const FormatNode *node,
                               bool *have_error);
-static int    from_char_parse_int_len(int *dest, char **src, const int len,
+static int    from_char_parse_int_len(int *dest, const char **src, const int len,
                                     FormatNode *node, bool *have_error);
-static int    from_char_parse_int(int *dest, char **src, FormatNode *node,
+static int    from_char_parse_int(int *dest, const char **src, FormatNode *node,
                                 bool *have_error);
 static int    seq_search(const char *name, const char *const *array, int *len);
-static int    from_char_seq_search(int *dest, char **src,
+static int    from_char_seq_search(int *dest, const char **src,
                                  const char *const *array, int max,
                                  FormatNode *node, bool *have_error);
 static void do_to_timestamp(text *date_txt, text *fmt, bool std,
@@ -2255,7 +2255,7 @@ adjust_partial_year_to_2020(int year)


 static int
-strspace_len(char *str)
+strspace_len(const char *str)
 {
     int            len = 0;

@@ -2344,12 +2344,12 @@ on_error:
  * and -1 is returned.
  */
 static int
-from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node,
+from_char_parse_int_len(int *dest, const char **src, const int len, FormatNode *node,
                         bool *have_error)
 {
     long        result;
     char        copy[DCH_MAX_ITEM_SIZ + 1];
-    char       *init = *src;
+    const char *init = *src;
     int            used;

     /*
@@ -2366,8 +2366,11 @@ from_char_parse_int_len(int *dest, char **src, const int len, FormatNode *node,
          * This node is in Fill Mode, or the next node is known to be a
          * non-digit value, so we just slurp as many characters as we can get.
          */
+        char       *endptr;
+
         errno = 0;
-        result = strtol(init, src, 10);
+        result = strtol(init, &endptr, 10);
+        *src = endptr;
     }
     else
     {
@@ -2444,7 +2447,7 @@ on_error:
  * required length explicitly.
  */
 static int
-from_char_parse_int(int *dest, char **src, FormatNode *node, bool *have_error)
+from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_error)
 {
     return from_char_parse_int_len(dest, src, node->key->len, node, have_error);
 }
@@ -2524,7 +2527,7 @@ seq_search(const char *name, const char *const *array, int *len)
  * node->key->name identifies what we were searching for.
  */
 static int
-from_char_seq_search(int *dest, char **src, const char *const *array,
+from_char_seq_search(int *dest, const char **src, const char *const *array,
                      int max, FormatNode *node, bool *have_error)
 {
     int            len;
@@ -3158,11 +3161,11 @@ DCH_to_char(FormatNode *node, bool is_interval, TmToChar *in, char *out, Oid col
  * ----------
  */
 static void
-DCH_from_char(FormatNode *node, char *in, TmFromChar *out, bool std,
+DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
               bool *have_error)
 {
     FormatNode *n;
-    char       *s;
+    const char *s;
     int            len,
                 value;
     bool        fx_mode = std;

Re: Allow to_date() and to_timestamp() to accept localized names

От
Arthur Zakirov
Дата:
On 2020/01/23 7:11, Tom Lane wrote:
> Closer examination shows that the "max" argument is pretty bogus as
> well.  It doesn't do anything except confuse the reader, because there
> are no cases where the value passed is less than the maximum array entry
> length, so it never acts to change seq_search's behavior.  So we should
> just drop that behavior from seq_search, too, and redefine "max" as
> having no purpose except to specify how much of the string to show in
> error messages.  There's still a question of what that should be for
> non-English cases, but at least we now have a clear idea of what we
> need the value to do.

Shouldn't we just show all remaining string instead of truncating it? 
For example I get the following output:

=# select to_timestamp('3 ЯНВАРЯ 1999', 'DD MONTH YYYY');
ERROR:  invalid value "ЯНВА" for "MONTH"
DETAIL:  The given value did not match any of the allowed values for 
this field.

This behavior is reproduced without the patch though (on Postgres 12).

And the English case might confuse too I think:

=# select to_date('3 JANUZRY 1999', 'DD MONTH YYYY');
ERROR:  invalid value "JANUZRY 1" for "MONTH"
DETAIL:  The given value did not match any of the allowed values for 
this field.

Users might think what means "1" in "JANUZRY 1" string.

I attached the draft patch, which shows all remaining string. So the 
query above will show the following:

=# select to_timestamp('3 ЯНВАРЯ 1999', 'DD MONTH YYYY');
ERROR:  invalid value "ЯНВАРЯ 1999" for "MONTH"
DETAIL:  The remaining value did not match any of the allowed values for 
this field.

> The 0001 patch attached cleans up all the above complaints.  I felt
> that given the business about scribbling on strings we shouldn't,
> it would also be wise to const-ify the string pointers if possible.
> That seems not too painful, per 0002 attached.

+1 on the patch.

-- 
Arthur

Вложения

Re: Allow to_date() and to_timestamp() to accept localized names

От
Tom Lane
Дата:
Arthur Zakirov <zaartur@gmail.com> writes:
> On 2020/01/23 7:11, Tom Lane wrote:
>> Closer examination shows that the "max" argument is pretty bogus as
>> well.  It doesn't do anything except confuse the reader, because there
>> are no cases where the value passed is less than the maximum array entry
>> length, so it never acts to change seq_search's behavior.  So we should
>> just drop that behavior from seq_search, too, and redefine "max" as
>> having no purpose except to specify how much of the string to show in
>> error messages.  There's still a question of what that should be for
>> non-English cases, but at least we now have a clear idea of what we
>> need the value to do.

> Shouldn't we just show all remaining string instead of truncating it? 

That would avoid a bunch of arbitrary decisions, for sure.
Anybody have an objection?

            regards, tom lane



Re: Allow to_date() and to_timestamp() to accept localized names

От
Alvaro Herrera
Дата:
On 2020-Jan-22, Tom Lane wrote:

> Arthur Zakirov <zaartur@gmail.com> writes:
> > On 2020/01/23 7:11, Tom Lane wrote:
> >> Closer examination shows that the "max" argument is pretty bogus as
> >> well.  It doesn't do anything except confuse the reader, because there
> >> are no cases where the value passed is less than the maximum array entry
> >> length, so it never acts to change seq_search's behavior.  So we should
> >> just drop that behavior from seq_search, too, and redefine "max" as
> >> having no purpose except to specify how much of the string to show in
> >> error messages.  There's still a question of what that should be for
> >> non-English cases, but at least we now have a clear idea of what we
> >> need the value to do.
> 
> > Shouldn't we just show all remaining string instead of truncating it? 
> 
> That would avoid a bunch of arbitrary decisions, for sure.
> Anybody have an objection?

I think it would be clearer to search for whitespace starting at the
start of the bogus token and stop there.  It might not be perfect,
particularly if any language has whitespace in a month etc name (I don't
know any example but I guess it's not impossible for it to exist;
Portuguese weekday names maybe?), but it seems better than either of the
behaviors shown above.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Allow to_date() and to_timestamp() to accept localized names

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2020-Jan-22, Tom Lane wrote:
>> Arthur Zakirov <zaartur@gmail.com> writes:
>>> Shouldn't we just show all remaining string instead of truncating it? 

>> That would avoid a bunch of arbitrary decisions, for sure.
>> Anybody have an objection?

> I think it would be clearer to search for whitespace starting at the
> start of the bogus token and stop there.  It might not be perfect,
> particularly if any language has whitespace in a month etc name (I don't
> know any example but I guess it's not impossible for it to exist;
> Portuguese weekday names maybe?), but it seems better than either of the
> behaviors shown above.

I'm okay with that.  It won't work so well for cases like
"1-Januzry-1999", but it's still better than what happens now:

regression=# select to_date('1-Januzry-1999', 'DD MONTH YYYY');
ERROR:  invalid value "Januzry-1" for "MONTH"

That particular case could be improved by stopping at a dash ... but
since this code is also used to match strings like "A.M.", we can't
just exclude punctuation in general.  Breaking at whitespace seems
like a reasonable compromise.

            regards, tom lane



Re: Allow to_date() and to_timestamp() to accept localized names

От
Alvaro Herrera
Дата:
On 2020-Jan-23, Tom Lane wrote:

> That particular case could be improved by stopping at a dash ... but
> since this code is also used to match strings like "A.M.", we can't
> just exclude punctuation in general.  Breaking at whitespace seems
> like a reasonable compromise.

Yeah, and there are cases where dashes are used in names -- browsing
through glibc for example I quickly found Akan, for which the month names are:

mon  "Sanda-<U0186>p<U025B>p<U0254>n";/
     "Kwakwar-<U0186>gyefuo";/
     "Eb<U0254>w-<U0186>benem";/

and so on.  Even whitespace is problematic for some languages, such as Afan,

mon      "Qunxa Garablu";/
         "Naharsi Kudo";/
         "Ciggilta Kudo";/
(etc) but I think whitespace-splitting is going to be more comprehensible
in the vast majority of cases, even if not perfect.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Allow to_date() and to_timestamp() to accept localized names

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Even whitespace is problematic for some languages, such as Afan,

> mon      "Qunxa Garablu";/
>          "Naharsi Kudo";/
>          "Ciggilta Kudo";/

> (etc) but I think whitespace-splitting is going to be more comprehensible
> in the vast majority of cases, even if not perfect.

Interesting.  Given that to_date et al are often willing to ignore
whitespace in input, I wonder whether we won't have some other issues
with names like that --- not so much that basic cases wouldn't work,
as that people might reasonably expect that, say, we'd be flexible
about the amount of whitespace.  Seems like a problem for another day
though.

In the meantime, I agree that "truncate at whitespace" is a better
heuristic for the error messages than what we've got.  I'll go make it so.

            regards, tom lane



Re: Allow to_date() and to_timestamp() to accept localized names

От
Tom Lane
Дата:
Here's a v7 patch, rebased over my recent hacking, and addressing
most of the complaints I raised in <31691.1579648824@sss.pgh.pa.us>.
However, I've got some new complaints now that I've looked harder
at the code:

* It's not exactly apparent to me why this code should be concerned
about non-normalized characters when noplace else in the backend is.
I think we should either rip that out entirely, or move the logic
into str_tolower (and hence also str_toupper etc).  I'd tend to favor
the former, but of course I don't speak any languages where this would
be an issue.  Perhaps a better idea is to invent a new SQL function
that normalizes a given string, and expect users to call that first
if they'd like to_date() to match unnormalized text.

* I have no faith in this calculation that decides how long the match
length was:

        *len = element_len + name_len - norm_len;

I seriously doubt that this does the right thing if either the
normalization or the downcasing changed the byte length of the
string.  I'm not actually sure how we can do that correctly.
There's no good way to know whether the changes happened within
the part of the "name" string that we matched, or the part beyond
what we matched, but we only want to count the former.  So this
seems like a pretty hard problem, and even if this logic is somehow
correct as it stands, it needs a comment explaining why.

* I'm still concerned about the risk of integer overflow in the
string allocations in seq_search_localized().  Those need to be
adjusted to be more paranoid, similar to the code in e.g. str_tolower.

            regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 6c4359d..ff44496 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5934,7 +5934,7 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
        </row>
        <row>
         <entry><literal>TM</literal> prefix</entry>
-        <entry>translation mode (print localized day and month names based on
+        <entry>translation mode (use localized day and month names based on
          <xref linkend="guc-lc-time"/>)</entry>
         <entry><literal>TMMonth</literal></entry>
        </row>
@@ -5966,8 +5966,13 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
      <listitem>
       <para>
        <literal>TM</literal> does not include trailing blanks.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
        <function>to_timestamp</function> and <function>to_date</function> ignore
-       the <literal>TM</literal> modifier.
+       the case when receiving names as an input.
       </para>
      </listitem>

diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index 792f9ca..2f39050 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -87,6 +87,7 @@

 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
+#include "common/unicode_norm.h"
 #include "mb/pg_wchar.h"
 #include "parser/scansup.h"
 #include "utils/builtins.h"
@@ -1059,9 +1060,11 @@ static int    from_char_parse_int_len(int *dest, const char **src, const int len,
                                     FormatNode *node, bool *have_error);
 static int    from_char_parse_int(int *dest, const char **src, FormatNode *node,
                                 bool *have_error);
-static int    seq_search(const char *name, const char *const *array, int *len);
+static int    seq_search_ascii(const char *name, const char *const *array, int *len);
+static int    seq_search_localized(const char *name, char **array, int *len);
 static int    from_char_seq_search(int *dest, const char **src,
                                  const char *const *array,
+                                 char **localized_array,
                                  FormatNode *node, bool *have_error);
 static void do_to_timestamp(text *date_txt, text *fmt, bool std,
                             struct pg_tm *tm, fsec_t *fsec, int *fprec,
@@ -2459,7 +2462,7 @@ from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_er
  * suitable for comparisons to ASCII strings.
  */
 static int
-seq_search(const char *name, const char *const *array, int *len)
+seq_search_ascii(const char *name, const char *const *array, int *len)
 {
     unsigned char firstc;
     const char *const *a;
@@ -2505,8 +2508,92 @@ seq_search(const char *name, const char *const *array, int *len)
 }

 /*
- * Perform a sequential search in 'array' for an entry matching the first
- * character(s) of the 'src' string case-insensitively.
+ * Sequentially search an array of possibly non-English words for
+ * a case-insensitive match to the initial character(s) of "name".
+ *
+ * This has the same API as seq_search_ascii(), but we use a more general
+ * downcasing transformation to achieve case-insensitivity.
+ *
+ * The array is treated as const, but we don't declare it that way because
+ * the arrays exported by pg_locale.c aren't const.
+ */
+static int
+seq_search_localized(const char *name, char **array, int *len)
+{
+    char      **a;
+    char       *lower_name;
+    char       *norm_name;
+    int            name_len;
+    int            norm_len;
+
+    *len = 0;
+
+    /* empty string can't match anything */
+    if (!*name)
+        return -1;
+
+    name_len = strlen(name);
+
+    /* Normalize name, if working in Unicode */
+    if (GetDatabaseEncoding() == PG_UTF8)
+    {
+        pg_wchar   *wchar_name;
+        pg_wchar   *norm_wname;
+        size_t        name_wlen;
+        size_t        norm_wlen;
+
+        wchar_name = (pg_wchar *) palloc((name_len + 1) * sizeof(pg_wchar));
+        name_wlen = pg_mb2wchar_with_len(name, wchar_name, name_len);
+        norm_wname = unicode_normalize_kc(wchar_name);
+        pfree(wchar_name);
+        norm_wlen = pg_wchar_strlen(norm_wname);
+        norm_name = (char *) palloc(norm_wlen * MAX_MULTIBYTE_CHAR_LEN + 1);
+        norm_len = pg_wchar2mb_with_len(norm_wname, norm_name, norm_wlen);
+        pfree(norm_wname);
+    }
+    else
+    {
+        norm_name = unconstify(char *, name);
+        norm_len = name_len;
+    }
+
+    /* Now lower-case it */
+    lower_name = str_tolower(norm_name, norm_len, DEFAULT_COLLATION_OID);
+    if ((const char *) norm_name != name)
+        pfree(norm_name);
+
+    for (a = array; *a != NULL; a++)
+    {
+        char       *lower_element;
+        int            element_len;
+
+        /* Lower-case array element, assuming it is normalized */
+        lower_element = str_tolower(*a, strlen(*a), DEFAULT_COLLATION_OID);
+        element_len = strlen(lower_element);
+
+        /* Match? */
+        if (strncmp(lower_name, lower_element, element_len) == 0)
+        {
+            *len = element_len + name_len - norm_len;
+            pfree(lower_element);
+            pfree(lower_name);
+            return a - array;
+        }
+        pfree(lower_element);
+    }
+
+    pfree(lower_name);
+    return -1;
+}
+
+/*
+ * Perform a sequential search in 'array' (or 'localized_array', if that's
+ * not NULL) for an entry matching the first character(s) of the 'src'
+ * string case-insensitively.
+ *
+ * The 'array' is presumed to be English words (all-ASCII), but
+ * if 'localized_array' is supplied, that might be non-English
+ * so we need a more expensive downcasing transformation.
  *
  * If a match is found, copy the array index of the match into the integer
  * pointed to by 'dest', advance 'src' to the end of the part of the string
@@ -2520,11 +2607,15 @@ seq_search(const char *name, const char *const *array, int *len)
  */
 static int
 from_char_seq_search(int *dest, const char **src, const char *const *array,
+                     char **localized_array,
                      FormatNode *node, bool *have_error)
 {
     int            len;

-    *dest = seq_search(*src, array, &len);
+    if (localized_array == NULL)
+        *dest = seq_search_ascii(*src, array, &len);
+    else
+        *dest = seq_search_localized(*src, localized_array, &len);

     if (len <= 0)
     {
@@ -3172,6 +3263,9 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
     /* number of extra skipped characters (more than given in format string) */
     int            extra_skip = 0;

+    /* cache localized days and months */
+    cache_locale_time();
+
     for (n = node, s = in; n->type != NODE_TYPE_END && *s != '\0'; n++)
     {
         /*
@@ -3272,7 +3366,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_P_M:
             case DCH_a_m:
             case DCH_p_m:
-                from_char_seq_search(&value, &s, ampm_strings_long,
+                from_char_seq_search(&value, &s, ampm_strings_long, NULL,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->pm, value % 2, n, have_error);
@@ -3283,7 +3377,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_PM:
             case DCH_am:
             case DCH_pm:
-                from_char_seq_search(&value, &s, ampm_strings,
+                from_char_seq_search(&value, &s, ampm_strings, NULL,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->pm, value % 2, n, have_error);
@@ -3396,7 +3490,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_B_C:
             case DCH_a_d:
             case DCH_b_c:
-                from_char_seq_search(&value, &s, adbc_strings_long,
+                from_char_seq_search(&value, &s, adbc_strings_long, NULL,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->bc, value % 2, n, have_error);
@@ -3406,7 +3500,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_BC:
             case DCH_ad:
             case DCH_bc:
-                from_char_seq_search(&value, &s, adbc_strings,
+                from_char_seq_search(&value, &s, adbc_strings, NULL,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->bc, value % 2, n, have_error);
@@ -3416,6 +3510,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_Month:
             case DCH_month:
                 from_char_seq_search(&value, &s, months_full,
+                                     S_TM(n->suffix) ? localized_full_months : NULL,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->mm, value + 1, n, have_error);
@@ -3425,6 +3520,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_Mon:
             case DCH_mon:
                 from_char_seq_search(&value, &s, months,
+                                     S_TM(n->suffix) ? localized_abbrev_months : NULL,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->mm, value + 1, n, have_error);
@@ -3439,6 +3535,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_Day:
             case DCH_day:
                 from_char_seq_search(&value, &s, days,
+                                     S_TM(n->suffix) ? localized_full_days : NULL,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->d, value, n, have_error);
@@ -3449,6 +3546,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_Dy:
             case DCH_dy:
                 from_char_seq_search(&value, &s, days_short,
+                                     S_TM(n->suffix) ? localized_abbrev_days : NULL,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->d, value, n, have_error);
@@ -3566,7 +3664,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
                 break;
             case DCH_RM:
             case DCH_rm:
-                from_char_seq_search(&value, &s, rm_months_lower,
+                from_char_seq_search(&value, &s, rm_months_lower, NULL,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->mm, MONTHS_PER_YEAR - value,
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 25fb7e2..64fd3ae 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -96,11 +96,17 @@ char       *locale_monetary;
 char       *locale_numeric;
 char       *locale_time;

-/* lc_time localization cache */
-char       *localized_abbrev_days[7];
-char       *localized_full_days[7];
-char       *localized_abbrev_months[12];
-char       *localized_full_months[12];
+/*
+ * lc_time localization cache.
+ *
+ * We use only the first 7 or 12 entries of these arrays.  The last array
+ * element is left as NULL for the convenience of outside code that wants
+ * to sequentially scan these arrays.
+ */
+char       *localized_abbrev_days[7 + 1];
+char       *localized_full_days[7 + 1];
+char       *localized_abbrev_months[12 + 1];
+char       *localized_full_months[12 + 1];

 /* indicates whether locale information cache is valid */
 static bool CurrentLocaleConvValid = false;
@@ -922,6 +928,8 @@ cache_locale_time(void)
         cache_single_string(&localized_full_days[i], bufptr, encoding);
         bufptr += MAX_L10N_DATA;
     }
+    localized_abbrev_days[7] = NULL;
+    localized_full_days[7] = NULL;

     /* localized months */
     for (i = 0; i < 12; i++)
@@ -931,6 +939,8 @@ cache_locale_time(void)
         cache_single_string(&localized_full_months[i], bufptr, encoding);
         bufptr += MAX_L10N_DATA;
     }
+    localized_abbrev_months[12] = NULL;
+    localized_full_months[12] = NULL;

     CurrentLCTimeValid = true;
 }
diff --git a/src/test/regress/expected/collate.linux.utf8.out b/src/test/regress/expected/collate.linux.utf8.out
index 37c6add..af3762e 100644
--- a/src/test/regress/expected/collate.linux.utf8.out
+++ b/src/test/regress/expected/collate.linux.utf8.out
@@ -461,6 +461,22 @@ SELECT to_char(date '2010-04-01', 'DD TMMON YYYY' COLLATE "tr_TR");
  01 NİS 2010
 (1 row)

+-- to_date
+SELECT to_date('01 ŞUB 2010', 'DD TMMON YYYY'); -- normalized \u015E
+  to_date
+------------
+ 02-01-2010
+(1 row)
+
+SELECT to_date('01 ŞUB 2010', 'DD TMMON YYYY'); -- not normalized \u0053+\u0327
+  to_date
+------------
+ 02-01-2010
+(1 row)
+
+SELECT to_date('1234567890ab 2010', 'TMMONTH YYYY'); -- fail
+ERROR:  invalid value "1234567890ab" for "MONTH"
+DETAIL:  The given value did not match any of the allowed values for this field.
 -- backwards parsing
 CREATE VIEW collview1 AS SELECT * FROM collate_test1 WHERE b COLLATE "C" >= 'bbc';
 CREATE VIEW collview2 AS SELECT a, b FROM collate_test1 ORDER BY b COLLATE "C";
diff --git a/src/test/regress/sql/collate.linux.utf8.sql b/src/test/regress/sql/collate.linux.utf8.sql
index 8c26f16..b0a85c2 100644
--- a/src/test/regress/sql/collate.linux.utf8.sql
+++ b/src/test/regress/sql/collate.linux.utf8.sql
@@ -182,6 +182,12 @@ SELECT to_char(date '2010-02-01', 'DD TMMON YYYY' COLLATE "tr_TR");
 SELECT to_char(date '2010-04-01', 'DD TMMON YYYY');
 SELECT to_char(date '2010-04-01', 'DD TMMON YYYY' COLLATE "tr_TR");

+-- to_date
+
+SELECT to_date('01 ŞUB 2010', 'DD TMMON YYYY'); -- normalized \u015E
+SELECT to_date('01 ŞUB 2010', 'DD TMMON YYYY'); -- not normalized \u0053+\u0327
+SELECT to_date('1234567890ab 2010', 'TMMONTH YYYY'); -- fail
+

 -- backwards parsing


Re: Allow to_date() and to_timestamp() to accept localized names

От
Juan José Santamaría Flecha
Дата:


On Thu, Jan 23, 2020 at 11:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Thank you for your time looking into this.

Here's a v7 patch, rebased over my recent hacking, and addressing
most of the complaints I raised in <31691.1579648824@sss.pgh.pa.us>.
However, I've got some new complaints now that I've looked harder
at the code:

* It's not exactly apparent to me why this code should be concerned
about non-normalized characters when noplace else in the backend is.
I think we should either rip that out entirely, or move the logic
into str_tolower (and hence also str_toupper etc).  I'd tend to favor
the former, but of course I don't speak any languages where this would
be an issue.  Perhaps a better idea is to invent a new SQL function
that normalizes a given string, and expect users to call that first
if they'd like to_date() to match unnormalized text.


There is an open patch that will make the normalization functionality user visible [1]. So, if a user can call to_date(normalize('01 ŞUB 2010'), 'DD TMMON YYYY') I would vote to drop the normalization logic inside this patch altogether.

* I have no faith in this calculation that decides how long the match
length was:

                *len = element_len + name_len - norm_len;

I seriously doubt that this does the right thing if either the
normalization or the downcasing changed the byte length of the
string.  I'm not actually sure how we can do that correctly.
There's no good way to know whether the changes happened within
the part of the "name" string that we matched, or the part beyond
what we matched, but we only want to count the former.  So this
seems like a pretty hard problem, and even if this logic is somehow
correct as it stands, it needs a comment explaining why.


The proper logic would come from do_to_timestamp() receiving a normalized "date_txt" input, so we do not operate with unnormalize and normalize strings at the same time.
 
* I'm still concerned about the risk of integer overflow in the
string allocations in seq_search_localized().  Those need to be
adjusted to be more paranoid, similar to the code in e.g. str_tolower.

Please find attached a patch with the normalization logic removed, thus no direct allocations in seq_search_localized().

I would like to rise a couple of questions myself:

* When compiled with DEBUG_TO_FROM_CHAR, there is a warning "‘dump_node’ defined but not used". Should we drop this function or uncomment its usage?

* Would it be worth moving str_tolower(localized_name) from seq_search_localized() into cache_locale_time()?


Regards,

Juan José Santamaría Flecha

Вложения

Re: Allow to_date() and to_timestamp() to accept localized names

От
Alvaro Herrera
Дата:
On 2020-Jan-24, Juan José Santamaría Flecha wrote:

> There is an open patch that will make the normalization functionality user
> visible [1]. So, if a user can call to_date(normalize('01 ŞUB 2010'), 'DD
> TMMON YYYY') I would vote to drop the normalization logic inside this patch
> altogether.

I was reading the SQL standard on this point, and it says this (4.2.8
Universal character sets):

  An SQL-implementation may assume that all UCS strings are normalized
  in one of [Unicode normalization forms].

which seems to agree with what you're saying.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Allow to_date() and to_timestamp() to accept localized names

От
Tom Lane
Дата:
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= <juanjo.santamaria@gmail.com> writes:
> On Thu, Jan 23, 2020 at 11:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> * It's not exactly apparent to me why this code should be concerned
>> about non-normalized characters when noplace else in the backend is.

> There is an open patch that will make the normalization functionality user
> visible [1]. So, if a user can call to_date(normalize('01 ŞUB 2010'), 'DD
> TMMON YYYY') I would vote to drop the normalization logic inside this patch
> altogether.

Works for me.

> * I have no faith in this calculation that decides how long the match
>> length was:
>>     *len = element_len + name_len - norm_len;

> The proper logic would come from do_to_timestamp() receiving a normalized
> "date_txt" input, so we do not operate with unnormalize and normalize
> strings at the same time.

No, that only solves half the problem, because the downcasing
transformation can change the string length too.  Two easy examples:

* In German, I believe "ß" downcases to "ss".  In Latin-1 encoding that's
a length change, though I think it might accidentally not be in UTF8.

* The Turks distinguish dotted and dotless "i", so that "İ" downcases to
"i", and conversely "I" downcases to "ı".  Those are length changes in
UTF8, though not in whichever Latin-N encoding works for Turkish.

Even if these cases happen not to apply to any month or day name of
the relevant language, we still have a problem, arising from the fact
that we're downcasing the whole remaining string --- so length changes
after the match could occur anyway.

> I would like to rise a couple of questions myself:

> * When compiled with DEBUG_TO_FROM_CHAR, there is a warning "‘dump_node’
> defined but not used". Should we drop this function or uncomment its usage?

Maybe, but I don't think it belongs in this patch.

> * Would it be worth moving str_tolower(localized_name)
> from seq_search_localized() into cache_locale_time()?

I think it'd complicate tracking when that cache has to be invalidated
(i.e. it now depends on more than just LC_TIME).  On the whole I wouldn't
bother unless someone does the measurements to show there'd be a useful
speedup.

            regards, tom lane



Re: Allow to_date() and to_timestamp() to accept localized names

От
Peter Eisentraut
Дата:
On 2020-01-24 12:44, Alvaro Herrera wrote:
> On 2020-Jan-24, Juan José Santamaría Flecha wrote:
> 
>> There is an open patch that will make the normalization functionality user
>> visible [1]. So, if a user can call to_date(normalize('01 ŞUB 2010'), 'DD
>> TMMON YYYY') I would vote to drop the normalization logic inside this patch
>> altogether.
> 
> I was reading the SQL standard on this point, and it says this (4.2.8
> Universal character sets):
> 
>    An SQL-implementation may assume that all UCS strings are normalized
>    in one of [Unicode normalization forms].
> 
> which seems to agree with what you're saying.

When you interact with glibc locale functions, you essentially have to 
assume that everything is in NFC.  When using ICU locale functions 
(which we don't here, but just for the sake of argument), then I would 
expect that ICU does appropriate normalization itself when I call 
icu_what_month_is_this(string) returns int.  So I think it is 
appropriate to not deal with normalization in this patch.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Allow to_date() and to_timestamp() to accept localized names

От
Alvaro Herrera
Дата:
On 2020-Jan-24, Peter Eisentraut wrote:

> When you interact with glibc locale functions, you essentially have to
> assume that everything is in NFC.  When using ICU locale functions (which we
> don't here, but just for the sake of argument), then I would expect that ICU
> does appropriate normalization itself when I call
> icu_what_month_is_this(string) returns int.  So I think it is appropriate to
> not deal with normalization in this patch.

But that's a different POV.  The input to this function could come from
arbitrary user input from any application whatsoever.  So the only
reason we can get away with that is because the example regression case
Juan José added (which uses non-normals) does not conform to the
standard.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Allow to_date() and to_timestamp() to accept localized names

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> But that's a different POV.  The input to this function could come from
> arbitrary user input from any application whatsoever.  So the only
> reason we can get away with that is because the example regression case
> Juan José added (which uses non-normals) does not conform to the
> standard.

I'm unsure about "conforming to standard", but I think it's reasonable
to put the onus of doing normalization when necessary on the user.
Otherwise, we need to move normalization logic into basically all
the string processing functions (even texteq), which seems like a
pretty huge cost that will benefit only a small minority of people.
(If it's not a small minority, then where's the bug reports complaining
that we don't do it today?)

            regards, tom lane



Re: Allow to_date() and to_timestamp() to accept localized names

От
Peter Eisentraut
Дата:
On 2020-01-24 17:22, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> But that's a different POV.  The input to this function could come from
>> arbitrary user input from any application whatsoever.  So the only
>> reason we can get away with that is because the example regression case
>> Juan José added (which uses non-normals) does not conform to the
>> standard.
> 
> I'm unsure about "conforming to standard", but I think it's reasonable
> to put the onus of doing normalization when necessary on the user.
> Otherwise, we need to move normalization logic into basically all
> the string processing functions (even texteq), which seems like a
> pretty huge cost that will benefit only a small minority of people.
> (If it's not a small minority, then where's the bug reports complaining
> that we don't do it today?)

These reports do exist, and this behavior is known.  However, the impact 
is mostly that results "look wrong" (looks the same but doesn't compare 
as equal) rather than causing inconsistency and corruption, so it's 
mostly shrugged off.  The nondeterministic collation feature was 
introduced in part to be able to deal with this; the pending 
normalization patch is another.  However, this behavior is baked deeply 
into Unicode, so no single feature or facility will simply make it go away.

AFAICT, we haven't so far had any code that does a lookup of non-ASCII 
strings in a table, so that's why we haven't had this discussion yet.

Now that I think about it, you could also make an argument that this 
should be handled through collation, so the function that looks up the 
string in the locale table should go through texteq.  However, this 
would mostly satisfy the purists but create a bizarre user experience.

Looking through the patch quickly, if you want to get Unicode-fancy, 
doing a case-insensitive comparison by running lower-case on both 
strings is also wrong in corner cases.  All the Greek month names end in 
sigma, so I suspect that this patch might not work correctly in such cases.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Allow to_date() and to_timestamp() to accept localized names

От
Tom Lane
Дата:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> Looking through the patch quickly, if you want to get Unicode-fancy, 
> doing a case-insensitive comparison by running lower-case on both 
> strings is also wrong in corner cases.  All the Greek month names end in 
> sigma, so I suspect that this patch might not work correctly in such cases.

Hm.  That's basically what citext does, and I don't recall hearing
complaints about that.  What other definition of "case insensitive"
would you suggest?

            regards, tom lane



Re: Allow to_date() and to_timestamp() to accept localized names

От
Juan José Santamaría Flecha
Дата:

On Fri, Jan 24, 2020 at 5:46 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> Looking through the patch quickly, if you want to get Unicode-fancy,
> doing a case-insensitive comparison by running lower-case on both
> strings is also wrong in corner cases.  All the Greek month names end in
> sigma, so I suspect that this patch might not work correctly in such cases.

Hm.  That's basically what citext does, and I don't recall hearing
complaints about that.  What other definition of "case insensitive"
would you suggest?


To illustrate the issue, it does not work as expected:

postgres=# select lower(to_char(now(),'TMMONTH'));
   lower
------------
 ιανουάριοσ
(1 row)

postgres=# select to_char(now(),'TMmonth');
  to_char
------------
 ιανουάριος
(1 row)

Regards,

Juan José Santamaría Flecha
 

Re: Allow to_date() and to_timestamp() to accept localized names

От
Peter Eisentraut
Дата:
On 2020-01-24 18:25, Juan José Santamaría Flecha wrote:
> To illustrate the issue, it does not work as expected:
> 
> postgres=# select lower(to_char(now(),'TMMONTH'));
>     lower
> ------------
>   ιανουάριοσ
> (1 row)
> 
> postgres=# select to_char(now(),'TMmonth');
>    to_char
> ------------
>   ιανουάριος
> (1 row)

Well, this is interesting, because on macOS and Debian stable I get

postgres=# select to_char(now(),'TMmonth');
   to_char
------------
  ιανουαρίου
(1 row)

which is the genitive of ιανουάριος.  You use the genitive form for a 
date (24th of January) but the nominative otherwise.  But the reverse 
mapping can only take one of these forms.  So here

select to_date('Ιανουαριος', 'TMMonth');

fails, which is bad.

In the glibc locale data sources they have both forms listed, but 
apparently the API were are using only accepts one of them.

(I don't know any Greek, I'm just extrapolating from Wikipedia and 
locale data.)

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Allow to_date() and to_timestamp() to accept localized names

От
Peter Eisentraut
Дата:
On 2020-01-24 19:01, Peter Eisentraut wrote:
> postgres=# select to_char(now(),'TMmonth');
>     to_char
> ------------
>    ιανουαρίου
> (1 row)
> 
> which is the genitive of ιανουάριος.  You use the genitive form for a
> date (24th of January) but the nominative otherwise.  But the reverse
> mapping can only take one of these forms.  So here
> 
> select to_date('Ιανουαριος', 'TMMonth');
> 
> fails, which is bad.

For the record, the correct form of that would appear to be

select to_date('Ιανουάριος', 'TMMonth');

with the accent.  I had tried different variations of that and they all 
failed.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Allow to_date() and to_timestamp() to accept localized names

От
Tom Lane
Дата:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> For the record, the correct form of that would appear to be
> select to_date('Ιανουάριος', 'TMMonth');
> with the accent.  I had tried different variations of that and they all
> failed.

OK, so for anyone who is as confused as I was, the main point here
seems to be this: the upper case form of Greek sigma is 'Σ',
and the lower case form is 'σ' ... except as the final letter of
a word, where it is supposed to be written like 'ς'.

If I set lc_collate, lc_ctype, and lc_time to 'el_GR.utf8',
then (on a somewhat hoary glibc platform) I get

u8=# select to_char('2020-01-01'::timestamptz, 'TMMONTH');
       to_char
----------------------
 ΙΑΝΟΥΆΡΙΟΣ
(1 row)

u8=# select to_char('2020-01-01'::timestamptz, 'TMMonth');
       to_char
----------------------
 Ιανουάριος
(1 row)

u8=# select to_char('2020-01-01'::timestamptz, 'TMmonth');
       to_char
----------------------
 ιανουάριος
(1 row)

which is correct AFAICS ... but

u8=# select lower(to_char('2020-01-01'::timestamptz, 'TMMONTH'));
        lower
----------------------
 ιανουάριοσ
(1 row)

So what we actually have here, ISTM, is a bug in lower() not to_char().
The bug is unsurprising because str_tolower() simply applies towlower_l()
to each character independently, so there's no way for it to account for
the word-final rule.  I'm not aware that glibc provides any API whereby
that could be done correctly.  On the other hand, we get it right when
using an ICU collation for lower():

u8=# select lower(to_char('2020-01-01'::timestamptz, 'TMMONTH') collate "el-gr-x-icu");
        lower
----------------------
 ιανουάριος
(1 row)

because that code path passes the whole string to ICU at once, and
of course getting this right is ICU's entire job.

I haven't double-checked, but I imagine that the reason that to_char
gets the month name case-folding right is that what comes out of
strftime(..."%B"...) is "Ιανουάριος" which we are able to upcase
correctly, while the downcasing code paths don't affect 'ς'.

I thought for a little bit about trying to dodge this issue in the
patch by folding to upper case, not lower, before comparing month/day
names.  I fear that that would just shift the problem cases to some
other language(s).  However, it makes Greek better, and I think it
makes German better (does 'ß' appear in any month/day names there?),
so maybe we should just roll with that.  In the end, it doesn't seem
right to reject this patch just because lower() is broken on some
platforms.

The other question your example raises is whether we should be trying
to de-accent before comparison, ie was it right for 'Ιανουάριος' to
be treated differently from 'Ιανουαριος'.  I don't know enough Greek
to say, but it kind of feels like that should be outside to_date's
purview.

            regards, tom lane



Re: Allow to_date() and to_timestamp() to accept localized names

От
Mark Dilger
Дата:

> On Jan 27, 2020, at 6:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> and I think it
> makes German better (does 'ß' appear in any month/day names there?),
> so maybe we should just roll with that.

To my knowledge, neither /ß/ nor /ss/ occur in day or month names in German, neither presently nor in recent times, so
Iwouldn’t expect it to matter for German whether you use uppercase or lowercase. 

>
> The other question your example raises is whether we should be trying
> to de-accent before comparison, ie was it right for 'Ιανουάριος' to
> be treated differently from 'Ιανουαριος'.  I don't know enough Greek
> to say, but it kind of feels like that should be outside to_date's
> purview.

I’m getting a little lost here.  German uses both Sonnabend and Samstag for Saturday, so don’t you have to compare to a
listof values anyway?  I don’t know Norwegian, but Wikipedia shows both sundag and søndag for Sunday in Norwegian
Nynorsk.  Faroese seems to have a few similar cases.  Whether those languages have alternate day names both in common
usage,I can’t say, but both Sonnabend and Samstag are still in use in the German speaking world.  If you need to
compareagainst lists, then I would think you could put both ιανουάριοσ and ιανουάριος into a list, even if only one of
themis grammatically correct Greek.  I’d think that unaccented versions of names might also go into the list, depending
onwhether speakers of that language consider them equivalent.  That sounds like something for the translators to hash
out.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Allow to_date() and to_timestamp() to accept localized names

От
Peter Eisentraut
Дата:
On 2020-01-28 04:05, Mark Dilger wrote:
> German uses both Sonnabend and Samstag for Saturday, so don’t you have to compare to a list of values anyway?

Yeah, good point.  If it doesn't accept both "Sonnabend" and "Samstag", 
then it's not really usable.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Allow to_date() and to_timestamp() to accept localized names

От
Peter Eisentraut
Дата:
On 2020-01-28 03:11, Tom Lane wrote:
> The other question your example raises is whether we should be trying
> to de-accent before comparison, ie was it right for 'Ιανουάριος' to
> be treated differently from 'Ιανουαριος'.  I don't know enough Greek
> to say, but it kind of feels like that should be outside to_date's
> purview.

To clarify, nothing in my examples was meant to imply that de-accenting 
might be necessary.  AFAICT, the accented forms are the only 
linguistically acceptable forms and all the locale libraries accept them 
correctly in their accented forms.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Allow to_date() and to_timestamp() to accept localized names

От
Tom Lane
Дата:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> On 2020-01-28 04:05, Mark Dilger wrote:
>> German uses both Sonnabend and Samstag for Saturday, so don’t you have to compare to a list of values anyway?

> Yeah, good point.  If it doesn't accept both "Sonnabend" and "Samstag",
> then it's not really usable.

If we're going to insist on that, then the entire patch is junk.
Indeed, I don't even know where we could get the knowledge of which
name(s) to accept, because strftime is surely only going to tell us
one of them.

            regards, tom lane



Re: Allow to_date() and to_timestamp() to accept localized names

От
Mark Dilger
Дата:

> On Jan 28, 2020, at 6:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
>> On 2020-01-28 04:05, Mark Dilger wrote:
>>> German uses both Sonnabend and Samstag for Saturday, so don’t you have to compare to a list of values anyway?
>
>> Yeah, good point.  If it doesn't accept both "Sonnabend" and "Samstag",
>> then it's not really usable.
>
> If we're going to insist on that, then the entire patch is junk.

I’m not insisting, just asking about the design.  If it only works with one name supported per weekday per language,
andper month per language, I’m not going to object.  I was just curious if you were going to support multiple names
anyway,and if that made the question about the Greek case folding less pressing. 

> Indeed, I don't even know where we could get the knowledge of which
> name(s) to accept, because strftime is surely only going to tell us
> one of them.

I haven’t played with this yet, but per the manual page for strftime_l:

>      %A    is replaced by national representation of the full weekday name.
>
>      %a    is replaced by national representation of the abbreviated weekday name.
>
>      %B    is replaced by national representation of the full month name.
>
>      %b    is replaced by national representation of the abbreviated month name.

Which I think gives you just the preferred name, by whatever means the library decides which of Sonnabend/Samstag is
thepreferred name.  But then the manual page goes on to say: 

>      %E* %O*
>            POSIX locale extensions.  The sequences %Ec %EC %Ex %EX %Ey %EY %Od %Oe %OH %OI %Om %OM %OS %Ou %OU %OV
%Ow%OW %Oy are supposed to provide alternate representations. 
>
>            Additionally %OB implemented to represent alternative months names (used standalone, without day
mentioned).

This is the part I haven’t played with, but it sounds like it can handle at least one alternate name.  Perhaps you can
getthe alternates this way? 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Allow to_date() and to_timestamp() to accept localized names

От
Juan José Santamaría Flecha
Дата:


On Tue, Jan 28, 2020 at 4:06 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:

I’m not insisting, just asking about the design.  If it only works with one name supported per weekday per language, and per month per language, I’m not going to object.  I was just curious if you were going to support multiple names anyway, and if that made the question about the Greek case folding less pressing.


This patch targets to do something symmetrical to to_char(), which will just return a single value.

The issue with the greek locale is that it cannot hold this equivalent behaviour, as in:

postgres=# select to_date(to_char(now(), 'TMMONTH'), 'TMMONTH');
ERROR:  invalid value "ΙΑΝΟΥΆΡΙΟΣ" for "MONTH"

Because of the particular behaviour sigma (Σσς) casing has, which is also user visible with a lower().
 

>      %E* %O*
>            POSIX locale extensions.  The sequences %Ec %EC %Ex %EX %Ey %EY %Od %Oe %OH %OI %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy are supposed to provide alternate representations.
>
>            Additionally %OB implemented to represent alternative months names (used standalone, without day mentioned).

This is the part I haven’t played with, but it sounds like it can handle at least one alternate name.  Perhaps you can get the alternates this way?


This looks like a POSIX feature that some systems might not like (Windows [1]). But if this is something that the patch should aim to, I am fine with a RWF and give it another try in the future.


Regards,

Juan José Santamaría Flecha 

Re: Allow to_date() and to_timestamp() to accept localized names

От
Mark Dilger
Дата:

> On Jan 28, 2020, at 7:47 AM, Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> wrote:
>
> This looks like a POSIX feature that some systems might not like (Windows [1]). But if this is something that the
patchshould aim to, I am fine with a RWF and give it another try in the future. 

As long as this implementation doesn’t create a backward-compatibility problem when doing a more complete
implementationlater, I’m fine with this patch not tackling the whole problem. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Allow to_date() and to_timestamp() to accept localized names

От
Alvaro Herrera
Дата:
On 2020-Jan-28, Peter Eisentraut wrote:

> On 2020-01-28 04:05, Mark Dilger wrote:
> > German uses both Sonnabend and Samstag for Saturday, so don’t you have to compare to a list of values anyway?
> 
> Yeah, good point.  If it doesn't accept both "Sonnabend" and "Samstag", then
> it's not really usable.

The string "Sonnabend" never appears in the glibc sources, so that will
certainly not work.  I vote not to care about that, but of course my
language is not one that has alternate weekday/month names.  I guess if
we're intent on recognizing alternate names, we'll have to build our own
list of them :-(

I don't have the ICU sources here to check the status there.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Allow to_date() and to_timestamp() to accept localized names

От
Tom Lane
Дата:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> But then the manual page goes on to say:

>> %E* %O*
>> POSIX locale extensions.  The sequences %Ec %EC %Ex %EX %Ey %EY %Od %Oe %OH %OI %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy
aresupposed to provide alternate representations. 
>>
>> Additionally %OB implemented to represent alternative months names (used standalone, without day mentioned).

> This is the part I haven’t played with, but it sounds like it can handle at least one alternate name.  Perhaps you
canget the alternates this way? 

This sounded promising, but the POSIX strftime spec doesn't mention %OB,
so I'm afraid we can't count on it to do much.  At this point I'm not
really convinced that there are no languages with more than two forms,
anyway :-(.

I also wondered whether we could get any further by using strptime() to
convert localized month and day names on-the-fly, rather than the patch's
current approach of re-using strftime() results.  If strptime() fails
to support alternative names, it's their bug not ours.  Unfortunately,
glibc has got said bug (AFAICS anyway), so in practice this would only
offer us plausible deniability and not much of any real functionality.

In the end it seems like we could only handle alternative names by
keeping our own lists of them.  There are probably few enough cases
that that wouldn't be a tremendous maintenance problem, but what
I'm not quite seeing is how we'd decide which list to use when.
Right now, locale identifiers are pretty much opaque to us ... do
we really want to get into the business of recognizing that such a
name refers to German, or Greek?

A brute-force answer, if there are few enough cases, is to recognize
them regardless of the specific value of LC_TIME.  Do we think
anybody would notice or care if to_date('Sonnabend', 'TMDay') works
even when in a non-German locale?

            regards, tom lane



Re: Allow to_date() and to_timestamp() to accept localized names

От
Mark Dilger
Дата:

> On Jan 28, 2020, at 9:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> A brute-force answer, if there are few enough cases, is to recognize
> them regardless of the specific value of LC_TIME.  Do we think
> anybody would notice or care if to_date('Sonnabend', 'TMDay') works
> even when in a non-German locale?

I think this only becomes a problem if there are weekday or month name collisions between languages where they have
differentmeanings.  As an absurd hypothetical, if “Sunday” in Tagalog means what “Tuesday” means in English, then
you’vegot a problem. 

This does happen for month abbreviations.  “Mar” is short for “March” and variations in a number of languages, but
shortfor “November” in Finnish. 

For day abbreviations, “Su” collides between fi_FI and hr_HR, and “tor” collides between sl_SL and no_NO.

I don’t see any collisions with full month or full weekday names, but I’m also only looking at the 53 locales installed
in/usr/share/locale/.*/LC_TIME on my mac. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Allow to_date() and to_timestamp() to accept localized names

От
Tom Lane
Дата:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> On Jan 28, 2020, at 9:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> A brute-force answer, if there are few enough cases, is to recognize
>> them regardless of the specific value of LC_TIME.  Do we think
>> anybody would notice or care if to_date('Sonnabend', 'TMDay') works
>> even when in a non-German locale?

> I think this only becomes a problem if there are weekday or month name collisions between languages where they have
differentmeanings.  As an absurd hypothetical, if “Sunday” in Tagalog means what “Tuesday” means in English, then
you’vegot a problem. 

> This does happen for month abbreviations.  “Mar” is short for “March” and variations in a number of languages, but
shortfor “November” in Finnish. 

> For day abbreviations, “Su” collides between fi_FI and hr_HR, and “tor” collides between sl_SL and no_NO.

But none of those cases are alternative names, so we wouldn't have
entries for them in this hypothetical list.  They'd only be recognized
when strftime() returns them.  I suspect also that the abbreviated
names have few if any alternatives, so we may only need this whole hack
for full names.

A possible way of tightening things up without explicitly decoding the
locale name is to make the entries of the list be string pairs: "if
strftime() returned this, then also consider that".  That puts a bit
of a premium on knowing which names strftime considers primary, but
I think we'd have had to know that anyway.

            regards, tom lane



Re: Allow to_date() and to_timestamp() to accept localized names

От
Peter Eisentraut
Дата:
On 2020-01-28 16:47, Juan José Santamaría Flecha wrote:
> This patch targets to do something symmetrical to to_char(), which will 
> just return a single value.

I didn't fully realize while reading this thread that to_char() already 
supports localized output and this patch indeed just wants to do the 
opposite.

So I'm withdrawing my concerns with respect to this patch.  As long as 
it can do a roundtrip conversion with to_char(), it's fine.

It's pretty clear that this interface cannot be useful for producing or 
parsing fully general localized dates.  But since it exists now (and 
it's quasi SQL standard now), we might as well fill in this gap.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Allow to_date() and to_timestamp() to accept localized names

От
Juan José Santamaría Flecha
Дата:


On Tue, Jan 28, 2020 at 5:21 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
On 2020-Jan-28, Peter Eisentraut wrote:

> On 2020-01-28 04:05, Mark Dilger wrote:
> > German uses both Sonnabend and Samstag for Saturday, so don’t you have to compare to a list of values anyway?
>
> Yeah, good point.  If it doesn't accept both "Sonnabend" and "Samstag", then
> it's not really usable.

The string "Sonnabend" never appears in the glibc sources, so that will
certainly not work.  I vote not to care about that, but of course my
language is not one that has alternate weekday/month names.  I guess if
we're intent on recognizing alternate names, we'll have to build our own
list of them :-(

I don't have the ICU sources here to check the status there.


"Sonnabend" is neither available in ICU.

What is available are both genitive and nominative forms for months, as reported up thread by Peter. See formats "M" and "L" in:


Regards,

Juan José Santamaría Flecha 

Re: Allow to_date() and to_timestamp() to accept localized names

От
Juan José Santamaría Flecha
Дата:


On Tue, Jan 28, 2020 at 9:35 PM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
On 2020-01-28 16:47, Juan José Santamaría Flecha wrote:
> This patch targets to do something symmetrical to to_char(), which will
> just return a single value.

I didn't fully realize while reading this thread that to_char() already
supports localized output and this patch indeed just wants to do the
opposite.

So I'm withdrawing my concerns with respect to this patch.  As long as
it can do a roundtrip conversion with to_char(), it's fine.


We can avoid issues with non injective case conversion languages with a double conversion, so both strings in the comparison end up in the same state.

I propose an upper/lower conversion as in the attached patch.

Regards,

Juan José Santamaría Flecha
 
Вложения

Re: Allow to_date() and to_timestamp() to accept localized names

От
Tom Lane
Дата:
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= <juanjo.santamaria@gmail.com> writes:
> On Tue, Jan 28, 2020 at 9:35 PM Peter Eisentraut <
> peter.eisentraut@2ndquadrant.com> wrote:
>> So I'm withdrawing my concerns with respect to this patch.  As long as
>> it can do a roundtrip conversion with to_char(), it's fine.

> We can avoid issues with non injective case conversion languages with a
> double conversion, so both strings in the comparison end up in the same
> state.
> I propose an upper/lower conversion as in the attached patch.

This seems pretty reasonable to me, with a couple of caveats:

* I don't think it's appropriate to hard-wire DEFAULT_COLLATION_OID
as the collation to do case-folding with.  For to_date/to_timestamp,
we have collatable text input so we can rely on the function's input
collation instead, at the cost of having to pass down the collation
OID through a few layers of subroutines :-(.  For parse_datetime,
I punted for now and let it use DEFAULT_COLLATION_OID, because that's
currently only called by JSONB code that probably hasn't got a usable
input collation anyway (since jsonb isn't considered collatable).

* I think it's likely worthwhile to do a quick check for an exact
match before we go through all these case-folding pushups.  If the
expected use-case is reversing to_char() output, that will win all
the time.  We're probably ahead of the game even if it only matches
a few percent of the time.

Attached v10 incorporates those improvements, plus a bit of
polishing of docs and comments.  Barring objections, this seems
committable to me.

            regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 28035f1..8b73e05 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5968,7 +5968,7 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
        </row>
        <row>
         <entry><literal>TM</literal> prefix</entry>
-        <entry>translation mode (print localized day and month names based on
+        <entry>translation mode (use localized day and month names based on
          <xref linkend="guc-lc-time"/>)</entry>
         <entry><literal>TMMonth</literal></entry>
        </row>
@@ -5999,9 +5999,20 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');

      <listitem>
       <para>
-       <literal>TM</literal> does not include trailing blanks.
-       <function>to_timestamp</function> and <function>to_date</function> ignore
-       the <literal>TM</literal> modifier.
+       <literal>TM</literal> suppresses trailing blanks whether or
+       not <literal>FM</literal> is specified.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       <function>to_timestamp</function> and <function>to_date</function>
+       ignore letter case in the input; so for
+       example <literal>MON</literal>, <literal>Mon</literal>,
+       and <literal>mon</literal> all accept the same strings.  When using
+       the <literal>TM</literal> modifier, case-folding is done according to
+       the rules of the function's input collation (see
+       <xref linkend="collation"/>).
       </para>
      </listitem>

diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index d029468..db99a6b 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1038,7 +1038,7 @@ static void parse_format(FormatNode *node, const char *str, const KeyWord *kw,
 static void DCH_to_char(FormatNode *node, bool is_interval,
                         TmToChar *in, char *out, Oid collid);
 static void DCH_from_char(FormatNode *node, const char *in, TmFromChar *out,
-                          bool std, bool *have_error);
+                          Oid collid, bool std, bool *have_error);

 #ifdef DEBUG_TO_FROM_CHAR
 static void dump_index(const KeyWord *k, const int *index);
@@ -1057,11 +1057,14 @@ static int    from_char_parse_int_len(int *dest, const char **src, const int len,
                                     FormatNode *node, bool *have_error);
 static int    from_char_parse_int(int *dest, const char **src, FormatNode *node,
                                 bool *have_error);
-static int    seq_search(const char *name, const char *const *array, int *len);
+static int    seq_search_ascii(const char *name, const char *const *array, int *len);
+static int    seq_search_localized(const char *name, char **array, int *len,
+                                 Oid collid);
 static int    from_char_seq_search(int *dest, const char **src,
                                  const char *const *array,
+                                 char **localized_array, Oid collid,
                                  FormatNode *node, bool *have_error);
-static void do_to_timestamp(text *date_txt, text *fmt, bool std,
+static void do_to_timestamp(text *date_txt, text *fmt, Oid collid, bool std,
                             struct pg_tm *tm, fsec_t *fsec, int *fprec,
                             uint32 *flags, bool *have_error);
 static char *fill_str(char *str, int c, int max);
@@ -2457,7 +2460,7 @@ from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_er
  * suitable for comparisons to ASCII strings.
  */
 static int
-seq_search(const char *name, const char *const *array, int *len)
+seq_search_ascii(const char *name, const char *const *array, int *len)
 {
     unsigned char firstc;
     const char *const *a;
@@ -2503,8 +2506,89 @@ seq_search(const char *name, const char *const *array, int *len)
 }

 /*
- * Perform a sequential search in 'array' for an entry matching the first
- * character(s) of the 'src' string case-insensitively.
+ * Sequentially search an array of possibly non-English words for
+ * a case-insensitive match to the initial character(s) of "name".
+ *
+ * This has the same API as seq_search_ascii(), but we use a more general
+ * case-folding transformation to achieve case-insensitivity.  Case folding
+ * is done per the rules of the collation identified by "collid".
+ *
+ * The array is treated as const, but we don't declare it that way because
+ * the arrays exported by pg_locale.c aren't const.
+ */
+static int
+seq_search_localized(const char *name, char **array, int *len, Oid collid)
+{
+    char      **a;
+    char       *upper_name;
+    char       *lower_name;
+
+    *len = 0;
+
+    /* empty string can't match anything */
+    if (!*name)
+        return -1;
+
+    /*
+     * The case-folding processing done below is fairly expensive, so before
+     * doing that, make a quick pass to see if there is an exact match.
+     */
+    for (a = array; *a != NULL; a++)
+    {
+        int            element_len = strlen(*a);
+
+        if (strncmp(name, *a, element_len) == 0)
+        {
+            *len = element_len;
+            return a - array;
+        }
+    }
+
+    /*
+     * Fold to upper case, then to lower case, so that we can match reliably
+     * even in languages in which case conversions are not injective.
+     */
+    upper_name = str_toupper(unconstify(char *, name), strlen(name), collid);
+    lower_name = str_tolower(upper_name, strlen(upper_name), collid);
+    pfree(upper_name);
+
+    for (a = array; *a != NULL; a++)
+    {
+        char       *upper_element;
+        char       *lower_element;
+        int            element_len;
+
+        /* Likewise upper/lower-case array element */
+        upper_element = str_toupper(*a, strlen(*a), collid);
+        lower_element = str_tolower(upper_element, strlen(upper_element),
+                                    collid);
+        pfree(upper_element);
+        element_len = strlen(lower_element);
+
+        /* Match? */
+        if (strncmp(lower_name, lower_element, element_len) == 0)
+        {
+            *len = element_len;
+            pfree(lower_element);
+            pfree(lower_name);
+            return a - array;
+        }
+        pfree(lower_element);
+    }
+
+    pfree(lower_name);
+    return -1;
+}
+
+/*
+ * Perform a sequential search in 'array' (or 'localized_array', if that's
+ * not NULL) for an entry matching the first character(s) of the 'src'
+ * string case-insensitively.
+ *
+ * The 'array' is presumed to be English words (all-ASCII), but
+ * if 'localized_array' is supplied, that might be non-English
+ * so we need a more expensive case-folding transformation
+ * (which will follow the rules of the collation 'collid').
  *
  * If a match is found, copy the array index of the match into the integer
  * pointed to by 'dest', advance 'src' to the end of the part of the string
@@ -2518,11 +2602,15 @@ seq_search(const char *name, const char *const *array, int *len)
  */
 static int
 from_char_seq_search(int *dest, const char **src, const char *const *array,
+                     char **localized_array, Oid collid,
                      FormatNode *node, bool *have_error)
 {
     int            len;

-    *dest = seq_search(*src, array, &len);
+    if (localized_array == NULL)
+        *dest = seq_search_ascii(*src, array, &len);
+    else
+        *dest = seq_search_localized(*src, localized_array, &len, collid);

     if (len <= 0)
     {
@@ -3147,19 +3235,20 @@ DCH_to_char(FormatNode *node, bool is_interval, TmToChar *in, char *out, Oid col
     *s = '\0';
 }

-/* ----------
- * Process a string as denoted by a list of FormatNodes.
+/*
+ * Process the string 'in' as denoted by the array of FormatNodes 'node[]'.
  * The TmFromChar struct pointed to by 'out' is populated with the results.
  *
+ * 'collid' identifies the collation to use, if needed.
+ * 'std' specifies standard parsing mode.
+ * If 'have_error' is NULL, then errors are thrown, else '*have_error' is set.
+ *
  * Note: we currently don't have any to_interval() function, so there
  * is no need here for INVALID_FOR_INTERVAL checks.
- *
- * If 'have_error' is NULL, then errors are thrown, else '*have_error' is set.
- * ----------
  */
 static void
-DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
-              bool *have_error)
+DCH_from_char(FormatNode *node, const char *in, TmFromChar *out,
+              Oid collid, bool std, bool *have_error)
 {
     FormatNode *n;
     const char *s;
@@ -3170,6 +3259,9 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
     /* number of extra skipped characters (more than given in format string) */
     int            extra_skip = 0;

+    /* cache localized days and months */
+    cache_locale_time();
+
     for (n = node, s = in; n->type != NODE_TYPE_END && *s != '\0'; n++)
     {
         /*
@@ -3271,6 +3363,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_a_m:
             case DCH_p_m:
                 from_char_seq_search(&value, &s, ampm_strings_long,
+                                     NULL, InvalidOid,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->pm, value % 2, n, have_error);
@@ -3282,6 +3375,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_am:
             case DCH_pm:
                 from_char_seq_search(&value, &s, ampm_strings,
+                                     NULL, InvalidOid,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->pm, value % 2, n, have_error);
@@ -3395,6 +3489,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_a_d:
             case DCH_b_c:
                 from_char_seq_search(&value, &s, adbc_strings_long,
+                                     NULL, InvalidOid,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->bc, value % 2, n, have_error);
@@ -3405,6 +3500,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_ad:
             case DCH_bc:
                 from_char_seq_search(&value, &s, adbc_strings,
+                                     NULL, InvalidOid,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->bc, value % 2, n, have_error);
@@ -3414,6 +3510,8 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_Month:
             case DCH_month:
                 from_char_seq_search(&value, &s, months_full,
+                                     S_TM(n->suffix) ? localized_full_months : NULL,
+                                     collid,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->mm, value + 1, n, have_error);
@@ -3423,6 +3521,8 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_Mon:
             case DCH_mon:
                 from_char_seq_search(&value, &s, months,
+                                     S_TM(n->suffix) ? localized_abbrev_months : NULL,
+                                     collid,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->mm, value + 1, n, have_error);
@@ -3437,6 +3537,8 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_Day:
             case DCH_day:
                 from_char_seq_search(&value, &s, days,
+                                     S_TM(n->suffix) ? localized_full_days : NULL,
+                                     collid,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->d, value, n, have_error);
@@ -3447,6 +3549,8 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_Dy:
             case DCH_dy:
                 from_char_seq_search(&value, &s, days_short,
+                                     S_TM(n->suffix) ? localized_abbrev_days : NULL,
+                                     collid,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->d, value, n, have_error);
@@ -3565,6 +3669,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_RM:
             case DCH_rm:
                 from_char_seq_search(&value, &s, rm_months_lower,
+                                     NULL, InvalidOid,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->mm, MONTHS_PER_YEAR - value,
@@ -4031,13 +4136,15 @@ to_timestamp(PG_FUNCTION_ARGS)
 {
     text       *date_txt = PG_GETARG_TEXT_PP(0);
     text       *fmt = PG_GETARG_TEXT_PP(1);
+    Oid            collid = PG_GET_COLLATION();
     Timestamp    result;
     int            tz;
     struct pg_tm tm;
     fsec_t        fsec;
     int            fprec;

-    do_to_timestamp(date_txt, fmt, false, &tm, &fsec, &fprec, NULL, NULL);
+    do_to_timestamp(date_txt, fmt, collid, false,
+                    &tm, &fsec, &fprec, NULL, NULL);

     /* Use the specified time zone, if any. */
     if (tm.tm_zone)
@@ -4072,11 +4179,13 @@ to_date(PG_FUNCTION_ARGS)
 {
     text       *date_txt = PG_GETARG_TEXT_PP(0);
     text       *fmt = PG_GETARG_TEXT_PP(1);
+    Oid            collid = PG_GET_COLLATION();
     DateADT        result;
     struct pg_tm tm;
     fsec_t        fsec;

-    do_to_timestamp(date_txt, fmt, false, &tm, &fsec, NULL, NULL, NULL);
+    do_to_timestamp(date_txt, fmt, collid, false,
+                    &tm, &fsec, NULL, NULL, NULL);

     /* Prevent overflow in Julian-day routines */
     if (!IS_VALID_JULIAN(tm.tm_year, tm.tm_mon, tm.tm_mday))
@@ -4116,7 +4225,13 @@ parse_datetime(text *date_txt, text *fmt, bool strict, Oid *typid,
     int            fprec;
     uint32        flags;

-    do_to_timestamp(date_txt, fmt, strict, &tm, &fsec, &fprec, &flags, have_error);
+    /*
+     * At some point we might wish to have callers supply the collation to
+     * use, but right now it's not clear that they'd be able to do better than
+     * DEFAULT_COLLATION_OID anyway.
+     */
+    do_to_timestamp(date_txt, fmt, DEFAULT_COLLATION_OID, strict,
+                    &tm, &fsec, &fprec, &flags, have_error);
     CHECK_ERROR;

     *typmod = fprec ? fprec : -1;    /* fractional part precision */
@@ -4278,21 +4393,21 @@ on_error:
  * Parse the 'date_txt' according to 'fmt', return results as a struct pg_tm,
  * fractional seconds, and fractional precision.
  *
+ * 'collid' identifies the collation to use, if needed.
+ * 'std' specifies standard parsing mode.
+ * Bit mask of date/time/zone components found in 'fmt' is returned in 'flags',
+ * if that is not NULL.
+ * If 'have_error' is NULL, then errors are thrown, else '*have_error' is set.
+ *
  * We parse 'fmt' into a list of FormatNodes, which is then passed to
  * DCH_from_char to populate a TmFromChar with the parsed contents of
  * 'date_txt'.
  *
  * The TmFromChar is then analysed and converted into the final results in
- * struct 'tm' and 'fsec'.
- *
- * Bit mask of date/time/zone components found in 'fmt' is returned in 'flags'.
- *
- * 'std' specifies standard parsing mode.
- *
- * If 'have_error' is NULL, then errors are thrown, else '*have_error' is set.
+ * struct 'tm', 'fsec', and 'fprec'.
  */
 static void
-do_to_timestamp(text *date_txt, text *fmt, bool std,
+do_to_timestamp(text *date_txt, text *fmt, Oid collid, bool std,
                 struct pg_tm *tm, fsec_t *fsec, int *fprec,
                 uint32 *flags, bool *have_error)
 {
@@ -4352,7 +4467,7 @@ do_to_timestamp(text *date_txt, text *fmt, bool std,
         /* dump_index(DCH_keywords, DCH_index); */
 #endif

-        DCH_from_char(format, date_str, &tmfc, std, have_error);
+        DCH_from_char(format, date_str, &tmfc, collid, std, have_error);
         CHECK_ERROR;

         pfree(fmt_str);
diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 25fb7e2..64fd3ae 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -96,11 +96,17 @@ char       *locale_monetary;
 char       *locale_numeric;
 char       *locale_time;

-/* lc_time localization cache */
-char       *localized_abbrev_days[7];
-char       *localized_full_days[7];
-char       *localized_abbrev_months[12];
-char       *localized_full_months[12];
+/*
+ * lc_time localization cache.
+ *
+ * We use only the first 7 or 12 entries of these arrays.  The last array
+ * element is left as NULL for the convenience of outside code that wants
+ * to sequentially scan these arrays.
+ */
+char       *localized_abbrev_days[7 + 1];
+char       *localized_full_days[7 + 1];
+char       *localized_abbrev_months[12 + 1];
+char       *localized_full_months[12 + 1];

 /* indicates whether locale information cache is valid */
 static bool CurrentLocaleConvValid = false;
@@ -922,6 +928,8 @@ cache_locale_time(void)
         cache_single_string(&localized_full_days[i], bufptr, encoding);
         bufptr += MAX_L10N_DATA;
     }
+    localized_abbrev_days[7] = NULL;
+    localized_full_days[7] = NULL;

     /* localized months */
     for (i = 0; i < 12; i++)
@@ -931,6 +939,8 @@ cache_locale_time(void)
         cache_single_string(&localized_full_months[i], bufptr, encoding);
         bufptr += MAX_L10N_DATA;
     }
+    localized_abbrev_months[12] = NULL;
+    localized_full_months[12] = NULL;

     CurrentLCTimeValid = true;
 }
diff --git a/src/test/regress/expected/collate.linux.utf8.out b/src/test/regress/expected/collate.linux.utf8.out
index 37c6add..f06ae54 100644
--- a/src/test/regress/expected/collate.linux.utf8.out
+++ b/src/test/regress/expected/collate.linux.utf8.out
@@ -461,6 +461,22 @@ SELECT to_char(date '2010-04-01', 'DD TMMON YYYY' COLLATE "tr_TR");
  01 NİS 2010
 (1 row)

+-- to_date
+SELECT to_date('01 ŞUB 2010', 'DD TMMON YYYY');
+  to_date
+------------
+ 02-01-2010
+(1 row)
+
+SELECT to_date('01 Şub 2010', 'DD TMMON YYYY');
+  to_date
+------------
+ 02-01-2010
+(1 row)
+
+SELECT to_date('1234567890ab 2010', 'TMMONTH YYYY'); -- fail
+ERROR:  invalid value "1234567890ab" for "MONTH"
+DETAIL:  The given value did not match any of the allowed values for this field.
 -- backwards parsing
 CREATE VIEW collview1 AS SELECT * FROM collate_test1 WHERE b COLLATE "C" >= 'bbc';
 CREATE VIEW collview2 AS SELECT a, b FROM collate_test1 ORDER BY b COLLATE "C";
diff --git a/src/test/regress/sql/collate.linux.utf8.sql b/src/test/regress/sql/collate.linux.utf8.sql
index 8c26f16..cbbd220 100644
--- a/src/test/regress/sql/collate.linux.utf8.sql
+++ b/src/test/regress/sql/collate.linux.utf8.sql
@@ -182,6 +182,12 @@ SELECT to_char(date '2010-02-01', 'DD TMMON YYYY' COLLATE "tr_TR");
 SELECT to_char(date '2010-04-01', 'DD TMMON YYYY');
 SELECT to_char(date '2010-04-01', 'DD TMMON YYYY' COLLATE "tr_TR");

+-- to_date
+
+SELECT to_date('01 ŞUB 2010', 'DD TMMON YYYY');
+SELECT to_date('01 Şub 2010', 'DD TMMON YYYY');
+SELECT to_date('1234567890ab 2010', 'TMMONTH YYYY'); -- fail
+

 -- backwards parsing


Re: Allow to_date() and to_timestamp() to accept localized names

От
Tom Lane
Дата:
I wrote:
> * I don't think it's appropriate to hard-wire DEFAULT_COLLATION_OID
> as the collation to do case-folding with.  For to_date/to_timestamp,
> we have collatable text input so we can rely on the function's input
> collation instead, at the cost of having to pass down the collation
> OID through a few layers of subroutines :-(.  For parse_datetime,
> I punted for now and let it use DEFAULT_COLLATION_OID, because that's
> currently only called by JSONB code that probably hasn't got a usable
> input collation anyway (since jsonb isn't considered collatable).

On closer look, it's probably a wise idea to change the signature
of parse_datetime() to include a collation argument, because that
function is new in v13 so there's no API-break argument against it.
It will never be cheaper to change it than today.  So v11 below
does that, pushing the use of DEFAULT_COLLATION_OID into the
json-specific code.  Perhaps somebody else would like to look at
whether there's something brighter for that code to do, but I still
suspect there isn't, so I didn't chase it further.

> Barring objections, this seems
> committable to me.

Going once, going twice ...

            regards, tom lane

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 28035f1..8b73e05 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -5968,7 +5968,7 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');
        </row>
        <row>
         <entry><literal>TM</literal> prefix</entry>
-        <entry>translation mode (print localized day and month names based on
+        <entry>translation mode (use localized day and month names based on
          <xref linkend="guc-lc-time"/>)</entry>
         <entry><literal>TMMonth</literal></entry>
        </row>
@@ -5999,9 +5999,20 @@ SELECT regexp_match('abc01234xyz', '(?:(.*?)(\d+)(.*)){1,1}');

      <listitem>
       <para>
-       <literal>TM</literal> does not include trailing blanks.
-       <function>to_timestamp</function> and <function>to_date</function> ignore
-       the <literal>TM</literal> modifier.
+       <literal>TM</literal> suppresses trailing blanks whether or
+       not <literal>FM</literal> is specified.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       <function>to_timestamp</function> and <function>to_date</function>
+       ignore letter case in the input; so for
+       example <literal>MON</literal>, <literal>Mon</literal>,
+       and <literal>mon</literal> all accept the same strings.  When using
+       the <literal>TM</literal> modifier, case-folding is done according to
+       the rules of the function's input collation (see
+       <xref linkend="collation"/>).
       </para>
      </listitem>

diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index d029468..95f7d05 100644
--- a/src/backend/utils/adt/formatting.c
+++ b/src/backend/utils/adt/formatting.c
@@ -1038,7 +1038,7 @@ static void parse_format(FormatNode *node, const char *str, const KeyWord *kw,
 static void DCH_to_char(FormatNode *node, bool is_interval,
                         TmToChar *in, char *out, Oid collid);
 static void DCH_from_char(FormatNode *node, const char *in, TmFromChar *out,
-                          bool std, bool *have_error);
+                          Oid collid, bool std, bool *have_error);

 #ifdef DEBUG_TO_FROM_CHAR
 static void dump_index(const KeyWord *k, const int *index);
@@ -1057,11 +1057,14 @@ static int    from_char_parse_int_len(int *dest, const char **src, const int len,
                                     FormatNode *node, bool *have_error);
 static int    from_char_parse_int(int *dest, const char **src, FormatNode *node,
                                 bool *have_error);
-static int    seq_search(const char *name, const char *const *array, int *len);
+static int    seq_search_ascii(const char *name, const char *const *array, int *len);
+static int    seq_search_localized(const char *name, char **array, int *len,
+                                 Oid collid);
 static int    from_char_seq_search(int *dest, const char **src,
                                  const char *const *array,
+                                 char **localized_array, Oid collid,
                                  FormatNode *node, bool *have_error);
-static void do_to_timestamp(text *date_txt, text *fmt, bool std,
+static void do_to_timestamp(text *date_txt, text *fmt, Oid collid, bool std,
                             struct pg_tm *tm, fsec_t *fsec, int *fprec,
                             uint32 *flags, bool *have_error);
 static char *fill_str(char *str, int c, int max);
@@ -2457,7 +2460,7 @@ from_char_parse_int(int *dest, const char **src, FormatNode *node, bool *have_er
  * suitable for comparisons to ASCII strings.
  */
 static int
-seq_search(const char *name, const char *const *array, int *len)
+seq_search_ascii(const char *name, const char *const *array, int *len)
 {
     unsigned char firstc;
     const char *const *a;
@@ -2503,8 +2506,89 @@ seq_search(const char *name, const char *const *array, int *len)
 }

 /*
- * Perform a sequential search in 'array' for an entry matching the first
- * character(s) of the 'src' string case-insensitively.
+ * Sequentially search an array of possibly non-English words for
+ * a case-insensitive match to the initial character(s) of "name".
+ *
+ * This has the same API as seq_search_ascii(), but we use a more general
+ * case-folding transformation to achieve case-insensitivity.  Case folding
+ * is done per the rules of the collation identified by "collid".
+ *
+ * The array is treated as const, but we don't declare it that way because
+ * the arrays exported by pg_locale.c aren't const.
+ */
+static int
+seq_search_localized(const char *name, char **array, int *len, Oid collid)
+{
+    char      **a;
+    char       *upper_name;
+    char       *lower_name;
+
+    *len = 0;
+
+    /* empty string can't match anything */
+    if (!*name)
+        return -1;
+
+    /*
+     * The case-folding processing done below is fairly expensive, so before
+     * doing that, make a quick pass to see if there is an exact match.
+     */
+    for (a = array; *a != NULL; a++)
+    {
+        int            element_len = strlen(*a);
+
+        if (strncmp(name, *a, element_len) == 0)
+        {
+            *len = element_len;
+            return a - array;
+        }
+    }
+
+    /*
+     * Fold to upper case, then to lower case, so that we can match reliably
+     * even in languages in which case conversions are not injective.
+     */
+    upper_name = str_toupper(unconstify(char *, name), strlen(name), collid);
+    lower_name = str_tolower(upper_name, strlen(upper_name), collid);
+    pfree(upper_name);
+
+    for (a = array; *a != NULL; a++)
+    {
+        char       *upper_element;
+        char       *lower_element;
+        int            element_len;
+
+        /* Likewise upper/lower-case array element */
+        upper_element = str_toupper(*a, strlen(*a), collid);
+        lower_element = str_tolower(upper_element, strlen(upper_element),
+                                    collid);
+        pfree(upper_element);
+        element_len = strlen(lower_element);
+
+        /* Match? */
+        if (strncmp(lower_name, lower_element, element_len) == 0)
+        {
+            *len = element_len;
+            pfree(lower_element);
+            pfree(lower_name);
+            return a - array;
+        }
+        pfree(lower_element);
+    }
+
+    pfree(lower_name);
+    return -1;
+}
+
+/*
+ * Perform a sequential search in 'array' (or 'localized_array', if that's
+ * not NULL) for an entry matching the first character(s) of the 'src'
+ * string case-insensitively.
+ *
+ * The 'array' is presumed to be English words (all-ASCII), but
+ * if 'localized_array' is supplied, that might be non-English
+ * so we need a more expensive case-folding transformation
+ * (which will follow the rules of the collation 'collid').
  *
  * If a match is found, copy the array index of the match into the integer
  * pointed to by 'dest', advance 'src' to the end of the part of the string
@@ -2518,11 +2602,15 @@ seq_search(const char *name, const char *const *array, int *len)
  */
 static int
 from_char_seq_search(int *dest, const char **src, const char *const *array,
+                     char **localized_array, Oid collid,
                      FormatNode *node, bool *have_error)
 {
     int            len;

-    *dest = seq_search(*src, array, &len);
+    if (localized_array == NULL)
+        *dest = seq_search_ascii(*src, array, &len);
+    else
+        *dest = seq_search_localized(*src, localized_array, &len, collid);

     if (len <= 0)
     {
@@ -3147,19 +3235,20 @@ DCH_to_char(FormatNode *node, bool is_interval, TmToChar *in, char *out, Oid col
     *s = '\0';
 }

-/* ----------
- * Process a string as denoted by a list of FormatNodes.
+/*
+ * Process the string 'in' as denoted by the array of FormatNodes 'node[]'.
  * The TmFromChar struct pointed to by 'out' is populated with the results.
  *
+ * 'collid' identifies the collation to use, if needed.
+ * 'std' specifies standard parsing mode.
+ * If 'have_error' is NULL, then errors are thrown, else '*have_error' is set.
+ *
  * Note: we currently don't have any to_interval() function, so there
  * is no need here for INVALID_FOR_INTERVAL checks.
- *
- * If 'have_error' is NULL, then errors are thrown, else '*have_error' is set.
- * ----------
  */
 static void
-DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
-              bool *have_error)
+DCH_from_char(FormatNode *node, const char *in, TmFromChar *out,
+              Oid collid, bool std, bool *have_error)
 {
     FormatNode *n;
     const char *s;
@@ -3170,6 +3259,9 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
     /* number of extra skipped characters (more than given in format string) */
     int            extra_skip = 0;

+    /* cache localized days and months */
+    cache_locale_time();
+
     for (n = node, s = in; n->type != NODE_TYPE_END && *s != '\0'; n++)
     {
         /*
@@ -3271,6 +3363,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_a_m:
             case DCH_p_m:
                 from_char_seq_search(&value, &s, ampm_strings_long,
+                                     NULL, InvalidOid,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->pm, value % 2, n, have_error);
@@ -3282,6 +3375,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_am:
             case DCH_pm:
                 from_char_seq_search(&value, &s, ampm_strings,
+                                     NULL, InvalidOid,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->pm, value % 2, n, have_error);
@@ -3395,6 +3489,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_a_d:
             case DCH_b_c:
                 from_char_seq_search(&value, &s, adbc_strings_long,
+                                     NULL, InvalidOid,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->bc, value % 2, n, have_error);
@@ -3405,6 +3500,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_ad:
             case DCH_bc:
                 from_char_seq_search(&value, &s, adbc_strings,
+                                     NULL, InvalidOid,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->bc, value % 2, n, have_error);
@@ -3414,6 +3510,8 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_Month:
             case DCH_month:
                 from_char_seq_search(&value, &s, months_full,
+                                     S_TM(n->suffix) ? localized_full_months : NULL,
+                                     collid,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->mm, value + 1, n, have_error);
@@ -3423,6 +3521,8 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_Mon:
             case DCH_mon:
                 from_char_seq_search(&value, &s, months,
+                                     S_TM(n->suffix) ? localized_abbrev_months : NULL,
+                                     collid,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->mm, value + 1, n, have_error);
@@ -3437,6 +3537,8 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_Day:
             case DCH_day:
                 from_char_seq_search(&value, &s, days,
+                                     S_TM(n->suffix) ? localized_full_days : NULL,
+                                     collid,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->d, value, n, have_error);
@@ -3447,6 +3549,8 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_Dy:
             case DCH_dy:
                 from_char_seq_search(&value, &s, days_short,
+                                     S_TM(n->suffix) ? localized_abbrev_days : NULL,
+                                     collid,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->d, value, n, have_error);
@@ -3565,6 +3669,7 @@ DCH_from_char(FormatNode *node, const char *in, TmFromChar *out, bool std,
             case DCH_RM:
             case DCH_rm:
                 from_char_seq_search(&value, &s, rm_months_lower,
+                                     NULL, InvalidOid,
                                      n, have_error);
                 CHECK_ERROR;
                 from_char_set_int(&out->mm, MONTHS_PER_YEAR - value,
@@ -4031,13 +4136,15 @@ to_timestamp(PG_FUNCTION_ARGS)
 {
     text       *date_txt = PG_GETARG_TEXT_PP(0);
     text       *fmt = PG_GETARG_TEXT_PP(1);
+    Oid            collid = PG_GET_COLLATION();
     Timestamp    result;
     int            tz;
     struct pg_tm tm;
     fsec_t        fsec;
     int            fprec;

-    do_to_timestamp(date_txt, fmt, false, &tm, &fsec, &fprec, NULL, NULL);
+    do_to_timestamp(date_txt, fmt, collid, false,
+                    &tm, &fsec, &fprec, NULL, NULL);

     /* Use the specified time zone, if any. */
     if (tm.tm_zone)
@@ -4072,11 +4179,13 @@ to_date(PG_FUNCTION_ARGS)
 {
     text       *date_txt = PG_GETARG_TEXT_PP(0);
     text       *fmt = PG_GETARG_TEXT_PP(1);
+    Oid            collid = PG_GET_COLLATION();
     DateADT        result;
     struct pg_tm tm;
     fsec_t        fsec;

-    do_to_timestamp(date_txt, fmt, false, &tm, &fsec, NULL, NULL, NULL);
+    do_to_timestamp(date_txt, fmt, collid, false,
+                    &tm, &fsec, NULL, NULL, NULL);

     /* Prevent overflow in Julian-day routines */
     if (!IS_VALID_JULIAN(tm.tm_year, tm.tm_mon, tm.tm_mday))
@@ -4098,25 +4207,31 @@ to_date(PG_FUNCTION_ARGS)
 }

 /*
- * Convert the 'date_txt' input to a datetime type using argument 'fmt' as a format string.
+ * Convert the 'date_txt' input to a datetime type using argument 'fmt'
+ * as a format string.  The collation 'collid' may be used for case-folding
+ * rules in some cases.  'strict' specifies standard parsing mode.
+ *
  * The actual data type (returned in 'typid', 'typmod') is determined by
  * the presence of date/time/zone components in the format string.
  *
- * When timezone component is present, the corresponding offset is set to '*tz'.
+ * When timezone component is present, the corresponding offset is
+ * returned in '*tz'.
  *
  * If 'have_error' is NULL, then errors are thrown, else '*have_error' is set
  * and zero value is returned.
  */
 Datum
-parse_datetime(text *date_txt, text *fmt, bool strict, Oid *typid,
-               int32 *typmod, int *tz, bool *have_error)
+parse_datetime(text *date_txt, text *fmt, Oid collid, bool strict,
+               Oid *typid, int32 *typmod, int *tz,
+               bool *have_error)
 {
     struct pg_tm tm;
     fsec_t        fsec;
     int            fprec;
     uint32        flags;

-    do_to_timestamp(date_txt, fmt, strict, &tm, &fsec, &fprec, &flags, have_error);
+    do_to_timestamp(date_txt, fmt, collid, strict,
+                    &tm, &fsec, &fprec, &flags, have_error);
     CHECK_ERROR;

     *typmod = fprec ? fprec : -1;    /* fractional part precision */
@@ -4278,21 +4393,21 @@ on_error:
  * Parse the 'date_txt' according to 'fmt', return results as a struct pg_tm,
  * fractional seconds, and fractional precision.
  *
+ * 'collid' identifies the collation to use, if needed.
+ * 'std' specifies standard parsing mode.
+ * Bit mask of date/time/zone components found in 'fmt' is returned in 'flags',
+ * if that is not NULL.
+ * If 'have_error' is NULL, then errors are thrown, else '*have_error' is set.
+ *
  * We parse 'fmt' into a list of FormatNodes, which is then passed to
  * DCH_from_char to populate a TmFromChar with the parsed contents of
  * 'date_txt'.
  *
  * The TmFromChar is then analysed and converted into the final results in
- * struct 'tm' and 'fsec'.
- *
- * Bit mask of date/time/zone components found in 'fmt' is returned in 'flags'.
- *
- * 'std' specifies standard parsing mode.
- *
- * If 'have_error' is NULL, then errors are thrown, else '*have_error' is set.
+ * struct 'tm', 'fsec', and 'fprec'.
  */
 static void
-do_to_timestamp(text *date_txt, text *fmt, bool std,
+do_to_timestamp(text *date_txt, text *fmt, Oid collid, bool std,
                 struct pg_tm *tm, fsec_t *fsec, int *fprec,
                 uint32 *flags, bool *have_error)
 {
@@ -4352,7 +4467,7 @@ do_to_timestamp(text *date_txt, text *fmt, bool std,
         /* dump_index(DCH_keywords, DCH_index); */
 #endif

-        DCH_from_char(format, date_str, &tmfc, std, have_error);
+        DCH_from_char(format, date_str, &tmfc, collid, std, have_error);
         CHECK_ERROR;

         pfree(fmt_str);
diff --git a/src/backend/utils/adt/jsonpath_exec.c b/src/backend/utils/adt/jsonpath_exec.c
index b6fdd47..1e4dd89 100644
--- a/src/backend/utils/adt/jsonpath_exec.c
+++ b/src/backend/utils/adt/jsonpath_exec.c
@@ -1781,6 +1781,7 @@ executeDateTimeMethod(JsonPathExecContext *cxt, JsonPathItem *jsp,
     JsonbValue    jbvbuf;
     Datum        value;
     text       *datetime;
+    Oid            collid;
     Oid            typid;
     int32        typmod = -1;
     int            tz = 0;
@@ -1797,6 +1798,13 @@ executeDateTimeMethod(JsonPathExecContext *cxt, JsonPathItem *jsp,
     datetime = cstring_to_text_with_len(jb->val.string.val,
                                         jb->val.string.len);

+    /*
+     * At some point we might wish to have callers supply the collation to
+     * use, but right now it's unclear that they'd be able to do better than
+     * DEFAULT_COLLATION_OID anyway.
+     */
+    collid = DEFAULT_COLLATION_OID;
+
     if (jsp->content.arg)
     {
         text       *template;
@@ -1814,7 +1822,7 @@ executeDateTimeMethod(JsonPathExecContext *cxt, JsonPathItem *jsp,
         template = cstring_to_text_with_len(template_str,
                                             template_len);

-        value = parse_datetime(datetime, template, true,
+        value = parse_datetime(datetime, template, collid, true,
                                &typid, &typmod, &tz,
                                jspThrowErrors(cxt) ? NULL : &have_error);

@@ -1858,7 +1866,7 @@ executeDateTimeMethod(JsonPathExecContext *cxt, JsonPathItem *jsp,
                 MemoryContextSwitchTo(oldcxt);
             }

-            value = parse_datetime(datetime, fmt_txt[i], true,
+            value = parse_datetime(datetime, fmt_txt[i], collid, true,
                                    &typid, &typmod, &tz,
                                    &have_error);

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 25fb7e2..64fd3ae 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -96,11 +96,17 @@ char       *locale_monetary;
 char       *locale_numeric;
 char       *locale_time;

-/* lc_time localization cache */
-char       *localized_abbrev_days[7];
-char       *localized_full_days[7];
-char       *localized_abbrev_months[12];
-char       *localized_full_months[12];
+/*
+ * lc_time localization cache.
+ *
+ * We use only the first 7 or 12 entries of these arrays.  The last array
+ * element is left as NULL for the convenience of outside code that wants
+ * to sequentially scan these arrays.
+ */
+char       *localized_abbrev_days[7 + 1];
+char       *localized_full_days[7 + 1];
+char       *localized_abbrev_months[12 + 1];
+char       *localized_full_months[12 + 1];

 /* indicates whether locale information cache is valid */
 static bool CurrentLocaleConvValid = false;
@@ -922,6 +928,8 @@ cache_locale_time(void)
         cache_single_string(&localized_full_days[i], bufptr, encoding);
         bufptr += MAX_L10N_DATA;
     }
+    localized_abbrev_days[7] = NULL;
+    localized_full_days[7] = NULL;

     /* localized months */
     for (i = 0; i < 12; i++)
@@ -931,6 +939,8 @@ cache_locale_time(void)
         cache_single_string(&localized_full_months[i], bufptr, encoding);
         bufptr += MAX_L10N_DATA;
     }
+    localized_abbrev_months[12] = NULL;
+    localized_full_months[12] = NULL;

     CurrentLCTimeValid = true;
 }
diff --git a/src/include/utils/formatting.h b/src/include/utils/formatting.h
index 357deb9..8f62aa3 100644
--- a/src/include/utils/formatting.h
+++ b/src/include/utils/formatting.h
@@ -26,7 +26,7 @@ extern char *asc_tolower(const char *buff, size_t nbytes);
 extern char *asc_toupper(const char *buff, size_t nbytes);
 extern char *asc_initcap(const char *buff, size_t nbytes);

-extern Datum parse_datetime(text *date_txt, text *fmt, bool std,
+extern Datum parse_datetime(text *date_txt, text *fmt, Oid collid, bool strict,
                             Oid *typid, int32 *typmod, int *tz,
                             bool *have_error);

diff --git a/src/test/regress/expected/collate.linux.utf8.out b/src/test/regress/expected/collate.linux.utf8.out
index 37c6add..f06ae54 100644
--- a/src/test/regress/expected/collate.linux.utf8.out
+++ b/src/test/regress/expected/collate.linux.utf8.out
@@ -461,6 +461,22 @@ SELECT to_char(date '2010-04-01', 'DD TMMON YYYY' COLLATE "tr_TR");
  01 NİS 2010
 (1 row)

+-- to_date
+SELECT to_date('01 ŞUB 2010', 'DD TMMON YYYY');
+  to_date
+------------
+ 02-01-2010
+(1 row)
+
+SELECT to_date('01 Şub 2010', 'DD TMMON YYYY');
+  to_date
+------------
+ 02-01-2010
+(1 row)
+
+SELECT to_date('1234567890ab 2010', 'TMMONTH YYYY'); -- fail
+ERROR:  invalid value "1234567890ab" for "MONTH"
+DETAIL:  The given value did not match any of the allowed values for this field.
 -- backwards parsing
 CREATE VIEW collview1 AS SELECT * FROM collate_test1 WHERE b COLLATE "C" >= 'bbc';
 CREATE VIEW collview2 AS SELECT a, b FROM collate_test1 ORDER BY b COLLATE "C";
diff --git a/src/test/regress/sql/collate.linux.utf8.sql b/src/test/regress/sql/collate.linux.utf8.sql
index 8c26f16..cbbd220 100644
--- a/src/test/regress/sql/collate.linux.utf8.sql
+++ b/src/test/regress/sql/collate.linux.utf8.sql
@@ -182,6 +182,12 @@ SELECT to_char(date '2010-02-01', 'DD TMMON YYYY' COLLATE "tr_TR");
 SELECT to_char(date '2010-04-01', 'DD TMMON YYYY');
 SELECT to_char(date '2010-04-01', 'DD TMMON YYYY' COLLATE "tr_TR");

+-- to_date
+
+SELECT to_date('01 ŞUB 2010', 'DD TMMON YYYY');
+SELECT to_date('01 Şub 2010', 'DD TMMON YYYY');
+SELECT to_date('1234567890ab 2010', 'TMMONTH YYYY'); -- fail
+

 -- backwards parsing


Re: Allow to_date() and to_timestamp() to accept localized names

От
Juan José Santamaría Flecha
Дата:


On Sun, Mar 1, 2020 at 11:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Barring objections, this seems
> committable to me.

Going once, going twice ...

You have moved this to better place, so none from me, and thank you.

Regards,

Juan José Santamaría Flecha
 

Re: Allow to_date() and to_timestamp() to accept localized names

От
Tom Lane
Дата:
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?= <juanjo.santamaria@gmail.com> writes:
> On Sun, Mar 1, 2020 at 11:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Going once, going twice ...

> You have moved this to better place, so none from me, and thank you.

Pushed then.

While looking over the patch one more time, I noticed some pretty poor
English in docs and error messages for the related jsonb code, so
I took it on myself to fix that while at it.

            regards, tom lane



Re: Allow to_date() and to_timestamp() to accept localized names

От
James Coleman
Дата:
On Tue, Mar 3, 2020 at 11:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> writes:
> On Sun, Mar 1, 2020 at 11:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Going once, going twice ...

> You have moved this to better place, so none from me, and thank you.

Pushed then.

On master with a clean build (and configure re-run) and a fresh init-db, I'm seeing the collate.linux.utf8 test fail with the attached diff.

Is there something I need to reconfigure, install on my machine (elementary os, so based on Ubuntu) for this to pass?

Thanks,
James
Вложения

Re: Allow to_date() and to_timestamp() to accept localized names

От
Tom Lane
Дата:
James Coleman <jtc331@gmail.com> writes:
> On master with a clean build (and configure re-run) and a fresh init-db,
> I'm seeing the collate.linux.utf8 test fail with the attached diff.

 -- to_char
 SET lc_time TO 'tr_TR';
+ERROR:  invalid value for parameter "lc_time": "tr_TR"
 SELECT to_char(date '2010-02-01', 'DD TMMON YYYY');

Looks like you may not have Turkish locale installed?  Try

locale -a | grep tr_TR

If you don't see "tr_TR.utf8" or some variant spelling of that,
the collate.linux.utf8 test is not gonna pass.  The required
package is probably some sub-package of glibc.

A workaround if you don't want to install more stuff is to run the
regression tests in C locale, so that that test script gets skipped.

            regards, tom lane



Re: Allow to_date() and to_timestamp() to accept localized names

От
James Coleman
Дата:
On Sat, Mar 7, 2020 at 9:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
James Coleman <jtc331@gmail.com> writes:
> On master with a clean build (and configure re-run) and a fresh init-db,
> I'm seeing the collate.linux.utf8 test fail with the attached diff.

 -- to_char
 SET lc_time TO 'tr_TR';
+ERROR:  invalid value for parameter "lc_time": "tr_TR"
 SELECT to_char(date '2010-02-01', 'DD TMMON YYYY');

Looks like you may not have Turkish locale installed?  Try

locale -a | grep tr_TR

If you don't see "tr_TR.utf8" or some variant spelling of that,
the collate.linux.utf8 test is not gonna pass.  The required
package is probably some sub-package of glibc.

A workaround if you don't want to install more stuff is to run the
regression tests in C locale, so that that test script gets skipped.

                        regards, tom lane

Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume the utf8 version is acceptable? Or is there a non-utf8 variant?

Thanks,
James

Re: Allow to_date() and to_timestamp() to accept localized names

От
Tom Lane
Дата:
James Coleman <jtc331@gmail.com> writes:
> On Sat, Mar 7, 2020 at 9:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Looks like you may not have Turkish locale installed?  Try
>> locale -a | grep tr_TR

> Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume the
> utf8 version is acceptable? Or is there a non-utf8 variant?

Hmm ... I'm far from an expert on the packaging of locale data, but
the simplest explanation I can think of is that the tr_TR locale exists
to some extent on your machine but the LC_TIME component of that is
missing.

Do you get different results from "date" depending on the locale?
I get

$ LANG=C date
Sat Mar  7 21:44:24 EST 2020
$ LANG=tr_TR.utf8 date
Cts Mar  7 21:44:26 EST 2020

on my Fedora 30 box.

Another possibility perhaps is that you have partial locale settings
in your environment that are bollixing the test.  Try

$ env | grep ^LANG
$ env | grep ^LC_

If there's more than one relevant environment setting, and they
don't all agree, I'm not sure what would happen with our
regression tests.

BTW, what platform are you using anyway?

            regards, tom lane



Re: Allow to_date() and to_timestamp() to accept localized names

От
Juan José Santamaría Flecha
Дата:


On Sun, Mar 8, 2020 at 3:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
James Coleman <jtc331@gmail.com> writes:
> On Sat, Mar 7, 2020 at 9:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Looks like you may not have Turkish locale installed?  Try
>> locale -a | grep tr_TR

> Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume the
> utf8 version is acceptable? Or is there a non-utf8 variant?

Hmm ... I'm far from an expert on the packaging of locale data, but
the simplest explanation I can think of is that the tr_TR locale exists
to some extent on your machine but the LC_TIME component of that is
missing.

 AFAICS, the locale 'tr_TR' uses the encoding ISO-8859-9 (LATIN5), is not the same as 'tr_TR.utf8'.


BTW, what platform are you using anyway?

I have just checked in a Debian Stretch  

Regards,

Juan José Santamaría Flecha 

Re: Allow to_date() and to_timestamp() to accept localized names

От
James Coleman
Дата:
On Sat, Mar 7, 2020 at 9:48 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
James Coleman <jtc331@gmail.com> writes:
> On Sat, Mar 7, 2020 at 9:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Looks like you may not have Turkish locale installed?  Try
>> locale -a | grep tr_TR

> Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume the
> utf8 version is acceptable? Or is there a non-utf8 variant?

Hmm ... I'm far from an expert on the packaging of locale data, but
the simplest explanation I can think of is that the tr_TR locale exists
to some extent on your machine but the LC_TIME component of that is
missing.

Do you get different results from "date" depending on the locale?
I get

$ LANG=C date
Sat Mar  7 21:44:24 EST 2020
$ LANG=tr_TR.utf8 date
Cts Mar  7 21:44:26 EST 2020

$ LANG=C date
Sun Mar  8 09:27:50 EDT 2020
$ LANG=tr_TR.utf8 date
Paz Mar  8 09:27:54 EDT 2020
$ LANG=tr_TR date
Sun Mar  8 09:27:57 EDT 2020

I'm curious if you get something different for that last one (no utf8 qualifier).

on my Fedora 30 box.

Another possibility perhaps is that you have partial locale settings
in your environment that are bollixing the test.  Try

$ env | grep ^LANG

This gives me:
LANG=en_US.UTF-8
LANGUAGE=en_US
 
$ env | grep ^LC_

Nothing for this.
 
If there's more than one relevant environment setting, and they
don't all agree, I'm not sure what would happen with our
regression tests.

There's LANG and LANGUAGE but they look like they effectively agree to me?
 
BTW, what platform are you using anyway?

ElementaryOS 5.1 Hera, which is built on top of Ubuntu 18.04.3 LTS (and Linux 4.15.0-72-generic).

This suggests I have the standard Ubuntu Turkish language packages installed:

$ dpkg -l | grep language-pack-tr
ii  language-pack-tr                              1:18.04+20180712                                all          translation updates for language Turkish
ii  language-pack-tr-base                         1:18.04+20180712                                all          translations for language Turkish

James

Re: Allow to_date() and to_timestamp() to accept localized names

От
James Coleman
Дата:
On Sun, Mar 8, 2020 at 7:17 AM Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> wrote:


On Sun, Mar 8, 2020 at 3:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
James Coleman <jtc331@gmail.com> writes:
> On Sat, Mar 7, 2020 at 9:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Looks like you may not have Turkish locale installed?  Try
>> locale -a | grep tr_TR

> Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume the
> utf8 version is acceptable? Or is there a non-utf8 variant?

Hmm ... I'm far from an expert on the packaging of locale data, but
the simplest explanation I can think of is that the tr_TR locale exists
to some extent on your machine but the LC_TIME component of that is
missing.

 AFAICS, the locale 'tr_TR' uses the encoding ISO-8859-9 (LATIN5), is not the same as 'tr_TR.utf8'.

The test name implies it's about utf8, though, which makes me wonder if the test should be testing utf8 instead?

That being said, a bit more googling based on your node about the proper ISO encoding turned up this page: https://unix.stackexchange.com/a/446762

And I confirmed that the locale you mentioned is available:
$ grep "tr_TR" /usr/share/i18n/SUPPORTED
tr_TR.UTF-8 UTF-8    
tr_TR ISO-8859-9

So I tried:
echo tr_TR.ISO-8859-9 >> /var/lib/locales/supported.d/local # In a root session
sudo dpkg-reconfigure locales

That didn't seem to fix it, though `locale -a` still only lists tr_TR.utf8, so I'm still at a loss, and also unclear why a test names utf8 is actually relying on an ISO encoding.

James

Re: Allow to_date() and to_timestamp() to accept localized names

От
James Coleman
Дата:
On Sun, Mar 8, 2020 at 10:42 AM James Coleman <jtc331@gmail.com> wrote:
On Sun, Mar 8, 2020 at 7:17 AM Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> wrote:


On Sun, Mar 8, 2020 at 3:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
James Coleman <jtc331@gmail.com> writes:
> On Sat, Mar 7, 2020 at 9:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Looks like you may not have Turkish locale installed?  Try
>> locale -a | grep tr_TR

> Hmm, when I grep the locales I see `tr_TR.utf8` in the output. I assume the
> utf8 version is acceptable? Or is there a non-utf8 variant?

Hmm ... I'm far from an expert on the packaging of locale data, but
the simplest explanation I can think of is that the tr_TR locale exists
to some extent on your machine but the LC_TIME component of that is
missing.

 AFAICS, the locale 'tr_TR' uses the encoding ISO-8859-9 (LATIN5), is not the same as 'tr_TR.utf8'.

The test name implies it's about utf8, though, which makes me wonder if the test should be testing utf8 instead?

That being said, a bit more googling based on your node about the proper ISO encoding turned up this page: https://unix.stackexchange.com/a/446762

And I confirmed that the locale you mentioned is available:
$ grep "tr_TR" /usr/share/i18n/SUPPORTED
tr_TR.UTF-8 UTF-8    
tr_TR ISO-8859-9

So I tried:
echo tr_TR.ISO-8859-9 >> /var/lib/locales/supported.d/local # In a root session
sudo dpkg-reconfigure locales

That didn't seem to fix it, though `locale -a` still only lists tr_TR.utf8, so I'm still at a loss, and also unclear why a test names utf8 is actually relying on an ISO encoding.

Another update:

Since sudo dpkg-reconfigure locales opens up an ncurses gui on my machine, I tried selecting the tr_TR.ISO-8859-9 option there and removed the /var/lib/locales/supported.d/local file. Now I get:

$ locale -a | grep tr_TR
tr_TR
tr_TR.iso88599
tr_TR.utf8

And now `make check` passes.

I'm still interested in understanding why we're using the ISO locale instead of the utf8 one in a utf8-labeled test though.

Thanks,
James

Re: Allow to_date() and to_timestamp() to accept localized names

От
Tom Lane
Дата:
James Coleman <jtc331@gmail.com> writes:
> Since sudo dpkg-reconfigure locales opens up an ncurses gui on my machine,
> I tried selecting the tr_TR.ISO-8859-9 option there and removed the
> /var/lib/locales/supported.d/local file. Now I get:

> $ locale -a | grep tr_TR
> tr_TR
> tr_TR.iso88599
> tr_TR.utf8

> And now `make check` passes.

Hm.

> I'm still interested in understanding why we're using the ISO locale
> instead of the utf8 one in a utf8-labeled test though.

We are not.  My understanding of the rules about this is that the
active LC_CTYPE setting determines the encoding that libc uses,
period.  The encoding suffix on the locale name only makes a
difference when LC_CTYPE is being specified (or derived from LANG or
LC_ALL), not any other LC_XXX setting --- although for consistency
they'll let you include it in any LC_XXX value.

We could of course choose to spell LC_TIME as 'tr_TR.utf8' in this
test, but that would open up the question of whether that causes
problems on platforms where the canonical spelling of the encoding
suffix is different (eg "UTF-8").  I'm disinclined to take that risk
without positive proof that it helps on some platform.

My guess about what was actually happening on your machine is that
the underlying locale data only exists in iso-8859-9 form, and that
glibc normally translates that to utf8 on-the-fly when it's demanded
... but if the data isn't installed at all, nothing happens.

On my Red Hat platforms, this installation choice seems pretty
all-or-nothing, but it sounds like Debian lets you chop it up
more finely (and maybe slice off your finger while at it :-().

            regards, tom lane



Re: Allow to_date() and to_timestamp() to accept localized names

От
Tom Lane
Дата:
I wrote:
> James Coleman <jtc331@gmail.com> writes:
>> I'm still interested in understanding why we're using the ISO locale
>> instead of the utf8 one in a utf8-labeled test though.

> We are not.  My understanding of the rules about this is that the
> active LC_CTYPE setting determines the encoding that libc uses,
> period.  The encoding suffix on the locale name only makes a
> difference when LC_CTYPE is being specified (or derived from LANG or
> LC_ALL), not any other LC_XXX setting --- although for consistency
> they'll let you include it in any LC_XXX value.

Oh wait --- I'm wrong about that.  Looking at the code in pg_locale.c,
what actually happens is that we get data in the codeset implied by
the LC_TIME setting and then translate it to the database encoding
(cf commit 7ad1cd31b).  So if bare "tr_TR" is taken as implying
iso-8859-9, which seems likely (it appears to work that way here,
anyway) then this test is exercising the codeset translation path.
We could change the test to say 'tr_TR.utf8' but that would give us
less test coverage.

            regards, tom lane



Re: Allow to_date() and to_timestamp() to accept localized names

От
James Coleman
Дата:
On Sun, Mar 8, 2020 at 2:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wrote:
> James Coleman <jtc331@gmail.com> writes:
>> I'm still interested in understanding why we're using the ISO locale
>> instead of the utf8 one in a utf8-labeled test though.

> We are not.  My understanding of the rules about this is that the
> active LC_CTYPE setting determines the encoding that libc uses,
> period.  The encoding suffix on the locale name only makes a
> difference when LC_CTYPE is being specified (or derived from LANG or
> LC_ALL), not any other LC_XXX setting --- although for consistency
> they'll let you include it in any LC_XXX value.

Oh wait --- I'm wrong about that.  Looking at the code in pg_locale.c,
what actually happens is that we get data in the codeset implied by
the LC_TIME setting and then translate it to the database encoding
(cf commit 7ad1cd31b).  So if bare "tr_TR" is taken as implying
iso-8859-9, which seems likely (it appears to work that way here,
anyway) then this test is exercising the codeset translation path.
We could change the test to say 'tr_TR.utf8' but that would give us
less test coverage.

So just to confirm I understand, that implies that the issue is solely that only the utf8 tr_TR set is installed by default on this machine, and the iso-8859-9 set is a hard requirement (that is, the test is explicitly testing a codepath that generates utf8 results from a non-utf8 source)?

If so, I'm going to try a bare Ubuntu install on a VM and see what locales are installed by default for Turkish.

If in fact Ubuntu doesn't install this locale by default, then is this a caveat we should add to developer docs somewhere? It seems odd to me that I'd be the only one encountering it, but OTOH I would have thought this a fairly vanilla install too...

James

Re: Allow to_date() and to_timestamp() to accept localized names

От
Tom Lane
Дата:
James Coleman <jtc331@gmail.com> writes:
> So just to confirm I understand, that implies that the issue is solely that
> only the utf8 tr_TR set is installed by default on this machine, and the
> iso-8859-9 set is a hard requirement (that is, the test is explicitly
> testing a codepath that generates utf8 results from a non-utf8 source)?

It's not "explicitly" testing that; the fact that "tr_TR" is treated
as "tr_TR.iso88599" is surely an implementation artifact.  (On macOS,
some experimentation shows that "tr_TR" is treated as "tr_TR.UTF-8".)
But yeah, I think it's intentional that we want the codeset translation
path to be exercised here on at least some platforms.

> If in fact Ubuntu doesn't install this locale by default, then is this a
> caveat we should add to developer docs somewhere? It seems odd to me that
> I'd be the only one encountering it, but OTOH I would have thought this a
> fairly vanilla install too...

Not sure.  The lack of prior complaints points to this not being a
common situation.  It does seem weird that they'd set things up so
that "tr_TR.utf8" exists but not "tr_TR"; even if that's not an
outright packaging mistake, it seems like a POLA violation from here.

            regards, tom lane



Re: Allow to_date() and to_timestamp() to accept localized names

От
James Coleman
Дата:
On Sun, Mar 8, 2020 at 6:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
James Coleman <jtc331@gmail.com> writes:
> So just to confirm I understand, that implies that the issue is solely that
> only the utf8 tr_TR set is installed by default on this machine, and the
> iso-8859-9 set is a hard requirement (that is, the test is explicitly
> testing a codepath that generates utf8 results from a non-utf8 source)?

It's not "explicitly" testing that; the fact that "tr_TR" is treated
as "tr_TR.iso88599" is surely an implementation artifact.  (On macOS,
some experimentation shows that "tr_TR" is treated as "tr_TR.UTF-8".)
But yeah, I think it's intentional that we want the codeset translation
path to be exercised here on at least some platforms.

I idly wonder if there could/should be some tests for the implicit case, some explicitly testing the codeset translation (if possible), and some testing the explicit utf8 case...but I don't know enough about this area to push for anything.
 
> If in fact Ubuntu doesn't install this locale by default, then is this a
> caveat we should add to developer docs somewhere? It seems odd to me that
> I'd be the only one encountering it, but OTOH I would have thought this a
> fairly vanilla install too...

Not sure.  The lack of prior complaints points to this not being a
common situation.  It does seem weird that they'd set things up so
that "tr_TR.utf8" exists but not "tr_TR"; even if that's not an
outright packaging mistake, it seems like a POLA violation from here.

                        regards, tom lane

All right. I downloaded an Ubuntu 18.04.3 server VM from OSBoxes, and it had very few locales installed by default...so that wasn't all that helpful.

I think at this point I'll just leave this as a mystery, much as I hate that.

James